-
Notifications
You must be signed in to change notification settings - Fork 140
[RFC] Resolve DMA boot bug in BDW in SST driver #1842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This reverts commit 0d2135e. There is a known bug in broadwell where sometimes the DMA does not resume correctly. This bug has been observed on 5.X kernels. The root causes has been identified as this workaround. Removing this workaround prevents the DSP from crashing on broadwell through suspend resume cycles. This appears to be a race condition where some devices are more susceptible than others. Change-Id: Id0e434186143547acd1ab3f6e308f6462e55cabc Signed-off-by: Jon Flatley <jflat@chromium.org> Signed-off-by: Curtis Malainey <cujomalainey@chromium.org>
|
Adding @keyonjie who authored the original patch. |
|
@keyonjie I double-checked the sequences listed in the commit message and the actual code and they are not aligned at all. "we must set VDRTCTL0.D3PGD to 1 (D3 power gating disabled) at first startup and keep it all the time" val &= ~(SST_VDRTCL0_D3PGD | SST_VDRTCL0_D3SRAMPGD);
writel(val, sst->addr.pci_cfg + SST_VDRTCTL0);I basically think the code is invalid and needs to be reverted. |
|
Hi all, I am not sure if it is an good idea to revert this major fix which was recommended from FW team and verified works for years. BTW, can anyone explain how can the DMA boot(or FW downloading) bug be related with this WA? |
Basically those wordings were from FW team(as you know, FW is a black box to us at that age), per my understanding, here "keep it all the time" actually means "keep it all the D0 time".
And preserving this D3 power gating for D3 entry is also emphasized from FW friends.
I think I explain above, just sent you the email thread about the background of the original patch. |
No one knows... the point is that with these "improvements" we see a failure so either |
understood, @plbossart can you share me what failure(bug link?) we are seeing, let me take a look to it. |
|
@keyonjie see the context link for dmesg trace What happens is that the DMA driver used to load the sst firmware fails to come online. The linked line fails with DW_PARAMS as 0x0 |
|
@cujomalainey thanks for sharing. IMHO, we'd better to proceed it like this:
|
|
@keyonjie some devices apparently hit it more frequently than others. We have a samus that has a pretty good reproduction rate. If I remember correctly this causes a similar DMA issue on buddy (broadwell all in one platform) that also reproduces it (again if I remember correctly) constantly. We have reached out to Andy previously and he was unable to figure it out. |
Thanks for information @cujomalainey Let me try if we can find a samus or buddy here. So to reproduce it, I need Chrome OS or Ubuntu installed is fine? just keep rebooting or doing suspend/resume loops? |
lgirdwood
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no issue in leaving the granular items of PM D3 entry to PCI/BIOS. I assume this also works on non coreboot platforms like Dell XPS.
|
To give further background - DW DMAC is shared between host and DSP FW on BDW. It can be programmed by both, there is no SW/FW arbitration over ownership. Host drivers will use DMA to load FW into DSP and then "hand over" control to FW. It is important that the DMAC is inactive/OFF (wrt host) when FW is performing DMAC (re)initialisation. |
This is interesting. And could be useful for our SDMA driver. How does this hand over happens? |
|
@keyonjie that is correct, we never tested Ubuntu here, only ChromeOS |
|
@lgirdwood we don't have any dell XPS here to test with unfortunately. |
|
@keqiaozhang @fredoh9 do we have any Broadwell Samus chromebook? Can we run Ubuntu on it? |
|
@keqiaozhang can you help check if we can get similar errors with stress on WSB/Dell XPS + ubuntu(unselect SOF, select SST driver) on our side? |
There is no logic enforcing this, it's just the driver does not programm the DAM IP after FW has loaded. There is nothing stopping other kernel users binding to the DMA engine and using it. |
|
@plbossart @cujomalainey I tried suspend/resume remotely on Samus+Ubuntu with topic/sof-dev kernel run with sst driver more than 20 cycles, didn't see the DMA probe error issue, but see i915 suspend errors, did you observe the similar: |
|
@keyonjie I just had this on SAMUS Plain vanilla ubuntu 5.3 kernel |
|
and again |
|
OK, trying commit 4d856f7 now. |
|
@keyonjie no I have not see i915 issues |
|
@cujomalainey thanks for the information, that is important hint. Though I don't know how the boot beep is implemented, I guess it uses the DW DMA to transfer data to the SSP FIFO, better to check and make sure it is shut down correctly after boot beep done. |
|
@keyonjie i doubt it is, that is the issue I had on CHT see thesofproject/sof#2148, if you want to take a look at the code here is a reference commit https://chromium-review.googlesource.com/c/chromiumos/platform/depthcharge/+/238859/ |
|
So far I've only been able to test the v4.14 based version of this patch, but I can say pretty conclusively that the patch helps there. Without this patch, my system (with the beep turned on) was reproducing the issue 100%. With the patch, I wasn't able to reproduce the issue after several reboots. I'm having a bit of trouble with my system at the moment, but I'll try and recover it and continue testing with the patch applied to a more recent baseline, but wanted to give you that feedback at least. |
That makes sense, thanks @cujomalainey. |
Thanks for the feedback, waiting for your result with recent code base. |
|
I've confirmed that the upstream version of this patch resolves this issue for me. With the patch's baseline commit: 30887e1 soundwire: bus: align with upstream I'm able to reproduce the issue 100% on my samus. With the commit applied here: e699a33 Revert "ASoC: Intel: Work around to fix HW D3 potential crash issue" I've been completely unable to reproduce the issue. |
Hi @rzwisler reverting the commit you mentioned is not the way we want to go, can you help check if the issue can be fixed with this commit applied(keep the commit "ASoC: Intel: Work around to fix HW D3 potential crash issue" from being reverted) 0d2135e |
|
Sorry, I'm confused. By "this commit" I thought you meant the commit that's added by this pull request: https://github.com/thesofproject/linux/pull/1842/commits Which is: e699a33 Revert "ASoC: Intel: Work around to fix HW D3 potential crash issue" right? |
discard the commit from PR#1842, and apply the commit from PR#1873 |
|
I tested with the commit here: 151eab2 ASoC: Intel: haswell-dsp: refine D0<->D3 transition sequence and my system still fails 100% of the time on boot: dmesg | grep haswell[ 41.124599] haswell-pcm-audio haswell-pcm-audio: error: DMA device register failed |
Thanks for the update, so we still need to dig it more to root cause it. |
|
Adding Cezary who might be helpful to look from the FW side also. @crojewsk Hi @crojewsk , we are hitting DMA probing fail issue on SAMUS Chrome OS as discussed above, My commit 0d2135e was written to follow the sequence suggested by your team member(I will forward you the thread), @plbossart raised some comments and corrections, and here #1873 is what I just wrote days back to try to fix. The latest test result from @rzwisler shows that my refining commit #1873 doesn't help on this issue, @crojewsk can you help to take a look on it? |
|
Hello, could you checkout my wpt_dx branch. It is based on recent broonie/for-next (v5.6-rc6) with my latest fixes for bdw based machine boards and a WIP patch addressing incorrect D0 <-> D3 flow. I didn't add the setup-defaults method yet - will do it later. This too is not streamlined with the recommended flow. Hope it's not some DMA race that started to occur due to delays appended after each crucial registry write to DSP hardware. Code version from 3.14 lacks some of these. Did basic tests on my BDW-Y RVP, though the issue never reproduced there in the first place. |
|
I tested with this commit: ee506e8e2e01 (crojewsk/wpt_dx) [WIP] ASoC: Intel: haswell: Power transitions update and verified that the issue still occurs: [ 14.824629] haswell-pcm-audio haswell-pcm-audio: error: DMA device register failed |
|
I'll prepare patch with set-default method appended. Also, you can remove changes done in hsw_block_disable/ enable by Keyon in the initial patch so we are aware if this regression is indeed caused by _set_set_D0() -or- by mem block handling instead. One more thing - regardless of outcome, I'm leaning toward delaying/ quickening sst_dma_new. Flow present in wpt-dx is the recommended one and one WPT-LP has been released with. Awaiting your response. |
|
Another one: logs you dump are quite minimal. to: /drivers/dma/dw/Makefile if that's =y. At least grep for: "DW_PARAMS", thank you. |
|
Uploaded v2. Please re-test. Explanation later. |
|
@cujomalainey @rzwisler any info regarding wpt-dx v2? |
|
@rzwisler thanks for testing, I have multiple P1 bugs that I have been dealing with. |
|
The following is true:
Offending change: 'Revert "ASoC: Intel: Work around to fix HW D3 potential crash issue"' is NAKed as it only covers the problem up and actually brings back the undefined behavior: some registers (e.g.: APLLSE) are describing LPT rather than WPT, only god knows what is actually happening during power-transitions when driver issues incorrect writes and leaves the regs of interest alone. Keyon's initial patch (the 5yr old one) does not resolve the HW D3 issue at all as it ignores the recommended sequence; probably comes just down to shutting down the core than anything else. Resolution:
|
|
In the wake of above, it's unsettling but important to mention: This issue is NOT tied to SOF and should not be posted here. Official Intel-client channel should have been used for tracking and resolving the issue. This goes to @cujomalainey and @rzwisler as you should be familiar with the procedure - Cedrik's team and Harsha are your contacts. Moreover, I've requested additional testing right after receiving feedback for my initial patch - no comments have been added. The next day after I got SAMUS into my home issue has been bisected and fixed. Don't believe standard bisection could not be issued during all (a month) this time - you do not have to wait on my explicit request. Sitting idle does not solve anything. This message has been forwarded to all necessary subjects. |
The first issue on Broadwell was filed on alsa-devel on July 29, 2019, see "[BUG] bdw-rt5650 DSP boot timeout" You were in copy and even replied to this thread. There was no follow-up except for Keyon trying to revisit the sequence from 5 years ago when Google folks identified the problematic initialization sequence. We appreciate the work you did last week but please stop telling world+dog that the SOF folks are evil.
@crojewsk giving lessons doesn't help and no one stayed idle. There was a DRM-based issue that impacted bisects and on Samus a very specific configuration is needed to support suspend-resume. |
We did bring this up with Intel last summer (July/August). According to the meeting minutes you attended the call, and both Cedrik and Harsha were on the CC. I'll forward you the minutes. |
Update D0 <-> D3 sequence to correctly transition hardware and DSP core from and to D3. On top of that, set SHIM registers to their recommended defaults during D0 and D3 proceduces as HW does not reset registers for us. Connected to: [alsa-devel][BUG] bdw-rt5650 DSP boot timeout https://mailman.alsa-project.org/pipermail/alsa-devel/2019-July/153098.html Github issue ticket reference: thesofproject#1842 Tested on: - BDW-Y RVP with rt286 - SAMUS with rt5677 Proposed solution (both in July 2019 and on github): 'Revert "ASoC: Intel: Work around to fix HW d3 potential crash issue"' is NAKed as it only covers the problem up and actually brings back the undefined behavior: some registers (e.g.: APLLSE) are describing LPT offsets rather than WPT ones. In consequence, during power-transitions driver issues incorrect writes and leaves the regs of interest alone. Existing patch - the non-revert - does not resolve the HW D3 issue at all as it ignores the recommended sequence and does not initialize hardware registers as expected. And thus, leaving things as are is also unacceptable. Signed-off-by: Cezary Rojewski <cezary.rojewski@intel.com>
Update D0 <-> D3 sequence to correctly transition hardware and DSP core from and to D3. On top of that, set SHIM registers to their recommended defaults during D0 and D3 proceduces as HW does not reset registers for us. Connected to: [alsa-devel][BUG] bdw-rt5650 DSP boot timeout https://mailman.alsa-project.org/pipermail/alsa-devel/2019-July/153098.html Github issue ticket reference: thesofproject#1842 Tested on: - BDW-Y RVP with rt286 - SAMUS with rt5677 Proposed solution (both in July 2019 and on github): 'Revert "ASoC: Intel: Work around to fix HW d3 potential crash issue"' is NAKed as it only covers the problem up and actually brings back the undefined behavior: some registers (e.g.: APLLSE) are describing LPT offsets rather than WPT ones. In consequence, during power-transitions driver issues incorrect writes and leaves the regs of interest alone. Existing patch - the non-revert - does not resolve the HW D3 issue at all as it ignores the recommended sequence and does not initialize hardware registers as expected. And thus, leaving things as are is also unacceptable. Signed-off-by: Cezary Rojewski <cezary.rojewski@intel.com> Tested-by: Ross Zwisler <zwisler@google.com> Link: https://lore.kernel.org/r/20200330194520.13253-1-cezary.rojewski@intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
|
Apologies for drudging up an old thread @crojewsk we applied your patch to our 4.14 kernel (clean application) and we got a bunch of failed tests back from the test team, mainly headset mic recording. We can't test with upstream kernels right now as our toolchains can't build upstream right now but figured you might want to check it out. |
jira LE-1907 Rebuild_History Non-Buildable kernel-4.18.0-294.el8 Rebuild_CHGLOG: - [sound] ALSA: ASoC: Intel: haswell: Power transition refactor (Jaroslav Kysela) [1869536] Rebuild_FUZZ: 94.00% commit-author Cezary Rojewski <cezary.rojewski@intel.com> commit 8ec7d60 Update D0 <-> D3 sequence to correctly transition hardware and DSP core from and to D3. On top of that, set SHIM registers to their recommended defaults during D0 and D3 proceduces as HW does not reset registers for us. Connected to: [alsa-devel][BUG] bdw-rt5650 DSP boot timeout https://mailman.alsa-project.org/pipermail/alsa-devel/2019-July/153098.html Github issue ticket reference: thesofproject/linux#1842 Tested on: - BDW-Y RVP with rt286 - SAMUS with rt5677 Proposed solution (both in July 2019 and on github): 'Revert "ASoC: Intel: Work around to fix HW d3 potential crash issue"' is NAKed as it only covers the problem up and actually brings back the undefined behavior: some registers (e.g.: APLLSE) are describing LPT offsets rather than WPT ones. In consequence, during power-transitions driver issues incorrect writes and leaves the regs of interest alone. Existing patch - the non-revert - does not resolve the HW D3 issue at all as it ignores the recommended sequence and does not initialize hardware registers as expected. And thus, leaving things as are is also unacceptable. Signed-off-by: Cezary Rojewski <cezary.rojewski@intel.com> Tested-by: Ross Zwisler <zwisler@google.com> Link: https://lore.kernel.org/r/20200330194520.13253-1-cezary.rojewski@intel.com Signed-off-by: Mark Brown <broonie@kernel.org> (cherry picked from commit 8ec7d60) Signed-off-by: Jonathan Maple <jmaple@ciq.com>
For context #1659 (comment)
Looking for verification now that Intel would accept this revert now that is has been observed by Pierre first hand. If so I will do a verification this so we can merge it. Comments welcome. This is a cherry-picked version of the patch we landed in 4.14 and launched samus with and we have seen no issues and due to lack of development on those drivers it cherry-picked cleanly.
NOTE: needs retesting on 5.X still, will do next week if I get a 👍