Skip to content

Conversation

@cujomalainey
Copy link

For context #1659 (comment)

Looking for verification now that Intel would accept this revert now that is has been observed by Pierre first hand. If so I will do a verification this so we can merge it. Comments welcome. This is a cherry-picked version of the patch we landed in 4.14 and launched samus with and we have seen no issues and due to lack of development on those drivers it cherry-picked cleanly.

NOTE: needs retesting on 5.X still, will do next week if I get a 👍

This reverts commit 0d2135e.

There is a known bug in broadwell where sometimes the DMA does not
resume correctly. This bug has been observed on 5.X kernels. The root
causes has been identified as this workaround. Removing this workaround
prevents the DSP from crashing on broadwell through suspend resume
cycles. This appears to be a race condition where some devices are more
susceptible than others.

Change-Id: Id0e434186143547acd1ab3f6e308f6462e55cabc
Signed-off-by: Jon Flatley <jflat@chromium.org>
Signed-off-by: Curtis Malainey <cujomalainey@chromium.org>
@kv2019i
Copy link
Collaborator

kv2019i commented Mar 2, 2020

Adding @keyonjie who authored the original patch.

@plbossart
Copy link
Member

@keyonjie I double-checked the sequences listed in the commit message and the actual code and they are not aligned at all.
For example, the code does not keep D3PGD set to one as clearly explained in the sequences:

"we must set VDRTCTL0.D3PGD to 1 (D3 power gating disabled) at first startup and keep it all the time"

	val &= ~(SST_VDRTCL0_D3PGD | SST_VDRTCL0_D3SRAMPGD);
	writel(val, sst->addr.pci_cfg + SST_VDRTCTL0);

I basically think the code is invalid and needs to be reverted.

@keyonjie
Copy link

keyonjie commented Mar 2, 2020

Hi all, I am not sure if it is an good idea to revert this major fix which was recommended from FW team and verified works for years. BTW, can anyone explain how can the DMA boot(or FW downloading) bug be related with this WA?

@keyonjie
Copy link

keyonjie commented Mar 2, 2020

@keyonjie I double-checked the sequences listed in the commit message and the actual code and they are not aligned at all.
For example, the code does not keep D3PGD set to one as clearly explained in the sequences:

"we must set VDRTCTL0.D3PGD to 1 (D3 power gating disabled) at first startup and keep it all the time"

Basically those wordings were from FW team(as you know, FW is a black box to us at that age), per my understanding, here "keep it all the time" actually means "keep it all the D0 time".

	val &= ~(SST_VDRTCL0_D3PGD | SST_VDRTCL0_D3SRAMPGD);

And preserving this D3 power gating for D3 entry is also emphasized from FW friends.

writel(val, sst->addr.pci_cfg + SST_VDRTCTL0);


I basically think the code is invalid and needs to be reverted.

I think I explain above, just sent you the email thread about the background of the original patch.

@plbossart
Copy link
Member

Hi all, I am not sure if it is an good idea to revert this major fix which was recommended from FW team and verified works for years. BTW, can anyone explain how can the DMA boot(or FW downloading) bug be related with this WA?

No one knows... the point is that with these "improvements" we see a failure so either
a) we revert
b) we fix.
If we don't fix we revert... Very simple decision.

@keyonjie
Copy link

keyonjie commented Mar 3, 2020

Hi all, I am not sure if it is an good idea to revert this major fix which was recommended from FW team and verified works for years. BTW, can anyone explain how can the DMA boot(or FW downloading) bug be related with this WA?

No one knows... the point is that with these "improvements" we see a failure so either
a) we revert
b) we fix.
If we don't fix we revert... Very simple decision.

understood, @plbossart can you share me what failure(bug link?) we are seeing, let me take a look to it.

@cujomalainey
Copy link
Author

@keyonjie see the context link for dmesg trace

What happens is that the DMA driver used to load the sst firmware fails to come online. The linked line fails with DW_PARAMS as 0x0

https://github.com/torvalds/linux/blob/3b319ee220a8795406852a897299dbdfc1b09911/drivers/dma/dw/core.c#L1070

@keyonjie
Copy link

keyonjie commented Mar 3, 2020

@cujomalainey thanks for sharing.

IMHO, we'd better to proceed it like this:

  1. Figure out fixed steps to reproduce it.
  2. Get help from DW DMA maintainers Andy and Vinod to explain what's wrong here with DW_PARAMS register read back 0. Doing this to try to figure out the root cause of the DMA probing failure.
  3. My original commit was to reinforce the registers configuration sequence for booting and D0<->D3 transition, based on the recommendation from the FW/HW team. Without it, we might hit potential DSP D3 crash issue, per what FW/HW guys said.
    So, even it is proved that reverting the commit can fix the DMA probing failure, we need to figure out which line of the commit lead to this failure, and then change the specific line only.

@cujomalainey
Copy link
Author

@keyonjie some devices apparently hit it more frequently than others. We have a samus that has a pretty good reproduction rate. If I remember correctly this causes a similar DMA issue on buddy (broadwell all in one platform) that also reproduces it (again if I remember correctly) constantly. We have reached out to Andy previously and he was unable to figure it out.

@keyonjie
Copy link

keyonjie commented Mar 3, 2020

@keyonjie some devices apparently hit it more frequently than others. We have a samus that has a pretty good reproduction rate. If I remember correctly this causes a similar DMA issue on buddy (broadwell all in one platform) that also reproduces it (again if I remember correctly) constantly. We have reached out to Andy previously and he was unable to figure it out.

Thanks for information @cujomalainey Let me try if we can find a samus or buddy here.

So to reproduce it, I need Chrome OS or Ubuntu installed is fine? just keep rebooting or doing suspend/resume loops?

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no issue in leaving the granular items of PM D3 entry to PCI/BIOS. I assume this also works on non coreboot platforms like Dell XPS.

@lgirdwood
Copy link
Member

To give further background - DW DMAC is shared between host and DSP FW on BDW. It can be programmed by both, there is no SW/FW arbitration over ownership. Host drivers will use DMA to load FW into DSP and then "hand over" control to FW. It is important that the DMAC is inactive/OFF (wrt host) when FW is performing DMAC (re)initialisation.

@dbaluta
Copy link
Collaborator

dbaluta commented Mar 3, 2020

To give further background - DW DMAC is shared between host and DSP FW on BDW. It can be programmed by both, there is no SW/FW arbitration over ownership. Host drivers will use DMA to load FW into DSP and then "hand over" control to FW. It is important that the DMAC is inactive/OFF (wrt host) when FW is performing DMAC (re)initialisation.

This is interesting. And could be useful for our SDMA driver. How does this hand over happens?

@cujomalainey
Copy link
Author

@keyonjie that is correct, we never tested Ubuntu here, only ChromeOS

@cujomalainey
Copy link
Author

@lgirdwood we don't have any dell XPS here to test with unfortunately.

@keyonjie
Copy link

keyonjie commented Mar 4, 2020

@keqiaozhang @fredoh9 do we have any Broadwell Samus chromebook? Can we run Ubuntu on it?

@keyonjie
Copy link

keyonjie commented Mar 4, 2020

@keqiaozhang can you help check if we can get similar errors with stress on WSB/Dell XPS + ubuntu(unselect SOF, select SST driver) on our side?

[   53.154287] haswell-pcm-audio haswell-pcm-audio: initialising Audio DSP IPC
[   53.154302] haswell-pcm-audio haswell-pcm-audio: initialising audio DSP id 0x3438
[   53.164474] haswell-pcm-audio haswell-pcm-audio: error: DMA device register failed
[   53.164478] haswell-pcm-audio haswell-pcm-audio: sst_dma_new failed -22
[   53.165332] bdw-rt5677 bdw-rt5677: bdw_rt5677_probe entry
[   53.165335] bdw-rt5677 bdw-rt5677: bdw_rt5677_probe end
[   53.165338] bdw-rt5677 bdw-rt5677: snd_soc_add_dai_link start
[   53.165340] bdw-rt5677 bdw-rt5677: ASoC: binding System PCM
[   53.165342] bdw-rt5677 bdw-rt5677: ASoC: platform component haswell-pcm-audio not found for System PCM
[   53.165344] bdw-rt5677 bdw-rt5677: ASoC: sanity check failed System PCM, ret -517

@lgirdwood
Copy link
Member

To give further background - DW DMAC is shared between host and DSP FW on BDW. It can be programmed by both, there is no SW/FW arbitration over ownership. Host drivers will use DMA to load FW into DSP and then "hand over" control to FW. It is important that the DMAC is inactive/OFF (wrt host) when FW is performing DMAC (re)initialisation.

This is interesting. And could be useful for our SDMA driver. How does this hand over happens?

There is no logic enforcing this, it's just the driver does not programm the DAM IP after FW has loaded. There is nothing stopping other kernel users binding to the DMA engine and using it.

@keyonjie
Copy link

keyonjie commented Mar 5, 2020

@plbossart @cujomalainey I tried suspend/resume remotely on Samus+Ubuntu with topic/sof-dev kernel run with sst driver more than 20 cycles, didn't see the DMA probe error issue, but see i915 suspend errors, did you observe the similar:

[25993.633840] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[25993.634906] printk: Suspending console(s) (use no_console_suspend to debug)
[25993.660806] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[25993.660924] sd 0:0:0:0: [sda] Stopping disk
[25993.771463] ------------[ cut here ]------------
[25993.771465] CPU PWM1 enabled
[25993.771553] WARNING: CPU: 1 PID: 2210 at drivers/gpu/drm/i915/display/intel_display_power.c:4518 hsw_enable_pc8+0x5d3/0x650 [i915]
[25993.771555] Modules linked in: snd_soc_sst_bdw_rt5677_mach snd_soc_sst_haswell_pcm snd_soc_sst_firmware snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_sst_acpi snd_soc_acpi_intel_match snd_soc_acpi cdc_ether usbnet r8152 i915 i2c_algo_bit chromeos_laptop drm_kms_helper syscopyarea snd_soc_rt5677 sysfillrect intel_pch_thermal sysimgblt x86_pkg_temp_thermal regmap_i2c intel_powerclamp snd_soc_rl6231 fb_sys_fops snd_seq_midi snd_soc_rt5677_spi snd_seq_midi_event drm snd_rawmidi mei_me snd_soc_core mei snd_compress snd_seq snd_pcm snd_seq_device snd_timer snd soundcore spi_pxa2xx_platform atmel_mxt_ts efivarfs xhci_pci xhci_hcd
[25993.771591] CPU: 1 PID: 2210 Comm: kworker/u8:41 Not tainted 5.6.0-rc3+ #574
[25993.771593] Hardware name: GOOGLE Samus/Samus, BIOS MrChromebox-4.10 10/28/2019
[25993.771601] Workqueue: events_unbound async_run_entry_fn
[25993.771654] RIP: 0010:hsw_enable_pc8+0x5d3/0x650 [i915]
[25993.771658] Code: c0 75 22 e8 7f 3d e6 ff e9 bc fb ff ff e8 f2 5c d2 ce 0f 0b e9 db fa ff ff e8 e6 5c d2 ce 0f 0b e9 ce fb ff ff e8 da 5c d2 ce <0f> 0b e9 98 fb ff ff e8 ce 5c d2 ce 0f 0b e9 e6 fa ff ff e8 c2 5c
[25993.771661] RSP: 0018:ffffa7f2419bfd70 EFLAGS: 00010282
[25993.771664] RAX: 0000000000000000 RBX: ffff9bc7721b0000 RCX: 0000000000000007
[25993.771666] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffff9bc776c98790
[25993.771668] RBP: ffff9bc7721b07b0 R08: 000017a42583bcd3 R09: ffffffff90aaa3f4
[25993.771670] R10: 00000000000002d7 R11: 00000000000319dc R12: ffff9bc7721b02d8
[25993.771672] R13: ffff9bc7721b02e8 R14: 0000000000000000 R15: 0000000000000002
[25993.771675] FS:  0000000000000000(0000) GS:ffff9bc776c80000(0000) knlGS:0000000000000000
[25993.771677] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25993.771679] CR2: 0000555d4f0f8b40 CR3: 000000010e80a004 CR4: 00000000003606e0
[25993.771681] Call Trace:
[25993.771715]  i915_drm_suspend_late+0x5d/0x120 [i915]
[25993.771723]  ? pci_pm_poweroff_late+0x30/0x30
[25993.771727]  dpm_run_callback+0x4a/0x140
[25993.771732]  __device_suspend_late+0xdc/0x1b0
[25993.771736]  async_suspend_late+0x16/0x90
[25993.771741]  async_run_entry_fn+0x32/0x140
[25993.771745]  process_one_work+0x1d3/0x380
[25993.771749]  worker_thread+0x45/0x3c0
[25993.771754]  kthread+0xf6/0x130
[25993.771758]  ? process_one_work+0x380/0x380
[25993.771762]  ? kthread_park+0x80/0x80
[25993.771769]  ret_from_fork+0x35/0x40
[25993.771774] ---[ end trace 11d36126a563916c ]---
[25993.771797] ------------[ cut here ]------------
[25993.771799] PCH PWM1 enabled
[25993.771881] WARNING: CPU: 1 PID: 2210 at drivers/gpu/drm/i915/display/intel_display_power.c:4523 hsw_enable_pc8+0x63f/0x650 [i915]

@plbossart
Copy link
Member

@keyonjie I just had this on SAMUS

[    2.866762] haswell-pcm-audio haswell-pcm-audio: error: DMA device register failed
[    2.866767] haswell-pcm-audio haswell-pcm-audio: sst_dma_new failed -22
[    2.874423] bdw-rt5677 bdw-rt5677: ASoC: failed to init link System PCM: -517
[    2.945261] intel_rapl_common: Found RAPL domain package
[    2.945263] intel_rapl_common: Found RAPL domain core
[    2.945265] intel_rapl_common: Found RAPL domain uncore
[    2.945266] intel_rapl_common: Found RAPL domain dram
[    2.951577] bdw-rt5677 bdw-rt5677: ASoC: failed to init link System PCM: -517

Plain vanilla ubuntu 5.3 kernel

@plbossart
Copy link
Member

and again

[    3.243505] haswell-pcm-audio haswell-pcm-audio: error: DMA device register failed
[    3.243509] haswell-pcm-audio haswell-pcm-audio: sst_dma_new failed -22
[    3.262504] bdw-rt5677 bdw-rt5677: ASoC: failed to init link System PCM: -517
[    4.479826] iwlwifi 0000:01:00.0 wlp1s0: renamed from wlan0
[    4.482075] intel_rapl_common: Found RAPL domain package
[    4.482077] intel_rapl_common: Found RAPL domain core
[    4.482078] intel_rapl_common: Found RAPL domain uncore
[    4.482079] intel_rapl_common: Found RAPL domain dram
[    4.490543] bdw-rt5677 bdw-rt5677: ASoC: failed to init link System PCM: -517

@keyonjie
Copy link

keyonjie commented Mar 6, 2020

OK, trying commit 4d856f7 now.

@cujomalainey
Copy link
Author

@keyonjie no I have not see i915 issues

@keyonjie
Copy link

keyonjie commented Mar 17, 2020

@cujomalainey thanks for the information, that is important hint. Though I don't know how the boot beep is implemented, I guess it uses the DW DMA to transfer data to the SSP FIFO, better to check and make sure it is shut down correctly after boot beep done.

@cujomalainey
Copy link
Author

cujomalainey commented Mar 17, 2020

@keyonjie i doubt it is, that is the issue I had on CHT see thesofproject/sof#2148, if you want to take a look at the code here is a reference commit https://chromium-review.googlesource.com/c/chromiumos/platform/depthcharge/+/238859/

@rzwisler
Copy link

So far I've only been able to test the v4.14 based version of this patch, but I can say pretty conclusively that the patch helps there. Without this patch, my system (with the beep turned on) was reproducing the issue 100%. With the patch, I wasn't able to reproduce the issue after several reboots.

I'm having a bit of trouble with my system at the moment, but I'll try and recover it and continue testing with the patch applied to a more recent baseline, but wanted to give you that feedback at least.

@keyonjie
Copy link

@keyonjie i doubt it is, that is the issue I had on CHT see thesofproject/sof#2148, if you want to take a look at the code here is a reference commit https://chromium-review.googlesource.com/c/chromiumos/platform/depthcharge/+/238859/

That makes sense, thanks @cujomalainey.

@keyonjie
Copy link

So far I've only been able to test the v4.14 based version of this patch, but I can say pretty conclusively that the patch helps there. Without this patch, my system (with the beep turned on) was reproducing the issue 100%. With the patch, I wasn't able to reproduce the issue after several reboots.

I'm having a bit of trouble with my system at the moment, but I'll try and recover it and continue testing with the patch applied to a more recent baseline, but wanted to give you that feedback at least.

Thanks for the feedback, waiting for your result with recent code base.

@rzwisler
Copy link

I've confirmed that the upstream version of this patch resolves this issue for me. With the patch's baseline commit:

30887e1 soundwire: bus: align with upstream

I'm able to reproduce the issue 100% on my samus.

With the commit applied here:

e699a33 Revert "ASoC: Intel: Work around to fix HW D3 potential crash issue"

I've been completely unable to reproduce the issue.

@keyonjie
Copy link

I've confirmed that the upstream version of this patch resolves this issue for me. With the patch's baseline commit:

30887e1 soundwire: bus: align with upstream

I'm able to reproduce the issue 100% on my samus.

With the commit applied here:

e699a33 Revert "ASoC: Intel: Work around to fix HW D3 potential crash issue"

I've been completely unable to reproduce the issue.

Hi @rzwisler reverting the commit you mentioned is not the way we want to go, can you help check if the issue can be fixed with this commit applied(keep the commit "ASoC: Intel: Work around to fix HW D3 potential crash issue" from being reverted) 0d2135e

@rzwisler
Copy link

Sorry, I'm confused. By "this commit" I thought you meant the commit that's added by this pull request:

https://github.com/thesofproject/linux/pull/1842/commits

Which is:

e699a33 Revert "ASoC: Intel: Work around to fix HW D3 potential crash issue"

right?

@keyonjie
Copy link

Sorry, I'm confused. By "this commit" I thought you meant the commit that's added by this pull request:

https://github.com/thesofproject/linux/pull/1842/commits

Which is:

e699a33 Revert "ASoC: Intel: Work around to fix HW D3 potential crash issue"

right?

discard the commit from PR#1842, and apply the commit from PR#1873

@rzwisler
Copy link

I tested with the commit here:

151eab2

151eab2 ASoC: Intel: haswell-dsp: refine D0<->D3 transition sequence

and my system still fails 100% of the time on boot:

dmesg | grep haswell

[ 41.124599] haswell-pcm-audio haswell-pcm-audio: error: DMA device register failed
[ 41.164464] haswell-pcm-audio haswell-pcm-audio: sst_dma_new failed -22

@keyonjie
Copy link

I tested with the commit here:

151eab2

151eab2 ASoC: Intel: haswell-dsp: refine D0<->D3 transition sequence

and my system still fails 100% of the time on boot:

dmesg | grep haswell

[ 41.124599] haswell-pcm-audio haswell-pcm-audio: error: DMA device register failed
[ 41.164464] haswell-pcm-audio haswell-pcm-audio: sst_dma_new failed -22

Thanks for the update, so we still need to dig it more to root cause it.

@keyonjie
Copy link

Adding Cezary who might be helpful to look from the FW side also. @crojewsk

Hi @crojewsk , we are hitting DMA probing fail issue on SAMUS Chrome OS as discussed above,
and it is verified that reverting a of my 5 years old commit 0d2135e will fix the issue.

My commit 0d2135e was written to follow the sequence suggested by your team member(I will forward you the thread), @plbossart raised some comments and corrections, and here #1873 is what I just wrote days back to try to fix.

The latest test result from @rzwisler shows that my refining commit #1873 doesn't help on this issue, @crojewsk can you help to take a look on it?

@crojewsk
Copy link
Member

Hello, could you checkout my wpt_dx branch. It is based on recent broonie/for-next (v5.6-rc6) with my latest fixes for bdw based machine boards and a WIP patch addressing incorrect D0 <-> D3 flow.

I didn't add the setup-defaults method yet - will do it later. This too is not streamlined with the recommended flow.

Hope it's not some DMA race that started to occur due to delays appended after each crucial registry write to DSP hardware. Code version from 3.14 lacks some of these.

Did basic tests on my BDW-Y RVP, though the issue never reproduced there in the first place.

@rzwisler
Copy link

I tested with this commit:

ee506e8e2e01 (crojewsk/wpt_dx) [WIP] ASoC: Intel: haswell: Power transitions update

and verified that the issue still occurs:

[ 14.824629] haswell-pcm-audio haswell-pcm-audio: error: DMA device register failed
[ 14.824641] haswell-pcm-audio haswell-pcm-audio: sst_dma_new failed -22

@crojewsk
Copy link
Member

crojewsk commented Mar 26, 2020

I'll prepare patch with set-default method appended.
Meanwhile: @rzwisler, alter the patch, remove/ reduce the delays and retest. I don't have access currently to SAMUS so I cannot reproduce anything, but you can. Your issue is tied to power-up flow - so _set_state_D0().

Also, you can remove changes done in hsw_block_disable/ enable by Keyon in the initial patch so we are aware if this regression is indeed caused by _set_set_D0() -or- by mem block handling instead.

One more thing - regardless of outcome, I'm leaning toward delaying/ quickening sst_dma_new. Flow present in wpt-dx is the recommended one and one WPT-LP has been released with.

Awaiting your response.

@crojewsk
Copy link
Member

Another one: logs you dump are quite minimal.
Please enable dyndbg for dw_dmac, either via modprobe.d if it's configured via =m -or- by appending:
ccflags-y += -DDEBUG

to: /drivers/dma/dw/Makefile if that's =y.

At least grep for: "DW_PARAMS", thank you.

@crojewsk
Copy link
Member

Uploaded v2. Please re-test. Explanation later.

@crojewsk
Copy link
Member

@cujomalainey @rzwisler any info regarding wpt-dx v2?

@rzwisler
Copy link

@crojewsk: Confirmed that this commit:

701a469 (crojewsk/wpt_dx) [WIP v2] ASoC: Intel: haswell: Power transitions update

solves the issue for me. Thanks!

@cujomalainey
Copy link
Author

@rzwisler thanks for testing, I have multiple P1 bugs that I have been dealing with.

@crojewsk
Copy link
Member

The following is true:

  • issue is connected to internal incorrect hw initialization procedure
  • said initialization needs to be addressed from SW side
  • setting dw_dmac and dw_dmac_core from =y to =m impacts issue reproduction
  • connected to initial "beep" sound, happening ~6-10 sec after system start
  • reproducing on both, ChromeOS and Ubuntu (tested on v19.10)

Offending change:

static int hsw_set_dsp_D0(struct sst_dsp *sst)
{
(...)
	/* disable all clock gating */
	writel(0x0, sst->addr.pci_cfg + SST_VDRTCTL2);

'Revert "ASoC: Intel: Work around to fix HW D3 potential crash issue"' is NAKed as it only covers the problem up and actually brings back the undefined behavior: some registers (e.g.: APLLSE) are describing LPT rather than WPT, only god knows what is actually happening during power-transitions when driver issues incorrect writes and leaves the regs of interest alone.

Keyon's initial patch (the 5yr old one) does not resolve the HW D3 issue at all as it ignores the recommended sequence; probably comes just down to shutting down the core than anything else.

Resolution:

  • adjust D0 <-> flow to recommended sequence
  • SW initializes HW registers on power transition state to recommended defaults
  • /haswell implementation is to be updated to properly distinguish between LPT and WPT

@crojewsk
Copy link
Member

In the wake of above, it's unsettling but important to mention:

This issue is NOT tied to SOF and should not be posted here. Official Intel-client channel should have been used for tracking and resolving the issue. This goes to @cujomalainey and @rzwisler as you should be familiar with the procedure - Cedrik's team and Harsha are your contacts.
Ticked has been posted on Feb 29, that is over a month ago while IGK has been notified only last week and by 'accident' during my short talk with Keyon. Dealing with old and usually forgotten issues on already released products is hard - that's why it's paramount that we communicate and work together. This message goes to SOF guys.

Moreover, I've requested additional testing right after receiving feedback for my initial patch - no comments have been added. The next day after I got SAMUS into my home issue has been bisected and fixed. Don't believe standard bisection could not be issued during all (a month) this time - you do not have to wait on my explicit request. Sitting idle does not solve anything.

This message has been forwarded to all necessary subjects.

@plbossart
Copy link
Member

plbossart commented Mar 30, 2020

In the wake of above, it's unsettling but important to mention:

This issue is NOT tied to SOF and should not be posted here. Official Intel-client channel should have been used for tracking and resolving the issue. This goes to @cujomalainey and @rzwisler as you should be familiar with the procedure - Cedrik's team and Harsha are your contacts.
Ticked has been posted on Feb 29, that is over a month ago while IGK has been notified only last week and by 'accident' during my short talk with Keyon. Dealing with old and usually forgotten issues on already released products is hard - that's why it's paramount that we communicate and work together. This message goes to SOF guys.

The first issue on Broadwell was filed on alsa-devel on July 29, 2019, see "[BUG] bdw-rt5650 DSP boot timeout"

You were in copy and even replied to this thread. There was no follow-up except for Keyon trying to revisit the sequence from 5 years ago when Google folks identified the problematic initialization sequence.

We appreciate the work you did last week but please stop telling world+dog that the SOF folks are evil.

Moreover, I've requested additional testing right after receiving feedback for my initial patch - no comments have been added. The next day after I got SAMUS into my home issue has been bisected and fixed. Don't believe standard bisection could not be issued during all (a month) this time - you do not have to wait on my explicit request. Sitting idle does not solve anything.

This message has been forwarded to all necessary subjects.

@crojewsk giving lessons doesn't help and no one stayed idle. There was a DRM-based issue that impacted bisects and on Samus a very specific configuration is needed to support suspend-resume.

@plbossart plbossart closed this Mar 30, 2020
@rzwisler
Copy link

In the wake of above, it's unsettling but important to mention:

This issue is NOT tied to SOF and should not be posted here. Official Intel-client channel should have been used for tracking and resolving the issue. This goes to @cujomalainey and @rzwisler as you should be familiar with the procedure - Cedrik's team and Harsha are your contacts.

We did bring this up with Intel last summer (July/August). According to the meeting minutes you attended the call, and both Cedrik and Harsha were on the CC. I'll forward you the minutes.

fengguang pushed a commit to 0day-ci/linux that referenced this pull request Mar 30, 2020
Update D0 <-> D3 sequence to correctly transition hardware and DSP core
from and to D3. On top of that, set SHIM registers to their recommended
defaults during D0 and D3 proceduces as HW does not reset registers for
us.

Connected to:
[alsa-devel][BUG] bdw-rt5650 DSP boot timeout
https://mailman.alsa-project.org/pipermail/alsa-devel/2019-July/153098.html

Github issue ticket reference:
thesofproject#1842

Tested on:
- BDW-Y RVP with rt286
- SAMUS with rt5677

Proposed solution (both in July 2019 and on github):
'Revert "ASoC: Intel: Work around to fix HW d3 potential crash issue"'
is NAKed as it only covers the problem up and actually brings back the
undefined behavior: some registers (e.g.: APLLSE) are describing LPT
offsets rather than WPT ones. In consequence, during power-transitions
driver issues incorrect writes and leaves the regs of interest alone.

Existing patch - the non-revert - does not resolve the HW D3 issue at
all as it ignores the recommended sequence and does not initialize
hardware registers as expected. And thus, leaving things as are is also
unacceptable.

Signed-off-by: Cezary Rojewski <cezary.rojewski@intel.com>
morimoto pushed a commit to morimoto/linux that referenced this pull request Apr 20, 2020
Update D0 <-> D3 sequence to correctly transition hardware and DSP core
from and to D3. On top of that, set SHIM registers to their recommended
defaults during D0 and D3 proceduces as HW does not reset registers for
us.

Connected to:
[alsa-devel][BUG] bdw-rt5650 DSP boot timeout
https://mailman.alsa-project.org/pipermail/alsa-devel/2019-July/153098.html

Github issue ticket reference:
thesofproject#1842

Tested on:
- BDW-Y RVP with rt286
- SAMUS with rt5677

Proposed solution (both in July 2019 and on github):
'Revert "ASoC: Intel: Work around to fix HW d3 potential crash issue"'
is NAKed as it only covers the problem up and actually brings back the
undefined behavior: some registers (e.g.: APLLSE) are describing LPT
offsets rather than WPT ones. In consequence, during power-transitions
driver issues incorrect writes and leaves the regs of interest alone.

Existing patch - the non-revert - does not resolve the HW D3 issue at
all as it ignores the recommended sequence and does not initialize
hardware registers as expected. And thus, leaving things as are is also
unacceptable.

Signed-off-by: Cezary Rojewski <cezary.rojewski@intel.com>
Tested-by: Ross Zwisler <zwisler@google.com>
Link: https://lore.kernel.org/r/20200330194520.13253-1-cezary.rojewski@intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
@cujomalainey
Copy link
Author

Apologies for drudging up an old thread

@crojewsk we applied your patch to our 4.14 kernel (clean application) and we got a bunch of failed tests back from the test team, mainly headset mic recording. We can't test with upstream kernels right now as our toolchains can't build upstream right now but figured you might want to check it out.

@cujomalainey cujomalainey deleted the bdw-fix branch April 28, 2020 23:39
PlaidCat added a commit to ctrliq/kernel-src-tree that referenced this pull request Sep 12, 2024
jira LE-1907
Rebuild_History Non-Buildable kernel-4.18.0-294.el8
Rebuild_CHGLOG: - [sound] ALSA: ASoC: Intel: haswell: Power transition refactor (Jaroslav Kysela) [1869536]
Rebuild_FUZZ: 94.00%
commit-author Cezary Rojewski <cezary.rojewski@intel.com>
commit 8ec7d60

Update D0 <-> D3 sequence to correctly transition hardware and DSP core
from and to D3. On top of that, set SHIM registers to their recommended
defaults during D0 and D3 proceduces as HW does not reset registers for
us.

Connected to:
[alsa-devel][BUG] bdw-rt5650 DSP boot timeout
https://mailman.alsa-project.org/pipermail/alsa-devel/2019-July/153098.html

Github issue ticket reference:
thesofproject/linux#1842

Tested on:
- BDW-Y RVP with rt286
- SAMUS with rt5677

Proposed solution (both in July 2019 and on github):
'Revert "ASoC: Intel: Work around to fix HW d3 potential crash issue"'
is NAKed as it only covers the problem up and actually brings back the
undefined behavior: some registers (e.g.: APLLSE) are describing LPT
offsets rather than WPT ones. In consequence, during power-transitions
driver issues incorrect writes and leaves the regs of interest alone.

Existing patch - the non-revert - does not resolve the HW D3 issue at
all as it ignores the recommended sequence and does not initialize
hardware registers as expected. And thus, leaving things as are is also
unacceptable.

	Signed-off-by: Cezary Rojewski <cezary.rojewski@intel.com>
	Tested-by: Ross Zwisler <zwisler@google.com>
Link: https://lore.kernel.org/r/20200330194520.13253-1-cezary.rojewski@intel.com
	Signed-off-by: Mark Brown <broonie@kernel.org>
(cherry picked from commit 8ec7d60)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants