-
Notifications
You must be signed in to change notification settings - Fork 140
SoundWire/ASoC-DPCM: debug incorrect states #3098
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SoundWire/ASoC-DPCM: debug incorrect states #3098
Conversation
|
@plbossart The issue is that we will disconnect a BE and FE in dpcm_fe_dai_close() -> dpcm_fe_dai_cleanup() -> dpcm_be_disconnect(). Ideally, the process is
And it will be no issue when the sequence is
|
42303f6 to
645ae57
Compare
|
@bardliao with the update I am able to run more tests, but still some unrelated issues when running the multiple pipelines tests more than once -> fail on the second run. There is likely a leak somewhere that is exposed by the deep-buffer topology. |
|
@plbossart I don't see the same issue as yours. But sometimes I can see |
|
@bardliao @ranj063 it seems we have an issue with the SOF changes that are not yet upstream. With the same patches on top of Mark Brown's for-next branch https://github.com/plbossart/sound/tree/sdw-upstream/debug-dpcm-states I get good results with sof-tests. I get a CONNECT IPC error with the SOF patches. |
|
@bardliao @ranj063 with additional traces I showed the the ALH2.OUT (the dai on the mixer path) is not freed in the first suspend, so not re-initialized on resume -> IPC error on the second run. This only happens with the command "TPLG=/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg ~/sof-test/test-case/multiple-pipeline.sh -f p" |
|
@plbossart I tried two different devices but still can't reproduce the issue on my side. Does it need a specific FW version? |
|
@bardliao this PR requires the SOF topology changes in thesofproject/sof#4611 |
|
@bardliao @ranj063 With the added traces, I was able to show that there is an imbalance at the ALH DAI level. When we open two PCM streams, we end-up doing the DAI HW_PARAMS twice, but the HW_FREE happens only once. This is done at a higher level than drivers/soundwire/, not sure what the issue is. See attached log (look for the '<<<' marker) |
Yes, but it is all about topology, not FW. Anyway, I can see the "doing the DAI HW_PARAMS twice, but the HW_FREE happens only once" issue, but somehow the test result of "~/sof-test/test-case/multiple-pipeline.sh -f p" is PASS on my side. The issue is that Another issue issue is that when we start 2 pcms and stop 1 manually, TRIGGER_STOP will be launched and lead to IO error on the other pcm. See dmesg_io_error.txt I think we should move the operation of be->dpcm[stream].be_start before be state check. Because the second |
It fails the second time you run the test, because we don't reinitialize ALH2.OUT due to the refcount !=0 after the first pas.
This is a big question. It's my understanding that the hw_params can be called multiple times, but if there are more than one users it makes no sense.
You're probably right indeed, we should track the number of requests, not if a state change actually happens. |
75d76f8 to
809ce53
Compare
|
@bardliao @RanderWang @ranj063 the updated PR seems to pass sof-test and monkey tests. Please take a look. |
I think I know why the test is PASS on my side now. swidget->use_count = 0; is only set when |
|
@plbossart I ran run-all-tests.sh with thesofproject/sof#4611 topology and found no regression. BTW, I met below issue when running test_suspend-resume-with-capture test. But I don't think it is a regression. |
bummer. We've seen this before but it's the first time I see this report since we reduce the SoundWire reset delay to the default 4096 cycles. we must be missing something. Just to be clear, this on a CML device, yes? |
Ah yes indeed. Thanks for testing this configuration though, it's useful since I don't have this setup on CML. |
2a59396 to
c25d3e4
Compare
|
@bardliao @ranj063 the updates from today help avoid a leak on the ALH DAI, but we still have a fundamental issue that the entire pipeline is 'active' so when we suspend the refcount is not zero! That's not a SoundWire issue... Last log: [ 104.495491] sof-audio-pci-intel-cnl 0000:00:1f.3: sof_tear_down_pipelines: error: widget PIPELINE.1.ALH2.OUT is still in use 5 |
kv2019i
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this took some time to go through. I picked one concern on repeated hw_params call, but other than that, nothing popped up. The sanity checks look like very welcome additions.
c25d3e4 to
f39353c
Compare
| if (!dma) | ||
| return -EIO; | ||
|
|
||
| if (dma->hw_params) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@plbossart can I handle this part in SOF by using the configred flag in
linux/sound/soc/sof/intel/hda.c
Line 44 in c4bbda2
| int hda_ctrl_dai_widget_setup(struct snd_soc_dapm_widget *w) |
Then we won't have to repeat it for all DAIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no I think you need this anyway. I've added an extra patch to my PR to handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you lost me here. the patch in your PR seems to address all DAIs, so what I added in the SoundWire-specific part seems like dead code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but my code wouldnt deal with the PDI stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't get the answer @ranj063. Wouldn't your change result in the SDW DAI hw_free() being called?
I am worried of a possible double-free if you free on the SOF side and I redo it on the SoundWire side - since we are using different variables to decide to do a free...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no @plbossart. My change only deals with the use_counts for the DAI widget. It will call only call hda_ctrl_dai_widget_free().
There won't be a double free because the first call will reset the configured flag for the DAI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the call graph @ranj063 there will be a double free...
intel_hw_params
intel_hw_free
intel_free_stream
sdw_dai_config_ipc
hda_ctrl_dai_widget_free
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to revert "ASoC: SOF: handle paused streams during system suspend" now. Otherwise I will see
=== PAUSE === Suspended. Trying resume. Failed. Restarting stream. aplay: suspend:1691: suspend: prepare error: Invalid argument
[ 880.519173] sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x30100000: GLB_TPLG_MSG: PIPE_NEW
[ 880.519273] sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx error for 0x30100000 (msg/reply size: 48/20): -22
[ 880.519275] sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to load widget PIPELINE.31.ALH258.OUT
[ 880.519276] sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to restore pipeline after resume -22
[ 880.519277] PM: dpm_run_callback(): pci_pm_resume+0x0/0x80 returns -22
[ 880.519284] sof-audio-pci-intel-tgl 0000:00:1f.3: PM: failed to resume async: error -22
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bardliao there must be something we missed, IIRC you are using firmware 1.8 and we're doing something different. Gah.
|
Really bad test results on some devices - but not all - with Intel test 6131. Who knew it'd be so hard to add a stupid mixer. |
When more than one FE is connected to a BE, e.g. in a mixing use case, the BE can be triggered multiple times when the FE are opened/started concurrently. This race condition is problematic in the case of SoundWire BE dailinks, and this is not desirable in a general case. The code carefully checks when the BE can be stopped or hw_free'ed, but the trigger code does not use any mutual exclusion. Fix by using the same spinlock already used to check FE states, and set the state before the trigger. In case of errors, the initial state will be restored. This patch does not change how the triggers are handled, it only makes sure the states are handled in critical sections. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
On start/pause_release/resume, when more than one FE is connected to the same BE, it's possible that the trigger is sent more than once. This is not desirable, we only want to trigger a BE once, which is straightforward to implement with a refcount. For stop/pause/suspend, the problem is more complicated: the check implemented in snd_soc_dpcm_can_be_free_stop() may fail due to a conceptual deadlock when we trigger the BE before the FE. In this case, the FE states have not yet changed, so there are corner cases where the TRIGGER_STOP is never sent - the dual case of start where multiple triggers might be sent. This patch suggests an unconditional trigger in all cases, without checking the FE states, using a refcount protected by a spinlock. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
A BE connected to more than one FE, e.g. in a mixer case, can go through the following transitions. play FE1 -> BE state is START pause FE1 -> BE state is PAUSED play FE2 -> BE state is START stop FE2 -> BE state is STOP release FE1 -> BE state is START stop FE1 -> BE state is STOP play FE1 -> BE state is START pause FE1 -> BE state is PAUSED play FE2 -> BE state is START release FE1 -> BE state is START stop FE2 -> BE state is START stop FE1 -> BE state is STOP The existing code for PAUSE_RELEASE only allows for the case where the BE is paused, which clearly would not work in the sequences above. Extend the allowed states to restart the BE when PAUSE_RELEASE is received. Reported-by: Bard Liao <bard.liao@intel.com> Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Before suspending, walk through all the widgets to make sure all refcounts are zero. If not, the resume will not work and random errors will be reported. Adding this paranoia check will help identify leaks and broken sequences. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
This reverts commit 048ecd7. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
The stream management currently flags an 'inconsistent state' error when a change is requested multiple times. This was added on purpose to identify programming mistakes. In hindsight, there was no real reason to fail if the logic at the ASoC-DPCM level invokes the same callback multiple times. It's perfectly acceptable to just return and not flag an error when there is nothing to do. The main concern with the state management is to trap errors such as trying to enable a stream that was not prepared first. This patch suggests allowing the stream functions to be idempotent, i.e. they can be called multiple times. Note that the prepare case was already handling multiple calls, this was added in commit c32464c ("soundwire: stream: only prepare stream when it is configured.") Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
It's not unusual for ALSA/ASoC hw_params callbacks to be invoked multiple times. This can happen when the application sets different configurations, but also in the case of multiple FEs connected to a single BE, the ASoC-DPCM framework will propagate hw_params on the BE dailink for each FE configuration. Unfortunately this is not supported by the SoundWire 'stream' API today, since it mixes allocation and configurations. In current implementations the SoundWire configurations are fixed, so we can just filter additional calls. If/when the configurations start varying, we will need a complete overhaul of the stream API to split allocation and configurations in two completely separate steps. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
We don't really need to pass a substream to the callback, we only need the direction. No functionality change, only simplification to enable improve suspend with paused streams. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
This patch provides both a simplification of the suspend flows and a better balanced operation during suspend/resume transition, as part of the transition of Sound Open Firmware (SOF) to dynamic pipelines: the DSP resources are only enabled when required instead of enabled on startup. The exiting code relies on a convoluted way of dealing with suspend signals. Since there is no .suspend DAI callback, we used the component .suspend and marked all the component DAI dmas as 'suspended'. The information was used in the .prepare stage to differentiate resume operations from xrun handling, and only reinitialize SHIM registers and DMA in the former case. While this solution has been working reliably for about 2 years, there is a much better solution consisting in trapping the TRIGGER_SUSPEND in the .trigger DAI ops. The DMA is still marked in the same way for the .prepare op to run, but in addition the callbacks sent to DSP firmware are now balanced. Normal operation: hw_params -> intel_params_stream hw_free -> intel_free_stream suspend -> intel_free_stream prepare -> intel_params_stream This balanced operation was not required with existing SOF firmware relying on static pipelines instantiated at every boot. With the on-going transition to dynamic pipelines, it's however a requirement to keep the use count for the DAI widget balanced across all transitions. The component suspend is not removed but instead modified to deal with a corner case: when a substream is PAUSED, the ALSA core does not throw the TRIGGER_SUSPEND. This is problematic since the refcount for all pipelines and widgets is not balanced, leading to issues on resume. The trigger callback keeps track of the 'paused' state with a new flag, which is tested during the component suspend called later to release the remaining DSP resources. These resources will be re-enabled in the .prepare step. The IPC used in the TRIGGER_SUSPEND to release DSP resources is not a problem since the BE dailink is already marked as non-atomic. Co-developed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Signed-off-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
f39353c to
0559ef6
Compare
kv2019i
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now. Nothing pops in CI results.
| be->dpcm[stream].state = SND_SOC_DPCM_STATE_START; | ||
| spin_unlock_irqrestore(&fe->card->dpcm_lock, flags); | ||
|
|
||
| ret = soc_pcm_trigger(be_substream, cmd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes me a bit suspicious: doesn't it mean that we set the state to START before actually executing the trigger, and since we have dropped the lock, we can now be preempted in an inconsistent state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem might happen if there is an error, and in the mean time the state changed. That would result in whatever pre-empted the current thread to be inconsistent indeed. but I have no solution here, there is no good way to prevent this - we don't want to lock during a trigger.
|
|
||
| break; | ||
| case SNDRV_PCM_TRIGGER_PAUSE_RELEASE: | ||
| spin_lock_irqsave(&fe->card->dpcm_lock, flags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to the commit message: in the first case when pausing FE1 caused BE to go into PAUSED, shouldn't stopping FE2 return the BE into PAUSED instead of pulling it into STOP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I replied on this but for completeness we don't have a stack of states, it's just easier to figure out what to do without keeping track of the past.
| case SNDRV_PCM_TRIGGER_PAUSE_RELEASE: | ||
| spin_lock_irqsave(&fe->card->dpcm_lock, flags); | ||
| if (be->dpcm[stream].state != SND_SOC_DPCM_STATE_PAUSED) { | ||
| if (be->dpcm[stream].state != SND_SOC_DPCM_STATE_START && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we want to trigger RELEASE if we're already in START, i.e. active?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think of the case with two streams and a mixer. one of the two streams may be paused with the other isn't, so the BE remains ACTIVE. When we receive the TRIGGER_RELEASE, it's provided by the BE but there's nothing to do.
| struct sdw_cdns *cdns = snd_soc_dai_get_drvdata(dai); | ||
|
|
||
| pm_runtime_mark_last_busy(cdns->dev); | ||
| pm_runtime_put_autosuspend(cdns->dev); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to commit: can we maybe explain what was wrong with that commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's just a mechanical revert-and-redo. The explanations should be found on the replacement: this didn't work with the suspended-while-paused state.
|
replaced by #3146 |
This is a companion PR to thesofproject/sof#4611, where we added a 'Deep buffer" device for the headset.
This PR fixes issues on trigger_start where a BE could be trigger twice.
Still an issue with the logic on stop with multiple pipelines, both FEs think the other is active and the BE is never stopped.
dmesg.log