Skip to content

Conversation

@plbossart
Copy link
Member

This is a companion PR to thesofproject/sof#4611, where we added a 'Deep buffer" device for the headset.

This PR fixes issues on trigger_start where a BE could be trigger twice.

Still an issue with the logic on stop with multiple pipelines, both FEs think the other is active and the BE is never stopped.

test command:
TPLG=/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg  ~/sof-test/test-case/multiple-pipeline.sh -f p
[   22.647889]  SDW0-Playback: ASoC: trigger BE SDW0-Playback cmd 0
[   22.647894]  Jack Out DeepBuffer: snd_soc_dpcm_check_state: Jack Out DeepBuffer: state 4 states[0] 4
[   22.647897]  SDW0-Playback: ASoC: trigger BE SDW0-Playback cmd 0: BE cannot stop 4
[   22.647898]  Jack Out: ASoC: post trigger FE Jack Out cmd 0
[   22.647902] sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: trigger stream 0 dir 0 cmd 0
[   22.647908] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60050000: GLB_STREAM_MSG: TRIG_STOP

[   22.647951]  Jack Out: snd_soc_dpcm_check_state: Jack Out: state 4 states[0] 4
[   22.647953]  SDW0-Playback: ASoC: trigger BE SDW0-Playback cmd 0: BE cannot stop 4
[   22.647955]  Jack Out DeepBuffer: ASoC: post trigger FE Jack Out DeepBuffer cmd 0
[   22.647957] sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: trigger stream 31 dir 0 cmd 0
[   22.648060] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
[   22.648068] sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x140000 successful
[   22.648072] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60030000: GLB_STREAM_MSG: PCM_FREE
[   22.648171] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60030000: GLB_STREAM_MSG: PCM_FREE

[   22.648177]  Jack Out: ASoC: hw_free FE Jack Out
[   22.648177] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
[   22.648179] sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: free stream 0 dir 0
[   22.648188]  Jack Out DeepBuffer: snd_soc_dpcm_check_state: Jack Out DeepBuffer: state 4 states[0] 4

[   22.648195]  Jack Out: ASoC: close FE Jack Out
[   22.648197] sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: close stream 0 dir 0
[   22.648199]  SDW0-Playback: ASoC: BE SDW0-Playback event 2 dir 0
[   22.648224]  Jack Out: ASoC: BE playback disconnect check for SDW0-Playback
[   22.648226]  Jack Out: freed DSP playback path Jack Out -> SDW0-Playback
[   22.648227]  Jack Out: reparent playback path Jack Out DeepBuffer -> SDW0-Playback
[   22.648269] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
[   22.648275] sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x1c0]=0x440000 successful
[   22.648276] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
[   22.648529] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
[   22.648536] sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x180]=0x240000 successful
[   22.648541] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60030000: GLB_STREAM_MSG: PCM_FREE
[   22.648712] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60030000: GLB_STREAM_MSG: PCM_FREE

[   22.648720]  Jack Out DeepBuffer: ASoC: hw_free FE Jack Out DeepBuffer
[   22.648723] sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: free stream 31 dir 0
[   22.648751] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60030000: GLB_STREAM_MSG: PCM_FREE
[   22.648775] sdw_deprepare_stream: SDW0 Pin2-Playback: inconsistent state state 3
[   22.648779] sdw_deprepare_stream: SDW0 Pin2-Playback: state 3

dmesg.log

@bardliao
Copy link
Collaborator

@plbossart The issue is that we will disconnect a BE and FE in dpcm_fe_dai_close() -> dpcm_fe_dai_cleanup() -> dpcm_be_disconnect(). Ideally, the process is

  1. Trigger stop
  2. Close
    But when we stop 2 PCM with the same BE in a really short time, it will become
  3. Trigger stop pcm 0,0 -> We will see "BE cannot stop" here since pcm 0,31 is still opened
  4. Trigger stop pcm 0,31 -> We sill see "BE cannot stop" here since pcm 0,0 is still opened
  5. Close pcm 0,0
  6. Close pcm 0,31

And it will be no issue when the sequence is

  1. Trigger stop pcm 0,0 -> We will see "BE cannot stop" here since pcm 0,31 is still opened
  2. Close pcm 0,0
  3. Trigger stop pcm 0,31
  4. Close pcm 0,31

@plbossart plbossart force-pushed the sdw/debug-dpcm-states branch 2 times, most recently from 42303f6 to 645ae57 Compare August 16, 2021 22:42
@plbossart
Copy link
Member Author

plbossart commented Aug 16, 2021

@bardliao with the update I am able to run more tests, but still some unrelated issues when running the multiple pipelines tests more than once -> fail on the second run. There is likely a leak somewhere that is exposed by the deep-buffer topology.

root@plb:~# TPLG=/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg  ~/sof-test/test-case/multiple-pipeline.sh -f c
/usr/bin/jq
2021-08-16 22:50:33 UTC [INFO] Starting /usr/local/bin/sof-logger -t -l /etc/sof/sof-cml.ldc -o /root/sof-test/logs/multiple-pipeline/2021-08-16-17:50:32-1910/slogger.txt
2021-08-16 22:50:33 UTC [INFO] /root/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg to run the test case
2021-08-16 22:50:33 UTC [INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2021-08-16 22:50:33 UTC [INFO] Run command to get pipeline parameters
2021-08-16 22:50:33 UTC [COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg -f 'type:any' -b ' pcm:HDA Digital' -s 0 -e
2021-08-16 22:50:33 UTC [INFO] ===== Testing: (Loop: 1/1) =====
2021-08-16 22:50:33 UTC [INFO] /root/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg to run the test case
2021-08-16 22:50:33 UTC [INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2021-08-16 22:50:33 UTC [INFO] Run command to get pipeline parameters
2021-08-16 22:50:33 UTC [COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg -f 'type:capture' -b ' pcm:HDA Digital' -s 0 -e
2021-08-16 22:50:33 UTC [INFO] Testing: Jack In [hw:0,1]
2021-08-16 22:50:33 UTC [COMMAND] arecord   -D hw:0,1 -c 2 -r 48000 -f S16_LE /dev/null -q
2021-08-16 22:50:33 UTC [INFO] Testing: Microphone [hw:0,4]
2021-08-16 22:50:33 UTC [COMMAND] arecord   -D hw:0,4 -c 2 -r 48000 -f S16_LE /dev/null -q
2021-08-16 22:50:33 UTC [INFO] /root/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg to run the test case
2021-08-16 22:50:33 UTC [INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2021-08-16 22:50:33 UTC [INFO] Run command to get pipeline parameters
2021-08-16 22:50:33 UTC [COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg -f 'type:playback' -b ' pcm:HDA Digital' -s 0 -e
2021-08-16 22:50:33 UTC [INFO] Testing: Jack Out [hw:0,0]
2021-08-16 22:50:33 UTC [COMMAND] aplay   -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
2021-08-16 22:50:33 UTC [INFO] Testing: Jack Out DeepBuffer [hw:0,31]
2021-08-16 22:50:33 UTC [COMMAND] aplay   -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
2021-08-16 22:50:33 UTC [INFO] wait 0.5s for aplay_opts()
2021-08-16 22:50:33 UTC [INFO] checking pipeline status
2021-08-16 22:50:33 UTC [INFO] Letting playback/capture run for 5s
2021-08-16 22:50:38 UTC [INFO] checking pipeline status again
2021-08-16 22:50:38 UTC [COMMAND] pkill -9 aplay arecord
/root/sof-test/test-case/../case-lib/lib.sh: line 354:  1453 Killed                  arecord $SOF_ALSA_OPTS $SOF_ARECORD_OPTS "$@"
/root/sof-test/test-case/../case-lib/lib.sh: line 354:  1446 Killed                  arecord $SOF_ALSA_OPTS $SOF_ARECORD_OPTS "$@"
/root/sof-test/test-case/../case-lib/lib.sh: line 348:  1481 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
/root/sof-test/test-case/../case-lib/lib.sh: line 348:  1473 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
declare -- cmd="journalctl_cmd --since=@1629154228"
2021-08-16 22:50:41 UTC [INFO] pkill -TERM sof-logger
Terminated
2021-08-16 22:50:42 UTC [INFO] nlines=2173 /root/sof-test/logs/multiple-pipeline/2021-08-16-17:50:32-1910/slogger.txt
2021-08-16 22:50:42 UTC [INFO] Test Result: PASS!
root@plb:~# TPLG=/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg  ~/sof-test/test-case/multiple-pipeline.sh -f c
/usr/bin/jq
2021-08-16 22:50:52 UTC [INFO] Starting /usr/local/bin/sof-logger -t -l /etc/sof/sof-cml.ldc -o /root/sof-test/logs/multiple-pipeline/2021-08-16-17:50:52-19417/slogger.txt
2021-08-16 22:50:52 UTC [INFO] /root/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg to run the test case
2021-08-16 22:50:52 UTC [INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2021-08-16 22:50:52 UTC [INFO] Run command to get pipeline parameters
2021-08-16 22:50:52 UTC [COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg -f 'type:any' -b ' pcm:HDA Digital' -s 0 -e
2021-08-16 22:50:52 UTC [INFO] ===== Testing: (Loop: 1/1) =====
2021-08-16 22:50:52 UTC [INFO] /root/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg to run the test case
2021-08-16 22:50:52 UTC [INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2021-08-16 22:50:52 UTC [INFO] Run command to get pipeline parameters
2021-08-16 22:50:52 UTC [COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg -f 'type:capture' -b ' pcm:HDA Digital' -s 0 -e
2021-08-16 22:50:52 UTC [INFO] Testing: Jack In [hw:0,1]
2021-08-16 22:50:52 UTC [COMMAND] arecord   -D hw:0,1 -c 2 -r 48000 -f S16_LE /dev/null -q
2021-08-16 22:50:52 UTC [INFO] Testing: Microphone [hw:0,4]
2021-08-16 22:50:52 UTC [COMMAND] arecord   -D hw:0,4 -c 2 -r 48000 -f S16_LE /dev/null -q
2021-08-16 22:50:52 UTC [INFO] /root/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg to run the test case
2021-08-16 22:50:52 UTC [INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2021-08-16 22:50:52 UTC [INFO] Run command to get pipeline parameters
2021-08-16 22:50:52 UTC [COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg -f 'type:playback' -b ' pcm:HDA Digital' -s 0 -e
2021-08-16 22:50:52 UTC [INFO] Testing: Jack Out [hw:0,0]
2021-08-16 22:50:52 UTC [COMMAND] aplay   -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
2021-08-16 22:50:52 UTC [INFO] Testing: Jack Out DeepBuffer [hw:0,31]
2021-08-16 22:50:52 UTC [INFO] wait 0.5s for aplay_opts()
2021-08-16 22:50:52 UTC [COMMAND] aplay   -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
arecord: main:852: arecord: main:852: audio open error: Device or resource busy
aplay: main:852: audio open error: Device or resource busyaplay: main:852: 
audio open error: Device or resource busy
audio open error: Device or resource busy
2021-08-16 22:50:52 UTC [INFO] checking pipeline status
2021-08-16 22:50:52 UTC [ERROR] Running process count is 0, but 4 is expected
2021-08-16 22:50:52 UTC [ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2021-08-16 22:50:52 UTC [ERROR]  func_error_exit()  @  /root/sof-test/test-case/multiple-pipeline.sh
2021-08-16 22:50:52 UTC [ERROR]  ps_checks()  @  /root/sof-test/test-case/multiple-pipeline.sh:135
2021-08-16 22:50:52 UTC [ERROR]  main()  @  /root/sof-test/test-case/multiple-pipeline.sh:171
2021-08-16 22:50:53 UTC [INFO] Starting /usr/local/bin/sof-logger  -l /etc/sof/sof-cml.ldc -o /root/sof-test/logs/multiple-pipeline/2021-08-16-17:50:52-19417/etrace.txt
2021-08-16 22:50:55 UTC [INFO] pkill -TERM sof-logger
Terminated
2021-08-16 22:50:56 UTC [INFO] nlines=3 /root/sof-test/logs/multiple-pipeline/2021-08-16-17:50:52-19417/etrace.txt
2021-08-16 22:50:56 UTC [INFO] Test Result: FAIL!
[   38.475563] sof-audio-pci-intel-cnl 0000:00:1f.3: sof_set_up_pipelines: connecting src BUF1.1 to sink ALH2.OUT
[   38.475572] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x30030000: GLB_TPLG_MSG: COMP_CONNECT
[   38.475671] sof-audio-pci-intel-cnl 0000:00:1f.3: error: ipc error for 0x30030000 size 12
[   38.475683] sof-audio-pci-intel-cnl 0000:00:1f.3: sof_set_up_pipelines: failed IPC: -22
[   38.475692] sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to restore pipeline after resume -22

@bardliao
Copy link
Collaborator

@plbossart I don't see the same issue as yours. But sometimes I can see soundwire_intel soundwire_intel.link.1: Bus clash for data word.

@plbossart
Copy link
Member Author

@bardliao @ranj063 it seems we have an issue with the SOF changes that are not yet upstream. With the same patches on top of Mark Brown's for-next branch https://github.com/plbossart/sound/tree/sdw-upstream/debug-dpcm-states I get good results with sof-tests. I get a CONNECT IPC error with the SOF patches.

@plbossart
Copy link
Member Author

@bardliao @ranj063 with additional traces I showed the the ALH2.OUT (the dai on the mixer path) is not freed in the first suspend, so not re-initialized on resume -> IPC error on the second run. This only happens with the command

"TPLG=/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg ~/sof-test/test-case/multiple-pipeline.sh -f p"

dmesg.log

@bardliao
Copy link
Collaborator

@plbossart I tried two different devices but still can't reproduce the issue on my side. Does it need a specific FW version?

@plbossart
Copy link
Member Author

@bardliao this PR requires the SOF topology changes in thesofproject/sof#4611

@plbossart
Copy link
Member Author

@bardliao @ranj063 With the added traces, I was able to show that there is an imbalance at the ALH DAI level. When we open two PCM streams, we end-up doing the DAI HW_PARAMS twice, but the HW_FREE happens only once.

This is done at a higher level than drivers/soundwire/, not sure what the issue is.

See attached log (look for the '<<<' marker)

dmesg.log

@plbossart
Copy link
Member Author

@bardliao @ranj063 it's really bad if you look at the added traces. the ALH HW_PARAMS can be called multiple times. That will cause not only a refcount imbalance but worse it will add the stream multiple times and reserve the PDI multiple times.

Wow. That's completely broken.

dmesg.log

@bardliao
Copy link
Collaborator

@bardliao this PR requires the SOF topology changes in thesofproject/sof#4611

Yes, but it is all about topology, not FW. Anyway, I can see the "doing the DAI HW_PARAMS twice, but the HW_FREE happens only once" issue, but somehow the test result of "~/sof-test/test-case/multiple-pipeline.sh -f p" is PASS on my side.

The issue is that be->dpcm[stream].state is SND_SOC_DPCM_STATE_HW_PARAMS when the second dpcm_be_dai_hw_params is called. IMHO dpcm_be_dai_hw_params doesn't need to do soc_pcm_hw_params when the be->dpcm[stream].state is already SND_SOC_DPCM_STATE_HW_PARAMS.
Removing (be->dpcm[stream].state != SND_SOC_DPCM_STATE_HW_PARAMS) && can fix the issue.

Another issue issue is that when we start 2 pcms and stop 1 manually, TRIGGER_STOP will be launched and lead to IO error on the other pcm. See dmesg_io_error.txt

I think we should move the operation of be->dpcm[stream].be_start before be state check. Because the second
dpcm_be_dai_trigger will "continue" in normal case.

@plbossart
Copy link
Member Author

@bardliao this PR requires the SOF topology changes in thesofproject/sof#4611

Yes, but it is all about topology, not FW. Anyway, I can see the "doing the DAI HW_PARAMS twice, but the HW_FREE happens only once" issue, but somehow the test result of "~/sof-test/test-case/multiple-pipeline.sh -f p" is PASS on my side.

It fails the second time you run the test, because we don't reinitialize ALH2.OUT due to the refcount !=0 after the first pas.

The issue is that be->dpcm[stream].state is SND_SOC_DPCM_STATE_HW_PARAMS when the second dpcm_be_dai_hw_params is called. IMHO dpcm_be_dai_hw_params doesn't need to do soc_pcm_hw_params when the be->dpcm[stream].state is already SND_SOC_DPCM_STATE_HW_PARAMS.
Removing (be->dpcm[stream].state != SND_SOC_DPCM_STATE_HW_PARAMS) && can fix the issue.

This is a big question. It's my understanding that the hw_params can be called multiple times, but if there are more than one users it makes no sense.

Another issue issue is that when we start 2 pcms and stop 1 manually, TRIGGER_STOP will be launched and lead to IO error on the other pcm. See dmesg_io_error.txt

I think we should move the operation of be->dpcm[stream].be_start before be state check. Because the second
dpcm_be_dai_trigger will "continue" in normal case.

You're probably right indeed, we should track the number of requests, not if a state change actually happens.

@plbossart plbossart force-pushed the sdw/debug-dpcm-states branch from 75d76f8 to 809ce53 Compare August 19, 2021 22:21
@plbossart
Copy link
Member Author

@bardliao @RanderWang @ranj063 the updated PR seems to pass sof-test and monkey tests. Please take a look.

@bardliao
Copy link
Collaborator

@bardliao this PR requires the SOF topology changes in thesofproject/sof#4611

Yes, but it is all about topology, not FW. Anyway, I can see the "doing the DAI HW_PARAMS twice, but the HW_FREE happens only once" issue, but somehow the test result of "~/sof-test/test-case/multiple-pipeline.sh -f p" is PASS on my side.

It fails the second time you run the test, because we don't reinitialize ALH2.OUT due to the refcount !=0 after the first pas.

I think I know why the test is PASS on my side now. swidget->use_count = 0; is only set when v->abi_version < SOF_ABI_VER(3, 19, 0) and I tested it with SOF v3.18.

@bardliao
Copy link
Collaborator

@plbossart I ran run-all-tests.sh with thesofproject/sof#4611 topology and found no regression.
Note that I tested with SOF v1.8 because my DUT doesn't support community key.

BTW, I met below issue when running test_suspend-resume-with-capture test.

2021-08-20 09:10:53 UTC [ERROR] Caught kernel log error
===========================>>
[ 8291.132370] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 0
[ 8291.133425] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 1
[ 8291.135019] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 2
[ 8291.136415] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 3
[ 8291.138008] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 4
[ 8291.139426] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 5
[ 8291.141022] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 6
[ 8291.142412] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 7
[ 8291.144007] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 8
[ 8291.145409] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 9
[ 8291.147003] kernel: soundwire_intel soundwire_intel.link.0: intel_resume failed: MCP_CONTROL_HW_RST is not cleared
<<===========================

But I don't think it is a regression.

@plbossart
Copy link
Member Author

@plbossart I ran run-all-tests.sh with thesofproject/sof#4611 topology and found no regression.
Note that I tested with SOF v1.8 because my DUT doesn't support community key.

BTW, I met below issue when running test_suspend-resume-with-capture test.

2021-08-20 09:10:53 UTC [ERROR] Caught kernel log error
===========================>>
[ 8291.132370] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 0
[ 8291.133425] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 1
[ 8291.135019] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 2
[ 8291.136415] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 3
[ 8291.138008] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 4
[ 8291.139426] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 5
[ 8291.141022] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 6
[ 8291.142412] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 7
[ 8291.144007] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 8
[ 8291.145409] kernel: soundwire_intel soundwire_intel.link.0: intel_resume: MCP_CONTROL_HW_RST is not cleared at iteration 9
[ 8291.147003] kernel: soundwire_intel soundwire_intel.link.0: intel_resume failed: MCP_CONTROL_HW_RST is not cleared
<<===========================

But I don't think it is a regression.

bummer. We've seen this before but it's the first time I see this report since we reduce the SoundWire reset delay to the default 4096 cycles. we must be missing something.

Just to be clear, this on a CML device, yes?

@plbossart
Copy link
Member Author

@bardliao this PR requires the SOF topology changes in thesofproject/sof#4611

Yes, but it is all about topology, not FW. Anyway, I can see the "doing the DAI HW_PARAMS twice, but the HW_FREE happens only once" issue, but somehow the test result of "~/sof-test/test-case/multiple-pipeline.sh -f p" is PASS on my side.

It fails the second time you run the test, because we don't reinitialize ALH2.OUT due to the refcount !=0 after the first pas.

I think I know why the test is PASS on my side now. swidget->use_count = 0; is only set when v->abi_version < SOF_ABI_VER(3, 19, 0) and I tested it with SOF v3.18.

Ah yes indeed. Thanks for testing this configuration though, it's useful since I don't have this setup on CML.

@plbossart
Copy link
Member Author

@bardliao @ranj063 the updates from today help avoid a leak on the ALH DAI, but we still have a fundamental issue that the entire pipeline is 'active' so when we suspend the refcount is not zero! That's not a SoundWire issue...

Last log:

[ 104.495491] sof-audio-pci-intel-cnl 0000:00:1f.3: sof_tear_down_pipelines: error: widget PIPELINE.1.ALH2.OUT is still in use 5
[ 104.495493] sof-audio-pci-intel-cnl 0000:00:1f.3: sof_tear_down_pipelines: error: widget ALH2.OUT is still in use 1
[ 104.495494] sof-audio-pci-intel-cnl 0000:00:1f.3: sof_tear_down_pipelines: error: widget BUF1.1 is still in use 1
[ 104.495496] sof-audio-pci-intel-cnl 0000:00:1f.3: sof_tear_down_pipelines: error: widget BUF1.0 is still in use 1
[ 104.495497] sof-audio-pci-intel-cnl 0000:00:1f.3: sof_tear_down_pipelines: error: widget PGA1.0 is still in use 1
[ 104.495499] sof-audio-pci-intel-cnl 0000:00:1f.3: sof_tear_down_pipelines: error: widget PCM0P is still in use 1

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this took some time to go through. I picked one concern on repeated hw_params call, but other than that, nothing popped up. The sanity checks look like very welcome additions.

@plbossart plbossart force-pushed the sdw/debug-dpcm-states branch from c25d3e4 to f39353c Compare August 26, 2021 23:22
@plbossart plbossart requested a review from lyakh as a code owner August 26, 2021 23:22
@plbossart
Copy link
Member Author

@bardliao Can you take this update for a spin? I included @ranj063's patches so that we can test a single PR, it's easier.

I have been testing on SoundWire with and without the deep-buffer solution, no issue. I will try the nocodec tests with the daily test devices.

if (!dma)
return -EIO;

if (dma->hw_params) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plbossart can I handle this part in SOF by using the configred flag in

int hda_ctrl_dai_widget_setup(struct snd_soc_dapm_widget *w)

Then we won't have to repeat it for all DAIs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I think you need this anyway. I've added an extra patch to my PR to handle it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you lost me here. the patch in your PR seems to address all DAIs, so what I added in the SoundWire-specific part seems like dead code?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but my code wouldnt deal with the PDI stuff

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't get the answer @ranj063. Wouldn't your change result in the SDW DAI hw_free() being called?

I am worried of a possible double-free if you free on the SOF side and I redo it on the SoundWire side - since we are using different variables to decide to do a free...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no @plbossart. My change only deals with the use_counts for the DAI widget. It will call only call hda_ctrl_dai_widget_free().

There won't be a double free because the first call will reset the configured flag for the DAI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the call graph @ranj063 there will be a double free...

intel_hw_params
  intel_hw_free
    intel_free_stream
      sdw_dai_config_ipc
        hda_ctrl_dai_widget_free

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to revert "ASoC: SOF: handle paused streams during system suspend" now. Otherwise I will see

=== PAUSE ===                                                             Suspended. Trying resume. Failed. Restarting stream. aplay: suspend:1691: suspend: prepare error: Invalid argument
[  880.519173] sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x30100000: GLB_TPLG_MSG: PIPE_NEW
[  880.519273] sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx error for 0x30100000 (msg/reply size: 48/20): -22
[  880.519275] sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to load widget PIPELINE.31.ALH258.OUT
[  880.519276] sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to restore pipeline after resume -22
[  880.519277] PM: dpm_run_callback(): pci_pm_resume+0x0/0x80 returns -22
[  880.519284] sof-audio-pci-intel-tgl 0000:00:1f.3: PM: failed to resume async: error -22

dmesg.txt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bardliao there must be something we missed, IIRC you are using firmware 1.8 and we're doing something different. Gah.

@plbossart
Copy link
Member Author

Really bad test results on some devices - but not all - with Intel test 6131.

Who knew it'd be so hard to add a stupid mixer.

plbossart and others added 9 commits August 27, 2021 15:39
When more than one FE is connected to a BE, e.g. in a mixing use case,
the BE can be triggered multiple times when the FE are opened/started
concurrently. This race condition is problematic in the case of
SoundWire BE dailinks, and this is not desirable in a general
case. The code carefully checks when the BE can be stopped or
hw_free'ed, but the trigger code does not use any mutual exclusion.

Fix by using the same spinlock already used to check FE states, and
set the state before the trigger. In case of errors,  the initial
state will be restored.

This patch does not change how the triggers are handled, it only makes
sure the states are handled in critical sections.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
On start/pause_release/resume, when more than one FE is connected to
the same BE, it's possible that the trigger is sent more than
once. This is not desirable, we only want to trigger a BE once, which
is straightforward to implement with a refcount.

For stop/pause/suspend, the problem is more complicated: the check
implemented in snd_soc_dpcm_can_be_free_stop() may fail due to a
conceptual deadlock when we trigger the BE before the FE. In this
case, the FE states have not yet changed, so there are corner cases
where the TRIGGER_STOP is never sent - the dual case of start where
multiple triggers might be sent.

This patch suggests an unconditional trigger in all cases, without
checking the FE states, using a refcount protected by a spinlock.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
A BE connected to more than one FE, e.g. in a mixer case, can go
through the following transitions.

play FE1    -> BE state is START
pause FE1   -> BE state is PAUSED
play FE2    -> BE state is START
stop FE2    -> BE state is STOP
release FE1 -> BE state is START
stop FE1    -> BE state is STOP

play FE1    -> BE state is START
pause FE1   -> BE state is PAUSED
play FE2    -> BE state is START
release FE1 -> BE state is START
stop FE2    -> BE state is START
stop FE1    -> BE state is STOP

The existing code for PAUSE_RELEASE only allows for the case where the
BE is paused, which clearly would not work in the sequences above.

Extend the allowed states to restart the BE when PAUSE_RELEASE is
received.

Reported-by: Bard Liao <bard.liao@intel.com>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Before suspending, walk through all the widgets to make sure all
refcounts are zero. If not, the resume will not work and random errors
will be reported. Adding this paranoia check will help identify leaks
and broken sequences.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
This reverts commit 048ecd7.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
The stream management currently flags an 'inconsistent state' error
when a change is requested multiple times. This was added on purpose
to identify programming mistakes.

In hindsight, there was no real reason to fail if the logic at the
ASoC-DPCM level invokes the same callback multiple times. It's
perfectly acceptable to just return and not flag an error when there
is nothing to do. The main concern with the state management is to
trap errors such as trying to enable a stream that was not prepared
first.

This patch suggests allowing the stream functions to be idempotent,
i.e. they can be called multiple times.

Note that the prepare case was already handling multiple calls, this
was added in commit c32464c ("soundwire: stream: only prepare
stream when it is configured.")

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
It's not unusual for ALSA/ASoC hw_params callbacks to be invoked
multiple times. This can happen when the application sets different
configurations, but also in the case of multiple FEs connected to a
single BE, the ASoC-DPCM framework will propagate hw_params on the BE
dailink for each FE configuration.

Unfortunately this is not supported by the SoundWire 'stream' API
today, since it mixes allocation and configurations.

In current implementations the SoundWire configurations are fixed, so
we can just filter additional calls. If/when the configurations start
varying, we will need a complete overhaul of the stream API to split
allocation and configurations in two completely separate steps.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
We don't really need to pass a substream to the callback, we only need
the direction. No functionality change, only simplification to enable
improve suspend with paused streams.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
This patch provides both a simplification of the suspend flows and a
better balanced operation during suspend/resume transition, as part of
the transition of Sound Open Firmware (SOF) to dynamic pipelines: the
DSP resources are only enabled when required instead of enabled on
startup.

The exiting code relies on a convoluted way of dealing with suspend
signals. Since there is no .suspend DAI callback, we used the
component .suspend and marked all the component DAI dmas as
'suspended'. The information was used in the .prepare stage to
differentiate resume operations from xrun handling, and only
reinitialize SHIM registers and DMA in the former case.

While this solution has been working reliably for about 2 years, there
is a much better solution consisting in trapping the TRIGGER_SUSPEND
in the .trigger DAI ops. The DMA is still marked in the same way for
the .prepare op to run, but in addition the callbacks sent to DSP
firmware are now balanced.

Normal operation:
hw_params -> intel_params_stream
hw_free   -> intel_free_stream

suspend    -> intel_free_stream
prepare    -> intel_params_stream

This balanced operation was not required with existing SOF firmware
relying on static pipelines instantiated at every boot. With the
on-going transition to dynamic pipelines, it's however a requirement
to keep the use count for the DAI widget balanced across all
transitions.

The component suspend is not removed but instead modified to deal with
a corner case: when a substream is PAUSED, the ALSA core does not
throw the TRIGGER_SUSPEND. This is problematic since the refcount for
all pipelines and widgets is not balanced, leading to issues on
resume. The trigger callback keeps track of the 'paused' state with a
new flag, which is tested during the component suspend called later to
release the remaining DSP resources. These resources will be
re-enabled in the .prepare step.

The IPC used in the TRIGGER_SUSPEND to release DSP resources is not a
problem since the BE dailink is already marked as non-atomic.

Co-developed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
@plbossart plbossart force-pushed the sdw/debug-dpcm-states branch from f39353c to 0559ef6 Compare August 27, 2021 20:49
Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now. Nothing pops in CI results.

be->dpcm[stream].state = SND_SOC_DPCM_STATE_START;
spin_unlock_irqrestore(&fe->card->dpcm_lock, flags);

ret = soc_pcm_trigger(be_substream, cmd);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes me a bit suspicious: doesn't it mean that we set the state to START before actually executing the trigger, and since we have dropped the lock, we can now be preempted in an inconsistent state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem might happen if there is an error, and in the mean time the state changed. That would result in whatever pre-empted the current thread to be inconsistent indeed. but I have no solution here, there is no good way to prevent this - we don't want to lock during a trigger.


break;
case SNDRV_PCM_TRIGGER_PAUSE_RELEASE:
spin_lock_irqsave(&fe->card->dpcm_lock, flags);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to the commit message: in the first case when pausing FE1 caused BE to go into PAUSED, shouldn't stopping FE2 return the BE into PAUSED instead of pulling it into STOP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I replied on this but for completeness we don't have a stack of states, it's just easier to figure out what to do without keeping track of the past.

case SNDRV_PCM_TRIGGER_PAUSE_RELEASE:
spin_lock_irqsave(&fe->card->dpcm_lock, flags);
if (be->dpcm[stream].state != SND_SOC_DPCM_STATE_PAUSED) {
if (be->dpcm[stream].state != SND_SOC_DPCM_STATE_START &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want to trigger RELEASE if we're already in START, i.e. active?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think of the case with two streams and a mixer. one of the two streams may be paused with the other isn't, so the BE remains ACTIVE. When we receive the TRIGGER_RELEASE, it's provided by the BE but there's nothing to do.

struct sdw_cdns *cdns = snd_soc_dai_get_drvdata(dai);

pm_runtime_mark_last_busy(cdns->dev);
pm_runtime_put_autosuspend(cdns->dev);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to commit: can we maybe explain what was wrong with that commit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just a mechanical revert-and-redo. The explanations should be found on the replacement: this didn't work with the suspended-while-paused state.

@plbossart
Copy link
Member Author

replaced by #3146

@plbossart plbossart closed this Sep 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants