Skip to content

Conversation

@plbossart
Copy link
Member

Updated from initial upstream contribution by Heikki Krogerius

@plbossart
Copy link
Member Author

turns out the helper is not needed, as mentioned by @jwrdegoede in previous comments.

Rework and update (put_device() use was incorrect).

@plbossart plbossart force-pushed the fix/device-properties2 branch from 8d68b39 to 4983c3b Compare March 29, 2021 17:45
bardliao
bardliao previously approved these changes Mar 30, 2021
lyakh
lyakh previously approved these changes Mar 30, 2021
Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really familiar with this API, but FWIW LGTM

@plbossart
Copy link
Member Author

@jwrdegoede any comments? I struggled on the atom drivers since they all use a different structure, but I think this is correct now.

@plbossart
Copy link
Member Author

plbossart commented Apr 1, 2021

also forgot to ask @andy-shev for feedback.

@andy-shev
Copy link

also forgot to ask @andy-shev for feedback.

Code looks good. Just want to be sure I understand the lifetime constraints. You have a board file (where this code is applied) which lives shorter or longer than codec device. Is it correct? And actually which is the right one (shorter or longer)? In some cases one may use https://elixir.bootlin.com/linux/v5.12-rc5/source/drivers/base/swnode.c#L1076 but with care.

@plbossart
Copy link
Member Author

also forgot to ask @andy-shev for feedback.

Code looks good. Just want to be sure I understand the lifetime constraints. You have a board file (where this code is applied) which lives shorter or longer than codec device. Is it correct? And actually which is the right one (shorter or longer)? In some cases one may use https://elixir.bootlin.com/linux/v5.12-rc5/source/drivers/base/swnode.c#L1076 but with care.

The platform devices and their drivers (modified here) depend on the functionality provided by the codec device/drivers, so the machine drivers will live for shorter time than the codec ones. The machine platform device is created AFTER we detect the presence of a codec device, and conversely we remove the machine platform device BEFORE removing the codec device.

That was the main argument behind NOT using device managed APIs, as Heikki did in his initial patch submitted upstream, as we are adding the properties on the codec device but from the machine driver, so when we remove the machine device the properties are not removed. It's much better to add/remove properties in the same scope, we've been there before and using devm_ creates lots of issues in our load/unload tests. That's what we are doing for SoundWire machine drivers, but for old Baytrail stuff we didn't, so we have to add an explicit remove step in the machine platform driver.

@andy-shev
Copy link

also forgot to ask @andy-shev for feedback.

Code looks good. Just want to be sure I understand the lifetime constraints. You have a board file (where this code is applied) which lives shorter or longer than codec device. Is it correct? And actually which is the right one (shorter or longer)? In some cases one may use https://elixir.bootlin.com/linux/v5.12-rc5/source/drivers/base/swnode.c#L1076 but with care.

The platform devices and their drivers (modified here) depend on the functionality provided by the codec device/drivers, so the machine drivers will live for shorter time than the codec ones. The machine platform device is created AFTER we detect the presence of a codec device, and conversely we remove the machine platform device BEFORE removing the codec device.

That was the main argument behind NOT using device managed APIs, as Heikki did in his initial patch submitted upstream, as we are adding the properties on the codec device but from the machine driver, so when we remove the machine device the properties are not removed. It's much better to add/remove properties in the same scope, we've been there before and using devm_ creates lots of issues in our load/unload tests. That's what we are doing for SoundWire machine drivers, but for old Baytrail stuff we didn't, so we have to add an explicit remove step in the machine platform driver.

Right, that explains how above patch is organazed and which APIs it uses. Thanks! To me it's clear and correct solution.

@plbossart
Copy link
Member Author

Something's wrong here, there is a refcount error when removing the device software node

[  119.949838] bytcr_rt5640 bytcr_rt5640: before device remove software node
[  119.950358] ------------[ cut here ]------------
[  119.950395] refcount_t: underflow; use-after-free.
[  119.950483] WARNING: CPU: 1 PID: 1317 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
[  119.950541] Modules linked in: snd_seq_dummy snd_hrtimer iptable_nat snd_soc_sst_bytcr_rt5640 intel_powerclamp iwlmvm iwlwifi mei_txe mei snd_sof_acpi_intel_byt(-) snd_sof_intel_ipc snd_sof_acpi snd_sof_xtensa_dsp snd_soc_rt5670 snd_sof snd_soc_rt5651 ledtrig_audio snd_soc_rt5645 snd_intel_sst_acpi snd_intel_sst_core snd_soc_rt5640 snd_soc_rl6231 snd_soc_sst_atom_hifi2_platform snd_soc_acpi_intel_match snd_soc_acpi regmap_i2c snd_soc_core snd_intel_dspcfg snd_intel_sdw_acpi snd_compress snd_seq snd_seq_device snd_pcm snd_timer snd soundcore zram mmc_block i915 i2c_algo_bit xhci_pci drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_hcd drm sdhci_acpi sdhci fuse
[  119.951352] CPU: 1 PID: 1317 Comm: rmmod Not tainted 5.12.0-rc4-test+ #8
[  119.951391] Hardware name: ZOTAC XXXXXX/Cherry Trail FFD, BIOS 5.11 09/28/2016
[  119.951419] RIP: 0010:refcount_warn_saturate+0xa6/0xf0
[  119.951465] Code: 05 0e 0a 3b 01 01 e8 41 92 65 00 0f 0b c3 80 3d fc 09 3b 01 00 75 95 48 c7 c7 80 17 5b 8f c6 05 ec 09 3b 01 01 e8 22 92 65 00 <0f> 0b c3 80 3d db 09 3b 01 00 0f 85 72 ff ff ff 48 c7 c7 d8 17 5b
[  119.951501] RSP: 0018:ffffb0c2813a3d78 EFLAGS: 00010286
[  119.951544] RAX: 0000000000000000 RBX: ffff9b80c5fd5e28 RCX: 0000000000000027
[  119.951638] RDX: 0000000000000027 RSI: 0000000058e38e39 RDI: ffff9b81349d7738
[  119.951676] RBP: ffff9b80c5936810 R08: ffff9b81349d7730 R09: 0000000000000000
[  119.951706] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[  119.951735] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  119.951764] FS:  00007f38a6fe4b80(0000) GS:ffff9b8134800000(0000) knlGS:0000000000000000
[  119.951799] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  119.951829] CR2: 00007f38a7051b40 CR3: 000000003eb1a000 CR4: 00000000001006e0
[  119.951860] Call Trace:
[  119.951896]  snd_byt_rt5640_mc_remove+0x2f/0x43 [snd_soc_sst_bytcr_rt5640]
[  119.951963]  platform_remove+0x1a/0x40
[  119.952014]  device_release_driver_internal+0xf1/0x1d0
[  119.952065]  bus_remove_device+0xed/0x160
[  119.952113]  device_del+0x187/0x3e0
[  119.952169]  platform_device_del.part.0+0xe/0x60
[  119.952216]  platform_device_unregister+0x17/0x30
[  119.952263]  snd_sof_device_remove+0x53/0xf0 [snd_sof]
[  119.952367]  sof_acpi_remove+0x2f/0x40 [snd_sof_acpi]
[  119.952414]  platform_remove+0x1a/0x40
[  119.952459]  device_release_driver_internal+0xf1/0x1d0
[  119.952508]  driver_detach+0x42/0x90
[  119.952554]  bus_remove_driver+0x56/0xd0
[  119.952667]  __x64_sys_delete_module+0x18f/0x250
[  119.952728]  ? exit_to_user_mode_prepare+0x2f/0x130
[  119.952784]  do_syscall_64+0x33/0x40
[  119.952827]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  119.952869] RIP: 0033:0x7f38a711135b
[  119.952904] Code: 73 01 c3 48 8b 0d 15 1b 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e5 1a 0c 00 f7 d8 64 89 01 48
[  119.952939] RSP: 002b:00007ffc837cd388 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  119.952983] RAX: ffffffffffffffda RBX: 000055969ecf0760 RCX: 00007f38a711135b
[  119.953013] RDX: 000000000000000a RSI: 0000000000000800 RDI: 000055969ecf07c8
[  119.953042] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  119.953070] R10: 00007f38a7184ac0 R11: 0000000000000206 R12: 00007ffc837cd5c0
[  119.953098] R13: 00007ffc837cd89a R14: 000055969ecf02a0 R15: 000055969ecf0760
[  119.953178] ---[ end trace 3f804da710f70803 ]---
[  119.953213] bytcr_rt5640 bytcr_rt5640: before after remove software node

Not sure what's happening here. full dmesg here
dmesg.txt

@plbossart
Copy link
Member Author

same issue with SoundWire devices, and this was already reported in CI results but the results are PASS
https://sof-ci.01.org/linuxpr/PR2810/build5522/devicetest/?model=TGLU_RVP_SDW&testcase=check-kmod-load-unload

@marc-hb @fredoh9 @aiChaoSONG any idea why we pass a test with such a blatant refcount issue?

@marc-hb
Copy link
Collaborator

marc-hb commented Apr 12, 2021

https://sof-ci.01.org/linuxpr/PR2810/build5522/devicetest/?model=TGLU_RVP_SDW&testcase=check-kmod-load-unload
@marc-hb @fredoh9 @aiChaoSONG any idea why we pass a test with such a blatant refcount issue?

Probably because we don't have a culture of "testing the tests" and we're just not trying hard to find issues in general. A test that never reported any failure hasn't been tested and probably does not work, yet for some reasons we consider passing tests as good enough to make even major tests changes. I've been trying to change this "green failure" culture for some time but finally understood it's not anyone's priority. We didn't even have CI for sof-test when I joined, so not all is bad :-)

I see the dmesg.txt at that link does not have any "error" keyword, that could be why. What is the log level of this Call Trace?


We love extending ignore_str to avoid looking at issues we want to believe don't matter but that seems to be unfortunately the only time we do some "test testing". There are extremely few people contributing to sof-test with code and reviews (just check the https://github.com/thesofproject/sof-test/graphs/contributors and other stats) it is not considered as a priority. The quality bar is also fairly low, for instance:

Why are some other cases have DMESG_LOG_START_LINE? actually I don't know, ask the owner!

Etc., these are just some relevant examples from the top of my head.

fengguang pushed a commit to 0day-ci/linux that referenced this pull request Apr 13, 2021
On Tue, Apr 13, 2021 at 03:20:45PM +0300, Heikki Krogerus wrote:
> On Mon, Apr 12, 2021 at 03:36:20PM -0500, Pierre-Louis Bossart wrote:
> > I took the code and split it in two for BYT/CHT (modified to remove devm_)
> > and SoundWire parts (added as is).
> >
> > thesofproject#2810
> >
> > Both cases result in a refcount error on device_remove_sof when removing the
> > platform device. I don't understand the code well enough to figure out what
> > happens, but it's likely a case of the software node being removed twice?
>
> Right. Because you are injecting the node to an already existing
> device, the node does not get linked with the device in sysfs. That
> would increment the reference count in a normal case. It all happens
> in the function software_node_notify(). Driver core calls it when a
> device is added and also when it's removed. In this case it is only
> called when it's removed.
>
> I think the best way to handle this now is to simply not decrementing
> the ref count when you add the properties, so don't call
> fwnode_handle_put() there (but add a comment explaining why you are
> not calling it).

No, sorry... That's just too hacky. Let's not do that after all.

We can also fix this in the software node code. I'm attaching a patch
that should make it possible to inject the software nodes also
afterwards safely. This is definitely also not without its problems,
but we can go with that if it works. Let me know.

> For a better solution you would call device_reprobe() after you have
> injected the software node, but before that you need to modify
> device_reprobe() so it calls device_platform_notify() (which it really
> should call in any case). But this should probable be done later,
> separately.
>
> thanks,
>
> P.S.
>
> Have you guys considered the possibility of describing the connections
> between all these components by using one of the methods that we now
> have for that in kernel, for example device graph? It can now be
> used also with software nodes (OF graph and ACPI device graph are of
> course already fully supported).

Br,

--
heikki

From 3d3dca1f80941f8975390bc8f488176d00acef22 Mon Sep 17 00:00:00 2001
From: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Date: Tue, 13 Apr 2021 16:28:34 +0300
Subject: [PATCH] software node: Allow node addition to already existing device

If the node is added to an already exiting device, the node
needs to be also linked to the device separately.

This will make sure the reference count is kept in balance
also when the node is injected to a device afterwards.

Reported-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Fixes: e68d011 ("software node: Introduce device_add_software_node()")
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
@plbossart plbossart dismissed stale reviews from lyakh and bardliao via c102eb6 April 13, 2021 14:47
@plbossart plbossart force-pushed the fix/device-properties2 branch from 4983c3b to c102eb6 Compare April 13, 2021 14:47
@plbossart plbossart requested review from bardliao and lyakh April 19, 2021 20:24
@plbossart
Copy link
Member Author

The first patch in this PR was merged by GregKH, so we can merge and contribute the two other patches upstream after v5.13-rc1

@plbossart
Copy link
Member Author

rebased to make SoundWire error go away

@plbossart plbossart force-pushed the fix/device-properties2 branch from c102eb6 to 72848ae Compare April 19, 2021 20:27
The function device_add_properties() is going to be removed.
Replacing it with software node API equivalents.

Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
@plbossart plbossart force-pushed the fix/device-properties2 branch from 72848ae to 33126f3 Compare May 14, 2021 20:38
@plbossart
Copy link
Member Author

Rebased on top of v5.13-rc1 to make initial commit go away. This should be good to merge and send upstream now.

Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing significant, maybe a minor future clean up

The function device_add_properties() is going to be removed.
Replacing it with software node API equivalents.

Co-developed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
@plbossart plbossart force-pushed the fix/device-properties2 branch from 33126f3 to 5f50121 Compare May 24, 2021 18:49
@plbossart plbossart requested a review from lyakh May 24, 2021 18:50
@plbossart
Copy link
Member Author

I'll go ahead and merge, this PR has been around for quite some time and I want to send the patches upstream.

@plbossart plbossart merged commit caca198 into thesofproject:topic/sof-dev May 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants