Skip to content

Conversation

@keyonjie
Copy link
Contributor

@keyonjie keyonjie commented Oct 10, 2021

The series includes 2 fixes:

  1. return failure in case ipc_init() fails.
  2. dma-trace: avoid multiple allocations to fix fuzzy IPC.

@keyonjie keyonjie requested a review from akloniex as a code owner October 10, 2021 00:39
@keyonjie keyonjie force-pushed the master_fix branch 3 times, most recently from eeaa77d to 5e86bce Compare October 11, 2021 06:02
@keyonjie keyonjie changed the title dma-trace: avoid multiple enabling to fix fuzzy IPC Minor fixes Oct 11, 2021
Copy link
Collaborator

@paulstelian97 paulstelian97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First two look like they should be squashed. I'll accept this whether they're squashed or not though.


/* return if already initialized */
if (buffer->addr)
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't you reset all the fields? The test on the address will not clear the rest, so you could be in a weird state....

        buffer->w_ptr = buffer->addr;
	buffer->r_ptr = buffer->addr;
	buffer->end_addr = (char *)buffer->addr + buffer->size;
	buffer->avail = 0;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't you reset all the fields? The test on the address will not clear the rest, so you could be in a weird state....

The intention of my change here is only to test and avoid multiple initializations , not to clear buffer, the clearing will be perform in dma_trace_buffer_free() e.g. "memset(buffer, 0, sizeof(*buffer));"

        buffer->w_ptr = buffer->addr;
	buffer->r_ptr = buffer->addr;
	buffer->end_addr = (char *)buffer->addr + buffer->size;
	buffer->avail = 0;

/* return if already initialized */
if (buffer->addr)
return 0;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I am at it, if rballoc initializes to zero, while if there a bzero() line 239?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I am at it, if rballoc initializes to zero, while if there a bzero() line 239?

good point, @lyakh has an argument that we should not perform 0 initialization for rballoc() and leave the decision to be made by the caller. Maybe let me add a new rbzalloc() for 0-initialized allocation.

@keyonjie
Copy link
Contributor Author

First two look like they should be squashed. I'll accept this whether they're squashed or not though.

Thanks, I would leave them as they are, as I want the testbench part to be separated for easier bisection.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SOF_MEM_ZONE_SYS_SHARED works the same way as SOF_MEM_ZONE_SYS in that it panics if allocation fails. No need to check return value, this is never done on purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SOF_MEM_ZONE_SYS_SHARED works the same way as SOF_MEM_ZONE_SYS in that it panics if allocation fails. No need to check return value, this is never done on purpose.

We should not depend on the panic inside the allocator, I will merge the _SYS_SHARED zone to BUFFER zone in PR #4747.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we rely on them in many locations and I have been reminded about that at least twice and since then I myself repeated that request to others. I don't find this particularly pretty, but I also see the logic behind that design. However, if you move those allocations to a non-system zone, we might have more work to do to clean those up. I currently count more than 15 allocations from SOF_MEM_ZONE_SYS_SHARED in my local tree. If you really change them all, I think a separate patch would be good for that, addressing them all.
Also, those allocations are of type "allocate once at init and never free" so, maybe we can switch most or all of them to use .data or .bss?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyakh who corrected? this really seems to be a policy call. if SYS_SHARED has guaranteed allocations don't have to be checked (like glib in Linux, you panic if you fail to allocate), this should be followed universally. or if not, it should be the same all over the place.

This is basicly a fundamental design question for any reusable C code. Either you check allocs or you do not. Both are deployed successfully, but you have to stick to the plan. If you start mixing both approaches, then it gets really complicated.

Personally, I don't know the history, but I'm a bit skeptical of the usefulness of SYS/SYS_SHARED. Some code/checks is saved, but it seems non-trivial to enforce the policy when SYS/SYS_SHARED should be used. E..g IPC is really application code and one can argue whether it should use the SYS heap at all, or should it be limited to RTOS usage. OTOH, in most products using SOF, IPC is mandatory part of the system, so use of SYS makes sense, but this is now a product call. And so forth and forth, seems complex when scaled up to more and more products.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for filing #4865 to track the TODOs, @lyakh .

@kv2019i agree that most of the sys/sys_shared usage today are actually runtime usage for us, imagine that those secondary cores related should be freed at powering down a specific core, DMA trace and be disabled. On the other hand, we reserve relative big zones for sys/sys_runtime for APL today, that's one of the reason that we are hitting OOM issues there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need to check now as Zephyr mode requires a check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kv2019i I'm sure this is the current official design: SYS/SYS_SHARED failures cause a panic, they must not be checked. And I haven't seen any violations of this in the code, if I had, I'd probably complain.

I'm not against changing that, but then we have to declare that change properly, not as a minor fix, and change it globally.

As I commented elsewhere in this PR - we need a formal API, once we have one, we can follow it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, we need to assume we are using Zephyr allocator and this could fail. No harm in xtos mode as we transition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to not like this one either. Does the description itself of the rballoc function state that the buffer is zeroed? If it's not stated in there it isn't happening from the point of view of the caller.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @paulstelian97 , this caught my eye as well, but seems Keyon now added that to rballoc documentation in alloc.h.

OTOH, L236 is a bit different. @keyonjie , rballoc() doesn't guarantee the zeros are written to memory. At least the Zephyr version of rballoc doesn't guarantee it. Is this still intentional to remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulstelian97 @kv2019i maybe let me to keep the L236 at this point, though writeback it without any writing here looks a bit odd.

Copy link
Collaborator

@paulstelian97 paulstelian97 Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking keep both 235 (bzero) and 236 (writeback). No need to remove the bzero when you don't have a guarantee (but instead an xtos implementation detail) that rballoc zeroes things. Just keep it around (is there such a big performance impact that it matters?)

Or remove both if there's no real need for the buffer to be zeroed. Just don't omit bzero but expect zeros, that is all. bzero and the writeback should work together (either both or neither in the code)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need both bzero() and WB here, but can we also make sure Zephyr wrapper has the same behavior with buffer allocations (wrt to bzero())

Copy link
Collaborator

@kv2019i kv2019i Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, or @keyonjie just drop this patch from the series? This isn't performance critical code and this is not really making the code easier to understand or maintain, so not sure of the benefit.

Looking at other dma-trace.c , I start to wonder what is the benefit allocation the buffer from cached memory. DSP only writes to the memory once, and then DMA hw reads (uncached) so DSP always needs to writeback the data. GIven you always need to writeback, it's not clear there's any benefit to using a cached mapping. But it does make the code harder to read and more error prone.

FYI to @marc-hb and @lyakh who've been looking at trace issues.

Copy link
Member

@plbossart plbossart Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, let's be consistent. Either rballoc() zeroes out the buffer or it doesn't. I don't see why this is done in a piecemeal operation related to dma-trace.

I have a really hard time btw with the notion of "minor fixes' for this PR. the moment you change the dma trace it's no longer minor but can have large consequences on the work of others. Can we please learn from the snaffus in the last 3 months and be careful?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @plbossart. After you call calloc() / kzalloc() you do not immediately call memset(0) on the same memory.

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. A few questions (inline).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @paulstelian97 , this caught my eye as well, but seems Keyon now added that to rballoc documentation in alloc.h.

OTOH, L236 is a bit different. @keyonjie , rballoc() doesn't guarantee the zeros are written to memory. At least the Zephyr version of rballoc doesn't guarantee it. Is this still intentional to remove?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyakh who corrected? this really seems to be a policy call. if SYS_SHARED has guaranteed allocations don't have to be checked (like glib in Linux, you panic if you fail to allocate), this should be followed universally. or if not, it should be the same all over the place.

This is basicly a fundamental design question for any reusable C code. Either you check allocs or you do not. Both are deployed successfully, but you have to stick to the plan. If you start mixing both approaches, then it gets really complicated.

Personally, I don't know the history, but I'm a bit skeptical of the usefulness of SYS/SYS_SHARED. Some code/checks is saved, but it seems non-trivial to enforce the policy when SYS/SYS_SHARED should be used. E..g IPC is really application code and one can argue whether it should use the SYS heap at all, or should it be limited to RTOS usage. OTOH, in most products using SOF, IPC is mandatory part of the system, so use of SYS makes sense, but this is now a product call. And so forth and forth, seems complex when scaled up to more and more products.

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bzero() will help for buffers as this has caught out a few users already (one recently fixed just prior to v1.9)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need to check now as Zephyr mode requires a check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need both bzero() and WB here, but can we also make sure Zephyr wrapper has the same behavior with buffer allocations (wrt to bzero())

Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a really hard time btw with the notion of "minor fixes' for this PR. the moment you change the dma trace it's no longer minor but can have large consequences on the work of others. Can we please learn from the snaffus in the last 3 months and be careful?

+1, a non-boilerplate and non-misleading PR title is really the least that can be expected. I found this PR only because I'm now running this pseudo-code / workaround but it's not sustainable:

if PR_subject ~= 'Minor changes' and author == Keyon:
    PR_subject = 'Not Minor unknown'

Also, please cc: me on any trace-related PR. Not because I can help (likely not) but to give me advance notice of potential regressions like d8a19a4 or f2c13f5

@keyonjie
Copy link
Contributor Author

@marc-hb @plbossart the initial PR was intended to be minor. Anyway, let me move controversial ones out and take it back to minor again.

@plbossart to unify the zeroing of the allocated buffer, I will make another PR to proceed that.

Add checks and return failure if buffer allocation fails to ipc_init().

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
Check and return failure, in case of failure happens in ipc_init().

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
The fuzzy IPC testing reports there is memory leak if dma_trace_config
IPC is sent multiple times, we should just return if the DMA trace
buffer is already allocated, to avoid allocating new buffers and the
leakage of the old buffers.

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
@lyakh
Copy link
Collaborator

lyakh commented Oct 14, 2021

The bzero() will help for buffers as this has caught out a few users already (one recently fixed just prior to v1.9)

@lgirdwood shouldn't those users be fixed? If we're at this, shall we maybe take a step back and decide what we want our heap allocation API look like? If we're moving towards Zephyr, should we say, that we basically only have xmalloc() anc xcalloc() with flags for caching etc.? And each *alloc() can fail, so, it must be checked? Laying out such an API would help reviewers follow those changes.

@lyakh
Copy link
Collaborator

lyakh commented Oct 14, 2021

@keyonjie as a side-track: I think it would be worthwhile looking through all SYS / SYS_SHARED allocations to check if they can be replaced with static objects. Since they're never freed, those with constant size and universally required on each boot, could be replaced.

@keyonjie
Copy link
Contributor Author

@keyonjie as a side-track: I think it would be worthwhile looking through all SYS / SYS_SHARED allocations to check if they can be replaced with static objects. Since they're never freed, those with constant size and universally required on each boot, could be replaced.

@lyakh there is one troublesome thing that if we want to use uncached buffer to sharing among cores, we have to use e.g. platform_shared_get() if the object is a static object one.

Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when they're actually minor, bug fixes deserve a more specific name than "Minor fixes". At the very least : "Minor IPC and DMA trace fixes" so this PR can be identified among others.

Small change requested for the trace fix.

PS: I would not call "minor" a memory leak issue, especially not in this sort of environment.

int err;

/* the existence of the trace buffer means the dma trace already initialized */
if (d->dmatb.addr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a safer dma_trace_initialized()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujfalusi , @plbossart , others: there is no IPC to _disable() the trace, correct? I found some IPC only for trace_off() but none to unwind this.

Just making sure running dma_trace_enable() a second time is never a valid use case and that this fix is purely "defensive programming". If correct the comment should say that this is never expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujfalusi , @plbossart , others: there is no IPC to _disable() the trace, correct?

It is under work by @ranj063.

I found some IPC only for trace_off() but none to unwind this.

Just making sure running dma_trace_enable() a second time is never a valid use case and that this fix is purely "defensive programming". If correct the comment should say that this is never expected.

It is going to be valid with the SOF client support, but if we have the #4849 then yes, this is purely safe coding against broken or old host


/* the existence of the trace buffer means the dma trace already initialized */
if (d->dmatb.addr)
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct, you must proceed to dma_trace_start() to get the new buffer parameters into use and reconfigure the channel.

if (!d->dmatb.addr) {
	err = dma_trace_buffer_init(d);

	if (err < 0) {
		mtrace_printf(LOG_LEVEL_ERROR, "dma_trace_enable: buffer_init failed");
		goto out;
	}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keyonjie, @marc-hb, see my PR to fix this memory leak and allow the dtrace re-configuration: #4879

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keyonjie, @marc-hb, see my PR to fix this memory leak and allow the dtrace re-configuration: #4879

Thanks @ujfalusi. This was reported by CI fuzzy IPC test, is it OK to leave this to you together with your dtrace refinement? If so, let me remove it from here.

int err;

/* the existence of the trace buffer means the dma trace already initialized */
if (d->dmatb.addr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujfalusi , @plbossart , others: there is no IPC to _disable() the trace, correct?

It is under work by @ranj063.

I found some IPC only for trace_off() but none to unwind this.

Just making sure running dma_trace_enable() a second time is never a valid use case and that this fix is purely "defensive programming". If correct the comment should say that this is never expected.

It is going to be valid with the SOF client support, but if we have the #4849 then yes, this is purely safe coding against broken or old host

@ujfalusi ujfalusi self-requested a review October 15, 2021 05:31
@lyakh
Copy link
Collaborator

lyakh commented Oct 15, 2021

@keyonjie as a side-track: I think it would be worthwhile looking through all SYS / SYS_SHARED allocations to check if they can be replaced with static objects. Since they're never freed, those with constant size and universally required on each boot, could be replaced.

@lyakh there is one troublesome thing that if we want to use uncached buffer to sharing among cores, we have to use e.g. platform_shared_get() if the object is a static object one.

Right, but only once.

@lgirdwood
Copy link
Member

@keyonjie I've lost track of what's going on here, is there any rework required or is this no longer required ?

@keyonjie
Copy link
Contributor Author

@keyonjie I've lost track of what's going on here, is there any rework required or is this no longer required ?

@lgirdwood this was split from #4747 , don't have chance to proceed the refinement part4 PR, so is this.

Copy link
Contributor

@ujfalusi ujfalusi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dtrace fix is still not correct.

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First 2 patches seem fine, could be a separate PR whilst the final patch is being aligned.

@ujfalusi
Copy link
Contributor

First 2 patches seem fine, could be a separate PR whilst the final patch is being aligned.

@lgirdwood, fwiw PR #4879 deals with the memleak along with other problems which could happen if the release dtrace message is not sent.

@plbossart
Copy link
Member

it doesn't seem like this PR is anywhere close to a consensus. @lgirdwood @keyonjie what's the plan here?

@keyonjie
Copy link
Contributor Author

keyonjie commented Dec 1, 2021

it doesn't seem like this PR is anywhere close to a consensus. @lgirdwood @keyonjie what's the plan here?

The change is needed in case we want the #4747 get proceed, they are for memory zones refinement/simplification only and not mandatory needed. I will proceed them if have time before Christmas, otherwise will close them.

@keyonjie
Copy link
Contributor Author

I have no plan to work on this anymore, closing it.

@keyonjie keyonjie closed this Dec 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants