Conversation
|
CI MESSAGE: [44656169]: BUILD STARTED |
|
CI MESSAGE: [44667089]: BUILD STARTED |
|
CI MESSAGE: [44667089]: BUILD FAILED |
|
CI MESSAGE: [44656169]: BUILD PASSED |
8fe96ee to
d879927
Compare
|
CI MESSAGE: [44698015]: BUILD STARTED |
|
CI MESSAGE: [44698471]: BUILD STARTED |
|
CI MESSAGE: [44717682]: BUILD STARTED |
|
CI MESSAGE: [44717682]: BUILD FAILED |
|
CI MESSAGE: [44719838]: BUILD STARTED |
|
CI MESSAGE: [44698471]: BUILD FAILED |
|
CI MESSAGE: [44719838]: BUILD FAILED |
|
CI MESSAGE: [44719838]: BUILD PASSED |
193e754 to
e485702
Compare
|
CI MESSAGE: [45440403]: BUILD STARTED |
Greptile SummaryThis PR introduces a Key findings:
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Op as Operator (image_decoder.h)
participant Facade as ThreadPoolFacade
participant Job as Job (std::optional)
participant TP as NewThreadPool (ThreadPoolBase)
Op->>Facade: AddWork(task1)
Facade->>Job: emplace() if empty
Facade->>Job: AddTask(task1)
Op->>Facade: RunAll(wait=false)
Facade->>Job: Run(*tp_, false)
Note over Job: executor_ set to &tp<br/>tasks submitted to TP
Job-->>TP: BeginBulkAdd / Submit tasks
Note over Facade: ⚠️ job_ NOT reset (wait=false)<br/>executor_ is now non-null
Op->>Op: do other work (e.g. last task in caller thread)
Op->>Facade: WaitForWork()
Facade->>Job: Wait()
Note over Job: Blocks until all tasks done
Facade->>Job: job_.reset() ✅ safe here
Note over Facade: --- BUG SCENARIO ---
Op->>Facade: AddWork(task2) [after RunAll(false), before WaitForWork]
Facade->>Job: AddTask(task2)
Note over Job: ❌ throws std::logic_error<br/>"job already started"
Note over Facade: --- DESTRUCTOR BUG ---
Facade->>Facade: ~ThreadPoolFacade() → RunAll(true)
Facade->>Job: Run(*tp_, true) [job_ still set from RunAll(false)]
Note over Job: ❌ throws in noexcept destructor<br/>→ std::terminate()
Last reviewed commit: 8709e96 |
| if (UseNewThreadPool()) { | ||
| std::cerr << "\n!!! Forced use of NewThreadPool !!!" << std::endl; | ||
| if (graph_info_.num_cpu > 0) { |
There was a problem hiding this comment.
Debug std::cerr left in production code
This diagnostic print was presumably added during development but should not be shipped:
| if (UseNewThreadPool()) { | |
| std::cerr << "\n!!! Forced use of NewThreadPool !!!" << std::endl; | |
| if (graph_info_.num_cpu > 0) { | |
| if (UseNewThreadPool()) { | |
| if (graph_info_.num_cpu > 0) { |
Remove the std::cerr line entirely — it will spam every user's stderr whenever they set DALI_USE_NEW_THREAD_POOL=1.
There was a problem hiding this comment.
@greptileai This will be removed before merging, but is necessary to validate that the flag is properly set and used in CI.
There was a problem hiding this comment.
Got it! That makes sense for CI validation. Consider adding a TODO comment on that line so it's clear it needs to be removed:
| if (UseNewThreadPool()) { | |
| std::cerr << "\n!!! Forced use of NewThreadPool !!!" << std::endl; | |
| if (graph_info_.num_cpu > 0) { | |
| // TODO: Remove before merge - only for CI validation | |
| std::cerr << "\n!!! Forced use of NewThreadPool !!!" << std::endl; |
This way it won't accidentally slip through code review.
|
CI MESSAGE: [45440403]: BUILD FAILED |
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
---- Signed-off-by: Michał Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
|
CI MESSAGE: [45700206]: BUILD STARTED |
| void ThreadPoolFacade::RunAll(bool wait) { | ||
| if (job_) { | ||
| auto atexit = AtScopeExit([&]() { | ||
| if (wait) | ||
| job_.reset(); | ||
| }); | ||
| job_->Run(*tp_, wait); | ||
| } | ||
| } |
There was a problem hiding this comment.
RunAll(false) leaves job_ in an unusable state
When RunAll(wait=false) is called, the AtScopeExit lambda only resets job_ when wait=true (line 101–102). This means job_ is left pointing to a Job whose executor_ has been set by the call to job_->Run(*tp_, false) on line 104.
This causes two downstream failures:
-
Subsequent
AddWorkcalls throw:Job::AddTask(inthread_pool_base.cc:114) unconditionally throwsstd::logic_error("This job has already been started - cannot add more tasks to it")whenexecutor_ != nullptr. TheOldThreadPoolcontract explicitly allows more work to be enqueued afterRunAll(false), so callers that rely on this (a legitimate use of theThreadPoolinterface) will get an unexpected exception. -
Destructor calls
std::terminate(): The destructor callsRunAll()(wait=true). Ifjob_is still set from a previousRunAll(false), this callsjob_->Run(*tp_, true)on an already-started job, which throwsstd::logic_error("This job has already been started.")(seethread_pool_base.cc:85–86). Since the destructor is declarednoexcept, the exception propagates tostd::terminate().
This is actively relevant: image_decoder.h calls tp_->RunAll(false) at lines 423 and 772, where tp_ is ThreadPool* — which will be a ThreadPoolFacade when DALI_USE_NEW_THREAD_POOL=1 is set.
The fix is to track the in-flight job separately (e.g. in a std::optional<Job> running_job_) and reset job_ unconditionally after Run, so that new work always goes into a fresh Job:
void ThreadPoolFacade::RunAll(bool wait) {
if (job_) {
// Move the current job into a separate slot so new AddWork calls get a fresh job
auto running = std::move(job_); // job_ is now nullopt
running->Run(*tp_, wait); // submit to thread pool
if (wait) {
// Wait() already called inside Run when wait=true - nothing to do
}
// If !wait, running_job_ should be retained until WaitForWork is called
}
}A complete solution would need a running_job_ member to support WaitForWork() after RunAll(false).
There was a problem hiding this comment.
@greptileai RunAll(false) must be followed by a call to WaitForWork.
When calling RunAll(false), the job will be retained (see the condition in the lambda passed to AtScopeExit) for use in a subsequent call to WaitForWork. Not calling WaitForWork before the facade is destroyed is an error and rightfully throws an exception.
Whan calling RunAll(true), the job_ is destroyed at scope exit, enabling scheduling more work afterwards.
There was a problem hiding this comment.
Thank you for the clarification. I see now - the contract requires WaitForWork() after RunAll(false), and the existing code in image_decoder.h already follows this correctly (lines 423-426: RunAll(false) → setup_task(-1) → WaitForWork()). The destructor exception is intentional to catch misuse.
My original comment was based on incorrectly assuming the new thread pool should have the same fire-and-forget semantics as OldThreadPool for RunAll(false). The explicit WaitForWork() requirement makes the async contract clearer and is working as designed.
| config_.thread_pool_threads, | ||
| config_.device.value_or(CPU_ONLY_DEVICE_ID), | ||
| config_.set_affinity, | ||
| "Executorv_v2"); |
There was a problem hiding this comment.
Typo in thread pool name — double v
The thread pool is named "Executorv_v2" in both the NewThreadPool and OldThreadPool creation paths, which contains a double v. This name appears in thread names visible in profiling tools (nsys, perf, etc.) and log messages, so a typo here is noticeable. It should be "Executor_v2".
| "Executorv_v2"); | |
| "Executor_v2"); |
| config_.thread_pool_threads, | ||
| config_.device.value_or(CPU_ONLY_DEVICE_ID), | ||
| config_.set_affinity, | ||
| "Executorv_v2"); |
There was a problem hiding this comment.
Same typo on the OldThreadPool path
| "Executorv_v2"); | |
| "Executor_v2"); |
|
CI MESSAGE: [45700206]: BUILD FAILED |
Category:
Refactoring (Redesign of existing code that doesn't affect functionality)
Description:
This change does the following:
ThreadPool(now calledThreadPool)ThreadPooltoOldThreadPoolNewThreadPooland a ThreadPool Facade that aggregates a pointer to a NewThreadPool and Job objectAdditional information:
Affected modules and functionalities:
Key points relevant for the review:
Tests:
New qa tests script
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: N/A