Skip to content

Main thread spread exception when thread-mgr enabled#1889

Merged
wenyongh merged 20 commits intobytecodealliance:mainfrom
xujuntwt95329:main_thread_spread
Jan 20, 2023
Merged

Main thread spread exception when thread-mgr enabled#1889
wenyongh merged 20 commits intobytecodealliance:mainfrom
xujuntwt95329:main_thread_spread

Conversation

@xujuntwt95329
Copy link
Collaborator

No description provided.

wasm_runtime_set_exception(exec_env->module_inst,
"the result conversion is failed");
#if WASM_ENABLE_THREAD_MGR != 0
wasm_cluster_spread_exception(exec_env);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also spread exception when wasm_runtime_prepare_call_function fails, L1767?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to our discussion, I‘ve moved the spread into wasm_set_exception function, so we don't need to process the spread elsewhere

clear_wasi_proc_exit_exception(AOTModuleInstance *module_inst)
{
#if WASM_ENABLE_LIBC_WASI != 0
#if (WASM_ENABLE_LIBC_WASI != 0) && (WASM_ENABLE_THREAD_MGR == 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why skip clear the exception for multi-threads?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored, thanks~

clear_wasi_proc_exit_exception(WASMModuleInstance *module_inst)
{
#if WASM_ENABLE_LIBC_WASI != 0
#if (WASM_ENABLE_LIBC_WASI != 0) && (WASM_ENABLE_THREAD_MGR == 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, why skip clear for multi-threads?

@yamt
Copy link
Contributor

yamt commented Jan 16, 2023

  • i guess you need some kind of a retry loop as other threads can create threads in the meantime.
  • i guess traverse_list should require a lock. (cluster->lock?) but it isn't a fault of this PR.

@xujuntwt95329 xujuntwt95329 changed the title Main thread spread Main thread spread exception when thread-mgr enabled Jan 16, 2023
@xujuntwt95329
Copy link
Collaborator Author

Thanks for the suggestions.

  • i guess you need some kind of a retry loop as other threads can create threads in the meantime.

I think we can add a status field to record the status of a cluster. When we spreading the exception:

  1. get cluster->lock
  2. set cluster's status to exception
  3. spread the exception

Then if there are any other threads creating new thread at the meantime, we directly fail it if the cluster's status is exception

@yamt @wenyongh how do you think about this solution?

  • i guess traverse_list should require a lock. (cluster->lock?) but it isn't a fault of this PR.

Yes, and traverse_list is used for different lists, I‘ll analyze every situation and add the locks if necessary

@yamt
Copy link
Contributor

yamt commented Jan 16, 2023

Thanks for the suggestions.

  • i guess you need some kind of a retry loop as other threads can create threads in the meantime.

I think we can add a status field to record the status of a cluster. When we spreading the exception:

1. get `cluster->lock`

2. set cluster's status to `exception`

3. spread the exception

Then if there are any other threads creating new thread at the meantime, we directly fail it if the cluster's status is exception

@yamt @wenyongh how do you think about this solution?

i think it works.

  • i guess traverse_list should require a lock. (cluster->lock?) but it isn't a fault of this PR.

Yes, and traverse_list is used for different lists, I‘ll analyze every situation and add the locks if necessary

thank you.

@wenyongh
Copy link
Collaborator

Thanks for the suggestions.

  • i guess you need some kind of a retry loop as other threads can create threads in the meantime.

I think we can add a status field to record the status of a cluster. When we spreading the exception:

  1. get cluster->lock
  2. set cluster's status to exception
  3. spread the exception

Then if there are any other threads creating new thread at the meantime, we directly fail it if the cluster's status is exception

@yamt @wenyongh how do you think about this solution?

It looks good to me.

bh_assert(module_inst_comm->module_type == Wasm_Module_Bytecode
|| module_inst_comm->module_type == Wasm_Module_AoT);

const char *exception = wasm_get_exception(module_inst);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had better declare the variable at the beginning of the function, a concern is that old version compiler might report warning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

#if WASM_ENABLE_THREAD_MGR != 0
wasm_cluster_spread_exception(
wasm_clusters_search_exec_env((WASMModuleInstanceCommon *)module_inst),
exception ? false : true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When exception is NULL, does it mean to clear exception of other threads?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@xujuntwt95329
Copy link
Collaborator Author

When adding lock for traverse_list, we will meet dead lock issue: in wasm_cluster_terminate_all_except_self, we get the lock, and call os_thread_join to wait, but the exiting thread will also try to get the lock so dead lock occurred.

I use this strategy to avoid this:

  • add a new field processing in cluster
  • in wasm_cluster_terminate_all_except_self we set processing to true, and unlock the lock before calling thread_join
  • other threads can still get this lock, but in wasm_cluster_spawn_exec_env and wasm_cluster_create_thread it will fail directly if processing is true
  • wasm_cluster_terminate_all_except_self clear processing flag once it finished

wasm_exec_env_destroy(exec_env);
(void)clear_wasi_proc_exit_exception(module_inst);
(void)clear_wasi_proc_exit_exception(
(WASMModuleInstanceCommon *)module_inst);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems no need to clear wasi proc exit exception here. WASI module doesn't set internal start func index,instead it exports a function named "_start". If here it really call wasi proc exit,we can also let instantiation process failed. @lum1n0us what is your opinion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with that, and updated

}
#endif

bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can set to static if the above is ok

os_mutex_unlock(&cluster->lock);

traverse_list(&cluster->exec_env_list, terminate_thread_visitor, NULL);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does traverse_list add lock for the list? why not remove L845 and L849

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terminate_thread_visitor, we will call os_thread_join to wait for other threads to exit, and the exited thread need to get cluster->lock for accessing the list, so we can't hold the lock during terminate_thread_visitor

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks.

Copy link
Collaborator

@wenyongh wenyongh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wenyongh wenyongh merged commit cadf9d0 into bytecodealliance:main Jan 20, 2023
@wenyongh wenyongh mentioned this pull request Jan 20, 2023
vickiegpt pushed a commit to vickiegpt/wamr-aot-gc-checkpoint-restore that referenced this pull request May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants