Skip to content

Conversation

@yelite
Copy link
Contributor

@yelite yelite commented Oct 26, 2023

This PR changes how to get the Session from DPackedFunc or DModule. Without this, there will be data corruption in the message channel, if Python gc happens in a different thread than the thread that owns the Disco session, which is typical in a multi-thread environment like LLM inference server.

An explanation on how such data corruption could occur:

  1. Every time dref.session gets called. A new Session Python object will be created, due to how FFI works
    return _ffi_api.DRefSession(self) # type: ignore # pylint: disable=no-member
  2. When we want to call the model through disco, we need to first obtain the DPackedFunc through DModule . This process calls the _get_cached_method on a new Session object (follows (1))
    func = self.session._get_cached_method("runtime.ModuleGetFunction")
  3. In hasattr(self, "_cache") , a call is made to the tvm::ReflectionVTable::GetAttr to check if it exists in the TVM object
    if not hasattr(self, "_cache"):
  4. This call will fail and raise error, which gets caught by
    raise_last_ffi_error()
    . hasattr will swallow this error and return False.
  5. During the creation of the FFI error in (4), traceback objects are created to help trace into C++ function.
    return types.TracebackType(tb, new_frame.tb_frame, new_frame.tb_lasti, new_frame.tb_lineno)
    Holding a reference to the frame object creates circular references that goes 'frame -> parent frames -> frame that holds the traceback object -> traceback -> frame'. All local variables in the call stack will be indirectly referenced and cannot be freed by ref counting. This is okay because Python gc will collect these objects once they are unreachable.
  6. (5) prevents the DPackedFunc returned from _get_cached_method to be freed through Python ref counting, because it's in the frame that's indirectly referred by the circular reference in (5).
  7. If Python gc happens in the main thread, the DPackedFunc from (6) will be collected, because there is no references to the newly-created Session, other than the reference path rooted from the circular reference between traceback and frame (from (5)).
  8. (7) will result in writing to the pipe in the main thread from the DRef destructor.
    Downcast<Session>(this->session)->DeallocReg(reg_id);
    If other disco activity happens in another thread, there will be multiple threads writing to the same pipe without synchronization. Data will be corrupted.

This PR fixes (1) and (3), so that the data corruption will not occur anymore.

cc @junrushao

Copy link
Member

@junrushao junrushao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for finding this a super complicated bug!

@junrushao junrushao merged commit 6408cc4 into apache:unity Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants