-
Notifications
You must be signed in to change notification settings - Fork 11
Add support for asynchronous embeddings export #394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if status["status"] != "Completed": | ||
| raise JobError(status, self) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why raise a JobError if the job is not completed? perhaps its still running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought process was that the usage pattern would be the following:
export_job = dataset.export_embeddings()
export_job.sleep_until_complete(False)
result = export_job.result_urls()We could just wait for the result urls inside result_urls() also, but then I'd highlight it somehow that obtaining the results could run for a long time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, that makes sense, didn't noticed the AsyncJob inheritence.
This is a neat idea, to have customized job result classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add a wait_for_completion parameter. It might even be the default to wait for the job to complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, let's do that
jean-lucas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
gatli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good to me! Let's address the instantiation of the EmbeddingsExportJob (and for that matter any AsyncJob) such that we can let people trigger this in one process and poll in another.
| poetry run black --check . | ||
| - run: | ||
| name: Ruff Lint Check # See pyproject.tooml [tool.ruff] | ||
| name: Ruff Lint Check # See pyproject.toml [tool.ruff] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merci 🙏
| if status["status"] != "Completed": | ||
| raise JobError(status, self) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add a wait_for_completion parameter. It might even be the default to wait for the job to complete.
nucleus/async_job.py
Outdated
| class EmbeddingsExportJob(AsyncJob): | ||
| def result_urls(self) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering how you would instantiate this in another process. I think we need a classmethod from_id that would allow you to spin this up in one environment and then poll in another just from the job_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, that would be used through the NucleusClient.list_jobs method though right? So something like this:
jobs = NucleusClient.list_jobs()
export_job = EmbeddingsExportJob.from_job_id(jobs[0].job_id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a from_id to the AsyncJob, but I couldn't make it more typesafe (e.g. client argument is) still inferred as any. Do you have any ideas on how to improve?


No description provided.