Skip to content

fix(datasets): fixed importing large amount of small files#1119

Merged
jsam merged 3 commits into
masterfrom
1118_import_small_files
Mar 24, 2020
Merged

fix(datasets): fixed importing large amount of small files#1119
jsam merged 3 commits into
masterfrom
1118_import_small_files

Conversation

@jsam
Copy link
Copy Markdown
Contributor

@jsam jsam commented Mar 19, 2020

closes: #1118

Even though this fixes the reported issue in an acceptable manner, UX experience remains unaddressed. Ticket for the UX part is still here #1082 and in this case its very clear that we should put more attention to the UX of this progress bar. On my machine progress bar disappears so fast that for a majority of the time I have a perception that the process is hanging even though it's working hard to import the data.

@jsam jsam requested a review from a team as a code owner March 19, 2020 16:25
@jsam jsam force-pushed the 1118_import_small_files branch from 5df4a92 to d805cbe Compare March 24, 2020 12:36
Comment thread renku/core/management/datasets.py Outdated

files = []
max_workers = min(os.cpu_count() + 4, 8)
max_workers = min(os.cpu_count() - 2, 4) or 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you decrease the number of workers? These are values recommended by Python and worked fine so far. Also, os.cpu_count() - 2 can be negative when there is one core.

Copy link
Copy Markdown
Contributor Author

@jsam jsam Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this recommendation coming from? I reduce it, since the wait in the thread has to be equal to the thread pool size otherwise I'm being rate limited on the upper bound (5000 req/h).

edit: I fixed the -2 bug, now it's -1 - thanks!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it from https://docs.python.org/3/library/concurrent.futures.html and then just reduced the max value. Basically, since downloading is an IO-bound operation, Python library used 5x more download thread up to 3.5 and they adjust it to be a bit lower from 3.6 upwards.
I believe UX will be a bit degraded with these values but we can adjust later if it is the case.

)
except error.HTTPError as e: # pragma nocover

exec_time = (time.time() * 1e+3 - start) // 1e+3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure these multiplication and division by 1000 is needed. We can use the output of time.time() and then check if exec_time < 1 to have the same effect and it's easier to understand the logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a matter of convention in the repo, which is everywhere (as far as I'm aware) in ms. I do agree that your suggesting is more readable, but it's also more dangerous to compare to floats like that due to the rounding error. So I would stick to ms, as it is convention in the repo.

Comment thread renku/core/management/datasets.py Outdated
# If execution time was less or equal to zero seconds,
# block the thread a bit to avoid being rate limited.
if exec_time == 0:
time.sleep(max(os.cpu_count() - 2, 4))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wait at least 4 seconds between consequent downloads which is too much. Since we have at most 8 threads, isn't it enough to just wait 1 second max (or 1 - exec_time if above comment is applied)?

Copy link
Copy Markdown
Contributor Author

@jsam jsam Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that logic and on my fiber I was still hitting the rate limit very easily (after few minutes of import). With this logic you normalize the requests rate across the thread pool, which results in not being rate limited (14 req/s). The additional problem with 1 - exec_time is that once you hit a rate limit, and eventually continue, you hit the rate limit again and those subsequent rate limit hits are extremely punishing - I was waiting up to 45 minutes to be allowed to make a new request - that's why this normalization across thread pool is needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that there might be other rate limits there in place; if it is was only this 14 req/sec, then with 8 thread and maximum one request per second, we should be well below it. Anyways, keeping it like this is OK although not very user friendly.

@jsam jsam merged commit 8d61473 into master Mar 24, 2020
@jsam jsam deleted the 1118_import_small_files branch March 24, 2020 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Too many requests error for dataset import from Zenodo

2 participants