fix(datasets): fixed importing large amount of small files#1119
Conversation
5df4a92 to
d805cbe
Compare
|
|
||
| files = [] | ||
| max_workers = min(os.cpu_count() + 4, 8) | ||
| max_workers = min(os.cpu_count() - 2, 4) or 1 |
There was a problem hiding this comment.
Why do you decrease the number of workers? These are values recommended by Python and worked fine so far. Also, os.cpu_count() - 2 can be negative when there is one core.
There was a problem hiding this comment.
Where is this recommendation coming from? I reduce it, since the wait in the thread has to be equal to the thread pool size otherwise I'm being rate limited on the upper bound (5000 req/h).
edit: I fixed the -2 bug, now it's -1 - thanks!
There was a problem hiding this comment.
I got it from https://docs.python.org/3/library/concurrent.futures.html and then just reduced the max value. Basically, since downloading is an IO-bound operation, Python library used 5x more download thread up to 3.5 and they adjust it to be a bit lower from 3.6 upwards.
I believe UX will be a bit degraded with these values but we can adjust later if it is the case.
| ) | ||
| except error.HTTPError as e: # pragma nocover | ||
|
|
||
| exec_time = (time.time() * 1e+3 - start) // 1e+3 |
There was a problem hiding this comment.
I am not sure these multiplication and division by 1000 is needed. We can use the output of time.time() and then check if exec_time < 1 to have the same effect and it's easier to understand the logic.
There was a problem hiding this comment.
It's a matter of convention in the repo, which is everywhere (as far as I'm aware) in ms. I do agree that your suggesting is more readable, but it's also more dangerous to compare to floats like that due to the rounding error. So I would stick to ms, as it is convention in the repo.
| # If execution time was less or equal to zero seconds, | ||
| # block the thread a bit to avoid being rate limited. | ||
| if exec_time == 0: | ||
| time.sleep(max(os.cpu_count() - 2, 4)) |
There was a problem hiding this comment.
This wait at least 4 seconds between consequent downloads which is too much. Since we have at most 8 threads, isn't it enough to just wait 1 second max (or 1 - exec_time if above comment is applied)?
There was a problem hiding this comment.
I tried that logic and on my fiber I was still hitting the rate limit very easily (after few minutes of import). With this logic you normalize the requests rate across the thread pool, which results in not being rate limited (14 req/s). The additional problem with 1 - exec_time is that once you hit a rate limit, and eventually continue, you hit the rate limit again and those subsequent rate limit hits are extremely punishing - I was waiting up to 45 minutes to be allowed to make a new request - that's why this normalization across thread pool is needed.
There was a problem hiding this comment.
My guess is that there might be other rate limits there in place; if it is was only this 14 req/sec, then with 8 thread and maximum one request per second, we should be well below it. Anyways, keeping it like this is OK although not very user friendly.
closes: #1118
Even though this fixes the reported issue in an acceptable manner, UX experience remains unaddressed. Ticket for the UX part is still here #1082 and in this case its very clear that we should put more attention to the UX of this progress bar. On my machine progress bar disappears so fast that for a majority of the time I have a perception that the process is hanging even though it's working hard to import the data.