fix(datasets): fixed importing large amount of small files by jsam · Pull Request #1119 · SwissDataScienceCenter/renku-python

jsam · 2020-03-19T16:25:34Z

closes: #1118

Even though this fixes the reported issue in an acceptable manner, UX experience remains unaddressed. Ticket for the UX part is still here #1082 and in this case its very clear that we should put more attention to the UX of this progress bar. On my machine progress bar disappears so fast that for a majority of the time I have a perception that the process is hanging even though it's working hard to import the data.

mohammad-alisafaee · 2020-03-24T16:15:20Z


        files = []
-        max_workers = min(os.cpu_count() + 4, 8)
+        max_workers = min(os.cpu_count() - 2, 4) or 1


Why do you decrease the number of workers? These are values recommended by Python and worked fine so far. Also, os.cpu_count() - 2 can be negative when there is one core.

Where is this recommendation coming from? I reduce it, since the wait in the thread has to be equal to the thread pool size otherwise I'm being rate limited on the upper bound (5000 req/h).

edit: I fixed the -2 bug, now it's -1 - thanks!

I got it from https://docs.python.org/3/library/concurrent.futures.html and then just reduced the max value. Basically, since downloading is an IO-bound operation, Python library used 5x more download thread up to 3.5 and they adjust it to be a bit lower from 3.6 upwards.
I believe UX will be a bit degraded with these values but we can adjust later if it is the case.

mohammad-alisafaee · 2020-03-24T16:21:20Z

            )
-        except error.HTTPError as e:  # pragma nocover
+
+            exec_time = (time.time() * 1e+3 - start) // 1e+3


I am not sure these multiplication and division by 1000 is needed. We can use the output of time.time() and then check if exec_time < 1 to have the same effect and it's easier to understand the logic.

It's a matter of convention in the repo, which is everywhere (as far as I'm aware) in ms. I do agree that your suggesting is more readable, but it's also more dangerous to compare to floats like that due to the rounding error. So I would stick to ms, as it is convention in the repo.

mohammad-alisafaee · 2020-03-24T16:24:31Z

+            # If execution time was less or equal to zero seconds,
+            # block the thread a bit to avoid being rate limited.
+            if exec_time == 0:
+                time.sleep(max(os.cpu_count() - 2, 4))


This wait at least 4 seconds between consequent downloads which is too much. Since we have at most 8 threads, isn't it enough to just wait 1 second max (or 1 - exec_time if above comment is applied)?

I tried that logic and on my fiber I was still hitting the rate limit very easily (after few minutes of import). With this logic you normalize the requests rate across the thread pool, which results in not being rate limited (14 req/s). The additional problem with 1 - exec_time is that once you hit a rate limit, and eventually continue, you hit the rate limit again and those subsequent rate limit hits are extremely punishing - I was waiting up to 45 minutes to be allowed to make a new request - that's why this normalization across thread pool is needed.

My guess is that there might be other rate limits there in place; if it is was only this 14 req/sec, then with 8 thread and maximum one request per second, we should be well below it. Anyways, keeping it like this is OK although not very user friendly.

jsam requested a review from a team as a code owner March 19, 2020 16:25

github-actions Bot added the documentation:pending label Mar 19, 2020

jsam removed the documentation:pending label Mar 19, 2020

jsam force-pushed the 1118_import_small_files branch from 5df4a92 to d805cbe Compare March 24, 2020 12:36

fix(datasets): fixed importing large amount of small files

d805cbe

mohammad-alisafaee requested changes Mar 24, 2020

View reviewed changes

mohammad-alisafaee approved these changes Mar 24, 2020

View reviewed changes

jsam merged commit 8d61473 into master Mar 24, 2020

jsam deleted the 1118_import_small_files branch March 24, 2020 17:59

jsam added 2 commits March 24, 2020 19:18

chore: fix the correct waiting time

f656b9f

Merge branch 'master' into 1118_import_small_files

64bc9b8

Panaetius pushed a commit that referenced this pull request Apr 14, 2020

fix(datasets): fixed importing large amount of small files (#1119)

a242f12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(datasets): fixed importing large amount of small files#1119

fix(datasets): fixed importing large amount of small files#1119
jsam merged 3 commits into
masterfrom
1118_import_small_files

jsam commented Mar 19, 2020 •

edited

Loading

Uh oh!

mohammad-alisafaee Mar 24, 2020

Uh oh!

jsam Mar 24, 2020 •

edited

Loading

Uh oh!

mohammad-alisafaee Mar 24, 2020

Uh oh!

mohammad-alisafaee Mar 24, 2020

Uh oh!

jsam Mar 24, 2020

Uh oh!

mohammad-alisafaee Mar 24, 2020

Uh oh!

jsam Mar 24, 2020 •

edited

Loading

Uh oh!

mohammad-alisafaee Mar 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jsam commented Mar 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mohammad-alisafaee Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

jsam Mar 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mohammad-alisafaee Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

mohammad-alisafaee Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

jsam Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

mohammad-alisafaee Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

jsam Mar 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mohammad-alisafaee Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jsam commented Mar 19, 2020 •

edited

Loading

jsam Mar 24, 2020 •

edited

Loading

jsam Mar 24, 2020 •

edited

Loading