Add support for IO-blocks by sergiorg-hpc · Pull Request #183 · BlueBrain/libsonata

sergiorg-hpc · 2022-03-15T17:14:54Z

The improvements provided in #174 and available in libsonata v0.1.11, introduced a specific limitation while fetching a small number of widely-disperse GIDs in very large reports. The {min,max} optimization, inspired by ROMIO's Data Sieving, would in such case force the implementation to retrieve large amounts of data without considering the gap between each of the position ranges of the GIDs. This means that, if we fetch only 2 GIDs that coincidentally are located in the very first entry of the dataset and in the very last one, the library would then read the complete report file just to filter in memory the two selected values.

To overcome these constraints while still trying to reduce the number of IO requests requested to the parallel file system, the current PR introduces several changes and optimizations:

The {min,max} range is now replaced by a collection of ranges or IO blocks, in which each range contains a subset of GIDs. The gap between each IO block is fixed at 64MB (or 4 blocks in GPFS), allowing the code to adapt depending on the number of GIDs requested and their distribution in the file. For instance, a higher number of GIDs would generally imply a lower number of IO blocks, and vice versa.
A flat-map inspired implementation replaces the existing std::map that is utilized to index the GIDs and their positions. The idea is to reduce the overhead when opening populations from very large report files with millions of GIDs. Moreover, it allows us to re-index the content based on different keys, without moving the actual data. For this purpose, the implementation uses several std::vector and a dedicated index that refers to the values contained in these vectors. When creating the IO blocks, it is enough to re-index the content by the position ranges and to create the groups based on this new index.
Other minor changes are included, such as behaviour consistency between asking for all the GIDs or a subset of the GIDs, clarification comments, and others.

Using the same ~60TB report utilized in #174 and #179, the following table illustrates the performance difference between v0.1.11 and the proposed PR by fetching a single timestep while varying the number of GIDs requested. The Open column reflects the cost of opening the population as part of the elapsed execution time, while the Select column shows the cost of the operations related to the selection of the GIDs in Python, and the Get column illustrates the cost of the actual retrieval of the data. The total cost is reflected in Elapsed. All measurements are provided in seconds:

v0.1.11
-------------------------------------------------------
   GIDs |   Elapsed |      Open |    Select |       Get
      1 |  5.347993 |  4.720139 |  0.626175 |  0.001667
     10 | 11.422943 |  5.391727 |  0.781538 |  5.249666
    100 | 11.814635 |  5.087994 |  0.763447 |  5.963183
   1000 | 12.090420 |  5.105253 |  0.774254 |  6.210903
  10000 | 12.084519 |  5.047278 |  0.779182 |  6.258047
 100000 | 12.789811 |  5.165905 |  0.785131 |  6.838766
1000000 | 18.066422 |  5.148455 |  0.834916 | 12.083038
4234929 | 33.317796 |  5.097557 |  0.414970 | 27.805258

PR
-------------------------------------------------------
   GIDs |   Elapsed |      Open |    Select |       Get
      1 |  1.558976 |  0.933911 |  0.623337 |  0.001715
     10 |  1.450080 |  0.733224 |  0.628178 |  0.088669
    100 |  4.454307 |  0.749991 |  0.638895 |  3.065413
   1000 |  7.777399 |  0.803403 |  0.632553 |  6.341434
  10000 |  7.836496 |  0.821947 |  0.635333 |  6.379207
 100000 |  8.150157 |  0.815197 |  0.636635 |  6.698314
1000000 | 12.450882 |  0.806348 |  0.706298 | 10.938223
4234929 | 21.047557 |  0.780823 |  0.272458 | 19.994266

From the results, it can be observed that the use of multiple-blocks and the integration of a flat-map reduces the overhead of the execution time considerably, specially in small number of GIDs. After 100 GIDs, the implementation can only find a single IO block to fetch from storage and, thus, its behaviour is equivalent to the original {min,max} single-block implementation of v0.1.11. Hence, allowing us to be more flexible in lower number of GIDs, and more aggressive in large number of GIDs to prevent overheads in the parallel file system. This is clearly noticeable by observing the Get column of the table above (note that the performance is improved because the PR also contains the changes in #179).

The following table reflects the number of blocks created by the implementation for each of the tests, showing how we can now be more flexible in requests with lower number of GIDs and more aggressive in requests with large number of GIDs:

PR
-----------------
   GIDs |  Blocks
      1 |       1
      2 |       2
     10 |      10
    100 |      32
   1000 |       1
  10000 |       1
 100000 |       1
1000000 |       1
4234929 |       1

If we now evaluate the edge case mentioned in the beginning of this description, the entry with 2 GIDs in the following table illustrates the performance by fetching the very first and the very last GID of the same ~60TB report for a single timestep:

v0.1.11
-------------------------------------------------------
   GIDs |   Elapsed |      Open |    Select |       Get
      2 | 11.180127 |  4.742555 |  0.205418 |  6.232141

PR
-------------------------------------------------------
   GIDs |   Elapsed |      Open |    Select |       Get
      2 |  1.103376 |  0.893741 |  0.209070 |  0.000555

sergiorg-hpc · 2022-03-16T09:45:00Z

Note that I'll update this PR after #182 is merged. Under my opinion, the logic plan would be to merge first #182, then try to merge #183 after it is ready, and finally refactor the code as proposed in #181. We can also combine #181 with this PR #183 and do both things together, although I'd prefer if we could keep it separated for now.

matz-e

Had an initial look over this, looks good to me!

Thank you as always for your work!

src/report_reader.cpp

mgeplf · 2022-03-17T07:34:57Z

Nice speedup!

Under my opinion, the logic plan would be to ...

That makes sense; don't worry about #181, I can just redo that when I have time. I just merged #182, so we should be good to go.

.gitignore

src/report_reader.cpp

sergiorg-hpc · 2022-03-23T23:32:50Z

A quick update. I have adapted the code of the PR after rebasing from master, including the use of std::next and std::advance with iterators instead of raw pointers. I have also replaced the ENV variable for the block_gap_size with a parameter, that is also exposed through Python. The parameter includes a boundary check to prevent users affecting the performance on GPFS and there are new CI tests to guarantee that we do not alter this behaviour accidentally in the future. Finally, I have also verified that the performance remains equivalent to what we had with raw pointers and the use of the ENV variable.

I will wait for further comments from your side. Also, thanks to @matz-e, @mgeplf, and @iomaganaris for the comments / insight.

mgeplf

looks good, I'd just like #187 to be merged first, so that the change in order of the report reading is checked.

include/bbp/sonata/report_reader.h

src/report_reader.cpp

mgeplf · 2022-03-30T12:18:30Z

src/report_reader.cpp

-        result.min_max_range.second = std::max(result.min_max_range.second, range.second);
-        element_ids_count += (range.second - range.first);
-    };
+    if (block_gap_limit < 4194304) {


is this a hard requirement, or more of a GPFS specific suggestion wrt performance?

No, not exactly. Each GPFS block is 16MB, so having a limit matching the block-size is reasonable, because anyway if the gap is smaller than that, you are most probably going to fetch the whole block anyway.

If we leave this a bit open, for instance by allowing users to set 0 or something small, then we would be in the same situation like we were when we started doing this whole {min,max} optimization (i.e., libsonata triggering millions of IO transactions). Therefore, I'd prefer if we could set a magic number limit, like 1 GPFS block, to at least limit how low you can set this parameter.

I'm really hoping no-one ever tweaks the parameter in the first place, but that's an aside.

I'm just wondering if we're making it too GPFS centric; but I guess the same value makes sense on an NVME node?

I think it makes sense for other systems, still. Even on NVMEs, we might want to have larger reads for lower latency…

As @matz-e says, it shouldn't matter because here we would just force to use the {min,max} optimization as much as possible. For the NVMe drives, the larger the requests also the better to reduce the latency.

tests/test_report_reader.cpp

mgeplf

Awesome, and thanks for adding the python tests, it's appreciated.

sergiorg-hpc · 2022-03-31T09:14:31Z

Awesome, and thanks for adding the python tests, it's appreciated.

Thank you, @mgeplf.

By the way, I forgot to add a small thing in the Python tests, can you approve once again, please?

sergiorg-hpc self-assigned this Mar 15, 2022

sergiorg-hpc added the WIP work in progress label Mar 15, 2022

sergiorg-hpc force-pushed the sandbox/srivas/io_improvements branch 2 times, most recently from 9977401 to 651d090 Compare March 15, 2022 17:25

sergiorg-hpc removed the WIP work in progress label Mar 15, 2022

sergiorg-hpc requested review from jorblancoa, matz-e and mgeplf March 15, 2022 17:43

matz-e reviewed Mar 16, 2022

View reviewed changes

src/report_reader.cpp Show resolved Hide resolved

src/report_reader.cpp Show resolved Hide resolved

src/report_reader.cpp Outdated Show resolved Hide resolved

src/report_reader.cpp Show resolved Hide resolved

mgeplf reviewed Mar 17, 2022

View reviewed changes

.gitignore Show resolved Hide resolved

src/report_reader.cpp Outdated Show resolved Hide resolved

sergiorg-hpc force-pushed the sandbox/srivas/io_improvements branch from 93fbd60 to fa0cc30 Compare March 22, 2022 17:00

sergiorg-hpc force-pushed the sandbox/srivas/io_improvements branch from fd73c96 to 704b1cf Compare March 30, 2022 07:50

mgeplf reviewed Mar 30, 2022

View reviewed changes

sergiorg-hpc added 7 commits March 30, 2022 15:51

Add support for IO-blocks

f4c0ab7

Merge and adapt changes from #182

4e1612e

Move 'block_gap_limit' from ENV variable to parameter and add CI tests

7e45858

Add skipped SonataError for malformed index_pointers

4b4b279

Make Python and C++ tests consistent

3e3e68f

Fix incorrect block assignment due to 64-bit assumption

87d2dac

Rebase 'master' and fix CI test

6820997

sergiorg-hpc force-pushed the sandbox/srivas/io_improvements branch from 704b1cf to 6820997 Compare March 30, 2022 14:37

mgeplf previously approved these changes Mar 31, 2022

View reviewed changes

Remove template from 'emplace_ids()' and add Python CI test

4b79fde

sergiorg-hpc dismissed mgeplf’s stale review via 4b79fde March 31, 2022 09:10

sergiorg-hpc force-pushed the sandbox/srivas/io_improvements branch from 96db720 to 4b79fde Compare March 31, 2022 09:10

mgeplf approved these changes Mar 31, 2022

View reviewed changes

mgeplf merged commit 0f8927a into master Mar 31, 2022

mgeplf deleted the sandbox/srivas/io_improvements branch March 31, 2022 09:40

1uc mentioned this pull request Oct 23, 2023

Optimize loading ?fferent_edges. #298

Merged

Conversation

sergiorg-hpc commented Mar 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sergiorg-hpc commented Mar 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matz-e left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgeplf commented Mar 17, 2022

Uh oh!

Uh oh!

Uh oh!

sergiorg-hpc commented Mar 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgeplf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgeplf Mar 30, 2022

Choose a reason for hiding this comment

Uh oh!

sergiorg-hpc Mar 30, 2022

Choose a reason for hiding this comment

Uh oh!

mgeplf Mar 31, 2022

Choose a reason for hiding this comment

Uh oh!

matz-e Mar 31, 2022

Choose a reason for hiding this comment

Uh oh!

sergiorg-hpc Mar 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgeplf left a comment

Choose a reason for hiding this comment

Uh oh!

sergiorg-hpc commented Mar 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sergiorg-hpc commented Mar 15, 2022 •

edited

Loading

sergiorg-hpc commented Mar 16, 2022 •

edited

Loading

sergiorg-hpc commented Mar 23, 2022 •

edited

Loading

sergiorg-hpc Mar 31, 2022 •

edited

Loading