Skip to content
This repository was archived by the owner on Feb 26, 2025. It is now read-only.

Add support for IO-blocks#183

Merged
mgeplf merged 8 commits intomasterfrom
sandbox/srivas/io_improvements
Mar 31, 2022
Merged

Add support for IO-blocks#183
mgeplf merged 8 commits intomasterfrom
sandbox/srivas/io_improvements

Conversation

@sergiorg-hpc
Copy link
Contributor

@sergiorg-hpc sergiorg-hpc commented Mar 15, 2022

The improvements provided in #174 and available in libsonata v0.1.11, introduced a specific limitation while fetching a small number of widely-disperse GIDs in very large reports. The {min,max} optimization, inspired by ROMIO's Data Sieving, would in such case force the implementation to retrieve large amounts of data without considering the gap between each of the position ranges of the GIDs. This means that, if we fetch only 2 GIDs that coincidentally are located in the very first entry of the dataset and in the very last one, the library would then read the complete report file just to filter in memory the two selected values.

To overcome these constraints while still trying to reduce the number of IO requests requested to the parallel file system, the current PR introduces several changes and optimizations:

  • The {min,max} range is now replaced by a collection of ranges or IO blocks, in which each range contains a subset of GIDs. The gap between each IO block is fixed at 64MB (or 4 blocks in GPFS), allowing the code to adapt depending on the number of GIDs requested and their distribution in the file. For instance, a higher number of GIDs would generally imply a lower number of IO blocks, and vice versa.
  • A flat-map inspired implementation replaces the existing std::map that is utilized to index the GIDs and their positions. The idea is to reduce the overhead when opening populations from very large report files with millions of GIDs. Moreover, it allows us to re-index the content based on different keys, without moving the actual data. For this purpose, the implementation uses several std::vector and a dedicated index that refers to the values contained in these vectors. When creating the IO blocks, it is enough to re-index the content by the position ranges and to create the groups based on this new index.
  • Other minor changes are included, such as behaviour consistency between asking for all the GIDs or a subset of the GIDs, clarification comments, and others.

Using the same ~60TB report utilized in #174 and #179, the following table illustrates the performance difference between v0.1.11 and the proposed PR by fetching a single timestep while varying the number of GIDs requested. The Open column reflects the cost of opening the population as part of the elapsed execution time, while the Select column shows the cost of the operations related to the selection of the GIDs in Python, and the Get column illustrates the cost of the actual retrieval of the data. The total cost is reflected in Elapsed. All measurements are provided in seconds:

v0.1.11
-------------------------------------------------------
   GIDs |   Elapsed |      Open |    Select |       Get
      1 |  5.347993 |  4.720139 |  0.626175 |  0.001667
     10 | 11.422943 |  5.391727 |  0.781538 |  5.249666
    100 | 11.814635 |  5.087994 |  0.763447 |  5.963183
   1000 | 12.090420 |  5.105253 |  0.774254 |  6.210903
  10000 | 12.084519 |  5.047278 |  0.779182 |  6.258047
 100000 | 12.789811 |  5.165905 |  0.785131 |  6.838766
1000000 | 18.066422 |  5.148455 |  0.834916 | 12.083038
4234929 | 33.317796 |  5.097557 |  0.414970 | 27.805258

PR
-------------------------------------------------------
   GIDs |   Elapsed |      Open |    Select |       Get
      1 |  1.558976 |  0.933911 |  0.623337 |  0.001715
     10 |  1.450080 |  0.733224 |  0.628178 |  0.088669
    100 |  4.454307 |  0.749991 |  0.638895 |  3.065413
   1000 |  7.777399 |  0.803403 |  0.632553 |  6.341434
  10000 |  7.836496 |  0.821947 |  0.635333 |  6.379207
 100000 |  8.150157 |  0.815197 |  0.636635 |  6.698314
1000000 | 12.450882 |  0.806348 |  0.706298 | 10.938223
4234929 | 21.047557 |  0.780823 |  0.272458 | 19.994266

From the results, it can be observed that the use of multiple-blocks and the integration of a flat-map reduces the overhead of the execution time considerably, specially in small number of GIDs. After 100 GIDs, the implementation can only find a single IO block to fetch from storage and, thus, its behaviour is equivalent to the original {min,max} single-block implementation of v0.1.11. Hence, allowing us to be more flexible in lower number of GIDs, and more aggressive in large number of GIDs to prevent overheads in the parallel file system. This is clearly noticeable by observing the Get column of the table above (note that the performance is improved because the PR also contains the changes in #179).

The following table reflects the number of blocks created by the implementation for each of the tests, showing how we can now be more flexible in requests with lower number of GIDs and more aggressive in requests with large number of GIDs:

PR
-----------------
   GIDs |  Blocks
      1 |       1
      2 |       2
     10 |      10
    100 |      32
   1000 |       1
  10000 |       1
 100000 |       1
1000000 |       1
4234929 |       1

If we now evaluate the edge case mentioned in the beginning of this description, the entry with 2 GIDs in the following table illustrates the performance by fetching the very first and the very last GID of the same ~60TB report for a single timestep:

v0.1.11
-------------------------------------------------------
   GIDs |   Elapsed |      Open |    Select |       Get
      2 | 11.180127 |  4.742555 |  0.205418 |  6.232141

PR
-------------------------------------------------------
   GIDs |   Elapsed |      Open |    Select |       Get
      2 |  1.103376 |  0.893741 |  0.209070 |  0.000555

@sergiorg-hpc sergiorg-hpc self-assigned this Mar 15, 2022
@sergiorg-hpc sergiorg-hpc added the WIP work in progress label Mar 15, 2022
@sergiorg-hpc sergiorg-hpc force-pushed the sandbox/srivas/io_improvements branch 2 times, most recently from 9977401 to 651d090 Compare March 15, 2022 17:25
@sergiorg-hpc sergiorg-hpc removed the WIP work in progress label Mar 15, 2022
@sergiorg-hpc
Copy link
Contributor Author

sergiorg-hpc commented Mar 16, 2022

Note that I'll update this PR after #182 is merged. Under my opinion, the logic plan would be to merge first #182, then try to merge #183 after it is ready, and finally refactor the code as proposed in #181. We can also combine #181 with this PR #183 and do both things together, although I'd prefer if we could keep it separated for now.

Copy link
Member

@matz-e matz-e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had an initial look over this, looks good to me!

Thank you as always for your work!

@mgeplf
Copy link
Contributor

mgeplf commented Mar 17, 2022

Nice speedup!

Under my opinion, the logic plan would be to ...

That makes sense; don't worry about #181, I can just redo that when I have time. I just merged #182, so we should be good to go.

@sergiorg-hpc sergiorg-hpc force-pushed the sandbox/srivas/io_improvements branch from 93fbd60 to fa0cc30 Compare March 22, 2022 17:00
@sergiorg-hpc
Copy link
Contributor Author

sergiorg-hpc commented Mar 23, 2022

A quick update. I have adapted the code of the PR after rebasing from master, including the use of std::next and std::advance with iterators instead of raw pointers. I have also replaced the ENV variable for the block_gap_size with a parameter, that is also exposed through Python. The parameter includes a boundary check to prevent users affecting the performance on GPFS and there are new CI tests to guarantee that we do not alter this behaviour accidentally in the future. Finally, I have also verified that the performance remains equivalent to what we had with raw pointers and the use of the ENV variable.

I will wait for further comments from your side. Also, thanks to @matz-e, @mgeplf, and @iomaganaris for the comments / insight.

@sergiorg-hpc sergiorg-hpc force-pushed the sandbox/srivas/io_improvements branch from fd73c96 to 704b1cf Compare March 30, 2022 07:50
Copy link
Contributor

@mgeplf mgeplf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, I'd just like #187 to be merged first, so that the change in order of the report reading is checked.

result.min_max_range.second = std::max(result.min_max_range.second, range.second);
element_ids_count += (range.second - range.first);
};
if (block_gap_limit < 4194304) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a hard requirement, or more of a GPFS specific suggestion wrt performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not exactly. Each GPFS block is 16MB, so having a limit matching the block-size is reasonable, because anyway if the gap is smaller than that, you are most probably going to fetch the whole block anyway.

If we leave this a bit open, for instance by allowing users to set 0 or something small, then we would be in the same situation like we were when we started doing this whole {min,max} optimization (i.e., libsonata triggering millions of IO transactions). Therefore, I'd prefer if we could set a magic number limit, like 1 GPFS block, to at least limit how low you can set this parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really hoping no-one ever tweaks the parameter in the first place, but that's an aside.

I'm just wondering if we're making it too GPFS centric; but I guess the same value makes sense on an NVME node?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense for other systems, still. Even on NVMEs, we might want to have larger reads for lower latency…

Copy link
Contributor Author

@sergiorg-hpc sergiorg-hpc Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @matz-e says, it shouldn't matter because here we would just force to use the {min,max} optimization as much as possible. For the NVMe drives, the larger the requests also the better to reduce the latency.

@sergiorg-hpc sergiorg-hpc force-pushed the sandbox/srivas/io_improvements branch from 704b1cf to 6820997 Compare March 30, 2022 14:37
mgeplf
mgeplf previously approved these changes Mar 31, 2022
Copy link
Contributor

@mgeplf mgeplf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, and thanks for adding the python tests, it's appreciated.

@sergiorg-hpc
Copy link
Contributor Author

Awesome, and thanks for adding the python tests, it's appreciated.

Thank you, @mgeplf.

By the way, I forgot to add a small thing in the Python tests, can you approve once again, please?

@mgeplf mgeplf merged commit 0f8927a into master Mar 31, 2022
@mgeplf mgeplf deleted the sandbox/srivas/io_improvements branch March 31, 2022 09:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants