Conversation
9977401 to
651d090
Compare
|
Note that I'll update this PR after #182 is merged. Under my opinion, the logic plan would be to merge first #182, then try to merge #183 after it is ready, and finally refactor the code as proposed in #181. We can also combine #181 with this PR #183 and do both things together, although I'd prefer if we could keep it separated for now. |
matz-e
left a comment
There was a problem hiding this comment.
Had an initial look over this, looks good to me!
Thank you as always for your work!
93fbd60 to
fa0cc30
Compare
|
A quick update. I have adapted the code of the PR after rebasing from I will wait for further comments from your side. Also, thanks to @matz-e, @mgeplf, and @iomaganaris for the comments / insight. |
fd73c96 to
704b1cf
Compare
| result.min_max_range.second = std::max(result.min_max_range.second, range.second); | ||
| element_ids_count += (range.second - range.first); | ||
| }; | ||
| if (block_gap_limit < 4194304) { |
There was a problem hiding this comment.
is this a hard requirement, or more of a GPFS specific suggestion wrt performance?
There was a problem hiding this comment.
No, not exactly. Each GPFS block is 16MB, so having a limit matching the block-size is reasonable, because anyway if the gap is smaller than that, you are most probably going to fetch the whole block anyway.
If we leave this a bit open, for instance by allowing users to set 0 or something small, then we would be in the same situation like we were when we started doing this whole {min,max} optimization (i.e., libsonata triggering millions of IO transactions). Therefore, I'd prefer if we could set a magic number limit, like 1 GPFS block, to at least limit how low you can set this parameter.
There was a problem hiding this comment.
I'm really hoping no-one ever tweaks the parameter in the first place, but that's an aside.
I'm just wondering if we're making it too GPFS centric; but I guess the same value makes sense on an NVME node?
There was a problem hiding this comment.
I think it makes sense for other systems, still. Even on NVMEs, we might want to have larger reads for lower latency…
There was a problem hiding this comment.
As @matz-e says, it shouldn't matter because here we would just force to use the {min,max} optimization as much as possible. For the NVMe drives, the larger the requests also the better to reduce the latency.
704b1cf to
6820997
Compare
mgeplf
left a comment
There was a problem hiding this comment.
Awesome, and thanks for adding the python tests, it's appreciated.
96db720 to
4b79fde
Compare
Thank you, @mgeplf. By the way, I forgot to add a small thing in the Python tests, can you approve once again, please? |
The improvements provided in #174 and available in
libsonata v0.1.11, introduced a specific limitation while fetching a small number of widely-disperse GIDs in very large reports. The{min,max}optimization, inspired by ROMIO's Data Sieving, would in such case force the implementation to retrieve large amounts of data without considering the gap between each of the position ranges of the GIDs. This means that, if we fetch only 2 GIDs that coincidentally are located in the very first entry of the dataset and in the very last one, the library would then read the complete report file just to filter in memory the two selected values.To overcome these constraints while still trying to reduce the number of IO requests requested to the parallel file system, the current PR introduces several changes and optimizations:
{min,max}range is now replaced by a collection of ranges or IO blocks, in which each range contains a subset of GIDs. The gap between each IO block is fixed at 64MB (or 4 blocks in GPFS), allowing the code to adapt depending on the number of GIDs requested and their distribution in the file. For instance, a higher number of GIDs would generally imply a lower number of IO blocks, and vice versa.std::mapthat is utilized to index the GIDs and their positions. The idea is to reduce the overhead when opening populations from very large report files with millions of GIDs. Moreover, it allows us to re-index the content based on different keys, without moving the actual data. For this purpose, the implementation uses severalstd::vectorand a dedicated index that refers to the values contained in these vectors. When creating the IO blocks, it is enough to re-index the content by the position ranges and to create the groups based on this new index.Using the same ~60TB report utilized in #174 and #179, the following table illustrates the performance difference between
v0.1.11and the proposed PR by fetching a single timestep while varying the number of GIDs requested. TheOpencolumn reflects the cost of opening the population as part of the elapsed execution time, while theSelectcolumn shows the cost of the operations related to the selection of the GIDs in Python, and theGetcolumn illustrates the cost of the actual retrieval of the data. The total cost is reflected inElapsed. All measurements are provided in seconds:From the results, it can be observed that the use of multiple-blocks and the integration of a flat-map reduces the overhead of the execution time considerably, specially in small number of GIDs. After 100 GIDs, the implementation can only find a single IO block to fetch from storage and, thus, its behaviour is equivalent to the original
{min,max}single-block implementation ofv0.1.11. Hence, allowing us to be more flexible in lower number of GIDs, and more aggressive in large number of GIDs to prevent overheads in the parallel file system. This is clearly noticeable by observing theGetcolumn of the table above (note that the performance is improved because the PR also contains the changes in #179).The following table reflects the number of blocks created by the implementation for each of the tests, showing how we can now be more flexible in requests with lower number of GIDs and more aggressive in requests with large number of GIDs:
If we now evaluate the edge case mentioned in the beginning of this description, the entry with 2 GIDs in the following table illustrates the performance by fetching the very first and the very last GID of the same ~60TB report for a single timestep: