-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files #10266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amol- . I think this needs to be reworked slightly, see below.
docs/source/python/memory.rst
Outdated
| To more efficiently read big data from disk, we can memory map the file, so that | ||
| the array can directly reference the data on disk and avoid copying it to memory. | ||
| In such case the memory consumption is greatly reduced and it's possible to read | ||
| arrays bigger than the total memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is rather misleading. The data is loaded back to memory when it is being read. It's just that it's read lazily, so the costs are not paid up front (and the cost is not paid for data that is not accessed).
What memory mapping can avoid is an intermediate copy when reading the data. So it is more performant in that sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. What I was trying to say is that Arrow doesn't have to allocate memory itself as it can directly point to the memory mapped buffer which allocation is managed by the system. Also the memory mapped buffer can be paged out more easily by the system without write back cost as it's not flagged as dirty memory, thus allowing to deal with files bigger than memory even in the absence of a swap file. I'll try to rephrase this in a less misleading way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rephrased it to make it more clear that in absolute terms you won't be consuming fewer memory, but the system will be able to more easily page it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would memory mapping be more efficient than a system with swap enabled? You mention that there are potential write back savings but why would the page be flagged as dirty in a swap scenario? In either case it seems we are talking about read only access to a larger than physical memory file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are benefits because if you are only reading the data (suppose to compute means or whatever on it) if you are using memory mapping and you have to read more data than it fits in your memory, the kernel can swap out the pages no longer in use without any cost of writing them into the swap, because they are already available in the file that was memory mapped and thus can be paged back in directly from the memory mapping.
On the other side, if you were relying on swap and had read the file normally, when the data you have to read doesn't fit into memory, the kernel will have to incur into the cost of writing it to the swap, otherwise it would be unable to page it out as there wouldn't be any copy (as far as the memory manager is concerned) that would allow to page it back in.
So memory mapping prevents the cost of writing to the swap file when you are exhausting memory.
docs/source/python/memory.rst
Outdated
| In such case the operating system will be able to page in the mapped memory | ||
| lazily and page it out without any write back cost when under pressure, | ||
| allowing to more easily read arrays bigger than the total memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what user we target here, but "page in" and "page out" is not commonly understood I think (of course, we can't start explaining in detail how memory works here, but I think this section will be typically read by people who might not fully understand what memory mapping is / how it works, but just think they can use it to avoid memory allocation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rephrased it this way mostly because there were some concerns in previous comments about the usage of the the "avoid memory allocation" wording as the memory is getting allocated anyway, it can just be swapped out at any time without any write back additional cost and thus you can avoid OOMs even if you exhausted memory or swap space.
|
@jorisvandenbossche @pitrou I think I did my best to address the remaining comments, could you take another look? :D |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks much better to me, thank you.
| For example to write an array of 10M integers, we could write it in 1000 chunks | ||
| of 10000 entries: | ||
|
|
||
| .. ipython:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still lukewarm about using ipython blocks here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fond of ipython directive too, but we have a dedicated Jira Issue ( https://issues.apache.org/jira/browse/ARROW-13159 ), for now I adhered to what seemed to be the practice in the rest of that file.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Thanks for the updated @amol- !
Uh oh!
There was an error while loading. Please reload this page.