ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files #10266

amol- · 2021-05-07T14:50:22Z

github-actions · 2021-05-07T14:50:47Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2021-05-07T14:54:48Z

https://issues.apache.org/jira/browse/ARROW-12650

pitrou

Thanks @amol- . I think this needs to be reworked slightly, see below.

docs/source/python/memory.rst

pitrou · 2021-05-12T08:35:04Z

docs/source/python/memory.rst

+To more efficiently read big data from disk, we can memory map the file, so that
+the array can directly reference the data on disk and avoid copying it to memory.
+In such case the memory consumption is greatly reduced and it's possible to read
+arrays bigger than the total memory


This is rather misleading. The data is loaded back to memory when it is being read. It's just that it's read lazily, so the costs are not paid up front (and the cost is not paid for data that is not accessed).

What memory mapping can avoid is an intermediate copy when reading the data. So it is more performant in that sense.

I see what you mean. What I was trying to say is that Arrow doesn't have to allocate memory itself as it can directly point to the memory mapped buffer which allocation is managed by the system. Also the memory mapped buffer can be paged out more easily by the system without write back cost as it's not flagged as dirty memory, thus allowing to deal with files bigger than memory even in the absence of a swap file. I'll try to rephrase this in a less misleading way.

I rephrased it to make it more clear that in absolute terms you won't be consuming fewer memory, but the system will be able to more easily page it out.

Would memory mapping be more efficient than a system with swap enabled? You mention that there are potential write back savings but why would the page be flagged as dirty in a swap scenario? In either case it seems we are talking about read only access to a larger than physical memory file.

there are benefits because if you are only reading the data (suppose to compute means or whatever on it) if you are using memory mapping and you have to read more data than it fits in your memory, the kernel can swap out the pages no longer in use without any cost of writing them into the swap, because they are already available in the file that was memory mapped and thus can be paged back in directly from the memory mapping.

On the other side, if you were relying on swap and had read the file normally, when the data you have to read doesn't fit into memory, the kernel will have to incur into the cost of writing it to the swap, otherwise it would be unable to page it out as there wouldn't be any copy (as far as the memory manager is concerned) that would allow to page it back in.

So memory mapping prevents the cost of writing to the swap file when you are exhausting memory.

docs/source/python/memory.rst

jorisvandenbossche · 2021-06-24T09:56:05Z

docs/source/python/memory.rst

+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.


I don't know what user we target here, but "page in" and "page out" is not commonly understood I think (of course, we can't start explaining in detail how memory works here, but I think this section will be typically read by people who might not fully understand what memory mapping is / how it works, but just think they can use it to avoid memory allocation).

I rephrased it this way mostly because there were some concerns in previous comments about the usage of the the "avoid memory allocation" wording as the memory is getting allocated anyway, it can just be swapped out at any time without any write back additional cost and thus you can avoid OOMs even if you exhausted memory or swap space.

docs/source/python/memory.rst

amol- · 2021-07-21T14:04:50Z

@jorisvandenbossche @pitrou I think I did my best to address the remaining comments, could you take another look? :D

pitrou

This looks much better to me, thank you.

docs/source/python/ipc.rst

pitrou · 2021-07-22T13:06:10Z

docs/source/python/ipc.rst

+For example to write an array of 10M integers, we could write it in 1000 chunks
+of 10000 entries:
+
+.. ipython:: python


I'm still lukewarm about using ipython blocks here.

I'm not fond of ipython directive too, but we have a dedicated Jira Issue ( https://issues.apache.org/jira/browse/ARROW-13159 ), for now I adhered to what seemed to be the practice in the rest of that file.

docs/source/python/ipc.rst

docs/source/python/parquet.rst

pitrou

+1. Thanks for the updated @amol- !

amol- changed the title ~~[Doc][Python] Improve documentation regarding dealing with memory mapped files~~ ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files May 7, 2021

pitrou reviewed May 12, 2021

View reviewed changes

pitrou reviewed Jun 23, 2021

View reviewed changes

docs/source/python/memory.rst Outdated Show resolved Hide resolved

pitrou reviewed Jun 23, 2021

View reviewed changes

docs/source/python/memory.rst Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Jun 24, 2021

View reviewed changes

amol- added 5 commits July 21, 2021 15:49

Example of efficiently writing/loading arrays and tables

d394675

Reduce example size

8344c3e

Incorporate further feedbacks

e95dfd6

oops, indent note

939ef16

typo

f56d5ce

pitrou requested changes Jul 22, 2021

View reviewed changes

amol- added 3 commits July 22, 2021 16:02

Address more comments

a74bae2

titlecasing

898da3d

retitle

6aa5956

pitrou approved these changes Jul 26, 2021

View reviewed changes

pitrou closed this in 2d33361 Jul 26, 2021

This was referenced Jul 26, 2021

[Doc][Python] Improve documentation regarding dealing with memory mapped files #28401

Closed

[Doc][Python] The use of IPython directive or doctest code blocks in the python user guide #28859

Open

ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files #10266

ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files #10266

Uh oh!

Conversation

amol- commented May 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 7, 2021

Uh oh!

github-actions bot commented May 7, 2021

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amol- May 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amol- commented Jul 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

amol- commented May 7, 2021 •

edited

Loading

amol- May 13, 2021 •

edited

Loading

amol- commented Jul 21, 2021 •

edited

Loading