Skip to content

Conversation

@amol-
Copy link
Member

@amol- amol- commented May 7, 2021

howitgets

@github-actions
Copy link

github-actions bot commented May 7, 2021

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@amol- amol- changed the title [Doc][Python] Improve documentation regarding dealing with memory mapped files ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files May 7, 2021
@github-actions
Copy link

github-actions bot commented May 7, 2021

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amol- . I think this needs to be reworked slightly, see below.

To more efficiently read big data from disk, we can memory map the file, so that
the array can directly reference the data on disk and avoid copying it to memory.
In such case the memory consumption is greatly reduced and it's possible to read
arrays bigger than the total memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rather misleading. The data is loaded back to memory when it is being read. It's just that it's read lazily, so the costs are not paid up front (and the cost is not paid for data that is not accessed).

What memory mapping can avoid is an intermediate copy when reading the data. So it is more performant in that sense.

Copy link
Member Author

@amol- amol- May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. What I was trying to say is that Arrow doesn't have to allocate memory itself as it can directly point to the memory mapped buffer which allocation is managed by the system. Also the memory mapped buffer can be paged out more easily by the system without write back cost as it's not flagged as dirty memory, thus allowing to deal with files bigger than memory even in the absence of a swap file. I'll try to rephrase this in a less misleading way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased it to make it more clear that in absolute terms you won't be consuming fewer memory, but the system will be able to more easily page it out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would memory mapping be more efficient than a system with swap enabled? You mention that there are potential write back savings but why would the page be flagged as dirty in a swap scenario? In either case it seems we are talking about read only access to a larger than physical memory file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are benefits because if you are only reading the data (suppose to compute means or whatever on it) if you are using memory mapping and you have to read more data than it fits in your memory, the kernel can swap out the pages no longer in use without any cost of writing them into the swap, because they are already available in the file that was memory mapped and thus can be paged back in directly from the memory mapping.

On the other side, if you were relying on swap and had read the file normally, when the data you have to read doesn't fit into memory, the kernel will have to incur into the cost of writing it to the swap, otherwise it would be unable to page it out as there wouldn't be any copy (as far as the memory manager is concerned) that would allow to page it back in.

So memory mapping prevents the cost of writing to the swap file when you are exhausting memory.

Comment on lines 330 to 332
In such case the operating system will be able to page in the mapped memory
lazily and page it out without any write back cost when under pressure,
allowing to more easily read arrays bigger than the total memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what user we target here, but "page in" and "page out" is not commonly understood I think (of course, we can't start explaining in detail how memory works here, but I think this section will be typically read by people who might not fully understand what memory mapping is / how it works, but just think they can use it to avoid memory allocation).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased it this way mostly because there were some concerns in previous comments about the usage of the the "avoid memory allocation" wording as the memory is getting allocated anyway, it can just be swapped out at any time without any write back additional cost and thus you can avoid OOMs even if you exhausted memory or swap space.

@amol-
Copy link
Member Author

amol- commented Jul 21, 2021

@jorisvandenbossche @pitrou I think I did my best to address the remaining comments, could you take another look? :D

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much better to me, thank you.

For example to write an array of 10M integers, we could write it in 1000 chunks
of 10000 entries:

.. ipython:: python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still lukewarm about using ipython blocks here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fond of ipython directive too, but we have a dedicated Jira Issue ( https://issues.apache.org/jira/browse/ARROW-13159 ), for now I adhered to what seemed to be the practice in the rest of that file.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Thanks for the updated @amol- !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants