Change open modes to be more in line with Python's open() function by mgeier · Pull Request #60 · bastibe/python-soundfile

mgeier · 2014-08-06T13:27:06Z

Until very recently I had the opinion that if a feature is available in both libsndfile and plain old Python, we should do it in libsndfile's way.

It seems that my opinion is changing ...

I wanted to fix SoundFile.__init__() (see #59) and got the idea that probably the real problem is that libsndfile tries to handle different situations within a single SFM_RDWR mode.

One solution to the whole problem would be to use Python's native file modes instead of the ones provided by libsndfile.
Initially, I thought this wouldn't be possible (see commit message of a3295f7), but it's actually only impossible if read-write mode has to be one single mode.

I'm proposing to allow the following modes:

read-only: 'rb', 'r' (for convenience, we would internally add 'b' to that because we only ever deal with binary files)
write-only: 'wb', 'w', 'xb' (raises FileExistsError to avoid accidental overwriting), 'x'
read/write existing: 'r+b', 'r+'
read/write new: 'w+b', 'w+', 'x+b', 'x+'

In addition, we could allow mode=None where the mode is obtained from the mode attribute of the user-provided file-like object.

All other Python file modes should be disallowed, especially 'a' (because I think it wouldn't play nicely with libsndfile) and 't'.
Our current 'rw' mode should also be disallowed.

In the documentation, we could state that "'b' is always implied" and use the variants without 'b' for simplicity.
However, this would be a difference to Python, where 't' is implied.
The default could be mode='r', as before.

All this would have the additional advantage that 'x' mode would be possible, making sure that no files are accidentally overwritten.

In order to implement this, we would have to use Python's open() function and then pass the opened file to libsndfile.
This could be done either as file object or as file descriptor.
I guess libsndfile does its own buffering (?), therefore it would make sense to use open(..., buffering=0).
I guess the "virtual I/O" callback functions are a little less efficient, so it may be better to get the file descriptor with fileno() and pass this to libsndfile.

In order to be consistent with Python's open(), we should also make sure that the read and write positions are modified accordingly after opening.
For the same reason, we should probably also drop the which argument from seek(), as already suggested by @bastibe (~~I don't remember where exactly~~ update: https://github.com/bastibe/PySoundFile/pull/35#issuecomment-49274456).
It would also finally make sense to add a tell() method (see #44).

bastibe · 2014-08-03T10:42:10Z

This is a very good idea. If in doubt, I always prefer the pythonic way to the sndfile way. I pretty much agree with everything you wrote.

This would be a backwards incompatible change (though a rather trivial one). I think it is well-justified though, since it makes using PySoundFile more intuitive to Pythonistas. After all, PySoundFile is still tagged alpha.

All other Python file modes should be disallowed, especially 'a' (because I think it wouldn't play nicely with libsndfile) and 't'.

Why would we not want to use 'a'? It might be very handy to be able to append audio to the end of a file.

This could be done either as file object or as file descriptor.
I guess the "virtual I/O" callback functions are a little less efficient, so it may be better to get the file descriptor.

Are file descriptors really consistent between Windows and Unix? We should definitely test all the corner cases on all OSes before committing to file descriptors or file objects.

I guess libsndfile does its own buffering (?), therefore it would make sense to use open(..., buffering=0).

We should definitely test this thoroughly as well. Especially on non-SSD hardware. I have never done any performance measurements for PySoundFile. Maybe it would make sense to build a small benchmark program for comparing different versions/OSes.

mgeier · 2014-08-03T21:50:08Z

Why would we not want to use 'a'?

I think it just cannot work.
'a' standing for itself means that we cannot read anything from the file. Not even the header. So we don't know the samplerate and channels and stuff.

'a+' might work in some cases, but I'm not sure if the OS lets us read the header in this case.

Also this sentence from the reference sounds like it's not a good idea:
"... and 'a' for appending (which on some Unix systems, means that all writes append to the end of the file regardless of the current seek position)."

I tried it anyway but libsndfile gave me this error:

Error : cannot open embedded file read/write.

I'm not quite sure what that means, but it sounds like libsndfile cannot really deal with this situation.

It might be very handy to be able to append audio to the end of a file.

Sure, but you can have this quite easily by using 'r+' and f.seek(0, sf.SEEK_END).

Are file descriptors really consistent between Windows and Unix?

No idea. But I didn't find anything in the docs that suggests any problems.
Of course, we should test this.

I have no idea how to sensibly test the buffering stuff.
But I guess the presence of sf_write_sync() suggests that there is some buffering going on within libsndfile.
Further, I guess that adding buffering on the Python-side will most likely make performance worse, not better.

But if you have an idea how to test this, we can of course try!

I started implementing the suggested changes and it seems to work quite well for the most part.
As an effect of merging the read/write positions we'll have do do some seek()ing before and after each read()/write(), but I guess a few more function calls are not a problem there.

I think the most severe change will be that we have to delete any possibly available file content (including the file header) in 'w+' mode.
This is done automatically when we use open() to open a given file by name.
I can probably also use open() for file descriptors, especially since os.ftruncate() seems to be only available for Unix.

I don't know yet how to handle file-like objects in this case.
We could call seek(0) and truncate(0) on them, but this may not work on all file-like objects.
Alternatively, we could just disallow 'w+' mode.
Or we could do nothing and write in the documentation that in this case the file-like object must be empty initially.

What do you think?

bastibe · 2014-08-04T11:54:28Z

I didn't realize that truncating the file deletes the header as well. But it makes perfect sense, of course.

It might be worth considering to "do what I mean" instead of doing it the same way as open(). For example, we could interpret 'a' as open in write mode, but put the write pointer to EOF. In a similar way, 'w' could delete all the audio data, but keep the header intact. We would likely do this without actually using open(). This would mean that the old 'r', 'w', 'rw' modes would map to the new 'r', 'a', 'a+' modes, and we would add the truncating 'w' and 'w+' modes, the non-truncating 'r+', and the exclusive creation modes 'x' and 'x+'.

Basically, we would apply the Unix file modes on the data, but not on the header. Personally, I quite like this. I think it makes sense that truncating a SoundFile does not delete its header. What is your opinion on this?

I'll try to implement a small performance benchmark next weekend. Maybe this will tell us something about the differences on different platforms and implementation.

As for 'w+' and file-like objects, I am not quite sure what you mean. If the user passes a file-like object to us, I think we should simply pass that to sndfile and see what happens. If sndfile can't handle it, it will tell the user. But I might be misunderstanding your intentions.

bastibe · 2014-08-04T12:03:48Z

That is a lot of modes. Let's summarize:

'r' read-only, pointer at BOF.
'r+' read-write, pointer at BOF
'w' write-only, truncate
'w+' read-write, truncate
'x' write-only, truncate, Error if existing
'x+' read-write, truncate, Error if existing
'a' write-only, pointer at EOF
'a+' read-write, pointer at EOF

Did I miss any?

mgeier · 2014-08-05T09:04:45Z

Did I miss any?

No, I think those are all.
Note, however, that 'x' is only supported since Python 3.3 (I wasn't aware of this before).
Also note that 'a' may or may not allow seeking, depending on the platform.
To be more specific, seek() may not raise an error, but just silently do nothing, leading to quite unexpected behavior if SoundFile.seek() is used before writing.

Basically, we would apply the Unix file modes on the data, but not on the header. Personally, I quite like this. I think it makes sense that truncating a SoundFile does not delete its header. What is your opinion on this?

I think we should embrace either the Python modes or the libsndfile modes. What you're describing sounds like a mixture of both.
If we use the Python modes, we should stay actually compatible with Python. E.g. if I open a file with the built-in open() function with a certain mode, I expect that I can use the exact same mode when I pass the file object on to sf.open(). This wouldn't work anymore with your suggestions.

The real problem, however, will be that most of your suggested modes actually need SFM_RDWR mode, which isn't supported for all file types!

... we could interpret 'a' as open in write mode, but put the write pointer to EOF.

But we would actually have to use SFM_RDWR, which isn't what most people would expect.

... 'w' could delete all the audio data, but keep the header intact.

This would also need SFM_RDWR.
Also, it would be impossible to e.g. overwrite a stereo file with mono data, or to change the subtype.

And both w and a would then have the exact same problem that I initially wanted to fix with this issue: If the file doesn't exist yet, some arguments have to be given, if not, they must not.
And we have no (straightforward) way to distinguish these 2 cases.

This would mean that the old 'r', 'w', 'rw' modes would map to the new 'r', 'a', 'a+' modes

The old 'w' a.k.a. SFM_WRITE does actually truncate the file, so it wouldn't be equivalent to your suggested new 'a'.

I think 'a' just doesn't work together with libsndfile, but 'a+' might.
However, it would be a strange asymmetry if 'a' is missing and I'm not sure if it's worth the extra code and documentation if the same thing can easily be done with 'r+' and seek().

As for 'w+' and file-like objects, I am not quite sure what you mean. If the user passes a file-like object to us, I think we should simply pass that to sndfile and see what happens. If sndfile can't handle it, it will tell the user. But I might be misunderstanding your intentions.

No, you're right.
I initially thought that 'w' means that we have to make sure that the file gets truncated.
In the meantime I checked the behavior of the built-in open() function when a file descriptor is passed in. In this case, the mode argument is basically ignored. The corresponding file is not truncated in 'w' mode, nor is seek(fd, 0) called. 'x' is also ignored, because if there is a file descriptor, there is always a file (but 'x' doesn't raise an error in this case).

Therefore, I think we can just do it like you suggested, pass it to libsndfile and see what happens.

I didn't find a formal specification for the behavior of the open() modes with regards to file descriptors.
This is the closest thing I could find to a "specification": https://docs.python.org/3.4/library/functions.html#open and it isn't very specific about file descriptors.

bastibe · 2014-08-06T08:40:48Z

This seems to be fraught with ambiguities and unexpected behavior. I think we should experiment with the actual behavior of sndfile and see what does or doesn't work.

If most of the common functionality works as expected, I am all for using pythonic modes. If there are too many weird cases, we should probably stick with sndfile's modes.

Regardless, we can rename 'rw' to 'r+', unify the read/write pointer, and introduce a tell function.

mgeier · 2014-08-06T13:30:25Z

I implemented the discussed changes, I think it works quite well.
I also extended the test cases, but it's quite likely that I missed something.

So far, I've tested it on Debian Linux with Python 2.7.8 and 3.4.1.

mgeier · 2014-08-06T15:27:30Z

I just tried it on OSX 10.6 and the tests are passing (Python 2.7 and 3.3).

bastibe · 2014-08-06T15:44:49Z

This looks very good! I would have thought this to be a lot more complicated.

bastibe · 2014-08-08T19:23:41Z

Yes, this looks better.

This is a terrific contribution! While the impact on actual usage is quite minor, I think that this level of attention to detail goes a long way in making PySoundFile more pythonic!

mgeier · 2014-08-12T14:51:53Z

It seems that using file descriptors is problematic on Windows (see #63), therefore we shouldn't use file descriptors when the user specifies a file name.

We could simply use the actual Python file object, which would be a trivial change to this PR.
The problem with that is two-fold (as far as I can tell):

the virtual I/O interface may be a little less efficient because of all the callbacks
the error messages are more obscure because the errors happen in the callbacks

Point 1. is probably irrelevant, as calling those callback functions will probably need much less resources than reading/writing the data.
Point 2. might be more relevant to our users, but probably they can live with it.

Another possibility would be to re-investigate the options from #59.
I listed 3 options there, probably there are more?

mgeier · 2014-08-12T15:02:09Z

Another option, which just came to my mind, would be to use sf_open() only for those modes where libsndfile and Python agree.
This would be 'r', 'w' and 'r+', if I'm not mistaken.
Python's open() function is only really needed for 'w+' (to truncate the file if it exists) and 'x'/'x+' (to raise an error if the file exists already).

I don't know if that really improves anything, though.
Probably not ...

bastibe · 2014-08-12T15:44:28Z

A third option would be to try to use file descriptors, and fall back to virtual I/O if that fails (and don't tell anyone that this basically means "every time on Windows"). I don't think that the error message issue with virtual I/O is particularly bad, since there error messages in the callbacks are very unlikely. We'd have to investigate performance, though.

mgeier · 2014-08-13T08:21:11Z

A third option would be to try to use file descriptors, and fall back to virtual I/O if that fails

I don't know if this would work ... I suspect that if sf_open_fd() raises an error, the file descriptor might have been closed or not. So it may or may not work if we call sf_open_virtual() afterwards. I don't think we should rely on a sane state after failure.

I think we would have the same problem with the second option in #59.

But probably my worries are unjustified?

bastibe · 2014-08-13T09:46:37Z

I guess we'll have to try it to find out.

mgeier · 2014-08-14T07:44:02Z

I switched from file descriptors to virtual I/O in c971db2.

To my great surprise, the tests now run about 40% faster on my Linux computer!
I still think the speed difference will be irrelevant for real applications, but it's an interesting factoid nevertheless.

Information about what is allowed in the "file" argument should be written into the docstring, not into the error message.

mgeier · 2014-08-14T09:14:03Z

I just tested it on Windows with

py.test -k "not fd"

This deselects the 38 test cases which use file descriptors. The other 109 test cases pass!

I wouldn't disable those tests automatically on Windows, because there may be some systems where it actually works.
We should add the test invocation to the documentation, though.

bastibe · 2014-08-14T09:23:55Z

Cool!

... in order to be compatible with Python file objects: https://docs.python.org/3.4/library/io.html#io.IOBase.seekable

mgeier · 2014-08-15T10:36:52Z

What do you think about this PR versus #65 (both as solution to #59)?

mgeier · 2014-08-18T15:24:45Z

As mentioned in #67, I added a check for seekable() before calling seek() (046023c).

Now it should work even if we add an exception to seek() as discussed in #67.

So what about this PR?
I think it's ready for merging ...
Is it better than #65?

bastibe · 2014-08-18T19:56:42Z

I vote for merging #60. This looks very good!

Change open modes to be more in line with Python's open() function

mgeier mentioned this pull request Aug 1, 2014

Handle 'rw' files correctly in __init__() #59

Closed

Change 'r'/'w'/'rw' to 'r'/'w'/'r+'/'w+'/'x'/'x+'

1998fea

Avoid catch-all exception handling

019cff9

mgeier mentioned this pull request Aug 12, 2014

Using file descriptors on Windows? #63

Closed

Use sf_open_virtual() instead of sf_open_fd() if file name is given

c971db2

Shorter exception message on invalid file argument

6e67830

Information about what is allowed in the "file" argument should be written into the docstring, not into the error message.

mgeier mentioned this pull request Aug 14, 2014

Change argument checking in SoundFile.__init__() #65

Closed

mgeier added 2 commits August 14, 2014 14:55

Fix name property

c6b8275

Change seekable from property to method

8e3f6ee

... in order to be compatible with Python file objects: https://docs.python.org/3.4/library/io.html#io.IOBase.seekable

This was referenced Aug 14, 2014

Add documentation for potential contributors #66

Merged

Check usage of seek() for non-seekable file formats #67

Merged

Check if seekable() before calling seek()

046023c

mgeier added a commit that referenced this pull request Aug 19, 2014

Merge pull request #60 from bastibe/open-rw

85c9efb

Change open modes to be more in line with Python's open() function

mgeier merged commit 85c9efb into master Aug 19, 2014

mgeier deleted the open-rw branch August 19, 2014 08:15

mgeier mentioned this pull request Aug 19, 2014

Allow pipe read/write #68

Merged

mgeier mentioned this pull request Nov 27, 2014

Allow mode='x' even for Python < 3.3 #94

Merged

mgeier mentioned this pull request Feb 2, 2017

DOC: Change 'rw' to 'r+' in README #191

Merged

Conversation

mgeier commented Aug 6, 2014

Uh oh!

bastibe commented Aug 3, 2014

Uh oh!

mgeier commented Aug 3, 2014

Uh oh!

bastibe commented Aug 4, 2014

Uh oh!

bastibe commented Aug 4, 2014

Uh oh!

mgeier commented Aug 5, 2014

Uh oh!

bastibe commented Aug 6, 2014

Uh oh!

mgeier commented Aug 6, 2014

Uh oh!

mgeier commented Aug 6, 2014

Uh oh!

bastibe commented Aug 6, 2014

Uh oh!

bastibe commented Aug 8, 2014

Uh oh!

mgeier commented Aug 12, 2014

Uh oh!

mgeier commented Aug 12, 2014

Uh oh!

bastibe commented Aug 12, 2014

Uh oh!

mgeier commented Aug 13, 2014

Uh oh!

bastibe commented Aug 13, 2014

Uh oh!

mgeier commented Aug 14, 2014

Uh oh!

mgeier commented Aug 14, 2014

Uh oh!

bastibe commented Aug 14, 2014

Uh oh!

mgeier commented Aug 15, 2014

Uh oh!

mgeier commented Aug 18, 2014

Uh oh!

bastibe commented Aug 18, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants