Skip to content

Change open modes to be more in line with Python's open() function#60

Merged
mgeier merged 7 commits intomasterfrom
open-rw
Aug 19, 2014
Merged

Change open modes to be more in line with Python's open() function#60
mgeier merged 7 commits intomasterfrom
open-rw

Conversation

@mgeier
Copy link
Contributor

@mgeier mgeier commented Aug 6, 2014

Until very recently I had the opinion that if a feature is available in both libsndfile and plain old Python, we should do it in libsndfile's way.

It seems that my opinion is changing ...

I wanted to fix SoundFile.__init__() (see #59) and got the idea that probably the real problem is that libsndfile tries to handle different situations within a single SFM_RDWR mode.

One solution to the whole problem would be to use Python's native file modes instead of the ones provided by libsndfile.
Initially, I thought this wouldn't be possible (see commit message of a3295f7), but it's actually only impossible if read-write mode has to be one single mode.

I'm proposing to allow the following modes:

  • read-only: 'rb', 'r' (for convenience, we would internally add 'b' to that because we only ever deal with binary files)
  • write-only: 'wb', 'w', 'xb' (raises FileExistsError to avoid accidental overwriting), 'x'
  • read/write existing: 'r+b', 'r+'
  • read/write new: 'w+b', 'w+', 'x+b', 'x+'

In addition, we could allow mode=None where the mode is obtained from the mode attribute of the user-provided file-like object.

All other Python file modes should be disallowed, especially 'a' (because I think it wouldn't play nicely with libsndfile) and 't'.
Our current 'rw' mode should also be disallowed.

In the documentation, we could state that "'b' is always implied" and use the variants without 'b' for simplicity.
However, this would be a difference to Python, where 't' is implied.
The default could be mode='r', as before.

All this would have the additional advantage that 'x' mode would be possible, making sure that no files are accidentally overwritten.

In order to implement this, we would have to use Python's open() function and then pass the opened file to libsndfile.
This could be done either as file object or as file descriptor.
I guess libsndfile does its own buffering (?), therefore it would make sense to use open(..., buffering=0).
I guess the "virtual I/O" callback functions are a little less efficient, so it may be better to get the file descriptor with fileno() and pass this to libsndfile.

In order to be consistent with Python's open(), we should also make sure that the read and write positions are modified accordingly after opening.
For the same reason, we should probably also drop the which argument from seek(), as already suggested by @bastibe (I don't remember where exactly update: https://github.com/bastibe/PySoundFile/pull/35#issuecomment-49274456).
It would also finally make sense to add a tell() method (see #44).

@bastibe
Copy link
Owner

bastibe commented Aug 3, 2014

This is a very good idea. If in doubt, I always prefer the pythonic way to the sndfile way. I pretty much agree with everything you wrote.

This would be a backwards incompatible change (though a rather trivial one). I think it is well-justified though, since it makes using PySoundFile more intuitive to Pythonistas. After all, PySoundFile is still tagged alpha.

All other Python file modes should be disallowed, especially 'a' (because I think it wouldn't play nicely with libsndfile) and 't'.

Why would we not want to use 'a'? It might be very handy to be able to append audio to the end of a file.

This could be done either as file object or as file descriptor.
I guess the "virtual I/O" callback functions are a little less efficient, so it may be better to get the file descriptor.

Are file descriptors really consistent between Windows and Unix? We should definitely test all the corner cases on all OSes before committing to file descriptors or file objects.

I guess libsndfile does its own buffering (?), therefore it would make sense to use open(..., buffering=0).

We should definitely test this thoroughly as well. Especially on non-SSD hardware. I have never done any performance measurements for PySoundFile. Maybe it would make sense to build a small benchmark program for comparing different versions/OSes.

@mgeier
Copy link
Contributor Author

mgeier commented Aug 3, 2014

Why would we not want to use 'a'?

I think it just cannot work.
'a' standing for itself means that we cannot read anything from the file. Not even the header. So we don't know the samplerate and channels and stuff.

'a+' might work in some cases, but I'm not sure if the OS lets us read the header in this case.

Also this sentence from the reference sounds like it's not a good idea:
"... and 'a' for appending (which on some Unix systems, means that all writes append to the end of the file regardless of the current seek position)."

I tried it anyway but libsndfile gave me this error:

Error : cannot open embedded file read/write.

I'm not quite sure what that means, but it sounds like libsndfile cannot really deal with this situation.

It might be very handy to be able to append audio to the end of a file.

Sure, but you can have this quite easily by using 'r+' and f.seek(0, sf.SEEK_END).

Are file descriptors really consistent between Windows and Unix?

No idea. But I didn't find anything in the docs that suggests any problems.
Of course, we should test this.

I have no idea how to sensibly test the buffering stuff.
But I guess the presence of sf_write_sync() suggests that there is some buffering going on within libsndfile.
Further, I guess that adding buffering on the Python-side will most likely make performance worse, not better.

But if you have an idea how to test this, we can of course try!

I started implementing the suggested changes and it seems to work quite well for the most part.
As an effect of merging the read/write positions we'll have do do some seek()ing before and after each read()/write(), but I guess a few more function calls are not a problem there.

I think the most severe change will be that we have to delete any possibly available file content (including the file header) in 'w+' mode.
This is done automatically when we use open() to open a given file by name.
I can probably also use open() for file descriptors, especially since os.ftruncate() seems to be only available for Unix.

I don't know yet how to handle file-like objects in this case.
We could call seek(0) and truncate(0) on them, but this may not work on all file-like objects.
Alternatively, we could just disallow 'w+' mode.
Or we could do nothing and write in the documentation that in this case the file-like object must be empty initially.

What do you think?

@bastibe
Copy link
Owner

bastibe commented Aug 4, 2014

I didn't realize that truncating the file deletes the header as well. But it makes perfect sense, of course.

It might be worth considering to "do what I mean" instead of doing it the same way as open(). For example, we could interpret 'a' as open in write mode, but put the write pointer to EOF. In a similar way, 'w' could delete all the audio data, but keep the header intact. We would likely do this without actually using open(). This would mean that the old 'r', 'w', 'rw' modes would map to the new 'r', 'a', 'a+' modes, and we would add the truncating 'w' and 'w+' modes, the non-truncating 'r+', and the exclusive creation modes 'x' and 'x+'.

Basically, we would apply the Unix file modes on the data, but not on the header. Personally, I quite like this. I think it makes sense that truncating a SoundFile does not delete its header. What is your opinion on this?

I'll try to implement a small performance benchmark next weekend. Maybe this will tell us something about the differences on different platforms and implementation.

As for 'w+' and file-like objects, I am not quite sure what you mean. If the user passes a file-like object to us, I think we should simply pass that to sndfile and see what happens. If sndfile can't handle it, it will tell the user. But I might be misunderstanding your intentions.

@bastibe
Copy link
Owner

bastibe commented Aug 4, 2014

That is a lot of modes. Let's summarize:

  • 'r' read-only, pointer at BOF.
  • 'r+' read-write, pointer at BOF
  • 'w' write-only, truncate
  • 'w+' read-write, truncate
  • 'x' write-only, truncate, Error if existing
  • 'x+' read-write, truncate, Error if existing
  • 'a' write-only, pointer at EOF
  • 'a+' read-write, pointer at EOF

Did I miss any?

@mgeier
Copy link
Contributor Author

mgeier commented Aug 5, 2014

Did I miss any?

No, I think those are all.
Note, however, that 'x' is only supported since Python 3.3 (I wasn't aware of this before).
Also note that 'a' may or may not allow seeking, depending on the platform.
To be more specific, seek() may not raise an error, but just silently do nothing, leading to quite unexpected behavior if SoundFile.seek() is used before writing.

Basically, we would apply the Unix file modes on the data, but not on the header. Personally, I quite like this. I think it makes sense that truncating a SoundFile does not delete its header. What is your opinion on this?

I think we should embrace either the Python modes or the libsndfile modes. What you're describing sounds like a mixture of both.
If we use the Python modes, we should stay actually compatible with Python. E.g. if I open a file with the built-in open() function with a certain mode, I expect that I can use the exact same mode when I pass the file object on to sf.open(). This wouldn't work anymore with your suggestions.

The real problem, however, will be that most of your suggested modes actually need SFM_RDWR mode, which isn't supported for all file types!

... we could interpret 'a' as open in write mode, but put the write pointer to EOF.

But we would actually have to use SFM_RDWR, which isn't what most people would expect.

... 'w' could delete all the audio data, but keep the header intact.

This would also need SFM_RDWR.
Also, it would be impossible to e.g. overwrite a stereo file with mono data, or to change the subtype.

And both w and a would then have the exact same problem that I initially wanted to fix with this issue: If the file doesn't exist yet, some arguments have to be given, if not, they must not.
And we have no (straightforward) way to distinguish these 2 cases.

This would mean that the old 'r', 'w', 'rw' modes would map to the new 'r', 'a', 'a+' modes

The old 'w' a.k.a. SFM_WRITE does actually truncate the file, so it wouldn't be equivalent to your suggested new 'a'.

I think 'a' just doesn't work together with libsndfile, but 'a+' might.
However, it would be a strange asymmetry if 'a' is missing and I'm not sure if it's worth the extra code and documentation if the same thing can easily be done with 'r+' and seek().

As for 'w+' and file-like objects, I am not quite sure what you mean. If the user passes a file-like object to us, I think we should simply pass that to sndfile and see what happens. If sndfile can't handle it, it will tell the user. But I might be misunderstanding your intentions.

No, you're right.
I initially thought that 'w' means that we have to make sure that the file gets truncated.
In the meantime I checked the behavior of the built-in open() function when a file descriptor is passed in. In this case, the mode argument is basically ignored. The corresponding file is not truncated in 'w' mode, nor is seek(fd, 0) called. 'x' is also ignored, because if there is a file descriptor, there is always a file (but 'x' doesn't raise an error in this case).

Therefore, I think we can just do it like you suggested, pass it to libsndfile and see what happens.

I didn't find a formal specification for the behavior of the open() modes with regards to file descriptors.
This is the closest thing I could find to a "specification": https://docs.python.org/3.4/library/functions.html#open and it isn't very specific about file descriptors.

@bastibe
Copy link
Owner

bastibe commented Aug 6, 2014

This seems to be fraught with ambiguities and unexpected behavior. I think we should experiment with the actual behavior of sndfile and see what does or doesn't work.

If most of the common functionality works as expected, I am all for using pythonic modes. If there are too many weird cases, we should probably stick with sndfile's modes.

Regardless, we can rename 'rw' to 'r+', unify the read/write pointer, and introduce a tell function.

@mgeier
Copy link
Contributor Author

mgeier commented Aug 6, 2014

I implemented the discussed changes, I think it works quite well.
I also extended the test cases, but it's quite likely that I missed something.

So far, I've tested it on Debian Linux with Python 2.7.8 and 3.4.1.

@mgeier
Copy link
Contributor Author

mgeier commented Aug 6, 2014

I just tried it on OSX 10.6 and the tests are passing (Python 2.7 and 3.3).

@bastibe
Copy link
Owner

bastibe commented Aug 6, 2014

This looks very good! I would have thought this to be a lot more complicated.

@bastibe
Copy link
Owner

bastibe commented Aug 8, 2014

Yes, this looks better.

This is a terrific contribution! While the impact on actual usage is quite minor, I think that this level of attention to detail goes a long way in making PySoundFile more pythonic!

@mgeier
Copy link
Contributor Author

mgeier commented Aug 12, 2014

It seems that using file descriptors is problematic on Windows (see #63), therefore we shouldn't use file descriptors when the user specifies a file name.

We could simply use the actual Python file object, which would be a trivial change to this PR.
The problem with that is two-fold (as far as I can tell):

  1. the virtual I/O interface may be a little less efficient because of all the callbacks
  2. the error messages are more obscure because the errors happen in the callbacks

Point 1. is probably irrelevant, as calling those callback functions will probably need much less resources than reading/writing the data.
Point 2. might be more relevant to our users, but probably they can live with it.

Another possibility would be to re-investigate the options from #59.
I listed 3 options there, probably there are more?

@mgeier
Copy link
Contributor Author

mgeier commented Aug 12, 2014

Another option, which just came to my mind, would be to use sf_open() only for those modes where libsndfile and Python agree.
This would be 'r', 'w' and 'r+', if I'm not mistaken.
Python's open() function is only really needed for 'w+' (to truncate the file if it exists) and 'x'/'x+' (to raise an error if the file exists already).

I don't know if that really improves anything, though.
Probably not ...

@bastibe
Copy link
Owner

bastibe commented Aug 12, 2014

A third option would be to try to use file descriptors, and fall back to virtual I/O if that fails (and don't tell anyone that this basically means "every time on Windows"). I don't think that the error message issue with virtual I/O is particularly bad, since there error messages in the callbacks are very unlikely. We'd have to investigate performance, though.

@mgeier
Copy link
Contributor Author

mgeier commented Aug 13, 2014

A third option would be to try to use file descriptors, and fall back to virtual I/O if that fails

I don't know if this would work ... I suspect that if sf_open_fd() raises an error, the file descriptor might have been closed or not. So it may or may not work if we call sf_open_virtual() afterwards. I don't think we should rely on a sane state after failure.

I think we would have the same problem with the second option in #59.

But probably my worries are unjustified?

@bastibe
Copy link
Owner

bastibe commented Aug 13, 2014

I guess we'll have to try it to find out.

@mgeier
Copy link
Contributor Author

mgeier commented Aug 14, 2014

I switched from file descriptors to virtual I/O in c971db2.

To my great surprise, the tests now run about 40% faster on my Linux computer!
I still think the speed difference will be irrelevant for real applications, but it's an interesting factoid nevertheless.

Information about what is allowed in the "file" argument should be
written into the docstring, not into the error message.
@mgeier
Copy link
Contributor Author

mgeier commented Aug 14, 2014

I just tested it on Windows with

py.test -k "not fd"

This deselects the 38 test cases which use file descriptors. The other 109 test cases pass!

I wouldn't disable those tests automatically on Windows, because there may be some systems where it actually works.
We should add the test invocation to the documentation, though.

@bastibe
Copy link
Owner

bastibe commented Aug 14, 2014

Cool!

@mgeier
Copy link
Contributor Author

mgeier commented Aug 15, 2014

What do you think about this PR versus #65 (both as solution to #59)?

@mgeier
Copy link
Contributor Author

mgeier commented Aug 18, 2014

As mentioned in #67, I added a check for seekable() before calling seek() (046023c).

Now it should work even if we add an exception to seek() as discussed in #67.

So what about this PR?
I think it's ready for merging ...
Is it better than #65?

@bastibe
Copy link
Owner

bastibe commented Aug 18, 2014

I vote for merging #60. This looks very good!

mgeier added a commit that referenced this pull request Aug 19, 2014
Change open modes to be more in line with Python's open() function
@mgeier mgeier merged commit 85c9efb into master Aug 19, 2014
@mgeier mgeier deleted the open-rw branch August 19, 2014 08:15
@mgeier mgeier mentioned this pull request Aug 19, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants