Skip to content

Several changes based on #18#22

Closed
mgeier wants to merge 31 commits intobastibe:masterfrom
mgeier:many-changes
Closed

Several changes based on #18#22
mgeier wants to merge 31 commits intobastibe:masterfrom
mgeier:many-changes

Conversation

@mgeier
Copy link
Contributor

@mgeier mgeier commented Mar 17, 2014

This pull request contains most of the changes of #18, but with more documentation and hopefully with a little less noise.

I also fixed 2 bugs regarding the virtual IO feature (963c1fa and 4e068f3, see also #19), I hope I didn't make it worse.

In addition to #18, I included following features:

  • available_formats() and available_subtypes() (88ab0ef)
  • allow string arguments for format/subtype/endian (c3d0df7)

mgeier added 22 commits March 16, 2014 21:16
All constants are now in one place

In addition, _getAttributeNames() was added which helps auto-completion in
IDEs.
In new-style classes the base class __setattr__() has to be called
instead of inserting the value into self.__dict__.

See http://docs.python.org/2/reference/datamodel.html#customizing-attribute-access
This is relevant for tools that do introspection, e.g. for
auto-completion in IDEs.
The first constructor argument can now be:
- a string (path to a file)
- an int (a file descriptor)
- a file-like object

The argument virtual_io is not needed, because this can be detected
automatically.

Closes #19
@bastibe
Copy link
Owner

bastibe commented Mar 20, 2014

Whoa, that's a lot of changes! That's almost a complete rewrite! Are you interested in becoming a committer? How can I contact you privatly about this?

@mgeier
Copy link
Contributor Author

mgeier commented Mar 21, 2014

Whoa, that's a lot of changes! That's almost a complete rewrite! Are you interested in becoming a committer?

I'd be honored. But as long as you're open for discussion it's also OK if everything goes through your hands.

How can I contact you privatly about this?

my first name dot my last name at gmail dot com

mgeier added 6 commits March 23, 2014 14:32
_FormatType, _SubtypeType and _EndianType are now combined into _Format.
_ModeType was renamed to _Mode.

Code for string input is now inside these classes.

Format constants are now directly assigned in the module namespace, which
should make clearer where they come from.

The dictionaries _formats, _subtypes and _endians are not necessary anymore.

Many assertions were replaced by exceptions.
... and change assertion to exception.
@bastibe
Copy link
Owner

bastibe commented Mar 30, 2014

I was thinking more along theses lines:

class _Format(int):
    """An int with a name that only allows certain values.

    To define a new valid int, use `_Format._define(value, name)`.
    Only defined values can be instantiated, and they will repr as
    name.

    Valid `_Format`s can be instantiated by value or name.

    """
    _formats = {}

    @classmethod
    def _define(cls, id, name):
        cls._formats[id] = name
        return _Format(id)

    def __new__(cls, value):
        if value is None:
            return None
        if isinstance(value, str):
            for k, v in cls._formats.items():
                if value.upper() == v.upper():
                    value = k
                    break
            else:
                raise ValueError("Invalid format string: {}".format(value))
        if value not in cls._formats:
            raise ValueError("Invalid format: {}".format(value))
        return super(_Format, cls).__new__(cls, value)

    def __repr__(self):
        return self._formats.get(self, super(_Format, self).__repr__())

    __str__ = __repr__

WAV   = _Format._define(0x010000, 'WAV' )  # Microsoft WAV format (little endian default).
AIFF  = _Format._define(0x020000, 'AIFF')  # Apple/SGI AIFF format (big endian).
AU    = _Format._define(0x030000, 'AU'  )  # Sun/NeXT AU format (big endian).
RAW   = _Format._define(0x040000, 'RAW' )  # RAW PCM data.
PAF   = _Format._define(0x050000, 'PAF' )  # Ensoniq PARIS file format.
SVX   = _Format._define(0x060000, 'SVX' )  # Amiga IFF / SVX8 / SV16 format.
NIST  = _Format._define(0x070000, 'NIST')  # Sphere NIST format.
...

This will behave just like a normal int when used, but it will only allow _defined values to be instanciated. Thus, it acts as error checking when creating a format.

But I do want to support undefined values!

Why? If sndfile were to introduce a new format, it would be quite easy to define.

This breaks the DRY principle.

Not really, as it does not really duplicate code anywhere. It is merely a string and a variable name that are the same, but which are used for different purposes. On the other hand, it is explicit, which is better than implicit, as per PEP20.

@mgeier
Copy link
Contributor Author

mgeier commented Mar 31, 2014

I admit that my suggestion is less explicit.
But I still think that your's breaks DRY because the names are repeated, even if once as a name for a Python object and the second time as a string literal.
Getting one from the other is not so far-fetched in Python because (at least conceptually) all names are string literals in some kind of dictionary (in our case globals()).

As a side note, I also kept the line length to 79 characters, as of PEP8, which is the reason why I defined the alias _fmt.

Anyway, let's assume PEP20 and DRY cancel each other out, the question of supporting undefined values still remains open and may decide our choice of implementation.

Supporting future version of libsndfile is just one aspect, probably not the most important one.

Another one is the combination of type/subtype/endian.
With your suggestion this will cease to work.
If a user has a combined type ID (from some unknown source) and wants to use it in PySoundFile, there's no way to do that.

I think it's easier for many users to have separate arguments format, subtype and endian, but for other users it might be easier to used OR'ed-together values like format=sf.WAVEX|sf.PCM_24.

I would like to support both.

When we decide to support combined format IDs (which I hope!), __new__() is not the right place to check for a wrong numeric ID.
We're actually not interested if format, subtype and endian hold valid IDs, as long as we check the combined format format | subtype | endian with sf_format_check().
This way we can give the users more liberty in choosing their arguments and still be safe.

@bastibe
Copy link
Owner

bastibe commented Mar 31, 2014

We could just check every combination of the three masks with the format. This would provide safety from invalid values. I guess that this safety would be desirable.

I wonder if we are overengineering this though. Maybe straight-up integers would really be sufficient after all. In fact, I can see users being confused by our ints-that-print.

Alternatively, maybe it would be preferable to use normal classes that contain the int and overload the | operator instead of monkey-patching printing into an int. Even that might really be overkill though.

What do you think about this?

@bastibe
Copy link
Owner

bastibe commented Mar 31, 2014

Regarding DRY vs. explicit, I understand DRY as a reminder to not copy and paste code all over the place. This avoids problems when one copy of the code is changed, but not another. It also avoids problems with two almost identical but subtly different pieces of code. Note that in our case, both instances of the code in question would be very close to one another, with little consequence if one was changed but not the other and no danger of mistaking one for the other.

Being explicit is a core quality of code in general. "explicit is better than implicit" is a guard against too much magic (at least that is how I understand it). It is meant to prevent things from having non-obvious meaning, such as declaring a global variable and thereby altering the behavior of some class. Again, it would be a rather benign case in our code and errors would merely lead to a wrong print somewhere.

Still, code is above all written to be read, and only incidentally for the computer to execute. I would therefore value explicitness over DRY.

However, both approaches obscure the meaning of our constants somewhat. The reader will wonder in what way that _Format thing is behaving differently from the numeric constant, and what we are trying to express by introducing the _Format instead of the number.

Also, while we're on the topic of slinging acronyms, I would like to invoke KISS, as an argument for plain old ints. YAGNI, also, as a guard against overengineering ;-)

@mgeier
Copy link
Contributor Author

mgeier commented Apr 5, 2014

We could just check every combination of the three masks with the format. This would provide safety from invalid values. I guess that this safety would be desirable.

Where exactly would you check the combination with the masks?

In this PR I suggest the following (in __init__()), would you do anything more/less/different?

format = format | subtype | endian

if not format:
    raise ValueError("No format specified")
if not format & _TYPEMASK:
    raise ValueError("Invalid format")
if not format & _SUBMASK:
    raise ValueError("Invalid subtype")
if not format & _ENDMASK and endian != FILE:
    raise ValueError("Invalid endian-ness")
if not format_check(format):
    raise ValueError(
        "Invalid combination of format, subtype and endian")
self._info.format = format

I wonder if we are overengineering this though. Maybe straight-up integers would really be sufficient after all. In fact, I can see users being confused by our ints-that-print.

We might be over-engineering it ...

But naive users don't even need to know that the constants are actually ints and the actual numeric value is of no relevance to them.

I think plain ints are definitely an option, but I would argue that it's even more confusing to print a number instead of the symbolic name of the constant.

But that's an important question we should agree on:

If we type sf.WAV, do we want to see WAV or 65536 or something entirely different?

Users who know that those are actually numeric values, can easily get the numbers with int(sf.WAV) or the actual hex representation used in sndfile.h with hex(sf.WAV).

When I came up with the _Format class, I wanted to create an object that's useful for both the human and the computer.
The user only gets exposed to the symbolic representation, the computer only uses the numbers.

Alternatively, maybe it would be preferable to use normal classes that contain the int and overload the | operator instead of monkey-patching printing into an int. Even that might really be overkill though.

I know the "rule" to prefer composition over inheritance, but in this case, inheritance is a much better fit. A format constant is a int, I wouldn't say it has a int.

I think overloading the | operator is even more obscure than providing a custom __repr__() (which would be needed anyway when using composition). I believe if we go that way, the class implementation will get much longer and probably more complicated, too. We would also have to provide a way to get the number back out to be used in sf_open().
BTW, it's not monkey-patching (which is evil), an interested user can easily see that the type is actually _Format and not a monkey-patched int (which isn't possible anyway), no black magic happening there.
I guess you meant that figuratively anyway, but I really see nothing bad in deriving from a built-in type.

Regarding DRY, I also consider repetition in close vicinity a violation of the rule.
It just looks wrong to me if the same string is repeated for no real reason.
It doesn't have anything to do with wanting to type less, it's very much about reading the code (which we do a lot more, as you mentioned).

In case of the _Mode object I think the repetition is good, because there are two possible string literals for each mode, (e.g. 'RDWR' and 'rw') and here it would really be confusing to get one explicitly and the other one from the variable name.

However, both approaches obscure the meaning of our constants somewhat.

I was hoping it would make the meaning of the constants clearer!

What is more obscure: WAV or 65536?

The reader will wonder in what way that _Format thing is behaving differently from the numeric constant, and what we are trying to express by introducing the _Format instead of the number.

I think it's reasonable to assume some knowledge of Python from a reader of the source code.
And I think it's quite easy to understand:
It's derived from int, so it should basically behave like one, except the overridden behavior, namely __new__ and __repr__, meaning that it is created differently and represented differently.
And that's exactly what we're trying to express.

Also, while we're on the topic of slinging acronyms, I would like to invoke KISS, as an argument for plain old ints. YAGNI, also, as a guard against overengineering ;-)

Yes, plain old ints are definitely an option.
But I think we would loose some convenience for users:

Specify formats as strings:

f = sf.open("myfile.wav", sf.WRITE, 44100, 2, 'PCM_24')

Error checking on positional arguments:

f = sf.open("myfile.wav", sf.WRITE, 44100, sf.PCM_24)

In this case no error would be raised and myfile.wav would be a 3-channel WAV file of type sf.PCM_16!

sf.available_formats() would show only the numbers and users would actually have to use plain numbers further on.

That's not a serious suggestion, but we could follow KISS even further and not provide the named constants at all.

I think we should try to find the right balance between convenience for the user and clean-ness of the implementation.

@bastibe
Copy link
Owner

bastibe commented Apr 9, 2014

I have been thinking a lot about this. On the one hand, we need numeric constants for libsndfile. On the other hand, we would like to have some human-readable representation of them.

If numeric constants are hard to use, let's not use them. Let's use string constants. Their meaning is clear. They can be easily translated to ints using a dict in the SoundFile constructor. The only trouble is, we can't or them together. Instead, we can create a class like this:

class Format:
    def __init__(self, format, subtype=None, endian=FILE)
        if subtype is None:
            subtype = _default_subtypes[type]
        self.type = format
        self.subtype = subtype
        self.endian = endian

I think this is a much cleaner solution, and it sidesteps most of our previous discussion. The Format class also alleviates all the problems with bit masks and oring.

Better yet, Format could implement a method that returns the numeric constant. This would keep all the error-checking and translation dicts out of SoundFile.

What do you think about this proposal?

@bastibe
Copy link
Owner

bastibe commented Apr 9, 2014

As for using ints vs. using classes, I think Python itself mostly sticks to ints (see os.SEEK_SET) or strings (see open()).

If we were to use a class, we should at least output something like <Format: WAV> instead of WAV to avoid confusion with strings. Also, I still very much dislike search globals(), since it gives variable names meaning. This is a huge code smell. In effect, it uses globals() as a dictionary to store some mapping data.

@bastibe
Copy link
Owner

bastibe commented Apr 9, 2014

Anyway, if this discussion continues, we should probably split it off into its own issue, and merge the rest.

@mgeier
Copy link
Contributor Author

mgeier commented Apr 10, 2014

Using string literals is definitely one of our options, I don't understand the proposed Format class, though.

Shall it be exposed to the user?
Could you please give a hypothetical usage example?

@bastibe
Copy link
Owner

bastibe commented Apr 10, 2014

I propose that we use strings for all the constants. When we want to open a SoundFile, we provide a Format, like f = SoundFile('file.wav', mode='w', format=Format('WAV', 'PCM_24')) or f = SoundFile('file.wav', mode=WRITE, format=Format(WAV, PCM_24)). Of course, since the format can be deduced from the file extension, it can be omitted. Also, it might be a nice touch to allow format='WAV' as well.

Thus, format=Format('WAV', 'PCM_24') is effectively the same thing as format=WAV|PCM_24. I think that it is much more readable and easier to understand though. In particular, the fallback to default subtypes and endianness is probably much easier to understand in the Format constructor than after applying bit masks to the format integer.

@bastibe
Copy link
Owner

bastibe commented Apr 10, 2014

Also, Format could provide a method that returns and validates the numeric sndfile format:

# part of class Format:
def snd_format(self):
    types = {
        'wav': 0x010000,
        'aiff': 0x020000,
         ...
        'rf64': 0x220000
    }
    if not self.format.lower() in types:
        raise ValueError('invalid type')
    format = _TYPEMASK & types[self.format.lower()]
    subtypes = {
        'pcm_s8': 0x0001,
        'pcm_16': 0x0002,
         ...
        'vorbis': 0x0060 
    }
    if not self.subtype.lower() in subtypes:
        raise ValueError('invalid subtype')
    format |= _SUBMASK & subtypes[self.subtype.lower()]
    endians = {
        'file': 0x00000000,
        'little': 0x10000000,
        'big': 0x20000000,
        'cpu': 0x30000000 
    }
    if not self.endian.lower() in endians:
        raise ValueError('invalid endian')
    format |= _ENDMASK & endians[self.endian.lower()]
    if not format_check(format):
        raise ValueError('Invalid combination of format, subtype and endian')
    return format

@mgeier
Copy link
Contributor Author

mgeier commented Apr 12, 2014

I'm slowly getting used to the idea of using string literals, but I wouldn't expose the Format class to the user (and probably the class is not needed at all).

Having to instantiate the class is unnecessarily complicated. I would stay with the three separate arguments format, subtype and endian.
The ORing together (e.g. WAVEX | PCM_24) was a nice feature for named numeric constants, but I think we should completely drop it if we change to string literals.

Here are some calls which I would like to support:

f = SoundFile('file.wav', 'w', 44100, 2)
f = SoundFile('file.wav', 'w', 44100, 2, 'PCM_24')
f = SoundFile('file.wav', 'w', 44100, 2, format='WAVEX', subtype='PCM_24')
f = SoundFile('file.wav', 'w', 44100, 2, format='WAVEX', subtype='PCM_24', endian='FILE')
f = SoundFile('file.wav', 'w', 44100, 2, format='WAVEX')

The implementation of this should be straightforward and would indeed sidestep many of the problems we discussed in this issue.

When changing to string, another question comes up: should we use uppercase or lowercase as default representation?
I would support both for input, but the strings returned from functions and the strings used in the documentation should be consistently either all upper or all lower.

Another question: should we also kick out the numeric constants for the open mode?
I think it would make sense to use the same method for open mode and file formats.
If yes, should we allow only 'r'/'w'/'rw' or also 'READ'/'WRITE'/'RDWR' or something entirely different?

@bastibe
Copy link
Owner

bastibe commented Apr 14, 2014

My idea with Format was to expose the notion of the format/subtype/endian-triplet to the user. But you are correct, what with default subtypes and default endianness, we can just wholesale skip that concept. Another step towards pythonicity!

Regarding capitalization, I would go for UPPER_CASE, because that is the way libsndfile named the constants. Should we provide variables for the strings as well? We probably should, if only to have a place for documenting them in the code. On the other hand, there will be a dictionary for translating strings to numbers anyway, so this is a bit of duplication there.

As for open mode, I would choose 'r'/'w'/'rw', simply because it is the easiest to type and the way open does it.

@mgeier
Copy link
Contributor Author

mgeier commented Apr 18, 2014

Should we provide variables for the strings as well?

Do you mean 'artist', 'copyright' etc.?

If yes, I think a dictionary should suffice. I created named variables for each string in my original proposal only for symmetry reasons. But with changing to format strings this is not necessary anymore.

As for open mode, I would choose 'r'/'w'/'rw', simply because it is the easiest to type and the way open does it.

I also think it's the easiest, Python's open() doesn't have 'rw', though.
But I guess with proper documentation this shouldn't cause too much confusion ...

@bastibe
Copy link
Owner

bastibe commented Apr 19, 2014

Alright, let's do it!

@mgeier mgeier mentioned this pull request Apr 19, 2014
@mgeier
Copy link
Contributor Author

mgeier commented Apr 19, 2014

The commits in this PR are quite confusing, so I started a new PR with part of the changes: #30.

If that's OK, I'll continue with further PRs for the rest of the discussed changes.

@bastibe bastibe closed this Apr 19, 2014
@bastibe
Copy link
Owner

bastibe commented Apr 19, 2014

Agreed!

bastibe added a commit that referenced this pull request Apr 20, 2014
A lot of smaller improvements based on #18, #22, and #30.
mgeier added a commit that referenced this pull request Apr 20, 2014
This commit holds several changes which were discussed in #22 (which was
itself based on #18).

Change handling of file formats:
File formats are handled as three simple strings: format, subtype and endian.

Add "which" argument to seek().
This is needed because the combination with logical or (e.g. SEEK_SET | READ)
isn't possible with string formats.

Update frames counter on write().

Hide all non-API names in module namespace (by prefixing _).
This is relevant for tools that do introspection, e.g. for
auto-completion in IDEs.

Support sf_open_fd(), remove obsolete argument virtual_io
The first constructor argument can now be:
- a string (path to a file)
- an int (a file descriptor)
- a file-like object
The argument virtual_io is not needed, because this can be detected
automatically.
Closes #19

Get file extension from 'name' attribute of a file-like object.

Change public attributes of SoundFile class to properties.

Add 'name' property.

Add properties 'format_info' and 'subtype_info'.

Proper handling of dtype's in read() and write().

Add function default_subtype().
@mgeier mgeier deleted the many-changes branch June 18, 2014 08:12
mgeier added a commit that referenced this pull request Dec 17, 2014
This commit holds several changes which were discussed in #22 (which was
itself based on #18).

Change handling of file formats:
File formats are handled as three simple strings: format, subtype and endian.

Add "which" argument to seek().
This is needed because the combination with logical or (e.g. SEEK_SET | READ)
isn't possible with string formats.

Update frames counter on write().

Hide all non-API names in module namespace (by prefixing _).
This is relevant for tools that do introspection, e.g. for
auto-completion in IDEs.

Support sf_open_fd(), remove obsolete argument virtual_io
The first constructor argument can now be:
- a string (path to a file)
- an int (a file descriptor)
- a file-like object
The argument virtual_io is not needed, because this can be detected
automatically.
Closes #19

Get file extension from 'name' attribute of a file-like object.

Change public attributes of SoundFile class to properties.

Add 'name' property.

Add properties 'format_info' and 'subtype_info'.

Proper handling of dtype's in read() and write().

Add function default_subtype().
mgeier added a commit that referenced this pull request Dec 17, 2014
Change argument name fObj to file.

Return dict directly instead of creating a local variable.

This was part of #22.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants