Skip to content

Add multivalue genre tag support#2503

Closed
antlarr wants to merge 31 commits intobeetbox:masterfrom
antlarr:multivalue-genre-tag
Closed

Add multivalue genre tag support#2503
antlarr wants to merge 31 commits intobeetbox:masterfrom
antlarr:multivalue-genre-tag

Conversation

@antlarr
Copy link
Copy Markdown
Contributor

@antlarr antlarr commented Apr 4, 2017

These changes allow beets to store multiple genre tags for multivalues, with the intent of fixing #505 . Note that this is not ready to be merged yet, but it's a beginning of discussion on how to properly fix the issue.

Note that these commits store lists in album.genre and item.genre properties, while storing them as text strings in the database using a separator. Also, I checked that if a user has something like this as path definition:

paths:
    default: $albumartist/$genre/$album%aunique/$genre/$track - $title

then $genre is formatted as a string correctly . In my test case, creating a directory called "Rock;Blues" .

This can break some complex configurations. For example, I had this as album_fields definition to add a album_genre_for_path tag that would contain the most used genre in an album:

album_fields:
    album_genre_for_path: |
        if 'album_genre' in globals():
            return album_genre.title()
        if hasattr(items[0], 'album_genre'):
            return items[0].album_genre.title()
        genres = {}
        cnt, candidate = 0, ''
        for item in items:
            genres[item.genre] = genres.get(item.genre, 0) + 1
            if genres[item.genre] > cnt:
                cnt, candidate = genres[item.genre], item.genre.title()
        return candidate

When evaluating this tag, beet broke sine item.genre is no longer hashable. I think this is a step better than silently breaking the directory structure. I fixed it easily by replacing the code above with:

    album_genre_for_path: |
        if 'album_genre' in globals():
            return album_genre.title()
        if hasattr(items[0], 'album_genre'):
            return items[0].album_genre.title()
        genres = {}
        cnt, candidate = 0, ''
        for item in items:
            if isinstance(item.genre, list):
                values = item.genre
            else:
                values = [item.genre]
            for genre in values:
                genres[genre] = genres.get(genre, 0) + 1
                if genres[genre] > cnt:
                    cnt, candidate = genres[genre], genre.title()
        return candidate

Then I also tested what happens when reading a file written by another application with multiple genre tags:

Python 3.6.1 (default, Mar 23 2017, 13:04:44) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import beets.mediafile
>>> mf=beets.mediafile.MediaFile('01 - Big Brother & the Holding Company -  Bye, Bye Baby.flac')
>>> mf.genre
['Rock', 'Album Rock']

Which is what I would have expected.

@mention-bot
Copy link
Copy Markdown

@antlarr, thanks for your PR! By analyzing the history of the files in this pull request, we identified @sampsyo, @Kraymer and @brunal to be potential reviewers.

@antlarr
Copy link
Copy Markdown
Contributor Author

antlarr commented Apr 4, 2017

I tested this with flac and mp3 files (that is, with APE tags, where multiple genre tags are written and ID3 tags, where \0 bytes are used to separate genres in a single ID3 tag). Everything works as expected due to ListMediaField (btw, congratulations to whoever implemented it and the *ListStorageStyle classes, they are really nicely done)

The remaining questions are:

  • Should a global parameter be added to the configuration? If yes, should it be named multivalue_separator or ... ? Also should it default to ';' ? '; ' ? ', ' ? This should be also set as the default value for tmpl_first, so it behaves correctly when handling the genre tag.
  • What to use for queries in the StringList type? I think using SubstringQuery is probably "good enough", but that can be discussed as it could return some unexpected matches . I'm thinking on someone querying for "ck;Bl" for example, but on the other side, maybe someone would like to query for "Rock;Blues" to get items with both values (although that wouldn't match genres stored as "Blues;Rock" which should be considered ).
  • Should a config option be added (in the mood of id3v23), like ... use_single_tags_with_separator (defaulting to False) just in case any user prefers to use the current behavior in case (s)he is using the music files with another application that doesn't support multivalue genre tags yet?

Copy link
Copy Markdown
Member

@sampsyo sampsyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! This is diff is surprisingly small—I expected this to be a far more disruptive change.

A couple of ideas come to mind for improving backwards compatibility, neither of which we necessarily have to use:

  • We could keep a field called genre that just gets the joined string, and add a new genres field for the list. That's what MediaFile previously did.
  • We could somehow allow a string to be assigned into the genre field, which would then be treated as a single-element list.

The only thing that makes me somewhat uncomfortable here is the age-old slippery slope question. How many other fields would it make sense to transform into lists? Should artist be a list? albumartist? If we make too many changes like this, will we disrupt simpler code that just cares about a single value?

Anyway, genre is clearly the best use case for this, so maybe it's not worth worrying too much about other tags.

@antlarr
Copy link
Copy Markdown
Contributor Author

antlarr commented Apr 5, 2017

Btw, I guess you'll doubt that getall('TIPL') can be replaced with get('TIPL') (mainly because I doubted about it too :) ). So here is the test I did with mutagen to check that it doesn't allow two TIPL frames:

>>> import mutagen
>>> m=mutagen.File('full.mp3')
>>> m.tags.get('TIPL')
>>> frame=mutagen.id3.Frames['TIPL'](encoding=mutagen.id3.Encoding.UTF8, people=[['aa','bb'],['cc','dd']])
>>> m.tags.add(frame)
>>> m.save()

Here, I checked with okteta (an hex editor) that the file included the added TIPL frame.

>>> frame2=mutagen.id3.Frames['TIPL'](encoding=mutagen.id3.Encoding.UTF8, people=[['ee','ff'],['gg','hh']])
>>> frame
TIPL(encoding=<Encoding.UTF8: 3>, people=[['aa', 'bb'], ['cc', 'dd']])
>>> frame2
TIPL(encoding=<Encoding.UTF8: 3>, people=[['ee', 'ff'], ['gg', 'hh']])
>>> m.tags.add(frame2)
>>> m.tags.getall('TIPL')
[TIPL(encoding=<Encoding.UTF8: 3>, people=[['ee', 'ff'], ['gg', 'hh']])]
>>> m.save()

and here, I checked again with an hex editor that the frame wasn't just added, but it replaced the previous TIPL frame.

Copy link
Copy Markdown
Member

@sampsyo sampsyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things are coming along nicely! Here are a few more comments on the latest.

if value is not None:
if value:
formatted[key] = u'; '.join(value)
elif value is not None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, for what it's worth, is one of the things that worries me about making arbitrary fields into lists: now client code will frequently need to have boilerplate for the form if isinstance(value, list):. This sort of thing puts additional weight on the side of having an explicit way to access lists, or some other foolproof conversion to produce a scalar value.

Copy link
Copy Markdown
Contributor Author

@antlarr antlarr Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you mean. Note that list fields (genre, arranger ... ) now are always lists, the problem comes with code like that which tries to work with all fields the same way. Would you prefer that I create a class that inherit from lists and reimplements __str__ so it returns '; '.join(self) ? That way code like that could be changed to:

if value:
    formatted[key] = str(value)

But still, that's not perfect neither.

this style.
"""
self.set_list(mutagen_file, [value])
if value is None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's odd—I'm a little surprised that ListStorageStyle is allowed to return None from get. That seems like it makes things more complicated for clients in general. Maybe that's worth fixing?

Copy link
Copy Markdown
Contributor Author

@antlarr antlarr Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I'd say it's ok. get is implemented as:

          try:
              return self.get_list(mutagen_file)[0]
          except IndexError:
              return None

So None is returned when get_list returns an empty lists, that is, when the tag doesn't exist in the file. This allows to identify when a tag doesn't exist and allows to remove a tag from a file. Do you think that it would be better to return an empty string and always interpret an empty value as a non-existant tag? (so setting ... say artist to '' would delete the tag entry)

@antlarr
Copy link
Copy Markdown
Contributor Author

antlarr commented Apr 5, 2017

I just noticed I had this written in a browser tab but forgot to click comment. I'll post it for the records, even if it's probably old now.

Very cool! This is diff is surprisingly small—I expected this to be a far more disruptive change.

Yes, I also was surprised, but most of the work was already implemented in ListMediaField, the *ListStorageStyle classes and mutagen.

A couple of ideas come to mind for improving backwards compatibility, neither of which we necessarily have to use:

We could keep a field called genre that just gets the joined string, and add a new genres field for the list. That's what MediaFile previously did.

Yes, I saw that, but the genres tag is (afaik) not used by any other application, but genre is.

We could somehow allow a string to be assigned into the genre field, which would then be treated as a single-element list.

Have a look at the test_write_genre_list_from_string test I added in test_mediafile.py to test fad5809 :). I think that's even better since it allows assigning a string but automatically converts it to a list.

The only thing that makes me somewhat uncomfortable here is the age-old slippery slope question. How many other fields would it make sense to transform into lists? Should artist be a list? albumartist? If we make too many changes like this, will we disrupt simpler code that just cares about a single value?

Hmm, I don't think that would be too much work. I can give it a try if you want. (This is now done)

def negated_clause(self):
clause, subvals, table = self.col_clause()
_table = self.model_cls._table
primary_key = '%s.id' % (_table)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sampsyo Do you think it's worth it to add a get_primary_key_field to Model so instead of hardcoding id I call self.model_cls.get_primary_key_field() which would iterate over its fields and search which has a PRIMARY_ID type? I think that would be more elegant, but I'm not sure if it's worth the effort.

query = query or TrueQuery() # A null query.
sort = sort or NullSort() # Unsorted.
where, subvals = query.clause()
where, subvals, new_tables = query.clause()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sampsyo should I still allow Query objects to maybe return 2-tuples just in case any external plugin still uses the old API? It's easy to check for unpack errors but it's a bit ugly, so I preferred to ask your opinion.

@antlarr antlarr force-pushed the multivalue-genre-tag branch 3 times, most recently from dcf1c93 to 72ab5c9 Compare April 12, 2017 16:17
antlarr added 17 commits May 30, 2017 18:37
A StringList type is a list of strings that can be formatted as a string
(using a separator character/string) to be stored in the database or used
as path component.

By default, I used ';' as separator, but I guess that should be made
configurable through a global config parameter (though I guess most people
would use ';' or ', ').

Also, probably query should be something different from
query.SubstringQuery, but that can be changed later.
genres is removed, but that's not standard, nor needed anymore.
This removes the need for a separator parameter, which now should
be made global.
…s separator

- Define a null value as []
- Parse '' as the null value (and not as ['']).
- Use the same separator ('; ') as tmpl_first does by default, so tmpl_first
keeps working as usual.
Allow to set item.genre = u'Rock; Blues' and automatically convert that
to a string, so we keep better backward compatibility. Also added a test
for that
Make to_sql and from_sql return/accept None values
Convert it to a string before adding the genres list to a string
ListStorageStyle.get can return None, so set should accept it
MP3ListStorageStyle.fetch can return [], so store should accept it.
MP3PeopleStorageStyle now gets/sets lists of people.

Note that MP3PeopleStorageStyle should only be used with TIPL tags
so there's no need to use getall, since it'll always return only one
element (or None).
Also remove an unneeded piece of code I added in a previous commit.
antlarr added 13 commits May 30, 2017 18:42
…more

Use the `multivalue_separator` config option whenever splitting or
joining multivalue fields
This adds a very simple TupleMediaField class that inherits
ListMediaField and just converts the result of __get__ to a tuple
since ListMediaField is also used for ImageListField, and I didn't
know if that should be changed to tuples too.

Changed all tests to use tuples accordingly.
The StringList type now uses json to store the list of strings in the database.

Also, since sampsyo expressed he tried to avoid using database scheme
versioning, I added a migration method that I think should be quick enough
to be run on every start (although I think I still would prefer to use
versioning, which would allow for this to be run once). This iterates on
every entry in a StringList field that sqlite recognizes as "not valid json"
 and converts it to json. After the database is migrated once, this should
return 0 rows so the migration would take only the time it takes sqlite
to test JSON_VALID() on StringList types.
There may be cases where genre contains values like `"Symphonic Rock"`
since `JSON_VALID('"Symphonic Rock"')` is True for sqlite, we have
to check too that the json value is actually an array.
The new JSonSubstringListQuery class implements a SubstringQuery over a
StringList type using json sqlite functions in the database. For this,
I had to change the clause() method return tuple to add a new
element which is a list of tables to add to the FROM in the final
sql expression. So now, JSonSubstringListQuery returns something like:

```
("json_genre1.value like ? escape '\\'",
 ('Rock',),
 ('json_each(genre) as json_genre1',)
)
```

Since json_each is a table that has a column named id, This would
collide with the id column from items/albums, so I changed
Database.fetch from `SELECT * ...` to
`SELECT DISTINCT items.id, items.title, items.artist, items.genre ...`
I added a DISTINCT since using multiple tables can repeat results.

Now it's possible to do things like:
`beet ls genre:Classic`
which returns Classical music as well as 'Classic Rock' music
`beet ls genre:Classic genre:Rock`
which returns music tagged as Classic AND Rock (or Classic Rock)
`beet ls genre:Classic , genre:Rock`
which returns all Classical music and all Rock music

All of that, using json data, so:
beet ls 'genre:["Rock'
doesn't return any result.
Negating json sql queries is not as simple as prefixing the where clause
with NOT, so now NotQuery first checks if the subquery has a
negated_clause function, and uses it if possible.

I implemented JSonSubstringListQuery.negated_clause to return
expressions like:

```
items.id NOT IN (SELECT items.id
                   FROM items,
                        JSON_EACH(items.genre) AS json_genre1
                  WHERE json_genre1.value LIKE 'Rock')
```

This allows to run commands like:

`beet ls genre:Blues ^genre:Rock`
Add a new JSonRegexpListQuery class that implements regular expressions
queries on StringList fields.

To use it without hardcoding any class names, I had to add more
flexibility to the query `prefixes` dictionary. Now, a type defines not
only what query class is used with it, but what query class is used for
regular expressions and the `prefixes` dictionary not only associates
each prefix with a Query class, but can associate a prefix with a
function that gets a field name and returns a Query class to use.
Sqlite includes the json extension since version 3.9.0 and there aren't
many distributions that don't have it, but there are still a few
important distributions that have sqlite < 3.9.0 (Debian stable,
centOS, openSUSE Leap and Ubuntu Trusty), so in those cases, fallback
to use equivalent python code, which is slower, but at least we don't
leave any distribution behind.

Btw, Arch, Debian stable backports|testing|unstable|experimental, Fedora
24|25|26|Rawhide, FreeBSD, Gentoo, Mageia, Manjaro, openBSD, openSUSE
Tumbleweed and Ubuntu 16.04 LTS|16.10|17.04 already contain at least 3.11.0.
`multivalue_separator` is no longer used to store multivalue fields
in the database, so remove the sentence that says that.
If a new field is added to the model, the migration happens before
it's added, so it can try to recover the values of a non-existing column.
In that case, just ignore the error since there's obviously nothing
to migrate there.
@antlarr antlarr force-pushed the multivalue-genre-tag branch from 7b0b7fc to 06f7c04 Compare May 30, 2017 16:42
@arcresu
Copy link
Copy Markdown
Member

arcresu commented May 19, 2019

@antlarr thanks for your work on this PR. It's been a while since it's had any attention, but it seems that multi-valued tags are still a feature that people would like to see!

Just a heads up that we're planning on splitting MediaFile out into a standalone Python module & git repository during the current beets release cycle, and this is clearly going to affect this PR. Your branch is anyway conflicting with master already, but after the split it'll mean that the changes will also need to be split into two separate sets. Are you still interested in working on this? If so it might be sensible to start with the MediaFile changes at beetbox/mediafile.

@jtpavlock
Copy link
Copy Markdown
Contributor

Closing as no response from the original author. There's still a lot of good info to be gained from this, but any further implementation ideas should be directed to #505. Also, I think it'd be ideal to first implement the basis for multi-value tags, and then add genres as an additional PR after.

@jtpavlock jtpavlock closed this Jul 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants