Use MB lookup to resolve ambiguous artist names#3862
Conversation
|
We should be careful with this;
What is the exact issue you are trying to solve here ? |
|
Just had another new edge case where a user has the artist "Above & Beyond presents OceanLab" where we dont have the "presents" in FEATURING_SPLITTERS so the parsing failed. Whilst I have improved things by comparing the number of MB IDs to the number of parsed artists it is still fragile. If the number of MB IDs doesnt equal the number of parsed artists then we still currently pull the incorrect artists into the database and log a warning which isn't ideal (better but not ideal). I just feel that if we have the MBIDs we could guarantee to get the artist names right and also solve any naming/ spelling/ language/ diacritics ambiguities. I agree that this will increase the number of calls but only for new additions to peoples libraries and only once when the track is first added to the database. I considered the further mitigation that the MBID lookup is cached for 30 days so a user with 50 Beatles tracks does 1 lookup not 50. |
|
Here is a good example from a classical album I have: Artist: Pyotr Ilyich Tchaikovsky |
|
We should prevent doing a lookup if the musicbrainz tags are already present.
And this is exactly what worries me. Local libraries are potentially very large so this may result in 10000s of calls for scanning an initial library. That is a lot of stress for a free service. What we can potentially do is if the artists tag is already present and matches number of MB id's, we do not have to do any lookup. |
abf7d6c to
140ea9b
Compare
|
Fair. I have switched it to as you suggested and just do the lookup on a mismatch between number of artist MBIDs and parsed number of artist names. There is still the problem of poor artist name tagging where the first potentially incorrect name is persisted when additional tracks are added. I thought we could maybe have it so that if you do an UPDATE METADATA or REFRESH ITEM on an artist then do the name lookup in that circumstance. That gives the user an internal path to fix this and the existing 30-day MB cache means repeat clicks won't make repeated API calls. Thoughts on this idea? |
Two parser improvements for multi-artist resolution:
1. Add " presents " to FEATURING_SPLITTERS so single ARTIST tag strings
like "Above & Beyond presents OceanLab" split correctly instead of
silently being mis-split on the inner ampersand.
2. When the parsed artist count doesn't match the MusicBrainz Artist ID
count, the filesystem_local resolver looks up canonical names via
the new MusicbrainzProvider.resolve_artists_from_mbids method. Failed
individual lookups are dropped rather than mapped back to a tag name
by position (unsafe when counts already disagree); if every lookup
fails, the resolver falls back to the tag-parsed names so the track
still gets stored. When counts already match, no lookup runs.
The mismatch warnings move out of tags.py into the resolver, where
they can report what actually happened.
Out of scope for this PR:
- First-write-wins persistence of misspellings ("Tchaikovsky" vs
"Pyotr Ilyich Tchaikovsky"). The count-match short-circuit means the
mismatch trigger doesn't help here; this needs a separate user-
triggered "refresh canonical names" action so the MB load is opt-in.
140ea9b to
09007cd
Compare
Resolving multiple artist names has been a perennial problem. In my most recent adjustment to the logic I moved to using the MBID count to try and identify when the heuristic split did not match the expected number of artists. That didnt solve the problem but at least made it visible. This PR takes the next step: when that count mismatch is detected, use the MBIDs to look up canonical names from MusicBrainz instead of just logging a warning and going with the wrong split.
So this PR adds:
When MusicBrainz Artist IDs / Album Artist IDs are present in tags and the parsed artist count does not match the MBID count, resolve canonical names from the MusicBrainz API and use those instead of the heuristic split. Applies to both track artists and album artists.
When counts already match the parsed names are trusted as before, so cleanly-tagged libraries do no network calls.
MusicbrainzProvider.get_artist_details is cached for 30 days so repeat MBIDs across tracks are effectively free.
Failed individual MBID lookups are dropped rather than being substituted from the tag-parsed names — matching by position is unsafe when the counts already disagree. If every lookup in a track fails, fall back to the tag-parsed names so the track still gets stored with something.
" presents " is added to FEATURING_SPLITTERS to handle "Above & Beyond presents OceanLab" and similar. The current heuristic produces the correct count (2) on that string but on the wrong boundary, so the current count-mismatch check would not catch it and the MB lookup would never fire. Thus the splitter addition is needed independently.