Issues with German umlaut (when matching The 100 episodes by title) since latest release (4.9.2)

All your suggestions, requests and ideas for future development
Post Reply
IfThenERROR
Posts: 7
Joined: 18 Jun 2017, 23:45

Issues with German umlaut (when matching The 100 episodes by title) since latest release (4.9.2)

Post by IfThenERROR »

Dear all,

since the latest update I am having problems whenever a title contains a German umlaut. Up until the previous release the format expression '.ascii()' replaced umlauts with the corresponding simple vocals (ä to a, ö to o, etc.). The internal matching process obviously also used this logic when fetching data from thetvdb.

Now the umlauts are replaced by combined vocals (ä to ae, …), not only in the format expression, but in the matching as well. This is orthographically correct, but differs from default behaviour in Linux and other programs like e.g. tvheadend. In the result version 4.9.2 is more often than not incapable of finding the correct results, where 4.9.1 found the match. Is there a way to change this behaviour in the settings or do I have to roll back to 4.9.1 until this is fixed?

Best regards!
User avatar
rednoah
The Source
Posts: 22986
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Issues with German umlaut since latest release (4.9.2)

Post by rednoah »

Option A:
Consider not using String.ascii() if you have reason not to do so. Unicode characters in filenames shouldn't be a problem. Why not leave the German Umlauts in? That'll make it easier to exactly match the filename to the online database entry.


Option B:
You could just customize your format and replace characters according to your specific preferences and requirements:

Code: Select all

n.replace('ä', 'a')

Option C:
You may also consider contacting the tvheadend developers and ask them to improve support for German Umlauts and/or Unicode characters in filenames.


:idea: Note that there is no such thing as a default Linux behaviour that would replace specific code points or diacritic marks in file paths, because on Linux file paths are just a sequence of bytes and not characters, and so the file system itself is completely unaware of what character encoding may have been used to encode a given byte sequence. This can lead to incompatibilities if you have different user-space processes configured with different file system encodings: https://stackoverflow.com/a/38951058/1514467 (TL;DR if you configure everything with LC_ALL=en_US.UTF-8 then things will just work)
:idea: Please read the FAQ and How to Request Help.
IfThenERROR
Posts: 7
Joined: 18 Jun 2017, 23:45

Re: Issues with German umlaut since latest release (4.9.2)

Post by IfThenERROR »

Hi Rednoah,

thanks for the quick response!

Your option b is what I already did for the format expression. This way the generated filenames are the same they used to be. But this doesn't solve the main issue of finding no matches. So is there a way of telling Filebot how to process the filenames for searching?
User avatar
rednoah
The Source
Posts: 22986
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Issues with German umlaut since latest release (4.9.2)

Post by rednoah »

Well, if FileBot is having trouble identifying a given file name, then I would need that file name please, and logs if you're using the CLI.


:idea: In any case, there's no way to configure the ä->ae thing. Note that String.ascii() is a format thing and thus generally unrelated to identification. Might be related. Might not be related. Can't say without a specific test case for debug purposes. Lots of alternate spellings work just because somebody added an alias already.


:idea: Please read How to Request Help.
:idea: Please read the FAQ and How to Request Help.
IfThenERROR
Posts: 7
Joined: 18 Jun 2017, 23:45

Re: Issues with German umlaut since latest release (4.9.2)

Post by IfThenERROR »

In this case I'm using the GUI.

A sample filename is "The 100 - Zwei in einem Korper.final.mkv". Filebot correctly suggests the series from theTVDB, but can't find the right episode. The matched episode isn't even remotely similar. Another episode in the same directory with a name without umlaut is correctly matched.

Debug log is at https://pastebin.com/EKGx4gMy

As I reported it's the same issue whenever a title contains an umlaut which is replaced by a simple vocal. Same format worked until v4.9.1 .
User avatar
rednoah
The Source
Posts: 22986
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Issues with German umlaut since latest release (4.9.2)

Post by rednoah »

I see. So we're talking about "matching by episode title in German" (that was very not clear; so always include screenshots please 🙏) specifically. That's always a tricky one. There's always corner cases that don't work well. It's very common for the title to be off by a few characters. Which doesn't necessarily not work, but it's not an exact since, and somethings things just don't work out.


:idea: In this case, I guess it's just matching the most recent episode for the lack of a better option. If you're using the GUI, then you can always do Double-Click -> Edit Match -> Select correct Episode match to fix the match manually if necessary:
Image


:!: We could revert the changes, but that would just make things not work if things are the other way around, so I'd rather ensure to get the correct spelling right. Additionally, your filenames saying Korper instead of Körper (Why not keep the ö in the filename?) seems to be something specific to your setup and I'm not sure how many other users would benefit if we were to prioritize your use case over others.



EDIT:

:!: :!: :!: Note that the issue seems to be rather specific to The 100 because it's the 100 in the series name that causes things to go wrong, rather than the oö bit, although the latter can sometimes come into play and save things if the former goes wrong:

Code: Select all

filebot -rename *.mkv --q "The 100" --db TheTVDB -non-strict --lang German --action TEST --log INFO
[TEST] from [Zwei in einem Korper.final.mkv] to [The 100 - 6x07 - Zwei in einem Körper.mkv]


EDIT 2:

:lol: :lol: :lol: Turns out that The 100 - 7x16 - The Last War is Episode 100 so that kinda matches the 100 number in the filename. Kind of a generic bug, but also won't really affect many shows other than The 100.

Fixed with FileBot r8104.
:idea: Please read the FAQ and How to Request Help.
IfThenERROR
Posts: 7
Joined: 18 Jun 2017, 23:45

Re: Issues with German umlaut (when matching The 100 episodes by title) since latest release (4.9.2)

Post by IfThenERROR »

rednoah wrote: 24 Oct 2020, 04:50 :idea: In this case, I guess it's just matching the most recent episode for the lack of a better option. If you're using the GUI, then you can always do Double-Click -> Edit Match -> Select correct Episode match to fix the match manually if necessary
Wow, that is a great feature! :D I can't remember having read that in any documentation. Will definately save me quite some time in the future. Maybe you should highlight this option more prominently.

Mismatching „The 100“ with the 100th episode is a really funny coincidence, but obviously my random choice was a bad example. There are many other cases where files with umlaut are mismatched, while others without match perfectly.

Another example:

„Criminal Minds - Alles nur fur Dich .ts“ is matched with „Criminal Minds“ S15E05 „Alles für meinen Bruder“.
The file „Criminal Minds - Am Ende des Traums .ts“ is matched correctly with „Criminal Minds“ S06E16 „Am Ende des Traums“.
https://pastebin.com/nvSY4ViP

Image

Maybe the program heuristics can be improved in a way that a match with only one or two letters mismatching is preferred above one with only 50% rate?
User avatar
rednoah
The Source
Posts: 22986
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Issues with German umlaut (when matching The 100 episodes by title) since latest release (4.9.2)

Post by rednoah »

There are many heuristics at play. One of these heuristics takes time stamps into account, so if you match brand new files against very old episode information, then that heuristic usually nudges things in the right direction, except in your case it seems to nudge things into the wrong direction and over the edge.


Unfortunately, I don't have a solution for you if you're using the GUI. The GUI must target the general case. If I were to make the changes you suggest, that would likely make things work less well for other users with more common use cases. You'll have to use Edit Match to fix the occasional mismatch, or configure your upstream software to output äöü instead of aou.


However, if you were to use the CLI, then we would be able to use --mapper expressions to fix up the Episode information we have to better match your files before matching:

Code: Select all

--mapper "episode.derive(0, 0).title(t.replace('ä':'a', 'ö':'o', 'ü':'u'))"

Code: Select all

filebot -rename *.mkv --db TheTVDB -non-strict --lang German --mapper "episode.derive(0, 0).title(t.replace('ä':'a', 'ö':'o', 'ü':'u'))" --action TEST --log INFO
[TEST] from [TV Shows/Criminal Minds - Alles nur fur Dich.mkv] to [TV Shows/Criminal Minds - 6x15 - Alles nur für Dich.mkv]
[TEST] from [TV Shows/The 100 - Zwei in einem Korper.final.mkv] to [TV Shows/The 100 - 6x07 - Zwei in einem Körper.mkv]
:idea: Please read the FAQ and How to Request Help.
Post Reply