I'm experimenting with using jaroWinklerDistance (
https://en.wikipedia.org/wiki/Jaro%E2%8 ... r_distance) from Apache Commons (
https://commons.apache.org/proper/commo ... mmary.html) to determine the "highest" matching entries in TheTVDB/AniDB.
First I query for the Series (TheTVDB.search/AniDB.Search), then the Aliases for each Series returned (TheTVDB.getSeriesInfo/AniDB.getSeriesInfo) and then perform a jaroWinklerDistance on those (vs the Series name from the file) and see if I can use that as a "hint" on the best way to resolve the file with higher accuracy then a basic AMC strict/non-strict (which for some current shows is very much a hit or miss because of some silly filenames people release).
If I can figure out how to download/parse the XML AniDB Title database, I'll add that into the mix for AniDB (which should greatly increase the 'accuracy' of the AniDB search as I can search for the highest match from all Anime Titles/Synonyms), that would also help with looking up info when switching between AniDB/TheTVDB as a "if the first one doesn't succeed, try the other one" process.
I am doing this because I do NOT want to manually babysit the Anime sorting at this part of the process (This is the first sorting of the raw inbound files), but the base accuracy of AMC is not as high as I want when in non-strict (which is unfortunately required for ALOT of anime, I DO run a strict pass FIRST before resorting to Non-Strict). So far of the current season, a number of the release groups for Re Zero, Latest Sword Art Online and surprisingly Black Clover anime have filenames that just confuse filebot (with some good reason for some of them).
So right now, there are a number of release groups that look to be purposely trying to make it hard to accurately get the series right (much less the episode), with names like the following (these are from the last few day's, and strict amc wasn't working)
None of these match using strict for me (recently), and this is not an exhaustive list of files that strict has issues with, and non-strict doesn't always work right .. and this is when using AMC, so when AMC doesn't group files together "accurately", it can also make for some wildly off matches using non-strict (reason #2 I don't like to use non-strict with amc).
Code: Select all
(CBB) Re-Zero kara Hajimeru Isekai Seikatsu - 35 (1080p)(HEVC)(10bit-AAC).mkv
[FFA] Re_Zero kara Hajimeru Isekai Seikatsu 2nd Season - 10 [1080p][HEVC][AAC].mkv - This does actually work correctly if using non-strict (but I don't like using non-strict unless all other options are exhausted)
[EMBER] Re Zero kara Hajimeru Isekai Seikatsu S02 - 10.mkv
[shadow.jp.net] Re Zero kara Hajimeru Isekai Seikatsu - 35 [720p] [MULTi-SUB].mkv - This does actually work correctly if using non-strict (but I don't like using non-strict unless all other options are exhausted)
[Edge] ReZero kara Hajimeru Isekai Seikatsu (Season 2) - 35- I Know Hell [1080p].mkv
[HorribleSubs] Re Zero kara Hajimeru Isekai Seikatsu - 35 [720p].mkv
[shadow.jp.net] Re Zero kara Hajimeru Isekai Seikatsu - 35 [720p] [MULTi-SUB].mkv
[AkihitoSubs] Sword Art Online Alicization - War of Underworld - S02E09 (21).mkv
[BakedFish] Sword Art Online_ Alicization - War of Underworld 2nd Season - 09 [720p][AAC].mp4
[HR] Black Clover 142 - S03E40.mkv
[Edge] Black Clover (TV) - 142 [1080p][Multiple Subtitle][10Bit][x265].mkv
[mal lu zen] Black Clover - 142 [720p].mkv
I've given up on a generic programmatic method of dealing with Sword Art Online Alicization War of the Underworld, and just hard code any "season 2" stuff to Sword Art Online War of the Underworld (2020) on AniDB (non-strict) and at least I get the right season, same deal with Black Clover as there is only the SINGLE season on TheTVDB, the episodes rarely match but at least it's the right series (but since the episodes rarely match, the time of year aka 'anime season' usually doesn't match, but it's the best that I can do with that filename). I'm at a bit of a loss on why some of the simple black cover filenames don't match the episodes frequently either..
I don't rename the files to the episode(s) at this point, nor add xattr because even a 20-30% mismatch is still thousands of files for me in this stage. I collect ALOT of fansub Anime (I also purchase the titles I like too), so accuracy is very important for me (I'm an avid AniDB user), and I also have alot of stuff that isn't in TheTVDB as well.
I have been adding Synonyms like Sword Art Online_ Alicization - War of Underworld 2nd Season and Re_Zero kara Hajimeru Isekai Seikatsu 2nd Season to AniDB so I can improve the accuracy of my filebot script as well as help increase the ability of other's to match against some of the silly names release groups are using.
If it's possible to get the AniDB synonyms info from filebot that would be my preferred method, even if it's a flag to pass to getSeriesInfo or even a dedicated method is fine. It *seems* like filebot will match against the synonym when in non-strict, but it could also be just getting "lucky"
Does this help explain what I'm looking for and why I'm asking for a method to retrieve the synonym info using filebot (in a groovy script)?