Strict mode processing files with bad matches

All your suggestions, requests and ideas for future development
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Strict mode processing files with bad matches

Post by kim »

I have seen many times if a movie is renamed with e.g. different year then on themoviedb it will fail or worse, it match wrong movie, even if year is of by only +- 1 year AND in strict mode.
this should be "easy" to fix, but...

if word(s) in filename is misspelled e.g. "traitors" when it should be "traitor", then it will fail to match correct...

SO, can you make something like "fuzzy logic" on the matching so if xx % of filename match... to give a better chance to get the correct movie

you can test with e.g
Rekrut.67.Petersen.1953.mkv (Rekrut.67.Petersen.1952 is the correct year)
in strict mode, it will match "The Recruit (2003)"
INFO: if you remove the (bad) year from filename it will match the correct movie.
the weird thing is, in NON-strict mode "Rekrut.67.Petersen.1953.mkv" match the correct movie.

and test with e.g.
Starship.Troopers.Traitors.of.Mars.2017.mkv (Starship.Troopers.Traitor.of.Mars.2017 is the correct titel)
in strict mode, it will match "Starship Troopers (1997)"
INFO: if you remove the misspelled part from filename to "Starship.Troopers.of.Mars.2017.mkv" or just "Starship.Troopers.2017.mkv" it will match the correct movie.


at the very least, stop it from renaming the files in strict mode
User avatar
rednoah
The Source
Posts: 23449
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Bug - themoviedb matching problem

Post by rednoah »

Agreed. Strict mode should refuse to process these files. I'll have a look what's going on.
:idea: Please read the FAQ and How to Request Help.
User avatar
rednoah
The Source
Posts: 23449
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Strict mode processing files with bad matches

Post by rednoah »

I can confirm that the CLI didn't verifying strict matches in the same very strict way that the GUI does leading to less strict behaviour. Fixed with r5209.
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Strict mode processing files with bad matches

Post by kim »

what about the "fuzzy logic" stuff or just compare all combo's in the build in database, by removing 1 word/year at a time ?

because this will never match correct movie
Rank [Starship Troopers Traitors of Mars 2017] => [Starship Troopers (1997), Starship (2011)]
and test with e.g.
Starship.Troopers.Traitors.of.Mars.2017.mkv (Starship.Troopers.Traitor.of.Mars.2017 is the correct titel)
...
INFO: if you remove the misspelled part from filename to "Starship.Troopers.of.Mars.2017.mkv" or just "Starship.Troopers.2017.mkv" it will match the correct movie.
in strict mode, it will now fail
in NON-strict mode, it will match "Starship Troopers (1997)" AND rename it to wrong movie
User avatar
rednoah
The Source
Posts: 23449
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Strict mode processing files with bad matches

Post by rednoah »

That's to be expected in non-strict mode. Bad luck. If a query for "Starship Troopers Traitors of Mars" yields no results then it's effectively the same as checking for "ajkdsflfdskjfdlkdsjaldskfj" and getting no results. It's impossible to guess that the extra "s" somehow causes the correct match to be missing.

Fortunately, FileBot has it's own movie index and search engine that works around search limitations of TheMovieDB so it'll work eventually once the movie becomes popular enough to be included in the FileBot index.
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Strict mode processing files with bad matches

Post by kim »

but it IS in the filebot database
6977240 0460790 2017 Starship Troopers: Traitor of Mars
I think it was added to TMDB in march or so.
I do not believe it ever will "become popular enough" whatever that means (the TMDB popular score) ?

Why does it not lookup "Starship Troopers"... then compare the matches to the filename/path if any of the other words or year are present. Then + xx% for every match so the one with the highest xx% "wins".

I think there is something wrong when it puts only "Starship Troopers (1997)" and "Starship (2011)" on the possible matches, BUT not any of the other "Starship Troopers" movies.
It like the year has little to no value in this lookup case ?
User avatar
rednoah
The Source
Posts: 23449
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Strict mode processing files with bad matches

Post by rednoah »

Removing words until the query works, works only in this very specific case. Bombarding the API with 3-4x times as many search requests just because there's a handful of movies where that actually gives you a good results is not ideal.

I'll look at it, but if it's already in the index and it still doesn't work then it's probably a bad luck corner case that needs to be processed semi-manually.

In this case, adding and an additional alternative title makes the most sense, unless you find a few more examples that would elevate singular corner case into a class of corner cases:
https://www.themoviedb.org/movie/460790 ... ive_titles
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Strict mode processing files with bad matches

Post by kim »

rednoah wrote: 23 Sep 2017, 03:29 Removing words until the query works, works only in this very specific case. Bombarding the API with 3-4x times as many search requests just because there's a handful of movies where that actually gives you a good results is not ideal.
thats why I wrote "what about the "fuzzy logic" stuff or just compare all combo's in the build in database, by removing 1 word/year at a time ?"
anyone can misspell a word or diff. year (because imdb and tmdb is not 100% the same info), but what are the odds that multiple words are ?
In this case, adding and an additional alternative title makes the most sense, unless you find a few more examples that would elevate singular corner case into a class of corner cases:
https://www.themoviedb.org/movie/460790 ... ive_titles
This is temp and bad way of "fixing" it, anyone at anytime can just remove it, because it's a wrong title.

"a few more examples"... remember this one ?
viewtopic.php?f=6&t=2816
it looks like a pattern to me ;)
User avatar
rednoah
The Source
Posts: 23449
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Strict mode processing files with bad matches

Post by rednoah »

I see the pattern of misspelled words, especially singular vs plural, and I'll see what can be done about that using the internal movie index.

Spamming TheMovieDB with lots of extra queries is not the solution though, and it also wouldn't account for bad spelling in the first and second word which are needed for any kind of half-way meaningful query. ;)
:idea: Please read the FAQ and How to Request Help.
User avatar
rednoah
The Source
Posts: 23449
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Strict mode processing files with bad matches

Post by rednoah »

I've looked into doing fuzzy search on the movie index similar to what I'm doing with the series index. Fuzzy search is very expensive, but for series names it's absolutely necessary to make things work reasonably well.

The movie index works differently in that everything is broken into terms and then only these terms are compared (i.e. word by word instead of character by character) to make things reasonably fast when cross-checking bits and pieces of the FileBot against a million movie alias names.

Interesting performance numbers:
* Word-Based Search: ~25ms (current solution)
* Fuzzy Search: ~2500ms

Word-Based Search can't cope with misspelled words, but it's 100x faster. In this case, I'm inclined to choose speed over accuracy. A single file would require multiple lookups, and then we're looking at ~10s per file on a high-end device.

I'll keep looking into better options for indexing text that allows for fuzzy lookup though.


EDIT:

I'm doing a partial fuzzy search locally now while waiting for online search queries which usually take a few hundred milliseconds anyway. You can try r5210 and see if it works well for you and check if there's a noticeable performance regression.
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Strict mode processing files with bad matches

Post by kim »

There is a smile on my face :D
looks like you did good...

I tested with "Paul.Blart.Mall.Cops.2008.mkv" (wrong tiltle/word AND wrong year)
and it passed the test.

I did not notice any "noticeable performance regression"...
and wants a few ms compared to renamed to wrong movie (this can take maybe hours to fix?)

btw: I will keep on testing ;)
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Strict mode processing files with bad matches

Post by kim »

To ALL users:
it's easy to test in the GUI, if the movie show up on the "best match" list and even better at the top, it's working.

if not on the "best match" list... please write here so we can make it better
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Strict mode processing files with bad matches

Post by kim »

I tested "Rekrut.67.Petersen.1953.mkv" (wrong year) again...

In strict mode, I can see it finds the correct movie, but wrong year so fail...

Can you make the output/log say something that indicate "a potential match found... the file may be incorrectly named... [best match] "
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Strict mode processing files with bad matches

Post by kim »

I tested these versions of "Starship Troopers 2 - Hero of the Federation 2004"

Starship Troopers 2 Hero of the Federations 2003 = match (no popup list) = (all OK)
Starship Troopers 2 Hero of the Federations 2004 = match (no popup list) = (all OK)
Starship Troopers 2 Heros of the Federation 2003 = # 2 on the list = ***
Starship Troopers 2 Heros of the Federation 2004 = # 1 on the list = (OK)
Starship Troopers 2 Heros of the Federations 2003 = # 2 on the list = ***
Starship Troopers 2 Heros of the Federations 2004 = # 1 on the list = (OK)

*** Maybe it can get to # 1 on the list, without doing more harm than good ?
User avatar
rednoah
The Source
Posts: 23449
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Strict mode processing files with bad matches

Post by rednoah »

The current implementation doesn't do fuzzy search on movies where the year doesn't match greatly reducing the number of movies that are being checked.

If the year in the filename is wrong, is it usually the year before the actual release year?
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Strict mode processing files with bad matches

Post by kim »

IMDB many times has some "Film Festival" as the year... but on TMDB it is the first date that many people can see it, so TMDB = IMDB + 1 year, but if doing the search the other way -1 year, you see ?

= lookup year +- 1 (I think is the best)
User avatar
rednoah
The Source
Posts: 23449
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Strict mode processing files with bad matches

Post by rednoah »

I'll go with only -1 for now. If +1 is an actual thing, then we can add it later. That's for fuzzy search. The ranking is a completely unrelated matter, but I can play with that for a bit and see if I can get overall better results.

Please try r5212 and give it some thorough testing because changes in the ranking are usually a bit of a gamble.
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Strict mode processing files with bad matches

Post by kim »

found a problem:
tested with "Paul Blart Mall Cop 2011" and "Paul Blart Mall Cops 2011"

strict mode:
"Paul 2011" (BAD)

and in NON-strict mode:
"Paul Blart Mall Cop 2 2015" (not sure, no list... "Paul Blart Mall Cop 2009" should be on the list, I think)

it's something to do with the year because:
"Paul Blart Mall Cop" (NO YEAR)

strict mode:
no match (OK)

NON-strict:
"Paul Blart Mall Cop 2009" as #1 = OK
User avatar
rednoah
The Source
Posts: 23449
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Strict mode processing files with bad matches

Post by rednoah »

Yes, "Paul" matches and "2011" matches so it'll match Paul (2011). That's all it takes to get a strict match. FileBot cannot assume that all your files are already using the Name (Year) standard. Might as well be Paul: Super Great Extended Edition 2011.

In this case it's just bad luck that there is a movie called "Paul" from the year "2011" that seems like a better match. If the year was equally wrong but 2010 or 2012 then it'll work. But there's always be very specific corner cases that just don't work. ;)

I'm not sure if the year consistently being wrong and misleading is a general thing, or just happens to be the case in your collection. :lol:

I could easily put more emphasis on a fuzzy name match, but the name being messed up is more likely then the year being blatantly wrong. I'm not inclined to make any changes here with just this specific corner case as reason.
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Strict mode processing files with bad matches

Post by kim »

I don't know how you match search (fuzzy) ?

I think, it should be something like this:
filename "Paul Blart Mall Cops 2011"

"Paul Blart Mall Cops 2011"
"Paul Blart Mall Cops"

if no match, then find all movies containing any combo of the words Paul, Blart, Mall or Cop
the more words match = higher score (top of list)
Paul Blart Mall Cop (3/4 match)
Paul Blart Mall Cop 2 (3/4 match)
Mall Cops: Mall of America (2/4 match)
5150 Mall Cop (2/5 match)
Paul (1/4 match)

filter result with year
Paul Blart Mall Cop 2009 (3/5 match)
Paul Blart Mall Cop 2 2015 (3/5 match)
Mall Cops: Mall of America 2009 (2/5 match)
Paul 2011 (2/5 match) + xx % because year match
5150 Mall Cop 2005 (2/5 match)

Paul Blart Mall Cop 2009 (3/5 match) 60% score
Paul Blart Mall Cop 2 2015 (3/5 match) 60% score - score because no "2" in filename = 50% score

# on the list:
1 Paul Blart Mall Cop 2009
2 Paul Blart Mall Cop 2 2015
3 Mall Cops: Mall of America 2009
4 Paul 2011
5 5150 Mall Cop 2005
User avatar
rednoah
The Source
Posts: 23449
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Strict mode processing files with bad matches

Post by rednoah »

Do you have a second or third example of a file that is not Paul Blart that fails in a similar way?
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Strict mode processing files with bad matches

Post by kim »

try this (but only fail in NON-strict)
"The Hitmans Bodyguard 2011"

#1 Bodyguard 2011
#2 the guard 2011
#3 The Hitman's Bodyguard 2017

looks to me like the year has to much value
many movies can have same year, but the more words in title should give better match (higher score)

try this (NON-strict)
"The Bodyguard 2002"
#1 The Backyard 2002 (WTF?)
#2 the Coast Guard 2002 (why?)
#3 2002 2001 (this should be way down the list, when better words to match are found)
#4 The Bodyguard 2004
User avatar
rednoah
The Source
Posts: 23449
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Strict mode processing files with bad matches

Post by rednoah »

Are these made up examples or do you actually have files where the year is wrong and off by 5 years?

It's like with "Firefly 1x01 Serenity", the 1x01 takes precedence and the episode gets matched accordingly. There is many real world examples where the year is the only common denominator (e.g. Spanish filename ➔ English match or LOTR1 2001 for the first Lord of the Rings movie).

"The Bodyguard" is an interesting example and shows how fuzzy search + year match can result in regression issues. I'll revert some of the previously made changes to make this one work properly again.
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Strict mode processing files with bad matches

Post by kim »

most of them are "made up" to test it, but also to show the effect the year has.

here are some more realistic ones:
"center vagten 2009" (the correct danish name AND year for "Paul Blart Mall Cop 2009")
same with "centervagten 2009" (alt. title)

#1 Paul Blart Mall Cop 2 2015
#2 Paul Blart Mall Cop 2009

"center vagten 2010"
#1 2010 1984
#2 Paul Blart Mall Cop 2 2015
#3 Paul Blart Mall Cop 2009


"Firefly 1x01 Serenity" in my head this should be matched to "Firefly 0x01 Serenity"
not "Firefly 1x01 The Train Job" just because the episode number match
FYI: it's "Firefly 1x11 Serenity" on TMDB
(again the same as the movie year has to much value)
also, episode number over title (what if correctly named, but in dvd order or other way?)

how do e.g. google match and is it possible to use same method ?
(I'm not saying the current matching is bulls***, but maybe it can get better?)... use AI ;)

btw: my focus right now is on the movie side
User avatar
rednoah
The Source
Posts: 23449
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Strict mode processing files with bad matches

Post by rednoah »

1.
FileBot has many many blind spots that you could exploit by handcrafting tricky test cases. However, putting emphasis on the release year (+-1) is generally a good rule to make things work stable and easy to understand.


2.
You're correct. Matching "Firefly 1x01 Serenity" to "Firefly 1x01 Trail Job" is incorrect, but it's an error that a user will be able to understand, and it's easy for users to come up with a solution, just fix the numbers to match.

:idea: Fun fact: Older versions of FileBot gave SxE matches and title matches the same weight, which meant that in tricky cases it was fairly impossible to predict what match would bubble up on top.

Fuzzy matches based on text (which may contain different wording or spelling errors) will be much more mysterious for non-tech savvy users. Plus movie/series name matches may be based on alias titles which are not immediately visible to end users.

It's not a perfect solution, but it works very well in practice.


3.
I'll debug a little bit with the examples you posted. Maybe there's a way to tweak things in the right direction without breaking the general use case.

"centervagten 2009" is a good example that should work but broke with yesterdays changes. I'll keep it as a test case. Also, with the latest revision, I think we're back to how it worked last week now. :lol:


4.
AI would require very vast amounts of example data and correct matches. It would require an humongous effort in terms of testing and development, require me to collect vast amounts of analytics data from existing users (which I refuse to do) to feed into the machine learning algorithms.

:idea: If I ever decide to do a PhD, I might look into using AI to identify movies though. I have tried and failed before though. :lol:
:idea: Please read the FAQ and How to Request Help.
Post Reply