subtitle language not detected

Any questions? Need some help?
Post Reply
Ztrust
Posts: 69
Joined: 21 Dec 2013, 17:04

subtitle language not detected

Post by Ztrust »

for some time filebot hasent been able to detect subtitle language did some testing and it seem to start with filebot 4.5.3.
filebot 4.5 works is this a bug or did something change


ztrust
User avatar
rednoah
The Source
Posts: 18979
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: subtitle language not detected

Post by rednoah »

Language detection is no longer supported. It was quite a hit-and-miss for most languages anyway.
:idea: Please read the FAQ and How to Request Help.
Ztrust
Posts: 69
Joined: 21 Dec 2013, 17:04

Re: subtitle language not detected

Post by Ztrust »

Thank You but sorry to hear that
User avatar
rednoah
The Source
Posts: 18979
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: subtitle language not detected

Post by rednoah »

It's *VERY* easy to do your own "language detection" if you already know what subtitle languages you have.

Most people used language detection to add ".eng" in which case you could just add ".eng" in the format.

Code: Select all

{if (ext == 'srt') '.eng'}
If you already know that you only have ".eng" or ".deu" subtitles you can easily differentiate the two by counting occurrences of "the" or accents or what not.

Code: Select all

{if (ext == 'srt' && file.text.matchAll('the|a|an').size() > 100) '.eng' else '.deu'}
English Subtitles? Then occurrences of "the" will be very numerous.
German Subtitles? Then occurrences of "the" will be virtually zero.
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1027
Joined: 15 May 2014, 16:17

Re: subtitle language not detected

Post by kim »

This did not work for me:

Code: Select all

{if (ext == 'srt' && file.text.matchAll('the|a|an').size() > 100) '.eng' else '.deu'}
but i made this work:

Code: Select all

{fn.replaceFirst(/(?i).da$|.dan$|.dk$|.en$|.eng$|.english$|.danish$/, '')}{ext == 'srt' ? (file.text.matchAll(/\bthe\b|\byou\b|\ba\b/).size() > 20 ? '.eng' : '') : ''}{ext == 'srt' ? (file.text.matchAll('æ|ø|å').size() > 10 ? '.dan' : '') : ''}
maybe it can be better/shorter, but i have no luck at it
"Pattern not found" if i try with æ|ø|å only and this

Code: Select all

{if (ext == 'srt' && file.text.matchAll('the|a|an').size() > 100) '.eng' else '.dan'}
fail to be correct
User avatar
rednoah
The Source
Posts: 18979
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: subtitle language not detected

Post by rednoah »

There might be encoding issues. Not sure what getText() uses by default to decode characters. Probably UTF-8 so if the subtitles use a different encoding then it won't find the characters.
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1027
Joined: 15 May 2014, 16:17

Re: subtitle language not detected

Post by kim »

In CLI:

How do you "fix" the problem with "Exception: Pattern not found" ?

Try with empty file:
when 0 match is found you get "Exception: Pattern not found" NOT e.g.
else '.deu'

Code: Select all

{if (ext == 'srt' && file.text.matchAll('the|a|an').size() > 100) '.eng' else '.deu'}
btw: I still don't know why some subs "work" and some don't, when notepad++ say it's same e.g. "ANSI"

Code: Select all

langsubFile = (it.getText('ISO8859_1').matchAll(/æ|ø|å/).size() > 10 ? '.dan' : (it.getText('utf8').matchAll(/æ|ø|å/).size() > 10 ? '.dan' : '.NOTdan'))
User avatar
rednoah
The Source
Posts: 18979
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: subtitle language not detected

Post by rednoah »

ANSI/ASCII is a subset of both ISO8859_1 and UTF-8 encodings. Guessing text encodings correctly can be tricky.

The latest revision supports language detection again, but this time instead of using OpenSubtitles API it's local. Not sure how it'll deal with encodings though! :D
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1027
Joined: 15 May 2014, 16:17

Re: subtitle language not detected

Post by kim »

I read that but "No such property: lang for class"... how to use it in CLI ?
(As in OFFLINE, a custom groovy file)

So no way around the "Exception: Pattern not found" part ?
(the true/false is NOT working when "Pattern not found")

Using:
FileBot 4.6.1 (r3541) / Java(TM) SE Runtime Environment 1.8.0_73
User avatar
rednoah
The Source
Posts: 18979
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: subtitle language not detected

Post by rednoah »

1.
kim wrote:No such property: lang for class
This error doesn't make sense to me. When does it appear? Screenshots?


2.
You can get around the The Unwind-on-Undefined Behaviour by using any{expr}{expr}{etc}

@see viewtopic.php?f=5&t=1895
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1027
Joined: 15 May 2014, 16:17

Re: subtitle language not detected

Post by kim »

1. you say "supports language detection again"...
I guess that is:
is was {lang} in GUI ?
is was --lang (as a args) ?
but what i need is to use "lang" in a CLEAN groovy file to rename subs OFFLINE so i dont need to use
e.g. "{if (ext == 'srt' && file.text.matchAll('the|a|an').size() > 100) '.eng' else '.deu'}"
but just e.g. {it.lang} ?

in htpc i see "lang == it.language" but it dont work...
do i need to import something or use a full command like net.filebot.WebServices.language ?

2. thx
User avatar
rednoah
The Source
Posts: 18979
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: subtitle language not detected

Post by rednoah »

1.
Just using {lang} should work with the latest builds. If language can't be detected based on the filename language suffix, then language detection will be used on the subtitle content.
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1027
Joined: 15 May 2014, 16:17

Re: subtitle language not detected

Post by kim »

Well i cant get it to work :(

_RenameSubs.groovy:

Code: Select all

rename(map: args.getFiles()
	.findAll{ it.name =~ '(?i:.srt$)' }
		.each{println "FOUND: $it.name"}
			.collectEntries{
				[it, it.lang]
				println it
			}
)
LOG:
MissingPropertyException: No such property: lang for class: java.io.File
Possible solutions: path, class, name
groovy.lang.MissingPropertyException: No such property: lang for class: java.io.File
Possible solutions: path, class, name
at Script1$_run_closure3.doCall(Script1.groovy:26)
at Script1.run(Script1.groovy:22)
at net.filebot.cli.ScriptShell.evaluate(ScriptShell.java:61)
at net.filebot.cli.ScriptShell.runScript(ScriptShell.java:82)
at net.filebot.cli.ArgumentProcessor.process(ArgumentProcessor.java:114)
at net.filebot.Main.main(Main.java:170)
Failure (°_°)
OR if [it, lang]
MissingPropertyException: No such property: lang for class: Script1
Possible solutions: log, class, _args
groovy.lang.MissingPropertyException: No such property: lang for class: Script1
Possible solutions: log, class, _args
User avatar
rednoah
The Source
Posts: 18979
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: subtitle language not detected

Post by rednoah »

Whatever you're doing is completely unrelated to the {lang} binding that you can use in the format. You're writing a script now, not a format.


You might be able to do something like this though:

Code: Select all

getMediaInfo(file: it, format: '{fn}.{lang}.{ext}')
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1027
Joined: 15 May 2014, 16:17

Re: subtitle language not detected

Post by kim »

I got this to work for me, BUT there is some "bug" with "{lang}"

if you have lang code (e.g. swe) before you rename then:
subFile.swe.srt --> subFile.swe.swe.srt

or (dont work)
subFile.se.srt --> subFile.se.srt

and to make it worse its a danish test subfile.

if it could do like my script, with every land code it will work 100% I think
".replaceFirst(/(?i:\.da$|\.dan$|\.dk$|\.en$|\.eng$|\.english$|\.danish$)/, '')"
... but I can see this doing more harm ?
so maybe just some unique check builtin ?
{fn}{'.'+lang.unique()}{'.'+ext}

Code: Select all

args.getFiles{ it.isSubtitle()}
	.each{
		def filePATH = "$it.parentFile\\"
		def subFile = it.nameWithoutExtension
		def cleansubFile = subFile.replaceFirst(/(?i:\.da$|\.dan$|\.dk$|\.en$|\.eng$|\.english$|\.danish$)/, '')
		if (it.extension == 'srt') { langsubFile = getMediaInfo(file: it, format: '{"."+lang}')}
		
		if (subFile != cleansubFile){
			newFilename = "$filePATH$cleansubFile$langsubFile.$it.extension"
			println "Rename: $cleansubFile.$it.extension --> $newFilename"
			it.renameTo(new File(newFilename))
		}
		else if (subFile == cleansubFile){
			newFilename = "$filePATH$cleansubFile$langsubFile.$it.extension"
			println "Rename: $cleansubFile --> $newFilename"
			it.renameTo(new File(newFilename))
		}
	}
User avatar
rednoah
The Source
Posts: 18979
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: subtitle language not detected

Post by rednoah »

1.
se is not a Swedish language code.

Code: Select all

ISO 639-1	sv
ISO 639-2	swe

2.
So in this case it can't detect the language based on the suffix, but has to use real language detection instead, so if it gets detected as Danish then I guess that Swedish and Danish are very similar languages. Please give me a link to that subtitle file, and tell me what language it should be.
:idea: Please read the FAQ and How to Request Help.
User avatar
rednoah
The Source
Posts: 18979
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: subtitle language not detected

Post by rednoah »

This code will run purely on statistical language detection:

Code: Select all

// print probabilities
def f = '/path/to/swedish/subtitles.srt' as File
def s = f.getText('UTF-8')
println net.filebot.subtitle.SubtitleUtilities.createLanguageDetector().getProbabilities(s)

// filebot subtitle language detection (with encoding auto-detection)
println net.filebot.subtitle.SubtitleUtilities.detectSubtitleLanguage(f)
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1027
Joined: 15 May 2014, 16:17

Re: subtitle language not detected

Post by kim »

ok so it works like this:
if lang already in filename then use else run detect lang ?

1. "(dont work) subFile.se.srt --> subFile.se.srt"
the problem here is it dont do the detect lang part OR it sees ".se" as a lang code ?
its the same if empty file
same subFile.uk.srt --> subFile.uk.srt" and so on...

2. unique ?
subFile.da.srt --> subFile.da.dan.srt (work but kind of the same?)
subFile.dan.srt --> subFile.dan.dan.srt (work but 2x lang code not unique ?)
subFile.eng.srt --> subFile.eng.eng.srt (work but 2x lang code not unique ?)


PS: all the about 200 *.srt i testet, detect lang 100% (if "lang code" is removed before)
easy to test in GUI "F2" mode with {fn}{'.'+lang}{'.'+ext}
User avatar
rednoah
The Source
Posts: 18979
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: subtitle language not detected

Post by rednoah »

1.
FileBot will check the ".lang" part before the ".extension" part. It will not check the entire filename for any language codes or names. If the ".lang" part is not a valid language code (e.g. se) then it'll be ignored.

hello.eng.srt => .eng
how.are.you.srt => no language code

SE and UK are country codes, NOT language codes. Country codes are not used for language detection.


2.
{fn} is the filename (i.e. name without extension). {lang} is the language code. If the filename already contains the language code, then it'll appear twice of course.

e.g. subFile.da.dan.srt
* subFile.da => original filename
* dan => language code
* srt => original extension

Works exactly as per specification.
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1027
Joined: 15 May 2014, 16:17

Re: subtitle language not detected

Post by kim »

1.
with a danish subfile: how.are.you.srt --> how.are.you.dan.srt (not "how.are.you.srt => no language code")

the ignore country codes path is OK (as in don't use as "lang"),
but NOT OK in the way that it don't detect and add the "lang" after

subFile.dk.srt --> subFile.dk.srt (ignore = wrong)
subFile.dk.srt --> subFile.dk.dan.srt (detect and add "lang" = right way)
e.g. i have a file where "group" is UK --> filename-UK.srt (ignore = wrong)

2.
e.g. if you download a danish sub with KODI... it will be named --> filename.da.srt
now lets say you want to re-scrape... now filename.da.srt --> filename.da.dan.srt (you see the problem?)

OR
lets say "lang" in filename.dan.srt is correct
re-scrape = filename.dan.srt --> filename.dan.dan.srt (you see the problem?)

OR
lets say "lang" in filename.eng.srt is for some reason wrong
re-scrape = filename.eng.srt --> filename.eng.eng.srt (you see the problem?)
though it is in fact a e.g. danish sub


my way of "fixing" it was to first remove diff. types of "lang"
".replaceFirst(/(?i:\.da$|\.dan$|\.dk$|\.en$|\.eng$|\.english$|\.danish$)/, '')"

it not many files that ignore county code (blabla.CountyCode.srt)... so i can deal/live with that

not sure about the diff. lang code ways (from e.g. KODI) ... "da" not detected and renamed "dan" or "en" = "eng"

but the 2x "lang", most be just {lang} .... must do e.g. the remove first like me
User avatar
rednoah
The Source
Posts: 18979
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: subtitle language not detected

Post by rednoah »

1.
I don't know what you're trying to say. Again, DK (Denmark country) is not a language code, DA (Danish language) is a language code. Since since there is no language code suffix, it'll perform language detection.

If the language code is .eng but it's actually not, than that's just bad naming. FileBot will take the language code at face value.


2.
Yes, but it's your fault for using {fn} in the first place. Standard formats like {ny}{'.'+lang} are not affected for obvious reasons.

If you're using {fn} then you need to figure out which part of {fn} is it that you want. Probably something like {fn.before(/\.\w+$/)}.


3.
I don't know what Kodi uses. Keep in mind that {lang} is NOT a String, it's a Language object. So you can do {lang.ISO2} or {lang.ISO3} or {lang.ISO3B} or {lang.name} or {lang.ISO3B} or {lang.locale.getDisplayName(Locale.CHINESE)} or whatever works for Kodi.
:idea: Please read the FAQ and How to Request Help.
Post Reply