Support for multiple subtitles in the same language

Pat · Post by **Pat** » 09 Nov 2014, 09:30

I use {'.'+lang} which is working pretty well detecting my subtitle files (and ignoring anything else) but it is limited to one "unique" language per video.

Sometimes there are multiple subtitles for the same language, ie: for the hearing impaired, for short parts with foreign language, of for localized versions for different countries.

This files are mostly named either
<title>.eng.srt
<title>.<type>.eng.srt
<title>.eng.<type>.srt

Where type could be "shd", "forced", "usa/us", "gbr/gb". I imagine it could even be "directors commentary" or almost anything else.

Now my wish/solution

- There is a standard for localization, that is [en], [en-us], [en-gb], etc, so I would like to use two-letter country codes and keep the four letter codes if provided. What about a new binding "ln" for a two/four letter code?

-shd and the rest should be considered "descriptions" and a binding would be helpful. Such as sdesc or maybe spredesc/spostdesc for whatever string comes immediately before or after the language code and between "."

And while I am at it, is there any method() that converts from three to two letter codes? I also noticed that many codes are missing (As a temporal solution I tried matching my shd/forced/etc to uncommon language codes such as Esperanto, Yiddish, etc)

Post by **rednoah** » 09 Nov 2014, 11:15

There is no standard, and there is certainly no way a program could reliably guess any of these things from the subtitle data.

However, any logic that you could come up with that would work in your case, you can certainly do yourself via the format.

Pat · Post by **Pat** » 09 Nov 2014, 14:40

I guess you are addressing the "description" issue with that answer. As long as the language issue is solved, then a user regex would be trivial if you do not see a viable way for a reliable binging. Then let me go back to the same language different country subtitles.

I just found you had said a while ago:

"the {lang} binding will detect any language code and always force ISO 639-3 for your convenience"

That is wrong. When my ISO 639-1 language codes are "en-gb" and "en-us", Filebot
converts them BOTH to "eng" "for my convenience". Downgrading a 4 letter codes into a 3 letter codes, loosing information, loosing a feature and creating a name collision for two different files! At least you should give the user the choice.

[edit: actually "en-gb" was detected as language and translated into "spa" as the last item in the srt set, see attachment]

As far as I see, "eng-gbr" or "eng-gb" style is not part of the ISO 639-3 standard

Pat · Post by **Pat** » 09 Nov 2014, 15:36

In addition I may add that I also include the language if the audio and text streams have it encoded.
That results in funny things like:
<filename> [2ch AAC.en.it] [mp4s.en.it.sp].mp4
<filename> [2ch AAC.en.it] [mp4s.en.it.sp].eng.srt

Yes, internally it is always two letter codes, and externally it is forcing 3 letters.

Post by **rednoah** » 09 Nov 2014, 15:47

en-gb is not a language code, it's a locale, consisting of ISO 639-1 language code and ISO 3166-1 alpha-2 country code.

There is no internal/external. When you take the language code from the container file, then it's doing just that, but it really could be anything, and just happens to be 2-letter codes. Not sure if there's a standard. {lang} is predictable in that everything will be converted to standard 3-letter codes (as best as possible, but no guarantees). Language is detected by a standard suffix .en, .eng, .English, . Inglise, etc but it'll default to OpenSubtitles language detection (which has a good chance of being wrong, machine language detection is tricky after all). Bad language codes like en-gb fall into the latter category.

PS:
The format gives you ample room for writing your own logic. If {lang} doesn't work for you, don't use {lang}.

Pat · Post by **Pat** » 09 Nov 2014, 16:01

{lang} works, pity you force users into 3 letter code instead of giving the choice.
The libraries are limited. I tried using "eng".someJavaMethodes() but none of the methodes I tried was defined.

Without access to the libraries the the lang biding uses I cannot match language codes embedded in the file name

Post by **rednoah** » 09 Nov 2014, 16:18

Choice 1:

Code: Select all

{lang}

Code: Select all

{lang.ISO3}

Choice 2:

Code: Select all

{lang.ISO2}

Choice 3:

Code: Select all

{lang.name}

Choices 4..infinite:

Code: Select all

{lang.name.upper()}

Code: Select all

{lang.ISO2.reverse()}

Code: Select all

{lang.name.transliterate('any-katakana')}

But just {lang} doesn't fix your bad "language codes" like en-gb and the like. But since you already know the "language code" you want, why not take it from the filename?

Choices infinite..2*infinite:

Code: Select all

{fn.match(/\.\w{2,3}(?:-\w{2})?$/)}

Pat · Post by **Pat** » 09 Nov 2014, 21:51

Well, this is great!
I have tried so many java methods to convert it...
Now I see lang is an object with several properties holding all the needed info.
The choice I wanted, but how could I have known if you weren't to tell me or if I would not have thought of testing lang.properties? Is it documented somewhere? Where I can look?

And what about audio.language for instance? What are the options?
My audio.language returns "en" but audio.language.name will of course not work.
Is there a list of available methods somewhere?
Is all this supported? http://groovy.codehaus.org/groovy-jdk/index-all.html
And where is transliterate() documented? I cannot find it anywhere (apart from in this short list http://www.filebot.net/naming.html)

Post by **rednoah** » 10 Nov 2014, 05:14

Most of it is Groovy, so it's documented by Groovy. There's some minor FileBot additions, the important ones of which are documented, while all the rest is documented through the plethora of examples here in the forums, and of course the source.

In particular, match is highly documented, and part of almost any example.

{lang} is a Language object. Use toString() to get a Java String.
If you extract data via the generic MediaInfo access bindings you'll always get String objects. You can use standard Java Locale to try to convert that to the language name.

Also I highly discourage using anything but the proper standard ISO 639-3 language codes, and support for the 2-letter codes is unofficial and subject to change/removal, thus no official docs.

Pat · Post by **Pat** » 10 Nov 2014, 12:04

Thanks, I will evaluate if I'll convert all my (mostly) two letter codes (as per http://digitalcinemanamingconvention.com/, not always ISO compliant, ie: LAS for Latin American Spanish) taken from within containers into ISO3

I also found out that en-gb is not only "locale" but also "language" as per IETF, the primary language subtag (two or three letters) may have several extended language subtags separated by "-", one of them being region (http://en.wikipedia.org/wiki/IETF_language_tag). For the moment I'll skip this.

But I will keep the "description" since it is supported by several players and media centers (and I need to name these distinct files somehow), in the format of <title>.<lang>.<desc>.srt where desc could be "forced", "SHD", "Latin American Spanish", "English for the hearing impaired", "Director's comment", etc.

As far as I see, lang.code is the matched original string from the file name.
But how can I use "lang.code" (or other expanded strings) in a regular expression in match()?

I would need to detect the full string after lang.code+'.' and before '.'+ext

This works, but it's hard coded
fn.match(/(?<=\.en\.).*[^\.srt$]/)

and this same thing does not work
fn.match(/(?<=\.{lang.code}\.).*[^\.{ext}$]/)

Post by **rednoah** » 10 Nov 2014, 12:18

I don't think using match with variables makes a lot of sense. It's too restrictive and lang isn't perfectly reliable so things will fail randomly.

Also fn is the file name without extension. For your special requirements I'd make a simple regex that'll directly match all the patterns you have. You can hardcore that easily.

This one should be pretty restricted already:
{fn.match(/\.\w{2,3}(?:-\w{2})?$/)}

But you can fine tune it to your needs.

Pat · Post by **Pat** » 11 Nov 2014, 08:44

I have the following terrible file names
"Back to the Future.1985 ID13 - el-Greek.srt"
"Back to the Future.1985 ID14 - el-Greek.srt"
"Back to the Future.1985.eng hi.srt"

Request 1) For further manipulation of fn, it would be very useful to have a binding holding the base name, ie fn.bname="Back to the Future.1985". I guess FileBot picks it up from the video file (fn.vname?)

Request 2) When FileBot detects the language (always in these examples), it builds the lang object with properties such as ISO2, ISO3, etc. But it is impossible to know exactly which string was matched. For further manipulation of fn, it would be very useful to also have a property with the matched string, ie lang.matched="el-Greek" (or was it "el"?)

In both cases FileBot knows the info with 100% certainty, while the user can only try to guess!

Less important and not sure how easy to implement:

Request 3) lang detection mostly works if the language is found at the end of the file name, and mostly fails if found somewhere else like in:
Title.en.forced.srt

In such cases, the user is in a better position to find the language string.
Therefore It would helpful to have string methods analogous to the lang properties, ie: "en".ISO3(), "eng",ISO2(), etc. A null value would tell if a string is a language, "ID14".ISO3().

Post by **rednoah** » 11 Nov 2014, 09:12

1.
FileBot does not know the "base name", though for your particular cases you have all the information you need via {n} and {y} so the "base name" would be {"$n $y"}

2.
This would only be useful for your exact special case. {lang} already covers the default case. Plus your special locale strings are not supported anyway.

3.
language tag at the end of the file name => use it
otherwise => ask OpenSubtitles (probably doing some language detection on some fuzzy character heuristics, so likely to get it wrong)

No intention of adding mostly-useless methods to the String type.

But I'm sure it can be done with Java Locale quite easily: new Locale(code).getISO3Country()

@see https://docs.oracle.com/javase/8/docs/a ... ocale.html

Names like movie.en.forced.srt or movie.en.HI.srt are quite common I think though. If you give me a complete list of standard subtitle modifiers like ".forced" or ".HI" I can add logic for that.

Pat · Post by **Pat** » 11 Nov 2014, 14:07

rednoah wrote:1.FileBot does not know the "base name", though for your particular cases you have all the information you need via {n} and {y} so the "base name" would be {"$n $y"}

Particular cases are simple, but we want generic solutions.
Before renaming, our "just acquired" files do not follow necessarily any strict rules (therefore the renaming

).
When renaming an srt FileBot let me use original video codecs, streams, other video file internals but it does not know the trivial original file name? Am I missing something?

If I want to use "lang" (I want), and get rid of the matched strings (but keeping the rest, ie, an unknown lang subtag) I have no way to know.
ie
"Back to the future I ID14 en hi"
Is the title "Back to the Future", "Back to the Future I", "Back to the future ID14"?
It gets worse when/if you support lang subtags or modifiers like "forced"

But I must be missing something. How is it possible that for building the subtitle file name I can use video codecs, audio codecs and many other esoteric infos from the original video file
but I cannot use the more trivial original filename.ext?

rednoah wrote:2.This would only be useful for your exact special case. {lang} already covers the default case. Plus your special locale strings are not supported anyway.

After you detect a lang I have no way to verify what is "the rest" or validate it.
It's not a special case. Was it two letter? Three? A locale? Is there a HI, forced or SHD there? Any other word lying around that may or may not be part of the original title or language modifier? You give options to convert that string to ISO2, ISO3, locale and more but you cannot show what are you converting? Trivial for FileBot to tell , but incredibly difficult or impossible for the user to discover and/or automate. lang is very useful, but fuzzy and unpredictable. To help it and tune it, it is imperative to know the original form before conversion.

rednoah wrote: 3.it can be done with Java Locale quite easily: new Locale(code).getISO3Country()
@see https://docs.oracle.com/javase/8/docs/a ... ocale.html

Yes, that would be the way to go

rednoah wrote: Names like movie.en.forced.srt or movie.en.HI.srt are quite common I think though. If you give me a complete list of standard subtitle modifiers like ".forced" or ".HI" I can add logic for that.

XBMC uses:
http://kodi.wiki/view/Subtitles#Externa ... _Subtitles
Movie Name (2006).English.Forced.srt
Movie Name (2006).en.forced.srt
Movie Name (2006).German.Forced.srt
Movie Name (2006)-Swedish-Forced.srt

You will also find HI,hi, SHD (this one used by the film industry)
I have seen often (though it is not the same)
Movie Name (2006).English Directors commentary
Movie Name (2006).en.Director's commentary
Movie Name (2006).pt-brazil
Movie Name (2006).pt-portugal

The best would be
<movie base name>[.-]<lang as supported by the libs>[.-]<any modifier(s)/extension[s]>
language name always at the begining. That follows the IETF standard:

http://en.wikipedia.org/wiki/IETF_langu ... guage_tags
https://tools.ietf.org/html/bcp47

In the basic form, these ordered hyphen separated subtags
primary lang
optional extended lang
optional script
optional region
optional variant(s)
optional extension(s)
optional private-use

Or simpler
<movie base name>[.-]<lang><-subtag1><-subtag2>...<-subtagN>

Post by **rednoah** » 11 Nov 2014, 15:47

1.
What's not working with {fn} {folder} {file.name} {file.path} etc? Of course you have access to the current filename, that's how fn.match() allows you to keep parts of the filename. You also have access to the original filename after renaming via xattr metadata if enabled.

2.
Since video+subtitle pairs are expected to be renamed to the same base name, all MediaInfo bindings (and some others) on subtitles files (which would always fail) are transparently evaluated against the actual video file.

There's links to the source on the website. Check out getInferredMediaFile() for details.

3.
Something like this would be the easy solution that works for most people here that care about these things:
{'.'+lang}{fn.match(/\.(forced|hi)/)}

Assuming {lang} works correctly. If .forced interferes with that I might have to fix that. But {lang} still won't include "forced" or "hi". That's something you have to retain by matching it from the existing filename.

4.
It's not easy because I care about false positives a lot more than false negatives. dot/slash/underscore are used interchangeably, the language tag could be anywhere, there may not be a year number.

a.en.hi.srt => movie a, with en.hi subtitles? or movie a en with hebrew subtitles? I don't mind if {lang} doesn't always work, as long as it doesn't accidentally work when it shouldn't.

Post by **rednoah** » 11 Nov 2014, 18:02

You can play with this:

Code: Select all

{fn.after(media.FileName)}

media => video filename
fn => subtitles filename

And the difference will give you the current "language suffix" whatever it may be.

Pat · Post by **Pat** » 11 Nov 2014, 22:12

oh, I missed your second post. It came hours later when I was already
composing my [deleted] reply.

Yes, the media object solves many things, I did not know it existed.
It exposes a lot of useful things, media.filename above all!

I will edit my reply to reflect this new insight and post again.

Pat · Post by **Pat** » 11 Nov 2014, 23:29

We are getting closer, {fn.after(media.FileName)} is a great step forward

Now If I would only know your matched language string without needing to reverse engineer your heuristics...

I could then cleanly find all compliant language modifiers after it with
{fn.after(media.FileName).after(lang.matchedStr)}

With lang.matchedStr I would not need the following object, but it would be IETF compliant (you had asked about standards)

lang.subtags[] (an array with all hyphen separated strings that come after lang.matchedStr)
It would handle automatically "SHD", "forced", "Director's commentary", language regional info, additions to ISO 639-3 languages or any free/private use text.

Assuming {lang} works correctly. If .forced interferes with that I might have to fix that

Yes, it interferes.
In "eng.forced.srt" lang is not detected
Have not tested "eng-forced.srt"
But it was working in "eng hi.srt"

Post by **rednoah** » 12 Nov 2014, 10:19

Code: Select all

It would handle automatically "SHD", "forced", "Director's commentary"

This is fixed in the latest release, not the generic locale/language extensions though. At the very least it's fully compatible with XBMC/Plex specs now.

How about this one?

Code: Select all

{fn.after(media.FileName).after([lang.ISO3, lang.ISO2, lang.name].join('|'))}

Pat · Post by **Pat** » 12 Nov 2014, 11:13

How about this one?
Code: Select all
{fn.after(media.FileName).after([lang.ISO3, lang.ISO2, lang.name].join('|'))}

Well, yes. But why to test against many possible conversions to see if I succeed matching the string when we already have it? My heuristics on top of yours. FileBot could simply tell me with 100% certainty.

But granted, media.FileName was the most important data here, it gives a precise back boudary. From there I can hack it myself through.

rednoah wrote:It would handle automatically "SHD", "forced", "Director's commentary"
This is fixed in the latest release

FileBot_4.5.2_B1?
I was testing with FileBot 4.5.2 (r2678)
I'll give it a try.

Post by **rednoah** » 12 Nov 2014, 11:24

No release for that yet. It's in the latest revision jar though:
viewtopic.php?f=7&t=1609

Pat · Post by **Pat** » 13 Nov 2014, 04:56

There seems to be an escaping problem.
The original files had no [] in the file names, and all worked as expected.
But when I apply the same conversion to the new file names (now with brakets) it fails

This is the testing code:

Code: Select all

{'This is fn: '+fn+' XXX This is media.FileName: '+media.FileName+' XXX This is fn.after(media.FileName): '+fn.after(media.FileName)}}

This is one example of a failing file:
Back to the Future (1985) [720p x264] [6ch AAC].ar-Arabic-ID3.srt

This is the outcome:
This is fn: Back to the Future (1985) [720p x264] [6ch AAC].ar-Arabic-ID3 XXX This is media.Filename: Back to the Future (1985) [720p x264] [6ch AAC] XXX This is fn.after(media.FileName): Back to the Future (1985) [720p x264] [6ch AAC].ar-Arabic-ID3

I presume .after() has troubles parsing the brakets.

In adition (but less critical)

lang IS properly detected in
Back to the Future.1985 ID3 - ar-Arabic.srt

lang IS NOT detected in
Back to the Future (1985) [720p x264] [6ch AAC].ar-Arabic-ID3.srt

lang USED TO be detected but is NOT ANY MORE in
Back to the Future.1985.eng hi.srt

Testing with HEAD
FileBot 4.5.2 (r2702) / OpenJDK Runtime Environment 1.8.0_40

BTW, I am testing this in the GUI *without modifying my files", just seeing how files would be modified.
But there I cannot copy & paste. How can I do it from the command line and see the text output in the console?

Post by **rednoah** » 13 Nov 2014, 05:33

1.
*-Arabic.srt => detect as "Arabic" language

2.
*-ID3.srt => not detectable

I could add "-ID3" to the same list as "-forced" but in this case I'll call it bad data / won't be fixed.

3.
.eng.hi => ok
-eng-hi => ok
.eng hi => not ok, just a case of non-standard bad naming that is not supported (though you can of course fix it yourself easily in the format or pre-renaming of files, at your own risk)

4.
I've removed OpenSubtitles language detection. If it doesn't work reliably I might as well not even try, wasting OpenSubtitles resources needlessly.

5.
-rename --action test

6.
String.after(regex) takes a regular expression as pattern. Not a String literal. You can do a \Q..\E to quote things. I'd use String.replace(String, String) instead. Replace FileName with '' empty string.

@see http://sourceforge.net/p/filebot/code/H ... .java#l177

@see viewtopic.php?f=8&t=1558

Pat · Post by **Pat** » 14 Nov 2014, 00:19

OK, I am done with my script, it seems to be working very well, testing from the CLI speeded things up

The script became large, so I wrote it in many lines and added comments.
1. Would it be a good idea to have a button next to the GUI format field to expand it into many lines?

2. Is there a standard way to read it from a file?

I am using

Code: Select all

--format "`grep -v '//' myscript |tr -d '\n'`"

It first removes comments and then it removes new lines (a single line does not work with double slash comments))

The script also runs with comments and multilines if I take care to backslash certain lines to avoid ambiguous expressions.

This is my script, it should be easy to read.

Code: Select all

{
  // Basename format: Title info [Video stream info] [Audio streams info-langs] [Sub streams info-langs]
  // External sub format: Basename.lang-extensions

  // Title
  // For series: Name - Season#Episode# - Title
  // For movies: Name (Year) 
  // For Music: TODO

  any { if (series) n + ' - ' + s00e00.lower() + ' - ' + t } 
      { if (movie) n + ' (' + y + ')' } 
      { if (music) "TODO" } 

}{

  // Video Stream (VS)
  // Format [VFormat VCodec]

  ' ['+ vf + ' ' + vc + ']'

}{

  // Audio streams (AS)
  // Format [AFormat ACodec.(lang1-lang2-langN or xN)] 

  // For many streams: hyphen separated languages from AS properties. Fallback to number of AS
  // For one stream: lang from AS properties. Fallback to lang from VS properties, then empty`

  // Are there any streams?
  if (audio) 
  { 
    ' [' + af + ' ' + ac 
    + any {'.' + audios.language.join("-")} 
          {(audios.size()>1) ? ".x" + audios.size() : '.' + video.language} 
    + ']'
  } 

}{

  // Subtitle streams (SS)
  // Format [SFormat.(lang1-lang2-langN or format.xN)]

  // For many streams: hyphen separated languages from from SS properties. Fallback to number of SS
  // For one strean: lang from SS properties. Fallback to lang from AS, then from VS, then number of AS

  // Are there any streams?
  if (text.format)
  {
    // Clean up the format, and get lang from sub strems, audio, video, or show number of sub streams
    ' [' + text.format.replaceAll(/\W/, '') + '.' 
    + any {texts.language.join("-")} 
          {(audios.size()>1) ? "x" + text.size() : any {audio.language} 
                                                       {video.language} 
                                                       {"x1"} 
          } + "]"
  }

}{

  // Subtitle Files (SF) 
  // Format (.ISOlang-postLangiString-preLangiString or fullString)

  // If language not detected: Keep origianl text, trimmed.
  // If language detected. Append after lang, hyphen separated, trimmed strings found after and the before lang.

  // Is there any text?
  if (fn.replace(media.FileName, ""))
  {
    any {'.' + lang.ISO2 
         + '-'  + fn.replace(media.FileName, "").after([lang.ISO3, lang.ISO2, lang.name, lang].join('|')).match(/\w.*\w/) 
         + '-'  + fn.replace(media.FileName, "").before([lang.ISO3, lang.ISO2, lang.name, lang].join('|')).match(/\w.*\w/) 
        } 
        {'.' + fn.replace(media.FileName, "").match(/\w.*\w/).replaceAll(/[ ]+-[ ]+/,"-")} 
  }
}

As you can see, I am fully trusting FileBot for language detection, then rescuing any extra string (SHD, forced or anything) that may be next to it. If detection fails, I keep the original subtitle strings as they are, which I can later edit by hand on single case basis if I find it necessary. This will spare me from file name collisions too.

3. Any best practices or important concepts I am deviating from due to my lack of groovy and FileBot knowledge?
I have been using both for only two days, so any particular or general feedback is more than welcome.

Post by **rednoah** » 14 Nov 2014, 04:33

1. & 2.
Not planned. Complicated format scripts like yours are not the primary target, but possible if you really really really need to, but most wouldn't be able to, not even close. First time I see --format `format-from-command` here in the forums.

3.
You're at the frontier. No best practices for that yet.

Support for multiple subtitles in the same language

Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language

Re: Support for multiple subtitles in the same language