Japanese Transliteration

Running FileBot from the console, Groovy / FileBot scripting, shell scripts, etc
Post Reply
devster
Posts: 284
Joined: 06 Jun 2017, 22:56

Japanese Transliteration

Post by devster » 14 Apr 2019, 20:04

I've been trying for a while to do transliteration properly, especially for Japanese movies.
The fundamental issue stems from ICU. It seems that thanks to this: https://en.wikipedia.org/wiki/Han_unification these characters are shared between Chinese, Japanese, and Korean. The library uses Chinese Pinyin by default and produces awful results for Japanese text.
Luckily there's https://github.com/hexenq/kuroshiro who kindly provides an API which does romanization fairly well.
This is a format snippet for Japanese romanization.

Code: Select all

{
  // throws error without (on the GUI at least)
  import groovy.json.JsonSlurper
  import groovy.json.JsonOutput

  def translJap = {
    // rate limited to 100 per day I believe, please be careful
    def url = new URL('https://api.kuroshiro.org/convert')
    def requestHeaders = [:]
    def postBody = [:]
      postBody.str = it
      postBody.to = "romaji"
      postBody.mode = "spaced"
      postBody.romajiSystem = "hepburn"
    def postResponse = url.post(JsonOutput.toJson(postBody).getBytes('UTF-8'), 'application/json', requestHeaders)
    def json = new JsonSlurper().parseText(postResponse.text)
    return json.result
  }
}
fairly simple yet effective.
My advice is a simple if else block to choose based on primary language:

Code: Select all

  def transl = { 
    (languages.first().iso_639_2B == 'jpn' ? translJap(it) : it.transliterate("Any-Latin; NFD; NFC; Title") }
using languages here is suboptimal, TheMovieDB offers original_language as part of the API, unfortunately TVDB does not, so this is the easiest common ground and usually true.
Exception exist -> The Passion of Christ for example, in which case the binding would return Hebrew.
However primaryTitle should return whatever's the original title in TheMovieDB, correcting the situation, and the english title for TVDB (forced setting).

Slightly better could be to localize the title first and then transliterate it, but I'm not sure if this still works https://www.filebot.net/forums/viewtopic.php?t=3736#p20820.

P.S. seems URL bbcode is OFF.
Last edited by devster on 15 Apr 2019, 23:35, edited 1 time in total.
I only work in black and sometimes very, very dark grey. (Batman)

User avatar
rednoah
The Source
Posts: 16101
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Transliteration

Post by rednoah » 15 Apr 2019, 05:29

1.
You can use this code to print all IDs via Groovy Console:

Code: Select all

com.ibm.icu.text.Transliterator.getAvailableIDs().each{ println it }
:!: However, it does seem that ICU really has no way to transliterate Kanji to Kana / Latin.


2.
You maybe able to use the {localize} binding though, assuming that the database has Japanese language entries:
https://www.filebot.net/forums/viewtopic.php?f=5&t=3761


3.
If disabled a few BBCode tags to discourage spammers. Let's see if it helps.
:idea: Please read the FAQ and How to Request Help.

devster
Posts: 284
Joined: 06 Jun 2017, 22:56

Re: Transliteration

Post by devster » 15 Apr 2019, 08:58

1. yes, it's weird that they do. Apparently it could be forced using Han-Latin and SetLocale to Japanese something, but I couldn't figure it out.
I used this http://demo.icu-project.org/icu-bin/translit for testing.

2. I tried https://www.filebot.net/forums/viewtopic.php?t=3736#p20820 to dynamically set language, but it threw an error in the console and I kind of gave up. Will post if I can replicate.
I only work in black and sometimes very, very dark grey. (Batman)

User avatar
rednoah
The Source
Posts: 16101
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Transliteration

Post by rednoah » 15 Apr 2019, 11:34

1.
Had a look for half an hour. Couldn't figure out a way, and figured you can't actually reasonably transliterate Kanji character by character because there's tons of readings for each. Haven't read anything about this SetLocale thing though.


2.
{localize} should work:

Code: Select all

$ filebot -list --q "One Piece" --format "{localize.ja.n}" --filter "absolute == 1"
Apply filter [absolute == 1] on [920] items
Include [One Piece - 1x01 - I'm Luffy! The Man Who's Gonna Be King of the Pirates!]
ワンピース
:idea: Please read the FAQ and How to Request Help.

devster
Posts: 284
Joined: 06 Jun 2017, 22:56

Re: Transliteration

Post by devster » 15 Apr 2019, 12:12

2. your snippet is perfect, however this threw an error last time I tried.

Code: Select all

{localize[languages.first()].name}
found it in one of the past examples.
I only work in black and sometimes very, very dark grey. (Batman)

User avatar
rednoah
The Source
Posts: 16101
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Transliteration

Post by rednoah » 15 Apr 2019, 12:15

Ah... Not sure, maybe early versions of {localize} just return a raw object (so it's using .name), rather than a bindings accessor object (which makes it easier, cause you can just use the bindings you already know, n, t, etc).

This should work with newer versions:

Code: Select all

{localize[languages[0]].n}
:idea: Please read the FAQ and How to Request Help.

Post Reply