Page 1 of 1

Japanese Transliteration

Posted: 14 Apr 2019, 20:04
by devster
I've been trying for a while to do transliteration properly, especially for Japanese movies.
The fundamental issue stems from ICU. It seems that thanks to this: https://en.wikipedia.org/wiki/Han_unification these characters are shared between Chinese, Japanese, and Korean. The library uses Chinese Pinyin by default and produces awful results for Japanese text.
Luckily there's https://github.com/hexenq/kuroshiro who kindly provides an API which does romanization fairly well.
This is a format snippet for Japanese romanization.

Code: Select all

{
  // throws error without (on the GUI at least)
  import groovy.json.JsonSlurper
  import groovy.json.JsonOutput

  def translJap = {
    // rate limited to 100 per day I believe, please be careful
    def url = new URL('https://api.kuroshiro.org/convert')
    def requestHeaders = [:]
    def postBody = [:]
      postBody.str = it
      postBody.to = "romaji"
      postBody.mode = "spaced"
      postBody.romajiSystem = "hepburn"
    def postResponse = url.post(JsonOutput.toJson(postBody).getBytes('UTF-8'), 'application/json', requestHeaders)
    def json = new JsonSlurper().parseText(postResponse.text)
    return json.result
  }
}
fairly simple yet effective.
My advice is a simple if else block to choose based on primary language:

Code: Select all

  def transl = { 
    (languages.first().iso_639_2B == 'jpn' ? translJap(it) : it.transliterate("Any-Latin; NFD; NFC; Title") }
using languages here is suboptimal, TheMovieDB offers original_language as part of the API, unfortunately TVDB does not, so this is the easiest common ground and usually true.
Exception exist -> The Passion of Christ for example, in which case the binding would return Hebrew.
However primaryTitle should return whatever's the original title in TheMovieDB, correcting the situation, and the english title for TVDB (forced setting).

Slightly better could be to localize the title first and then transliterate it, but I'm not sure if this still works https://www.filebot.net/forums/viewtopic.php?t=3736#p20820.

P.S. seems URL bbcode is OFF.

Re: Transliteration

Posted: 15 Apr 2019, 05:29
by rednoah
1.
You can use this code to print all IDs via Groovy Console:

Code: Select all

com.ibm.icu.text.Transliterator.getAvailableIDs().each{ println it }
:!: However, it does seem that ICU really has no way to transliterate Kanji to Kana / Latin.


2.
You maybe able to use the {localize} binding though, assuming that the database has Japanese language entries:
https://www.filebot.net/forums/viewtopic.php?f=5&t=3761


3.
If disabled a few BBCode tags to discourage spammers. Let's see if it helps.

Re: Transliteration

Posted: 15 Apr 2019, 08:58
by devster
1. yes, it's weird that they do. Apparently it could be forced using Han-Latin and SetLocale to Japanese something, but I couldn't figure it out.
I used this http://demo.icu-project.org/icu-bin/translit for testing.

2. I tried https://www.filebot.net/forums/viewtopic.php?t=3736#p20820 to dynamically set language, but it threw an error in the console and I kind of gave up. Will post if I can replicate.

Re: Transliteration

Posted: 15 Apr 2019, 11:34
by rednoah
1.
Had a look for half an hour. Couldn't figure out a way, and figured you can't actually reasonably transliterate Kanji character by character because there's tons of readings for each. Haven't read anything about this SetLocale thing though.


2.
{localize} should work:

Code: Select all

$ filebot -list --q "One Piece" --format "{localize.ja.n}" --filter "absolute == 1"
Apply filter [absolute == 1] on [920] items
Include [One Piece - 1x01 - I'm Luffy! The Man Who's Gonna Be King of the Pirates!]
ワンピース

Re: Transliteration

Posted: 15 Apr 2019, 12:12
by devster
2. your snippet is perfect, however this threw an error last time I tried.

Code: Select all

{localize[languages.first()].name}
found it in one of the past examples.

Re: Transliteration

Posted: 15 Apr 2019, 12:15
by rednoah
Ah... Not sure, maybe early versions of {localize} just return a raw object (so it's using .name), rather than a bindings accessor object (which makes it easier, cause you can just use the bindings you already know, n, t, etc).

This should work with newer versions:

Code: Select all

{localize[languages[0]].n}