Japanese Transliteration
Posted: 14 Apr 2019, 20:04
I've been trying for a while to do transliteration properly, especially for Japanese movies.
The fundamental issue stems from ICU. It seems that thanks to this: https://en.wikipedia.org/wiki/Han_unification these characters are shared between Chinese, Japanese, and Korean. The library uses Chinese Pinyin by default and produces awful results for Japanese text.
Luckily there's https://github.com/hexenq/kuroshiro who kindly provides an API which does romanization fairly well.
This is a format snippet for Japanese romanization.
fairly simple yet effective.
My advice is a simple if else block to choose based on primary language:
using languages here is suboptimal, TheMovieDB offers original_language as part of the API, unfortunately TVDB does not, so this is the easiest common ground and usually true.
Exception exist -> The Passion of Christ for example, in which case the binding would return Hebrew.
However primaryTitle should return whatever's the original title in TheMovieDB, correcting the situation, and the english title for TVDB (forced setting).
Slightly better could be to localize the title first and then transliterate it, but I'm not sure if this still works viewtopic.php?t=3736#p20820.
P.S. seems URL bbcode is OFF.
The fundamental issue stems from ICU. It seems that thanks to this: https://en.wikipedia.org/wiki/Han_unification these characters are shared between Chinese, Japanese, and Korean. The library uses Chinese Pinyin by default and produces awful results for Japanese text.
Luckily there's https://github.com/hexenq/kuroshiro who kindly provides an API which does romanization fairly well.
This is a format snippet for Japanese romanization.
Code: Select all
{
// throws error without (on the GUI at least)
import groovy.json.JsonSlurper
import groovy.json.JsonOutput
def translJap = {
// rate limited to 100 per day I believe, please be careful
def url = new URL('https://api.kuroshiro.org/convert')
def requestHeaders = [:]
def postBody = [:]
postBody.str = it
postBody.to = "romaji"
postBody.mode = "spaced"
postBody.romajiSystem = "hepburn"
def postResponse = url.post(JsonOutput.toJson(postBody).getBytes('UTF-8'), 'application/json', requestHeaders)
def json = new JsonSlurper().parseText(postResponse.text)
return json.result
}
}
My advice is a simple if else block to choose based on primary language:
Code: Select all
def transl = {
(languages.first().iso_639_2B == 'jpn' ? translJap(it) : it.transliterate("Any-Latin; NFD; NFC; Title") }
Exception exist -> The Passion of Christ for example, in which case the binding would return Hebrew.
However primaryTitle should return whatever's the original title in TheMovieDB, correcting the situation, and the english title for TVDB (forced setting).
Slightly better could be to localize the title first and then transliterate it, but I'm not sure if this still works viewtopic.php?t=3736#p20820.
P.S. seems URL bbcode is OFF.