Page 1 of 1

DIN5007-2 Transliterator (aka converting umlauts)

Posted: 08 Nov 2017, 16:00
by thielj
The old issue, converting umlauts like äüö to ae, oe and ue. This could be very elegantly handled in script with a custom Transliterator. The following line would take any input and transliterate it to Latin script, then handle umlauts before further reducing everything to ASCII characters.

Code: Select all

// DIN5007 applies to German words only. Don't use this for foreign words, e.g. Motörhead → Motorhead 
{n.transliterate("Any-Latin; DIN5007_2; Latin-ASCII")}
To register a Transliterator in Java, something like the following is necessary only once:

Code: Select all

Transliterator.registerInstance(
	Transliterator.createFromRules("DIN5007_2", rules, Transliterator.FORWARD));
The rules are as follows, with additional checks to convert e.g. Äffin to Aeffin:

Code: Select all

$beforeLower = [[:Mn:][:Me:]]* [:Lowercase:] ;

ä → ae;
ö → oe;
ü → ue;
ß → ss;

Ä } $beforeLower → Ae;
Ö } $beforeLower → Oe;
Ü } $beforeLower → Ue;

Ä → AE;
Ö → OE;
Ü → UE;
ẞ → SS;

Re: DIN5007-2 Transliterator (aka converting umlauts)

Posted: 09 Nov 2017, 12:56
by thielj
I've just noticed that the latest ICU (60) includes a "de-ascii" transliteration with similar rules. Give me a few days and I'll look into it...

Re: DIN5007-2 Transliterator (aka converting umlauts)

Posted: 09 Nov 2017, 17:50
by rednoah
Keep me posted. I'll make sure to include the latest ICU dependencies next time I update FileBot.

Re: DIN5007-2 Transliterator (aka converting umlauts)

Posted: 18 May 2020, 18:39
by rednoah
FileBot r7589 updates the built-in String.ascii() method with the DE-ASCII transliterator:

Code: Select all

Any-Latin;[äöüÄÖÜß]DE-ASCII;Latin-ASCII
e.g.

Code: Select all

"Motörhead".ascii() # Motoerhead