DIN5007-2 Transliterator (aka converting umlauts)

All your suggestions, requests and ideas for future development
Post Reply
thielj
Posts: 55
Joined: 05 Nov 2017, 22:15

DIN5007-2 Transliterator (aka converting umlauts)

Post by thielj »

The old issue, converting umlauts like äüö to ae, oe and ue. This could be very elegantly handled in script with a custom Transliterator. The following line would take any input and transliterate it to Latin script, then handle umlauts before further reducing everything to ASCII characters.

Code: Select all

// DIN5007 applies to German words only. Don't use this for foreign words, e.g. Motörhead → Motorhead 
{n.transliterate("Any-Latin; DIN5007_2; Latin-ASCII")}
To register a Transliterator in Java, something like the following is necessary only once:

Code: Select all

Transliterator.registerInstance(
	Transliterator.createFromRules("DIN5007_2", rules, Transliterator.FORWARD));
The rules are as follows, with additional checks to convert e.g. Äffin to Aeffin:

Code: Select all

$beforeLower = [[:Mn:][:Me:]]* [:Lowercase:] ;

ä → ae;
ö → oe;
ü → ue;
ß → ss;

Ä } $beforeLower → Ae;
Ö } $beforeLower → Oe;
Ü } $beforeLower → Ue;

Ä → AE;
Ö → OE;
Ü → UE;
ẞ → SS;
thielj
Posts: 55
Joined: 05 Nov 2017, 22:15

Re: DIN5007-2 Transliterator (aka converting umlauts)

Post by thielj »

I've just noticed that the latest ICU (60) includes a "de-ascii" transliteration with similar rules. Give me a few days and I'll look into it...
User avatar
rednoah
The Source
Posts: 24227
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: DIN5007-2 Transliterator (aka converting umlauts)

Post by rednoah »

Keep me posted. I'll make sure to include the latest ICU dependencies next time I update FileBot.
:idea: Please read the FAQ and How to Request Help.
User avatar
rednoah
The Source
Posts: 24227
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: DIN5007-2 Transliterator (aka converting umlauts)

Post by rednoah »

FileBot r7589 updates the built-in String.ascii() method with the DE-ASCII transliterator:

Code: Select all

Any-Latin;[äöüÄÖÜß]DE-ASCII;Latin-ASCII
e.g.

Code: Select all

"Motörhead".ascii() # Motoerhead
:idea: Please read the FAQ and How to Request Help.
Post Reply