Issue Handling Filename with Unicode Combining Diacritical Marks

Any questions? Need some help?
Post Reply
samwell
Posts: 15
Joined: 20 Jun 2022, 06:37

Issue Handling Filename with Unicode Combining Diacritical Marks

Post by samwell »

I have the below filename:

Code: Select all

G:\auto-media\La meglio gioventù 2003 1080p BluRay DD5.1 x264-EA\La meglio gioventù 2003 1080p BluRay DD5.1 x264 P1-EA.mkv
When this filename is passed to Filebot, this error is thrown:

Code: Select all

File does not exist: G:\auto-media\La meglio gioventu` 2003 1080p BluRay DD5.1 x264-EA\La meglio gioventu` 2003 1080p BluRay DD5.1 x264 P1-EA.mkv
^ You can see that the grave accent has been shifted.

(I have confirmed that all "chcp" unicode settings are correctly configured within Powershell, and I've tried "en_US.UTF-8" for "LANG" and "LC_ALL").

Note that the input filename contains the "symbol" ù <- that "symbol" is different than the character of ù

The first "symbol" is actually two characters:

1) https://www.codetable.net/decimal/117
and
2) https://www.codetable.net/decimal/768

^ a.k.a. https://en.wikipedia.org/wiki/Combining ... ical_Marks

Whereas the second character is the single character:

https://www.codetable.net/decimal/249

Can anyone advise how Filebot can be configured to handle the former "symbol?"

Thank you
User avatar
rednoah
The Source
Posts: 21350
Joined: 16 Nov 2011, 08:59

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by rednoah »

:?: Are you using CMD or PowerShell?


:?: What happens if you select the file via a glob pattern?

Code: Select all

filebot -script fn:sysenv G:\auto-media\*giovent*
:idea: Please read the FAQ and How to Request Help.
samwell
Posts: 15
Joined: 20 Jun 2022, 06:37

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by samwell »

rednoah wrote: 30 Sep 2022, 02:50 :?: Are you using CMD or PowerShell?


:?: What happens if you select the file via a glob pattern?

Code: Select all

filebot -script fn:sysenv G:\auto-media\*giovent*
Powershell is being used.

Below is the output from sysenv script. If I manually rename the file to use the single unicode character instead of the "dual diacritical symbol," then it also works.

Code: Select all

# Local Time #
Thu Sep 29 20:10:13 PDT 2022

# Process Tree #
C:\Windows\explorer.exe
?? C:\Windows\System32\WindowsPowerShell\v1.0\powershell_ise.exe
   ?? C:\Program Files\FileBot\filebot.exe
      ?? C:\Program Files\FileBot\jre\bin\java.exe

# Environment Variables #
=::: ::\
ALLUSERSPROFILE: C:\ProgramData
APPDATA: C:\Users\sam\AppData\Roaming
COMPUTERNAME: BEELINK
ComSpec: C:\WINDOWS\system32\cmd.exe
CommonProgramFiles: C:\Program Files\Common Files
CommonProgramFiles(x86): C:\Program Files (x86)\Common Files
CommonProgramW6432: C:\Program Files\Common Files
DriverData: C:\Windows\System32\Drivers\DriverData
HOMEDRIVE: C:
HOMEPATH: \Users\sam
LANG: en_US.UTF-8
LC_ALL: en_US.UTF-8
LOCALAPPDATA: C:\Users\sam\AppData\Local
LOGONSERVER: \\BEELINK
NUMBER_OF_PROCESSORS: 8
OS: Windows_NT
OneDrive: C:\Users\sam\OneDrive
OneDriveConsumer: C:\Users\sam\OneDrive
PATHEXT: .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC;.CPL
PROCESSOR_ARCHITECTURE: AMD64
PROCESSOR_IDENTIFIER: AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD
PROCESSOR_LEVEL: 23
PROCESSOR_REVISION: 1801
PSModulePath: C:\Users\sam\Documents\WindowsPowerShell\Modules;C:\Program Files\WindowsPowerShell\Modules;C:\WINDOWS\system32\WindowsPowerShell\v1.0\Modules
PUBLIC: C:\Users\Public
Path: C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\OpenSSH\;C:\Program Files\FileBot\;C:\Users\sam\AppData\Local\Micr
osoft\WindowsApps;C:\Users\sam\AppData\Local\Programs\Microsoft VS Code\bin;C:\Program Files\FileBot\jre\bin
ProgramData: C:\ProgramData
ProgramFiles: C:\Program Files
ProgramFiles(x86): C:\Program Files (x86)
ProgramW6432: C:\Program Files
SystemDrive: C:
SystemRoot: C:\WINDOWS
TEMP: C:\Users\sam\AppData\Local\Temp
TMP: C:\Users\sam\AppData\Local\Temp
USERDOMAIN: beelink
USERDOMAIN_ROAMINGPROFILE: beelink
USERNAME: sam
USERPROFILE: C:\Users\sam
windir: C:\WINDOWS

# Java System Properties #
application.deployment: msi
application.dir: C:\Users\sam\AppData\Roaming\FileBot
file.encoding: Cp1252
file.separator: \
grape.root: C:\Users\sam\AppData\Roaming\FileBot\grape
groovy.antlr4: false
http.agent: FileBot/4.9.6
java.class.path: C:\Program Files\FileBot\jar\filebot.jar
java.class.version: 61.0
java.home: C:\Program Files\FileBot\jre
java.io.tmpdir: C:\Users\sam\AppData\Roaming\FileBot\tmp
java.library.path: C:\Program Files\FileBot\lib
java.net.useSystemProxies: true
java.runtime.name: OpenJDK Runtime Environment
java.runtime.version: 17.0.2+8
java.specification.name: Java Platform API Specification
java.specification.vendor: Oracle Corporation
java.specification.version: 17
java.vendor: Eclipse Adoptium
java.vendor.url: https://adoptium.net/
java.vendor.url.bug: https://github.com/adoptium/adoptium-support/issues
java.vendor.version: Temurin-17.0.2+8
java.version: 17.0.2
java.version.date: 2022-01-18
java.vm.compressedOopsMode: Zero based
java.vm.info: mixed mode, sharing
java.vm.name: OpenJDK 64-Bit Server VM
java.vm.specification.name: Java Virtual Machine Specification
java.vm.specification.vendor: Oracle Corporation
java.vm.specification.version: 17
java.vm.vendor: Eclipse Adoptium
java.vm.version: 17.0.2+8
jdk.debug: release
jdk.logger.packages: net.filebot.Log
jna.boot.library.path: C:\Program Files\FileBot\lib
jna.library.path: C:\Program Files\FileBot\lib
jna.nosys: true
jna.nounpack: true
line.separator: 

native.encoding: Cp1252
net.filebot.AcoustID.fpcalc: C:\Program Files\FileBot\lib\fpcalc.exe
net.filebot.UserFiles.fileChooser: COM
org.apache.commons.logging.Log: org.apache.commons.logging.impl.NoOpLog
os.arch: amd64
os.name: Windows 11
os.version: 10.0
path.separator: ;
prism.order: sw
sun.arch.data.model: 64
sun.boot.library.path: C:\Program Files\FileBot\jre\bin
sun.cpu.endian: little
sun.cpu.isalist: amd64
sun.io.unicode.encoding: UnicodeLittle
sun.java.command: C:\Program Files\FileBot\jar\filebot.jar -script fn:sysenv G:\PoolPart.2a73f37a-d6ff-4f7c-a0de-69a72e1d9a63\torrents\auto-media\*giovent*
sun.java.launcher: SUN_STANDARD
sun.java2d.d3d: false
sun.jnu.encoding: Cp1252
sun.management.compiler: HotSpot 64-Bit Tiered Compilers
sun.net.client.defaultConnectTimeout: 10000
sun.net.client.defaultReadTimeout: 60000
sun.os.patch.level: 
swing.crossplatformlaf: javax.swing.plaf.nimbus.NimbusLookAndFeel
unixfs: false
useCreationDate: false
useExtendedFileAttributes: true
useNativeShell: false
user.country: US
user.dir: C:\WINDOWS\system32
user.home: C:\Users\sam
user.language: en
user.name: sam
user.script: 
user.timezone: America/Los_Angeles
user.variant: 

# Arguments #
args[0] = -script
args[1] = fn:sysenv
args[2] = G:\PoolPart.2a73f37a-d6ff-4f7c-a0de-69a72e1d9a63\torrents\auto-media\La meglio gioventu? 2003 1080p BluRay DD5.1 x264-EA
args[3] = G:\PoolPart.2a73f37a-d6ff-4f7c-a0de-69a72e1d9a63\torrents\auto-media\La meglio giovent� 2003 1080p BluRay DD5.1 x264-EA
Done ?(?????)?
User avatar
rednoah
The Source
Posts: 21350
Joined: 16 Nov 2011, 08:59

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by rednoah »

:?: PowerShell expands G:\auto-media\*giovent* to 2 arguments? The output suggests that you have the same folder twice, perhaps once for each diacritic encoding? That would be most unsual:

Code: Select all

args[2] = G:\PoolPart.2a73f37a-d6ff-4f7c-a0de-69a72e1d9a63\torrents\auto-media\La meglio gioventu? 2003 1080p BluRay DD5.1 x264-EA
args[3] = G:\PoolPart.2a73f37a-d6ff-4f7c-a0de-69a72e1d9a63\torrents\auto-media\La meglio giovent� 2003 1080p BluRay DD5.1 x264-EA

:idea: Note that the output gives us ? and � so that's different from ` in the previous test case. This might be a clue.


:?: The PoolPart.* folder suggests that you're using a 3rd party file system. This might be a clue. Have you tried processing files with combining diacritical marks that are placed on your Desktop or normal folder on a normal NTFS file system, e.g. your Desktop folder?


:idea: Note that this is likely a PowerShell issue / DrivePool issue and not a FileBot issue per-se. FileBot itself just processes the file paths that are passed along from PowerShell as-is. Notably, NTFS itself stores anything as Normalization Form Canonical Composition (NFC) internally and accepts any file path of unicode equivalence to refer to that file, i.e. both é and e◌́ should work irregardless of how you typed the file name in the first place.
:idea: Please read the FAQ and How to Request Help.
samwell
Posts: 15
Joined: 20 Jun 2022, 06:37

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by samwell »

rednoah wrote: 30 Sep 2022, 03:31 :?: PowerShell expands G:\auto-media\*giovent* to 2 arguments? The output suggests that you have the same folder twice, perhaps once for each diacritic encoding?

Code: Select all

args[2] = G:\PoolPart.2a73f37a-d6ff-4f7c-a0de-69a72e1d9a63\torrents\auto-media\La meglio gioventu? 2003 1080p BluRay DD5.1 x264-EA
args[3] = G:\PoolPart.2a73f37a-d6ff-4f7c-a0de-69a72e1d9a63\torrents\auto-media\La meglio giovent� 2003 1080p BluRay DD5.1 x264-EA

:idea: Note that the output gives us ? and � so that's different from ` in the previous test case. This might be a clue.


:?: The PoolPart.* folder suggests that you're using a 3rd party file system. This might be a clue. Have you tried processing files with combining diacritical marks that are placed on your Desktop or normal folder on a normal NTFS file system, e.g. your Desktop folder?
Correct, the two arguments you're seeing are the remnants of what was mentioned about 'If I manually rename the file to use the single unicode character instead of the "dual diacritical symbol," then it also works.'

The clue you've spotted may be partially explained by the fact that the previous ` was copied from the file under --log-file, whereas the ? and � were copied from the powershell ise console. Although I don't know the root cause that would explain this specifically.

G:\ is a normal NTFS hard drive mount. The PoolPart.* directory is just a folder that is organized by Drivepool. I don't think this is affecting anything here because we are accessing the PoolPart.* directory directly, i.e. bypassing the virtual mount at H:\, which is to say that afaik I could completely uninstall Drivepool from my system without having any effect on the path being accessible in our example. This was previously discussed here, a while back.

If you simply create a dummy file with the diacritical file path on your system, does Filebot handle it correctly on your end?
samwell
Posts: 15
Joined: 20 Jun 2022, 06:37

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by samwell »

rednoah wrote: 30 Sep 2022, 03:31 :idea: Note that this is likely a PowerShell issue / DrivePool issue and not a FileBot issue per-se. FileBot itself just processes the file paths that are passed along from PowerShell as-is. Notably, NTFS itself stores anything as Normalization Form Canonical Composition (NFC) internally and accepts any file path of unicode equivalence to refer to that file, i.e. both é and e◌́ should work irregardless of how you typed the file name in the first place.
I hear that. My main suspicion is that the Filebot rename works perfectly with the unicode charactor, but fails to locate the file with the diacritical symbol. The other thing I did to test if Powershell was mishandling the encoding was to make a simple bat script that does "echo %1"

Both the dual-diacritical and the unicode character are output exactly the same when calling "test.bat gioventù" and "test.bat gioventù"
User avatar
rednoah
The Source
Posts: 21350
Joined: 16 Nov 2011, 08:59

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by rednoah »

2 arguments means that PowerShell (not FileBot) sees two files or folders. If Windows Explorer / DrivePool show you 1 folder but PowerShell somehow sees 2 folders, then that would be most disturbing.


:?: What do you see when you list the parent folder with PowerShell?
https://linuxhint.com/list-files-directory-powershell/


:idea: --log-file will give you the most accurate picture, since FileBot itself will directly write UTF-8 text to the log file. What you see in CMD or PowerShell might have been mangled by the Console Host GUI (e.g. replacing unknown character codes with the � symbol) already.



EDIT:

You can also try passing the argument via an UTF-8 encoded external text file, thus bypassing PowerShell argument handling:
viewtopic.php?t=3244
:idea: Please read the FAQ and How to Request Help.
samwell
Posts: 15
Joined: 20 Jun 2022, 06:37

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by samwell »

rednoah wrote: 30 Sep 2022, 04:11 2 arguments means that PowerShell (not FileBot) sees two files or folders. If Windows Explorer / DrivePool show you 1 folder but PowerShell somehow sees 2 folders, then that would be most disturbing.


:?: What do you see when you list the parent folder with PowerShell?
https://linuxhint.com/list-files-directory-powershell/


:idea: --log-file will give you the most accurate picture, since FileBot itself will directly write UTF-8 text to the log file. What you see in CMD or PowerShell might have been mangled by the Console Host GUI (e.g. replacing unknown character codes with the � symbol) already.



EDIT:

You can also try passing the argument via an UTF-8 encoded external text file, thus bypassing PowerShell argument handling:
viewtopic.php?t=3244
There are 2 directory+file paths that I've made while debugging this: 1 with the dual-diacritical and 1 with the single-unicode. Powershell+Filebot fails with the former but succeeds with the latter:

Code: Select all

PS C:\WINDOWS\system32> Get-ChildItem "G:\PoolPart.2a73f37a-d6ff-4f7c-a0de-69a72e1d9a63\torrents\auto-media\" | Select-String "La meglio"
La meglio gioventù 2003 1080p BluRay DD5.1 x264-EA
La meglio gioventù 2003 1080p BluRay DD5.1 x264-EA
Using the external file did work, so you're right that the problem lies within how Powershell and Java/Filebot are handling encoding of stdin/stdout.

One thing that strikes me is that this same Powershell script is able to successfully interact with another external exe, which occurs earlier in my media pipeline, using this problematic, diacritical filename. The filename is correctly piped to the exe and subsequently parsed from the stdout. For reference, this is the dpcmd.exe referenced in our prior thread.

Do you know if Java/Filebot is using the File I/O API or the Console I/O API? I quote this stackoverflow discussion on piping unicode between Powershell and Java:
To read/write Unicode to a console, an application (or its C runtime library) should be smart enough to use not File-I/O API, but Console-I/O API. (For an example, see how Python does it.)
I noticed that the previously run "filebot -script fn:sysenv G:\auto-media\*giovent*" has setting of "file.encoding: Cp1252" -- I have seen some suggestions that setting this to UTF-8 will allow Java to better handle UTF-8 piped in from Powershell.

But after setting $env:JAVA_TOOL_OPTIONS="-Dfile.encoding=UTF-8" then running the sysenv script simply throws this error:

Code: Select all

PS C:\WINDOWS\system32> filebot -script fn:sysenv
filebot : Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
At line:1 char:1
+ filebot -script fn:sysenv
+ ~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Picked up JAVA_....encoding=UTF-8:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
Can you advise how I can run Filebot with "-Dfile.encoding=UTF-8" ?

Thanks again for sharing your expertise on this matter.
User avatar
rednoah
The Source
Posts: 21350
Joined: 16 Nov 2011, 08:59

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by rednoah »

:idea: Keep in mind that you have a unicode equivalence problem, and not a character encoding problem. Both PowerShell and Java/FileBot are using unicode. You can confirm this by passing 你好.mp4 or any other unicode file path where NFD / NFC doesn't come into play.


:idea: dpcmd.exe is a tool made by StableBit DrivePool for StableBit DrivePool. This might be a clue.




:?: Windows Explorer allows you to create the "same" (i.e. unicode equivalent) folder twice? Really? :shock: :shock: :shock: :shock:


:arrow: Please open Groovy Pad and run this code so we can confirm how Windows handles file access for different NFs that are unicode equivalent:

Code: Select all

def nfc = '\u00e9'
def nfd = '\u0065\u0301'

println "$nfc == $nfd [${nfc == nfd}]"

def a = ApplicationFolder.TemporaryFiles.resolve(nfc + '.txt')
def b = ApplicationFolder.TemporaryFiles.resolve(nfd + '.txt')

println "$a == $b [${a == b}]"

println "TIME = $now".saveAs(a)
println a.text
println b.text
:idea: Please read the FAQ and How to Request Help.
samwell
Posts: 15
Joined: 20 Jun 2022, 06:37

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by samwell »

Strangely, this threw an error at the end.

Code: Select all

� == e? [false]
C:\Users\sam\AppData\Roaming\FileBot\tmp\�.txt == C:\Users\sam\AppData\Roaming\FileBot\tmp\e?.txt [false]
C:\Users\sam\AppData\Roaming\FileBot\tmp\�.txt
TIME = Sat Oct 01 10:50:59 PDT 2022
java.io.FileNotFoundException: C:\Users\sam\AppData\Roaming\FileBot\tmp\é.txt (The system cannot find the file specified)
	at Script1.run(Script1.groovy:13)
	at net.filebot.cli.ScriptShell.evaluate(Unknown Source)
	at net.filebot.cli.GroovyPad$Runner.eval(Unknown Source)
	at net.filebot.cli.GroovyPad$Runner.lambda$new$0(Unknown Source)
confirmation of directories:

Image
samwell
Posts: 15
Joined: 20 Jun 2022, 06:37

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by samwell »

I am too unfamiliar with Java and the internals of Filebot, but I have run some tests between Powershell ISE and Python.

This hopefully establishes a proof-of-concept that demonstrates the ways in which Windows handles (and fails to handle by default) unicode on stdin and/or stdout.

This is the simple "print_me.py" script used to test stdin/stdout unicode:

Code: Select all

import sys

def main():
  arg = sys.argv[1]
  print(arg)

if __name__ == '__main__':
  main()
Let's use it via "python .\print_me.py hello"

Running the above will simply write hello to stdout

Let's test if we execute this script with our infamous diacritical, (from within Powershell ISE):

Code: Select all

$diacritic_py_out = python .\print_me.py gioventù
You should see the below error thrown by Python:

Code: Select all

python : Traceback (most recent call last):
At line:1 char:21
+ ... ritic_py_out = python C:\Users\sam\Desktop\test\print_me.py gioventù
+                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
 
  File "C:\Users\sam\Desktop\test\print_me.py", line 8, in <module>
    main()
  File "C:\Users\sam\Desktop\test\print_me.py", line 5, in main
    print(arg)
  File "C:\Users\sam\.pyenv\pyenv-win\versions\3.10.7\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0300' in position 8: character maps to <undefined>
This points at what I believe is the root cause of our problem and also indicates that, on Windows, programs default to using the CP1252 codec.

As this stackoverflow answer explains, CP1252 is a subset of unicode, which means that it represents only a partial range of the entire unicode codec:
CP1252 and UTF-8 are the same for all characters < 128. They differ above that. So if you stick to English and stay away from diacritical marks these will be the same.
This can be simply resolved in Python; our previous command just needs to tell Python to use unicode:

Code: Select all

$diacritic_py_out = python -X utf8 .\print_me.py gioventù
We can see that the stdout is equal to the original input:

Code: Select all

$diacritic_py_out -eq "gioventù"
The suspicion this hints at is that Java/Filebot is not using unicode but is also defaulting to CP1252, and the previously mentioned "filebot -script fn:sysenv" setting of "file.encoding: Cp1252" is pretty solid evidence of this imo.

I was able to run Filebot with "-Dfile.encoding=UTF-8" by setting $ErrorActionPreference="continue" to ignore the "Picked up JAVA_TOOL_OPTIONS" write to stderr, but this did not have any affect on the diacritical problem as Filebot still threw the "File does not exist: ... gioventu`" error.

Would love to hear your thoughts on this proof-of-concept and see if there's something that can be toggled in Java/Filebot similar to Python's "-X utf8"

edit: to be more specific with a Filebot example...

The below shows that some part of the encoding is getting lost in translation between Powershell and Java/Filebot:

Code: Select all

$filebot_out = filebot -script fn:sysenv gioventù
$filebot_out -contains "gioventù" 
The last command evaluates to False, and the output instead contains "args[2] = gioventu`"

I feel like there must be a way to resolve this, similarly to Python's "-X utf8" -- but I just haven't been able to derive the relevant Java/Filebot setting.
User avatar
rednoah
The Source
Posts: 21350
Joined: 16 Nov 2011, 08:59

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by rednoah »

file.encoding is for reading / writing text files (i.e. file content) and not "file system path encoding" which is always UTF-8 on Windows and macOS.

:idea: On Linux, the "file system path encoding" can be set via sun.jnu.encoding (this property has no effect on Windows) because on Linux a file name is just a sequence of bytes and can in theory be anything, though is always UTF-8 in practice because anything else would mindfck.



:idea: If java was limited to ASCII when decoding arguments, it would crash on startup, just like your python test case. You have now confirmed that both java and python -X allow you to pass unicode input arguments.


:arrow: We now want to extend our test case to show us the unicode code points so that we can see if what is passed in is actually what our code is working with. We want to find out if we can pass both NFD unicode sequences and NFC unicode sequence, or if perhaps one is internally normalised to the other at some point.


e.g. Please run this test script so we can see the code points that are passed in, and if those code points can be used to refer to the file:

Code: Select all

filebot -script "C:/test.groovy" G:\auto-media\*giovent*

Code: Select all

args.each{ a ->
	println a
	println a.chars().collect{ c -> String.format('%04x', c) }

	def f = new File(a)
	println f.exists()

	println "---------"
}

Feel free to rewrite the code in python and see if you get different results. That could be a clue.
:idea: Please read the FAQ and How to Request Help.
User avatar
rednoah
The Source
Posts: 21350
Joined: 16 Nov 2011, 08:59

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by rednoah »

I can confirm that Windows / NTFS indeed (very unexpected!!!) allows you to have multiple unicode-equivalent (but canonically different) file paths. Even macOS / APFS doesn't allow that.


I have further narrowed down the issue to the Console Host GUI which seems to mangle e` (NFD) at some point, and so I simply couldn't enter the unicode sequence that refers to that file path. It notably does work when PowerShell expands arguments internally for the filebot call.

Image

You can double check my findings like so:
* execute *.py script file that calls filebot (this should work)
* execute *.ps1 script file that calls filebot (this should work)


:?: Are you manually typing filebot commands on the command-line? Does your use case require you to manually type filebot commands on the command-line? Because the issue seems to be specific to the interactive Console Host GUI but shouldn't be an issue if filebot is called by other programs.
:idea: Please read the FAQ and How to Request Help.
samwell
Posts: 15
Joined: 20 Jun 2022, 06:37

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by samwell »

rednoah wrote: 02 Oct 2022, 03:21 e.g. Please run this test script so we can see the code points that are passed in, and if those code points can be used to refer to the file:

Code: Select all

filebot -script "C:/test.groovy" G:\auto-media\*giovent*

Code: Select all

args.each{ a ->
	println a
	println a.chars().collect{ c -> String.format('%04x', c) }

	def f = new File(a)
	println f.exists()

	println "---------"
}

Feel free to rewrite the code in python and see if you get different results. That could be a clue.
Please note these couple of items for this post:

1) all debug commands below are being run from PowerShell ISE
2) $test_path = "G:\PoolPart.2a73f37a-d6ff-4f7c-a0de-69a72e1d9a63\torrents\auto-media\La meglio gioventù 2003 1080p BluRay DD5.1 x264-EA\La meglio gioventù 2003 1080p BluRay DD5.1 x264 P1-EA.mkv"

Something unexpected is going on with Groovy/Java/Filebot that is not obvious to me.

It seems like each "a" arg is being cast as a File object.

Running the script as you posted simply errors with "No signature of method:"

Code: Select all

PS C:\WINDOWS\system32> filebot -script $script_path $test_path
G:\PoolPart.2a73f37a-d6ff-4f7c-a0de-69a72e1d9a63\torrents\auto-media\La meglio gioventu` 2003 1080p BluRay DD5.1 x264-EA\La meglio gioventu` 2003 1080p BluRay DD5.1 x264 P1-EA.mkv
filebot : No signature of method: java.io.File.chars() is applicable for argument types: () values: []
At line:1 char:1
+ filebot -script $script_path $test_path
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (No signature of...: () values: []:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
 
Possible solutions: hash(java.lang.String), check(groovy.lang.Closure), trash(), wait(), mkdirs(), every()
groovy.lang.MissingMethodException: No signature of method: java.io.File.chars() is applicable for argument types: () values: []
Possible solutions: hash(java.lang.String), check(groovy.lang.Closure), trash(), wait(), mkdirs(), every()
	at Script1$_run_closure1.doCall(Script1.groovy:4)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at Script1.run(Script1.groovy:1)
	at net.filebot.cli.ScriptShell.evaluate(Unknown Source)
	at net.filebot.cli.ScriptShell.runScript(Unknown Source)
	at net.filebot.cli.ArgumentProcessor.runScript(Unknown Source)
	at net.filebot.cli.ArgumentProcessor.run(Unknown Source)
	at net.filebot.Main.main(Unknown Source)
Error (o_O)
I modified the script to be:

Code: Select all

args.each{ a ->
	println a
	println a.exists()
	println "---------"
}
... and running "filebot -script $script_path hello $test_path" outputs:

Code: Select all

C:\WINDOWS\system32\hello
false
---------
G:\PoolPart.2a73f37a-d6ff-4f7c-a0de-69a72e1d9a63\torrents\auto-media\La meglio gioventu` 2003 1080p BluRay DD5.1 x264-EA\La meglio gioventu` 2003 1080p BluRay DD5.1 x264 P1-EA.mkv
false
---------
Done ?(?????)?
^ notice that the input arg of "hello" is cast as a File with path of "C:\WINDOWS\system32\" (this is the pwd)

^ also notice that the diacritical accent has been erroneously shifted as we've previously seen in this thread

If I duplicate this groovy test script with this python script:

Code: Select all

import os
import sys

def main():
  arg = sys.argv[1]
  file_exists = os.path.isfile(arg)
  print(f"file '{arg}' exists == {file_exists}")

if __name__ == '__main__':
  main()
... and running "python -X utf8 C:\Users\sam\Desktop\test\print_me.py $test_path" outputs:

Code: Select all

file 'G:\PoolPart.2a73f37a-d6ff-4f7c-a0de-69a72e1d9a63\torrents\auto-media\La meglio gioventù 2003 1080p BluRay DD5.1 x264-EA\La meglio gioventù 2003 1080p BluRay DD5.1 x264 P1-EA.mkv' exist
s == True
rednoah wrote: 02 Oct 2022, 05:31 :?: Are you manually typing filebot commands on the command-line? Does your use case require you to manually type filebot commands on the command-line? Because the issue seems to be specific to the interactive Console Host GUI but shouldn't be an issue if filebot is called by other programs.
The entirety of this "media pipeline" occurs within a single Powershell script that is sourced from this.
User avatar
rednoah
The Source
Posts: 21350
Joined: 16 Nov 2011, 08:59

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by rednoah »

Sorry, my bad. I was testing with sample String values in the IDE:

Code: Select all

def nfc = '\u00e9'
def nfd = '\u0065\u0301'

def args = [nfc, nfd]

args.each{ a ->
	println a
	println a.chars().collect{ c -> String.format('%04x', c) }

	def f = new File(a)
	println f.exists()
	println f.length()

	println "---------"
}

But since args is a List of File objects when executed by filebot, the correct way to mirror and list code points of the file paths should be this:

Code: Select all

args.each{ a ->
	println a
	println a.path.chars().collect{ c -> String.format('%04x', c) }

	println a.exists()
	println a.length()

	println "---------"
}


:?: Based on this test case. Can you confirm that different code points are getting passed depending on how PowerShell is calling filebot? Can you reproduce this behaviour with your python test case?

Image

:!: Note how U+0301 Combining Acute Accent (correct) is somehow at some point translated to U+00B4 Grave Accent (incorrect) and so that's just a different file path entirely.
:idea: Please read the FAQ and How to Request Help.
User avatar
rednoah
The Source
Posts: 21350
Joined: 16 Nov 2011, 08:59

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by rednoah »

After a good day of lots of random trial and error, and lots of random Googling, I'm now ready to give up. I'm pretty sure that there's nothing (i.e. changes in either your PowerShell code or my Java code) we can do to make it work.


:!: :!: :!: I did however learn that cmd / PowerShell really do not expand arguments. :shock: :shock: :shock: :shock: :shock: python *.txt literally just passes *.txt since python.exe does not expand arguments. :shock: :shock: :shock: :shock: :shock: However, this "feature" can be added at compile time by linking wsetargv.obj into the executable, according to Expanding wildcard arguments. Some executables do that. Some executables do not do that. And thus behave differently. :shock: :shock: :shock: :shock: :shock:


So it does seem like java.exe is indeed mangling the argument value before any FileBot code is even executed. Not sure if that is specific to java.exe or generic to all executables that link against wsetargv.obj. Either way, there's nothing we can do about it...



EDIT:

As a workaround, you could use Plain File Mode to normalize all file paths to NFC unicode sequences first, i.e. rename / move files that use NFD unicode character sequences.

e.g. rewrite file paths with NFC and delete left-behind empty folders:

Code: Select all

filebot -rename -r /input --db file --format "{ java.text.Normalizer.normalize(f.path, java.text.Normalizer.Form.NFC) }" --apply prune


EDIT 2:

ICU script transliteration can also get the job done and looks a bit more pretty since it's a FileBot built-in function:

Code: Select all

filebot -rename -r /input --db file --format "{ f.path.transliterate('NFC') }" --apply prune
:idea: Please read the FAQ and How to Request Help.
samwell
Posts: 15
Joined: 20 Jun 2022, 06:37

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by samwell »

Thanks again for all the effort you've devoted to helping debug this.

I've settled on your suggestion to pass all arguments to Filebot via the external file, and this is working in my pipeline now =)
User avatar
rednoah
The Source
Posts: 21350
Joined: 16 Nov 2011, 08:59

Re: Issue Handling Filename with Unicode Combining Diacritical Marks

Post by rednoah »

Bug ID: JDK-8294884 has been submitted and accepted. Let's see if the JDK developers figure out this mystery.
:idea: Please read the FAQ and How to Request Help.
Post Reply