I am too unfamiliar with Java and the internals of Filebot, but I have run some tests between Powershell ISE and Python.
This hopefully establishes a proof-of-concept that demonstrates the ways in which Windows handles (and fails to handle by default) unicode on stdin and/or stdout.
This is the simple "print_me.py" script used to test stdin/stdout unicode:
Code: Select all
import sys
def main():
arg = sys.argv[1]
print(arg)
if __name__ == '__main__':
main()
Let's use it via "python .\print_me.py hello"
Running the above will simply write hello to stdout
Let's test if we execute this script with our infamous diacritical, (from within Powershell ISE):
Code: Select all
$diacritic_py_out = python .\print_me.py gioventù
You should see the below error thrown by Python:
Code: Select all
python : Traceback (most recent call last):
At line:1 char:21
+ ... ritic_py_out = python C:\Users\sam\Desktop\test\print_me.py gioventù
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError
File "C:\Users\sam\Desktop\test\print_me.py", line 8, in <module>
main()
File "C:\Users\sam\Desktop\test\print_me.py", line 5, in main
print(arg)
File "C:\Users\sam\.pyenv\pyenv-win\versions\3.10.7\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0300' in position 8: character maps to <undefined>
This points at what I believe is the root cause of our problem and also indicates that, on Windows, programs default to using the CP1252 codec.
As
this stackoverflow answer explains, CP1252 is a subset of unicode, which means that it represents only a partial range of the entire unicode codec:
CP1252 and UTF-8 are the same for all characters < 128. They differ above that. So if you stick to English and stay away from diacritical marks these will be the same.
This can be simply resolved in Python; our previous command just needs to tell Python to use unicode:
Code: Select all
$diacritic_py_out = python -X utf8 .\print_me.py gioventù
We can see that the stdout is equal to the original input:
The suspicion this hints at is that Java/Filebot is not using unicode but is also defaulting to CP1252, and the previously mentioned "filebot -script fn:sysenv" setting of "file.encoding: Cp1252" is pretty solid evidence of this imo.
I was able to run Filebot with "-Dfile.encoding=UTF-8" by setting $ErrorActionPreference="continue" to ignore the "Picked up JAVA_TOOL_OPTIONS" write to stderr, but this did not have any affect on the diacritical problem as Filebot still threw the "File does not exist: ... gioventu`" error.
Would love to hear your thoughts on this proof-of-concept and see if there's something that can be toggled in Java/Filebot similar to Python's "-X utf8"
edit: to be more specific with a Filebot example...
The below shows that some part of the encoding is getting lost in translation between Powershell and Java/Filebot:
Code: Select all
$filebot_out = filebot -script fn:sysenv gioventù
$filebot_out -contains "gioventù"
The last command evaluates to False, and the output instead contains "args[2] = gioventu`"
I feel like there must be a way to resolve this, similarly to Python's "-X utf8" -- but I just haven't been able to derive the relevant Java/Filebot setting.