Charset auto-detection does not work reliably for Windows-1250 encoded text files (Czech and Eastern European languages)

All your suggestions, requests and ideas for future development
Post Reply
Stule
Posts: 14
Joined: 04 Mar 2024, 21:32

Charset auto-detection does not work reliably for Windows-1250 encoded text files (Czech and Eastern European languages)

Post by Stule »

As above stated, i think in new version there is a HUGE bug. I have selected option TRANSCODE SUBTITLES for long time and it worked flawlesly. But i think from new version there is a problem it scramble the subtitles and delete the UTF-8 BOM somehow dont ask me how. Format is ok SUBRIP but letters ć,č,ž got messed up. Only bigger problem of that is that i sorted in last few days (after the update to new version) 100-150 movies and 30-50 Shows with hundreds of episodes. and did not realised there was a problem because i did not watched anything. Now tried and saw this problem please help is there a way to do something about this?
User avatar
rednoah
The Source
Posts: 22987
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Huge problem i think with new version only

Post by rednoah »

:?: Are you using the GUI or the CLI? Which OS? What software are you using to play videos / subtitles?

:?: Can you provide example files that were not transcoded correctly? Both original file and transcoded file would be ideal.



:idea: The transcode subtitle files feature will transcode any subtitle format with any charset encoding (with or without BOM) to SubRip / UTF-8 without BOM. This has always been and has never changed. You will want to configure all your software to use UTF-8 by default, if that is not default already.
:idea: Please read the FAQ and How to Request Help.
Stule
Posts: 14
Joined: 04 Mar 2024, 21:32

Re: Huge problem i think with new version only (it scramble the subtitles and delete the UTF-8 BOM)

Post by Stule »

All is the same in options as before. but now is messed up i dont know how. where i can send you user data will that help you
Stule
Posts: 14
Joined: 04 Mar 2024, 21:32

Re: Huge problem i think with new version only (it scramble the subtitles and delete the UTF-8 BOM)

Post by Stule »

I dont understand gui or cli? how to know?
Here is the link with Original subtitle and transcoded with filebot https://file.io/q1ZkFhf6WMAb
Stule
Posts: 14
Joined: 04 Mar 2024, 21:32

Re: Huge problem i think with new version only (it scramble the subtitles and delete the UTF-8 BOM)

Post by Stule »

Im using all the same softwares to play files as before that is media player classic and jellyfin internal player. i did not habe any problems in past year with filebot all was transcoded ok. Only new was problem as i can see for now
Stule
Posts: 14
Joined: 04 Mar 2024, 21:32

Re: Huge problem i think with new version only (it scramble the subtitles and delete the UTF-8 BOM)

Post by Stule »

If by gui or cli you mean desktop or command line. I use desktop. i have checked now from update to new version all subtitles were messed up. On old version i have entered a few folder from 23.02.2024 and everything is fine but from 29.02.2024 all messed up. i did not changed any options by then.
Stule
Posts: 14
Joined: 04 Mar 2024, 21:32

Re: Huge problem i think with new version only (it scramble the subtitles and delete the UTF-8 BOM)

Post by Stule »

Code: Select all

FileBot 5.1.3 (r10185)
JNA Native: 6.1.6
MediaInfo: 23.10
7-Zip-JBinding: 16.02
Tools: fpcalc/1.5.0
Extended Attributes: OK
Unicode Filesystem: OK
Script Bundle: 2024-03-04 (r954)
Groovy: 4.0.15
JRE: OpenJDK Runtime Environment 17.0.8
JVM: OpenJDK 64-Bit Server VM
CPU/MEM: 8 Core / 4.3 GB Max Memory / 240 MB Used Memory
OS: Windows 11 (amd64)
STORAGE: NTFS [Pictures,Albums,Videos Private] @ 48 GB | NTFS [Games and Stuff SSD] @ 365 GB | NTFS [Mimy YouTube] @ 17 GB | NTFS [Backup] @ 30 GB | NTFS [Clean] @ 889 GB | NTFS [Downloads, Sorting] @ 266 GB | NTFS [Storage Space] @ 2.8 TB
DATA: C:\Users\Stule PC\AppData\Roaming\FileBot
Package: MSI
License: FileBot License P4964++++ (Valid-Until: 2024-05-14)

I have deleted 4 last digits from filebot license only
User avatar
rednoah
The Source
Posts: 22987
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Huge problem i think with new version only (it scramble the subtitles and delete the UTF-8 BOM)

Post by rednoah »

I have indeed isolated a change in FileBot 5.1.3 where we now assume UTF-8 for files (incorrectly in this case) where the encoding cannot be detected reliably.


ICU4J seems to not have enough heuristics to reliably detect the text encoding for the file at hand:

Code: Select all

windows-1250 | cs | 35
windows-1252 | it | 34
windows-1254 | tr | 27
:idea: Note that the original file (Serbian? also incorrectly detected as Czech) is likely windows-1250 encoded, and has no BOM because BOM is a UTF thing.


:idea: Previous versions would have just used windows-1250 (the best guess; even though the level of confidence is only 35) while newer versions require a confidence level of >=50 and assume UTF-8 otherwise.


:!: The problem is that the 5.1.3 behaviour works better if the file is UTF-8 encoded but gets incorrectly detected-with-low-confidence as Windows-1252 hence the change. Evidently this now poses a problem the other way around where Windows-1250 is correctly detected-with-low-confidence. Since we have now confirmed issues with both approaches, I'm also not sure what the preferrable "better than the alternative" behaviour should be.
:idea: Please read the FAQ and How to Request Help.
Stule
Posts: 14
Joined: 04 Mar 2024, 21:32

Re: Huge problem i think with new version only (it scramble the subtitles and delete the UTF-8 BOM)

Post by Stule »

I dont understand half of it. How can we repair this problem, for my part is the only thing that considers me?
Last edited by Stule on 05 Mar 2024, 10:02, edited 1 time in total.
Stule
Posts: 14
Joined: 04 Mar 2024, 21:32

Re: Huge problem i think with new version only (it scramble the subtitles and delete the UTF-8 BOM)

Post by Stule »

Its a huge problem for me, it messed ton of stuff and made my job even bigger because i need to locate every file and convert again. But will i be able to do it auto? Or should i need to download new subtitles and work on all of them to correct timings again and only then put it in Filebot. If later is the case that is abnormal ton of work.
Stule
Posts: 14
Joined: 04 Mar 2024, 21:32

Re: Huge problem i think with new version only (it scramble the subtitles and delete the UTF-8 BOM)

Post by Stule »

I did not have any problem with older versions, till now every one of subtitle was transcoded fine for me.
User avatar
rednoah
The Source
Posts: 22987
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Huge problem i think with new version only (it scramble the subtitles and delete the UTF-8 BOM)

Post by rednoah »

:idea: FileBot r10193 improves charset detection using the windows-1250 encoded text files provided above as additional sample data.


:idea: If you have the original windows-1250 encoded files, then you can use previous FileBot versions, or revision FileBot r10193 or higher, and try again. If you don't have the original windows-1250 encoded files anymore, then you are unfortunately out of luck.


:!: Note that auto-detection is statistics and probabilities, and not an exact science. It may work. It may even work most of the time. But there's always a small chance that it'll unexpectedly not work for one specific file with some specific unlucky character composition. If you need perfect reliability, and if can assume that all your files are always windows-1250 encoded, then you may want to look into a custom solution for your specific use case to convert windows-1250 specifically to UTF-8.




EDIT:

FileBot r10194 will additionally check transcoded subtitles for � replacement characters and fail-fast. (NOTE: FileBot will now equally fail on subtitles that are decoded correctly and intentionally contain � characters for some reason; time will tell if this is actually a real-word problem or perhaps even a feature)
:idea: Please read the FAQ and How to Request Help.
Stule
Posts: 14
Joined: 04 Mar 2024, 21:32

Re: Charset auto-detection does not work reliably for Windows-1250 encoded text files (Czech and Eastern European langua

Post by Stule »

how should i have original files when i renamed them using Filebot. I cant posibly understand why should you mess with working solution and make it dont work anymore. All titles worked like charm and now dont. How should i download r10194? I don see any workable solution in you response. So i dont even see any more in this program if it can do what it could till now.
User avatar
rednoah
The Source
Posts: 22987
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Charset auto-detection does not work reliably for Windows-1250 encoded text files (Czech and Eastern European langua

Post by rednoah »

Stule wrote: 05 Mar 2024, 13:38 I cant posibly understand why should you mess with working solution and make it dont work anymore
It was never designed to work nor tested for your use case. It just happened to work. We have never tested Windows-1250 encoded Serbian (?) Czech (?) subtitles. We have automated tests for Korean EUC-KR encoded subtitles because a Korean user requested the feature and provided the test data.

Changes to encoding auto-detection were made to fix an issue where UTF-8 encoded files were incorrectly decoded with Windows-1252 (1) and that fix evidently had unexpected side-effects when it comes to Windows-1250 encoded files. However, all our test cases continued to pass, and so nobody noticed. Unfortunately, we didn't have test cases for Windows-1250 encoded files, and there probably aren't many Serbian (?) Czech (?) beta testers that use this specific feature, so nobody noticed until you noticed. That's how these things happen. Now that we know about the Windows-1250 issue we can fix the Windows-1250 issue with upcoming revisions / releases.

If you don't have a backup of the original subtitle files in Windows-1250 encoding then there is unfortunately no solution. There's nothing you can do. I'm sorry for your loss. Backups are always a good idea.
:idea: Please read the FAQ and How to Request Help.
Stule
Posts: 14
Joined: 04 Mar 2024, 21:32

Re: Charset auto-detection does not work reliably for Windows-1250 encoded text files (Czech and Eastern European langua

Post by Stule »

No i dont have them anymore. Ok i understand that part for messed files. but is there a way that i can download or downgrade somehow to version that has worked for me? And what is that version and how to install it should i remove this first or how? Because at this moment program is no value for me.
User avatar
rednoah
The Source
Posts: 22987
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Charset auto-detection does not work reliably for Windows-1250 encoded text files (Czech and Eastern European langua

Post by rednoah »

:arrow: You can use the latest revision which already includes the fix mentioned above:
https://get.filebot.net/filebot/BETA/

:arrow: Alternatively, you can also use an older release:
https://get.filebot.net/filebot/
:idea: Please read the FAQ and How to Request Help.
Stule
Posts: 14
Joined: 04 Mar 2024, 21:32

Re: Charset auto-detection does not work reliably for Windows-1250 encoded text files (Czech and Eastern European langua

Post by Stule »

when i install this beta do i need to select/type/select/correct anything anywhere or it will work auto?
User avatar
rednoah
The Source
Posts: 22987
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Charset auto-detection does not work reliably for Windows-1250 encoded text files (Czech and Eastern European langua

Post by rednoah »

It'll work by default. Please run tests and confirm. Note that only a portable ZIP package is available for testing, thus there is nothing to install, just something to extract & run.
:idea: Please read the FAQ and How to Request Help.
Stule
Posts: 14
Joined: 04 Mar 2024, 21:32

Re: Charset auto-detection does not work reliably for Windows-1250 encoded text files (Czech and Eastern European langua

Post by Stule »

Now it does work. So for renamed subtitles there is no help i need to find all of them again,then correct timings for all and only then rename in Filebot portable. That is huge amount of work.
Post Reply