Why is FileBot so slow at calculating CRC values?

Mute · Post by **Mute** » 03 Nov 2014, 22:13

I like to rename files to include a CRC value. I do this so that if I ever suspect there has been data corruption or data loss in my disk array, I can just run all the files through a hash checker to find corrupted or partial files (I started employing this practice after a RAID5 rebuild went wrong).

I can't figure out why FileBot takes so long to calculate these values when renaming. FileBot (java) barely uses an resources while it's calculating values. I wish it would! I've got 24 logical cores in this machine, but it barely uses a fraction of one core.

I've experienced the same thing on a machine running an E3-1235 v3 and the machine from the screenshots, running dual Xeon X5650's. It's definitely not a CPU performance issue.

For comparison, this Hash Calculator will calculate SHA-1, MD5, and CRC32 simultaneously and do it faster than FileBot does CRC alone. It utilizes a lot more resources, which in this scenario is a good thing.

Is there currently any way to make FileBot use more resources? If not, would you consider making this an option, or the default behaviour?

Post by **rednoah** » 04 Nov 2014, 07:13

FileBot has never been tested nor optimized for RAID5 / 24 CPU systems.

For most people (1 HDD) IO would be the main issue, for this reason, and simplicity, all hash calculation is single threaded. So it'll only use 1 core.

Also in the format {crc32} only calculates the hash as a last resource. It'll first check if it can copy it from the filename itself, or from an sfv file somewhere in the filesystem.

Having a different optimized tool maintain sfv files might be a good idea in your case.

Mute · Post by **Mute** » 04 Nov 2014, 15:46

Fair enough. Thanks for the reply.

Still, it would be awesome if we could toggle FileBot into multi-threaded mode.

Multicore processors are the standard now, and many are "hyperthreaded." It's not uncommon to see 4-8 logical core machines in the consumer space, and data integrity concerns aren't limited to large RAID arrays.

Also in the format {crc32} only calculates the hash as a last resource. It'll first check if it can copy it from the filename itself, or from an sfv file somewhere in the filesystem.

On a different note...
Concerning my use case, I actually don't want FileBot to ever use a CRC value from an SFV file or the filename for a rename operation. I want FileBot to calculate a value on its own and then check it against the value it may find in the original filename to verify the integrity of the file before renaming. The way it stands, if a file is downloaded improperly, FileBot will rename it with a false value if it uses the one attached to the filename.

I understand that this is a fringe case and not worth your development resources, but I'll mention it anyways, because more customization for power users would be awesome. In my case, I downloads files fire to a remote server. The remote client does a hashcheck upon completion, but then there is still the process of transferring the file from the remote server to my local array.

Like I said, no expectations of accommodations for that last part, but for your information in case you ever decide to create some more granularity in the way FileBot operates.

Post by **rednoah** » 04 Nov 2014, 16:36

These hash algorithms cannot necessarily be multi-threaded. I suppose it somehow works for CRC32 but I'm not sure if that's even possible for MD5 or SHA. Besides, I'd have to implement CRC32 myself, so that's not gonna happen.

Using {crc32} doesn't really make sense to me. Cause you still don't know if the file is or was corrupt at some point before renaming. Best thing you can do is just make an sfv/md5/sha file for each folder, or file, on the original machine, and then as long as you have those files you can always check integrity.

It's just a few lines of Groovy code to write CRC32 to NTFS Extended Attributesb, in parallel (parallel is easy as opposed to multi-threaded) on many files if you want, and as long as those get synced properly as well you always have your checksum as part of the file metadata.

Post by **rednoah** » 04 Nov 2014, 18:53

New Script: Calc/Check CRC32 (and store in xattr)

http://pastebin.com/s3REmpU7

I mean doing something like this. As long as you have a file for each core you can easily parallel process everything. Also using 64-bit Oracle JVM should give you better performance.

Mute · Post by **Mute** » 04 Nov 2014, 18:56

That's really cool. I haven't played with Groovy at all. I'll check that out. Thanks!

Post by **rednoah** » 05 Nov 2014, 10:44

Added the script to my repository:
viewtopic.php?f=4&t=5#p12396

Why is FileBot so slow at calculating CRC values?

Why is FileBot so slow at calculating CRC values?

Re: Why is FileBot so slow at calculating CRC values?

Re: Why is FileBot so slow at calculating CRC values?

Re: Why is FileBot so slow at calculating CRC values?

Re: Why is FileBot so slow at calculating CRC values?

Re: Why is FileBot so slow at calculating CRC values?

Re: Why is FileBot so slow at calculating CRC values?