Read a compressed file (GZ) without unpacking it?

Running FileBot from the console, Groovy scripting, shell scripts, etc
Post Reply
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Read a compressed file (GZ) without unpacking it?

Post by kim »

Is there a way I can read a compressed file (GZ) without extracting / unpacking it ?
(I working on download gz file -> read it -> write / save in Filebot's cache, I believe the cache will only work if string)
  • Must be able to work on Linux and Windows
  • I prefer to use function build into Filebot
User avatar
rednoah
The Source
Posts: 22923
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Read a compressed file (GZ) without unpacking it?

Post by rednoah »

You mean you wanna preemptively fill the internal FileBot cache files with some information so FileBot won't request it at runtime? Newer revisions, starting some time last month, always use gzipped byte[] as cache value, to reduce excessive RAM usage cause by excessively large XML / JSON files.
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Read a compressed file (GZ) without unpacking it?

Post by kim »

I wanna update nfo's with ratings from IMDb Datasets:
https://www.imdb.com/interfaces/

and skip the save gz file -> unpack file -> read file -> delete files

so no more clear text, now

Code: Select all

byte[] content = 'text'.getBytes()
User avatar
rednoah
The Source
Posts: 22923
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Read a compressed file (GZ) without unpacking it?

Post by rednoah »

curl, gunzip, rm would make this easy if you were to use bash. You can of course do it with Groovy / Java but the code will probably require a fair bit of plumbing for the java.util.zip.GZIPInputStream part. I'd use bash to download / extract / etc all the files you need, so that you can then just read the local plain/text file from your Groovy code.
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Read a compressed file (GZ) without unpacking it?

Post by kim »

I was thinking I could do like you do with e.g. exclude-blacklist.txt.xz -> cache/data_x.data
(or any packed file going into the cache)

do you use a tmp dir and then delete the file or ?
User avatar
rednoah
The Source
Posts: 22923
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Read a compressed file (GZ) without unpacking it?

Post by rednoah »

FileBot itself will internally download, decompresses, and parse the data into memory at the same time. A gzipped copy of the downloaded content is added to the in-memory cache and flushed to disk as necessary by the ehcache library. The internal workings there are very much subject to change without notice. You could certainly prepare the cache to make FileBot work as if it had just downloaded and cached this or that file, but you'll probably have to update your code with every other release.
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Read a compressed file (GZ) without unpacking it?

Post by kim »

Yes it sounds like what i was looking for, but I don't know how to do it ?
I don't care if I need to change it again and again
User avatar
rednoah
The Source
Posts: 22923
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Read a compressed file (GZ) without unpacking it?

Post by rednoah »

As for preparing the cache, I have some FileBot / Groovy Script test code, where I do this:

Code: Select all

def data = args[0] as File
Cache.getCache('Manami', CacheType.Persistent).put('anime-offline-database.json', data.resolve('anime-offline-database.json.gz').bytes)
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Read a compressed file (GZ) without unpacking it?

Post by kim »

thx but I fail to see, how I can use this, it looks like it just puts the same compressed gz file in the cache?

where is the "download, decompresses, and parse the data into memory at the same time"
the decompresses in memory part is my main problem
I need to be able to read the clear text or I can't use it ?

btw: same ?

Code: Select all

Cache.getCache('Manami', CacheType.Persistent).put('anime-offline-database.json', data.resolve('anime-offline-database.json.gz').bytes)

Code: Select all

Cache.getCache('Manami', CacheType.Persistent).computeIfAbsent('anime-offline-database.json') {
	data.resolve('anime-offline-database.json.gz').bytes	
}
User avatar
rednoah
The Source
Posts: 22923
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Read a compressed file (GZ) without unpacking it?

Post by rednoah »

kim wrote: 03 Nov 2021, 15:47 thx but I fail to see, how I can use this, it looks like it just puts the same compressed gz file in the cache?
Yep, the code does exactly that and nothing else. Everything else is internal Java code.

kim wrote: 03 Nov 2021, 15:47 where is the "download, decompresses, and parse the data into memory at the same time"
Sorry, I don't have a copy & paste Groovy solution for that at hand. The internal FileBot code doesn't lend itself well to external Groovy scripts, so I don't have any example code for that either. I'd use standard Groovy / Java code to fetch and gunzip the file contents if you must do things from within your Groovy script.



EDIT:

This should work:

Code: Select all

def url = 'https://files.tmdb.org/p/exports/tv_series_ids_11_01_2021.json.gz'
def bytes = Cache.getCache('exports', CacheType.Persistent).bytes(url, URL.&new, java.util.zip.GZIPInputStream.&new).get()
def text = new String(bytes, 'UTF-8')
println text
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Read a compressed file (GZ) without unpacking it?

Post by kim »

Nice thx :D (you properly saved me a lot of time... EDIT: I would never have figured it out )

Can you give link to how this works ?
bytes() must be Filebot only, getBytes() does not work

Code: Select all

bytes(url, URL.&new, GZIPInputStream.&new)

Code: Select all

bytes(url, { new URL(it) }, { new GZIPInputStream(it) } )
it does not work in older Filebot, but I remembered
viewtopic.php?p=44418#p44418

this works in older Filebot:

Code: Select all

{
import net.filebot.Cache
import net.filebot.CacheType
import java.util.zip.GZIPInputStream
def url = 'https://files.tmdb.org/p/exports/tv_series_ids_11_01_2021.json.gz'
def bytes = Cache.getCache('exports', CacheType.Persistent).bytes(url, { new URL(it) }, { new GZIPInputStream(it) } ).get()
def text = new String(bytes, 'UTF-8')
println text
}
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Read a compressed file (GZ) without unpacking it?

Post by kim »

if it possible to use...

Code: Select all

Cache.getCache('exports', CacheType.Persistent).bytes(url, { new URL(it) }, { new GZIPInputStream(it) } ).get()
like ...

Code: Select all

Cache.getCache('Manami', CacheType.Persistent).computeIfAbsent('anime-offline-database.json') {
	data.resolve('anime-offline-database.json.gz').bytes	
}
I would like to have the option to change stuff before save to cache ?
User avatar
rednoah
The Source
Posts: 22923
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Read a compressed file (GZ) without unpacking it?

Post by rednoah »

kim wrote: 03 Nov 2021, 19:48 Nice thx :D (you properly saved me a lot of time... EDIT: I would never have figured it out )
Don't worry. Its undocumented internal API / internal plumbing that constantly changes. I didn't even know that Groovy can automatically cast method handles to interfaces.

Here's what the method signature looks like this:

Code: Select all

public <T> CachedResource<T, byte[]> bytes(T key, Transform<T, URL> resource, Transform<InputStream, InputStream> decompressor)
Your decompressor code could in theory read and decompress the original input stream, and then change stuff, and then generate a byte array input stream to pass back. That's going to be a lot of plumbing.

It's probably easier to cache multiple things. The thing you download is one thing. And the thing you generated from the downloaded thing is another:

Code: Select all

def value = Cache.getCache('string', CacheType.Persistent).computeIfAbsent(url) {
    def bytes = Cache.getCache('raw', CacheType.Persistent).bytes(url, { new URL(it) }, { new GZIPInputStream(it) } ).get()
    return new String(bytes, 'UTF-8')
}
:idea: Please read the FAQ and How to Request Help.
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Read a compressed file (GZ) without unpacking it?

Post by kim »

This in the GUI creates the 2 files raw_0.data and string_0.data BUT both are clean string and Filebot stops responding ?
Tried with 4.9.4 and an older version

Code: Select all

{
	import net.filebot.Cache
	import net.filebot.CacheType
	import java.util.zip.GZIPInputStream

	def url = 'https://files.tmdb.org/p/exports/tv_series_ids_11_01_2021.json.gz'
	def value = Cache.getCache('string', CacheType.Persistent).computeIfAbsent(url) {
		def bytes = Cache.getCache('raw', CacheType.Persistent).bytes(url, { new URL(it) }, { new GZIPInputStream(it) } ).get()
		return new String(bytes, 'UTF-8')
	}
}
kim
Power User
Posts: 1251
Joined: 15 May 2014, 16:17

Re: Read a compressed file (GZ) without unpacking it?

Post by kim »

Is it possible to do this with a local file ?

Code: Select all

{
import net.filebot.Cache
import net.filebot.CacheType
import java.util.zip.GZIPInputStream
def url = 'https://files.tmdb.org/p/exports/tv_series_ids_11_01_2021.json.gz'
def bytes = Cache.getCache('exports', CacheType.Persistent).bytes(url, { new URL(it) }, { new GZIPInputStream(it) } ).get()
def text = new String(bytes, 'UTF-8')
println text
}
User avatar
rednoah
The Source
Posts: 22923
Joined: 16 Nov 2011, 08:59
Location: Taipei
Contact:

Re: Read a compressed file (GZ) without unpacking it?

Post by rednoah »

You'll probably want to read things with your own code, and then do things for each line with your own code, because the internal caching APIs will just get in the way if don't actually have the particular use case that those APIs are designed for.


e.g. decompress and read gzipped text file:

Code: Select all

def f = '/path/to/tv_series_ids_11_01_2021.json.gz' as File

def stream = new java.util.zip.GZIPInputStream(f.newInputStream())
def lines = stream.readLines('UTF-8')

println lines.size()

:idea: Groovy Pad may not perform well if you print 100000+ lines, especially if the memory limit is less than a gigabyte, so when testing, it's best to print just a few lines and the line count instead of every single line.
:idea: Please read the FAQ and How to Request Help.
Post Reply