Reading partially downloaded gzip with an offset

Question

Let's say that there is one huge db.sql.gz of size 100GB available https://example.com/db/backups/db.sql.gz and the server supports range requests.

So instead of downloading the entire file, I downloaded y bytes(let's say 1024bytes) with an offset of x bytes(let's say 1000bytes) like the following.

curl -r 1000-2024 https://example.com/db/backups/db.sql.gz

With the above command I was able to download the partial content of the gzipped file, now my question is how can I read that partial content?

I tried gunzip -c db.sql.gz | dd ibs=1024 skip=0 count=1 > o.sql but this gives an error

gzip: dbrange.sql.gz: not in gzip format

The error is acceptable since I guess at the top of the file may be there are header blocks which describes encoding.

I noticed that if I'm downloading the file without an offset, I'm able to read the file using gunzip and piping.

curl -r 0-2024 https://example.com/db/backups/db.sql.gz

circulosmeos · Accepted Answer · 2020-04-16 13:12:49Z

Just FWIW, gzip can be accessed randomly, if a previous index file has been created...

I've developed a command line tool that can quickly and (almost-)randomly access a gzip if an index is provided (if it is not provided, it is automatically created):

https://github.com/circulosmeos/gztool

gztool can be used to access chunks of the original gzip file, if those chunks are retrieved at the specific byte points the index has pointed to (-1 byte to be sure, because gzip is a stream of bits, not bytes), or better, after them.

For example if an index points starts (gztool -ll index.gzi provides this data) at compressed-byte 1508611 of the gzip file and we want 1M compressed bytes after that:

$ curl -r 1508610-2508611 https://example.com/db/backups/db.sql.gz > chunk.gz

Note that chunk.gz will occupy on disk only the chunk size!
Also note that it is not a valid gzip file, as it is incomplete.
Also take into account that we have retrieve from desired index-point-position minus 1 byte.

Now the complete index (previously only-once created: for example with gztool -i *.gz to create indexes for all your already gzipped files, or gztool -c * to both compress and create index) must also be retrieved. Note that indexes are ~0.3% of gzip size (or much smaller if gztool compresses the data itself).

$ curl https://example.com/db/backups/db.sql.gzi -o chunk.gzi

And now the extraction can be done with gztool. The corresponding uncompressed byte (or a byte passed that one) of compressed-1508610 must be known, but the index can show this info with gztool -ll. See examples here. Let's suppose it is byte 9009009. Or the uncompressed byte we want is just passed the corresponding first index point that is contained in chunk.gz. Let's suppose again that this byte would also be 9009009 for this case.

$ gztool -n 1508610 -b 9009009 chunk.gz > extracted_chunk.sql

gztool will stop extracting data when the chunk.gz file ends.

Maybe tricky, but would run without changing compression method nor already compressed files. But indexes would need to be created for them.

NOTES: Another way to do the extraction without using the -n parameter is filling the gzip file with sparse zeroes: this is done for the example with a dd command before the first curl for retrieving the chunk.gz file, so:

$ dd if=/dev/zero of=chunk.gz seek=1508609 bs=1 count=0
$ curl -r 1508610-2508611 https://example.com/db/backups/db.sql.gz >> chunk.gz
$ curl https://example.com/db/backups/db.sql.gzi -o chunk.gzi

This way, the first 1508609 bytes of the file are zeroes, but they don't occupy space in disk. Without seek in dd command, the zeroes are all written to disk, which will be also valid for gzip, but this way we don't occupy unnecessary space on disk. Then, the gztool command doesn't need the -n parameter. The data zeroed is not needed because as the index exists, gztool will use it to jump to the index point just before the uncompressed 9009009 byte position, so all previous data is just ignored:

$ gztool -b 9009009 chunk.gz > extracted_chunk.sql

Stephen Kitt · Accepted Answer · 2021-08-09 09:44:52Z

4

gzip doesn’t produce block-compressed files (see the RFC for the gory details), so it’s not suitable for random access on its own. You can start reading from a stream and stop whenever you want, which is why your curl -r 0-2024 example works, but you can’t pick up a stream in the middle, unless you have a complementary file to provide the missing data (such as the index files created by gztool).

To achieve what you’re trying to do, you need to use block compression of some sort; e.g. bgzip (which produces files which can be decompressed by plain gzip) or bzip2, and do some work on the receiving end to determine where the block boundaries lie. Peter Cock has written a few interesting posts on the subject: BGZF - Blocked, Bigger & Better GZIP!, Random access to BZIP2?

edited Aug 9, 2021 at 9:44

answered Mar 9, 2018 at 10:19

Stephen Kitt

484k60 gold badges1.2k silver badges1.4k bronze badges

Wow, that’s interesting! Somewhat pedantically, it’s the “index + .gz” combination which can be accessed randomly, not the .gz ;-). Does gztool -c create the index during compression?

Stephen Kitt
– Stephen Kitt

2019-09-15 18:31:34 +00:00
Commented Sep 15, 2019 at 18:31
1

Sorry @stephen-kitt, I transformed the comment in a proper answer. gztool does not compress: it decompress gzip files (from any byte position) and creates indexes while doing so (also can just create the index - both take the same time). But it can for example Supervise (-S) a gzip file that's being created by another process, to create the index in background. There are also tailing options, even for a growing gzip file (-T). Take a look at examples! : github.com/circulosmeos/gztool

circulosmeos
– circulosmeos

2019-09-16 16:23:31 +00:00
Commented Sep 16, 2019 at 16:23
Your own usage docs mention the -c option which compresses the given input file to stdout; that’s what I’m asking about. I know that gztool’s primary function is to create indexes; but I thought it would be useful to create indexes while compressing... Otherwise gztool isn’t all that useful as an answer to the question: if the index file isn’t provided alongside the download, then curl -r 1000-2024 https://example.com/db/backups/db.sql.gz can never be made to work on its own.

Stephen Kitt
– Stephen Kitt

2019-09-16 16:27:22 +00:00
Commented Sep 16, 2019 at 16:27
Oh, sorry, yes: but -c and -d are "utilities" to compress/decompress raw-zlib streams, not gzip files... it can be useful to decompress raw chunks of zlib data... Thanks for the comment: I'd clarify this... I created gztool primarly to treat gzipped log files created by other processes, so I haven't planned about compression. For that use the -S option can cover that use, and also the creation of an index when the gzip file is being created by another tool (rsyslog, gzip...)

circulosmeos
– circulosmeos

2019-09-16 16:33:15 +00:00
Commented Sep 16, 2019 at 16:33
Yes, you're right: the index is needed. I tried to clearly state that all along the answer.

circulosmeos
– circulosmeos

2019-09-16 16:38:24 +00:00
Commented Sep 16, 2019 at 16:38

| Show 3 more comments

Stack Exchange Network

Reading partially downloaded gzip with an offset

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Reading partially downloaded gzip with an offset

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions