Is there a multicore option to compress a NumPy array?

Ask Question

Asked 2 years, 1 month ago

Modified 2 years, 1 month ago

Viewed 172 times

I'm using np.savez_compressed() to compress and save a single large 4D NumPy array, but it uses only one CPU core. Is there an alternative, which can use many cores? Preferably something simple without need to code array split and compression of its pieces in multiple processes.

edited Oct 10, 2023 at 8:16

asked Oct 10, 2023 at 8:00

Paul Jurczak

8,6284 gold badges52 silver badges98 bronze badges

1

numpy.org/doc/stable/reference/generated/… says the .npz format is a ZIPed archive, which implies the compression algorithm is DEFLATE, aka LZ77, same as gzip / zlib implement. It is possible to split data into blocks you compress separately with only a small loss of compression ratio, and still get a valid stream that gunzip can decompress (that's what pigz does; zlib.net/pigz), so in theory it should be possible without breaking file-format compatibility.

Peter Cordes
– Peter Cordes

2023-10-10 20:28:34 +00:00
Commented Oct 10, 2023 at 20:28
But if you're looking for speed, something based on zstd (en.wikipedia.org/wiki/Zstd) compresses about 10x faster than zlib (per core) with similar compression ratio for most data, and the standard implementation of it was designed with threading in mind, chunking data so separate threads can work on separate chunks. On an 8-core machine, optimistically a good implementation using that might be 80x faster than current np.savez_compressed, rather than just 8x.

Peter Cordes
– Peter Cordes

2023-10-10 20:34:40 +00:00
Commented Oct 10, 2023 at 20:34
(I have no idea if anyone's already written anything faster, but yes it should be very possible to do better than using a single thread to generate a ZIP file, with similar compression ratio.)

Peter Cordes
– Peter Cordes

2023-10-10 20:47:39 +00:00
Commented Oct 10, 2023 at 20:47
1

@PeterCordes Hmm, I had already tried zstd in their previous question (here) and with random data of 98% 1-bytes and 2% 2-bytes (similar to what they described), zstd was disappointing. At default level (3) its compression was ~20% worse, and at high enough level (13) to match the compression, it was about equally fast with two threads.

Kelly Bundy
– Kelly Bundy

2023-10-11 05:24:35 +00:00
Commented Oct 11, 2023 at 5:24
1

@PeterCordes Yes, I've used zstd in the past for real-world data and it was very good. If their data is still like 98% 1-bytes and 2% 2-bytes, maybe a simple custom preprocessing in numpy (like representing streaks of 1-bytes as a single byte telling the length) would be both very fast and compress very well. I'd need some code from them to generate realistic data, though.

Kelly Bundy
– Kelly Bundy

2023-10-11 06:01:01 +00:00
Commented Oct 11, 2023 at 6:01

| Show 10 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Is there a multicore option to compress a NumPy array?

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked