0

I recently created a script that parses several web proxy logs into a tidy sqlite3 db file that is working great for me... with one snag. the file size. I have been pressed to use this format (a sqlite3 db) and python handles it natively like a champ, so my question is this... what is the best form of string compression that I can use for db entries when file size is the sole concern. zlib? base-n? Klingon?

Any advice would help me loads, again just string compression for characters that are compliant for URLs.

1
  • Sounds like there is some confusion as to what you are looking for between A) compress the sqlite db file and B) compress the values inserted into the sqlite db. Most seem to think you want A, but I suspect you're wanting B. Commented Dec 1, 2009 at 23:35

3 Answers 3

1

Instead of inserting compression/decompression code into your program, you could store the table itself on a compressed drive.

Sign up to request clarification or add additional context in comments.

Comments

1

Here is a page with an SQLite extension to provide compression.

This extension provides a function that can be called on individual fields.

Here is some of the example text from the page

create a test table

sqlite> create table test(name varchar(20),surname varchar(20));

insert into test table some text by compressing text, you can also compress binary content and insert it into a blob field

sqlite> insert into test values(mycompress('This is a sample text'),mycompress('This is a sample text'));

this shows nothing because our data is in binary format and compressed

sqlite> select * from test;

following works, it uncompresses the data

sqlite> select myuncompress(name),myuncompress(surname) from test;

Comments

0

what sort of parsing do you do before you put it in the database? I get the impression that it is fairly simple with a single table holding each entry - if not then my apologies.

Compression is all about removing duplication, and in a log file most of the duplication is between entries rather than within each entry so compressing each entry individually is not going to be a huge win.

This is off the top of my head so feel free to shoot it down in flames, but I would consider breaking the table into a set of smaller tables holding the individual parts of the entry. A log entry would then mostly consist of a timestamp (as DATE type rather than a string) plus a set of indexes into other tables (e.g. requesting IP, request type, requested URL, browser type etc.)

This would have a trade-off of course, since it would make the database a lot more complex to maintain, but on the other hand it would enable meaningful queries such as "show me all the unique IPs that requested page X in the last week".

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.