3

I have a lot of spectra that I want to store in a database. A spectrum is basically an array of integers with in my case a variable length of typically 512 or 1024. How best to store these spectra? Along with the spectra I want to store some additional data like time and a label, which will be simple fields in my database. The spectra will not be retrieved often and if I need them, I need them as a whole.

For storing the spectra I can think of 2 possible solutions:

  • Storing them as a string, like "1,7,9,3,..."
  • Storing the spectra in a separate table, with each value in a separate row, containing fields like spectrum_id, index and value

Any suggestions on which one to use? Other solutions are much appreciated of course!

1
  • I guess a stored spectrum is a well-represented by a chart with wavelength on the x axis and power, or photon count, or something similar on the y axis. Does every spectrum you store share the same x-axis values? Do you need to store and retrieve information describing the x-axis for individual spectra? Are you going to want to look things up with queries like "find the spectra with measurements > 250 at 550nm?" Commented Aug 4, 2015 at 13:27

2 Answers 2

4

Your first solution is a common mistake when people transition from the procedural/OO programming mindset to the database mindset. It's all about efficiency, least number of records to fetch etc. The database world requires a different paradigm to store and retrieve data.

Here's how I'd do it: make 2 tables:

spectra
---------
spectra_id (primary key)
label
time

spectra_detail
---------
spectra_id
index
value

To retrieve them:

SELECT     *
FROM       spectra s
INNER JOIN spectra_detail sd ON s.spectra_id = sd.spectra_id
WHERE      s.spectra_id = 42
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. Your solution is basically what I had in mind with my 2nd option. Although this option is more complex at first glance, I agree with this probably being more efficient. I was surprised however to see how often the serialize function was used in PHP code to convert arrays to strings before writing it to a database.
The serialized approach is better if you don't need to query by that data.
0

If you have a small dataset (hundreds of MB), there is no problem in using an SQL DBMS with any of the alternatives.

As proposed by Maciej, serialization is an improvement over the other alternative, such as you can group each spectrum sweep into a single tuple (row in a table), reducing the overhead in keys and other information.

For the serialization, you may consider using objects such as linestring or multipoint in order to be able to better process the data using SQL functions. This will require some scaling but will allow querying the data and if you use WKB you may also achieve a relevant gain in storage use with little loss in the performance.

The problem is that spectrum data tends to accumulate and storage usage may become a problem that will not be easily solved by the serialization trick. You should carefully consider this in your project.

Working on a similar problem, I came to the conclusion that it is a bad idea to use any SQL DMBS (MySQL, SQL Server, Postgre, and such) to manage large numerical matrix data, such as spectrum sweep measurements. It's a bit like trying to create an image library CMS by storing images pixel by pixel into a database.

The following table presents a comparison between a few formats in my experiment. This may help to understand the problem of using SQL DBMS to store numerical data matrixes.

    MySQL Table​      Table with key - Int(10) - and value - decimal(4,1)    1 157 627 904 B​
    TXT              CSV decimal(4,1), equivalent to 14bit                    276 895 606 B​
    BIN (original)   Matrix 1 byte x 51200 columns x 773 rows + Metadata       40 038 580 B
    HDF5             Matrix 3 bytes x 51200 columns x 773 rows + Metadata      35 192 973 B
    TXT + Zip​        CSV decimal (4,1) + standard zip compression              34 175 971 B
    PNGRGBa​          Matrix 4 bytes x 51200 columns x 773 rows                 33 997 095 B
    ZIP(BIN)​         Original BIN file compressed with standard zip            26 028 780 B
    PNG 8bIndexed​    Matrix 1 byte x 51200 columns x 773 rows + Color scale    25 947 324 B

The example using MySQL didn't use any serialization. I didn't try it but one may expect a reduction to almost half the size of the occupied storage by using WKT linestrings or similar features. Even so, the storage used would be almost the double of the corresponding CSV and more than 20 times the size of a PNG8b with the same data.

These numbers are expected when you stop to think about how much extra data you are storing in terms of keys and search optimization when you use an SQL DBMS.

For closing remarks, I would suggest that you consider the use of PNG, TIFF, HDF5, or any other digital format that is more suitable to construct your front-end to store the spectrum data (or any other large matrix) and maybe using an SQL DBMS for the dimensions around this core data, such as who measures, when, with which equipment, to which end, etc. In short, have a BLOB within the database with the files or outside, as it better suits your system architecture.

Alternatively, is worthwhile to consider using a big data solution around some digital format such as HDF5. Each tool to its end.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.