Fetching large number of rows from large MySQL table

Question

I'm working with an IOT application where we have a bunch of devices sending readings every minute or so, to an AWS Aurora MySQL 5.6 (InnoDB) instance. That instance is a db.t2.medium (2 CPU, 4GB RAM) size. SELECT queries where we fetch by device id and sensor type have begun taking longer and longer, and I'm guessing it's because we're outgrowing our instance size.

The table we query has about 60 million rows, and because of the way our feature works to display the data on a graph, we fetch all historical data rather than do pagination. I also suspect this might be part of the problem. An example query looks like SELECT * FROM readings WHERE device_id = 1234 and sensor_type = 'pressure' and time >= 1644092837 and time <= 1646684837 and returns about 500K rows, taking around 5-8 seconds.

The readings table has four columns - device_id, sensor_type, time (Unix timestamp, stored as an int), and value. A composite index is on device_id, sensor_type, and time.

My main question is - how have people handled returning a large number of rows from an already large table? This table is only going to grow due to the frequency of the data the sensors send. I've considered having a readings table per device but I'm not that comfortable with having potentially thousands of tables, especially if we have to add or edit a column.

I'm also wondering how people have handled scaling up a database in an IOT use case because I'm concerned our AWS bill is going to get very expensive if we just keep increasing RAM / increasing the instance size.

(from Comment)

CREATE TABLE readings (
    device_id int(11) unsigned NOT NULL AUTO_INCREMENT, 
    sensor_type char(5) CHARACTER SET ascii NOT NULL DEFAULT '', 
    time int(11) unsigned NOT NULL, 
    value float NOT NULL, 
    PRIMARY KEY (device_id,sensor_type,time)
) ENGINE=InnoDB AUTO_INCREMENT=48025983 DEFAULT CHARSET=latin1

Your issue isn't likely directly the amount of data you have but rather the amount you're trying to pull at one time on the server you've provisioned which sounds tiny (though I haven't directly worked with Aurora to say for sure). It seems like you're trying to drag race on a bicycle. — J.D.
– J.D., Commented Mar 7, 2022 at 20:43
You’ve hit the nail on the head yourself, fetching half a million rows of data when you only need a few dozen data points is never going to scale. Fix your feature so that only the data needed is pulled through. — Andrew Sayer
– Andrew Sayer, Commented Mar 7, 2022 at 20:51
@dataoilfield SELECT * is ALWAYS a mistake. Only select the columns you NEED for dramatic improvement and reduced query overhead. — Wilson Hauck
– Wilson Hauck, Commented Mar 7, 2022 at 22:19
@dataoilfield Using TIME BETWEEN may work quicker for you than current selections. You desperately need a composite index on device_id, sensor_type and time to improve performance. — Wilson Hauck
– Wilson Hauck, Commented Mar 7, 2022 at 22:23

Rick James · Accepted Answer · 2022-03-09 02:30:13Z

Please provide SHOW CREATE TABLE. I suspect you have PRIMARY KEY(id), which is not the best option for a sensor table. Instead, it should be one of these:

If you can guarantee that there will never two readings from one sensor at the same time (or you simply throw away one of the readings if it happens, eg, via INSERT IGNORE) have the following and get rid of id:
```
PRIMARY KEY (sensor_id, time)
```
Else it gets a bit messy; I can elaborate if you need me to.

That makes it so that all the rows needed for your particular table are "clustered" together. That way, the time for the query will be proportional to the number of rows returned, not the number of rows in the table.

I suggested sensor_id, you seem to have a 2-column way to specify a unique sensor? (device_id, sensor_type). Well, that is nearly as good. However, it takes more space in your huge "Fact" table. So you might consider having a table Sensors with columns device_id and sensor_type (etc).

The datatypes are important. If you are blindly using INT (4 bytes) for ids, you should switch to SMALLINT UNSIGNED NOT NULL (2 bytes) for some id that won't exceed 64K. See also TINYINT and MEDIUMINT.

Shrink the datatypes now before the table gets any bigger. It will require downtime that will only grow as the table gets bigger.

Consider PARTITION BY RANGE(TO_DAYS(time)) if you intend to remove "old" data. That will, make the "delete" (via DROP PARTITION) immensely faster, but is not likely to speed up anything else. More discussion: Partition

Do you throw 500K rows at the graphing package? Choke! I doubt if the resolution of the resulting chart will look any different if you had only 1K points. To that end, build and maintain a Summary table. It should contain sensor_id hour (or other resolution), count/min/max/sum/... of the metric. Whether you use SUM() or AVG() depends on how you define "average". I am thinking of what is an average over a day if there are missing data samples?

The fetch from the summary table will be much faster, without hurting the resolution of the graph.

Note that data can be rolled up via min = min of mins, sum = sum of sums, count = sum of counts, etc. (Even stddev can be rolled up if you keep sum and sum-of-squares.)

More: Summary Tables

I assume you have a mixture of metrics (degrees, inches, etc)? FLOAT may be optimal; it takes 4 bytes. With more regularity (or scaling), some DECIMAL might be adequate with 2-3 bytes.

Do not split into lots of tables.

Not-yet-ready-for-prime-time: Sensor data in MySQL

(Response to Comment)

Huh? You cannot put the mouse on one point out of 500M! If, on the other hand, it takes a few steps to zoom in, the do as follows. Start with the Summary table; when that plays out, fetch from the Fact table. That is, at any stage, don't provide all 500M points. Instead, let the user get a week's average when clicking on the original graph. Then, as you zoom in, get more and more refined.

Thank you for such a detailed response! ``` CREATE TABLE readings ( device_id int(11) unsigned NOT NULL AUTO_INCREMENT, sensor_type char(5) CHARACTER SET ascii NOT NULL DEFAULT '', time int(11) unsigned NOT NULL, value float NOT NULL, PRIMARY KEY (device_id,sensor_type,time) ) ENGINE=InnoDB AUTO_INCREMENT=48025983 DEFAULT CHARSET=latin1 ``` We do fetch all 500K for graph - I agree it's not a good practice, but we have a feature where you can zoom in to view each data point which is why we do that. Might need to get rid of that feature. — dataoilfield
– dataoilfield, Commented Mar 8, 2022 at 22:43

Stack Exchange Network

Fetching large number of rows from large MySQL table

1 Answer 1

Your Answer

Hot Network Questions

Fetching large number of rows from large MySQL table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions