Fast query in formatted data

Question

In my program I need to query through metadata.

I read data into numpy record array A from csv-like text file ** without duplicate rows**.

var1|var2|var3|var4|var5|var6
'a1'|'b1'|'c1'|1.2|2.2|3.4
'a1'|'b1'|'c4'|3.2|6.2|3.2
'a2'|''|'c1'|1.4|5.7|3.8
'a2'|'b1'|'c2'|1.2|2.2|3.4
'a3'|''|'c2'|1.2|2.2|3.4
'a1'|'b2'|'c4'|7.2|6.2|3.2
...

There are millions of rows and the query in nested loops can be up to billion times (mostly matching the first 3 columns), so the efficiency becomes critical.

There are 3 types of queries and the first one is the most frequent.

Get rows matching one or more of the first 3 columns with given strings, e.g.,
- To match a record where var1='a2' and var2='b1',
```
 ind = np.logical_and(A['var1']=='a2', A['var2']=='b1')
```
- To match a record where var1='a2', var2='b1' and var3='c1',
```
 ind = np.logical_and(np.logical_and(A['var1']=='a2', A['var2']=='b1'), A['var3']=='c1')
```

As one can see, each time we compare the all elements of columns with given strings.

I thought mapping could be a more efficient way for indexing, so I converted the recarray A to a dict D = {'var1_var2_var3: [var4, var5, var6], ...}, and search through the keys byfnmatch(keys, pat)`. I'm not sure it's a better way.

Or I can make a hierachical dict {'var1':{'var2':{'var3':[],...},...},...} or in-memory hdf5 /var1/var2/var3 and just try to get the item if exists. This looks the fastest way?

The latter two types of queries are not very frequent and I can accept the way of numpy recarray comparison.

Get all rows the numeric values in the latter columns in a specific range, e.g.,
- get rows where '1
```
ind = np.logical_and(1<A['var4']<3), 0<A['var5']<3)
```

A combination of the above two, e.g.,

get rows where var2='b1', '1

ind = np.logical_and(np.logical_and(A['var2']=='b1', 1<A['var4']<3), 0<A['var5']<3)

SQL could be a good way but it looks too heavy to use database for this small task. And I don't have authority to install database support everywhere.

Any suggestions for data structure for fast in-memory query? (If it is hard to have a simple customed implementation, sqlite and pandas.dateframe seem to be possible solutions, as suggested.)

Have you looked at pandas.pydata.org/pandas-docs/stable/generated/… ? — N1B4
– N1B4, Commented Aug 5, 2016 at 21:43
Explain your numpy efficiency worries more? As for a database solution, sqlite3 can be used in-memory. — hpaulj
– hpaulj, Commented Aug 6, 2016 at 10:09
@hpaulj every query requires comparing all the millions of items with given values, which looks a big waste. I ever tried to use dict but it seems the regexp matching is not faster. If it is hard to have a simple customed python code for the purpose, sqlite and pandas.dateframe seem to be possible solutions. — wsdzbm
– wsdzbm, Commented Aug 6, 2016 at 11:57
In numpy the bilateral tests have to be written as (1<A['var4'])&(A['var4']<3) — hpaulj
– hpaulj, Commented Aug 6, 2016 at 21:10

hpaulj · Accepted Answer · 2016-08-07 00:31:26Z

With your file sample ('b' for py3)

In [51]: txt=b"""var1|var2|var3|var4|var5|var6
    ...: 'a1'|'b1'|'c1'|1.2|2.2|3.4
    ...: 'a1'|'b1'|'c4'|3.2|6.2|3.2
    ...: 'a2'|''|'c1'|1.4|5.7|3.8
    ...: 'a2'|'b1'|'c2'|1.2|2.2|3.4
    ...: 'a3'|''|'c2'|1.2|2.2|3.4
    ...: 'a1'|'b2'|'c4'|7.2|6.2|3.2"""

A simple read leaves me with the double layer of quoting

data = np.genfromtxt(txt.splitlines(), names=True, delimiter='|', dtype=None)

array([(b"'a1'", b"'b1'", b"'c1'", 1.2, 2.2, 3.4), ...
    dtype=[('var1', 'S4'), ('var2', 'S4'), ('var3', 'S4'), ('var4', '<f8'), ('var5', '<f8'), ('var6', '<f8')])

So I'll define a converter to strip those (a csv reader might do it as well):

def foo(astr):
    return eval(astr)

In [55]: A = np.genfromtxt(txt.splitlines(), names=True, delimiter='|', dtype='U3,U3,U3,f8,f8,f8', converters={0:foo,1:foo,2:foo})
In [56]: A
Out[56]: 
array([('a1', 'b1', 'c1', 1.2, 2.2, 3.4),
       ('a1', 'b1', 'c4', 3.2, 6.2, 3.2), 
       ('a2', '', 'c1', 1.4, 5.7, 3.8),
       ('a2', 'b1', 'c2', 1.2, 2.2, 3.4), 
       ('a3', '', 'c2', 1.2, 2.2, 3.4),
       ('a1', 'b2', 'c4', 7.2, 6.2, 3.2)], 
      dtype=[('var1', '<U3'), ('var2', '<U3'), ('var3', '<U3'), ('var4', '<f8'), ('var5', '<f8'), ('var6', '<f8')])

and I can write tests like

In [57]: (A['var1']=='a2')&(A['var2']=='b1')
Out[57]: array([False, False, False,  True, False, False], dtype=bool)
In [58]: (1<A['var4'])&(A['var4']<3)
Out[58]: array([ True, False,  True,  True,  True, False], dtype=bool)

The tests over all records of A are being done in compile numpy code, so they shouldn't be that slow.

This data could also be viewed as 2 multicolumn fields

In [59]: dt = np.dtype([('labels', '<U3', (3,)), ('data', '<f8', (3,))])
In [60]: A1 = A.view(dt)
In [61]: A1
Out[61]: 
array([(['a1', 'b1', 'c1'], [1.2, 2.2, 3.4]),
       (['a1', 'b1', 'c4'], [3.2, 6.2, 3.2]),
       (['a2', '', 'c1'], [1.4, 5.7, 3.8]),
       (['a2', 'b1', 'c2'], [1.2, 2.2, 3.4]),
       (['a3', '', 'c2'], [1.2, 2.2, 3.4]),
       (['a1', 'b2', 'c4'], [7.2, 6.2, 3.2])], 
      dtype=[('labels', '<U3', (3,)), ('data', '<f8', (3,))])

Or loaded directly with

A = np.genfromtxt(txt.splitlines(), skip_header=1, delimiter='|', dtype='(3)U3,(3)f8', converters={0:foo,1:foo,2:foo})

Then tests could be written as:

In [64]: (A1['labels'][:,0]=='a1') & (A1['labels'][:,1]=='b2') & ((A1['data']<6).any(axis=1))
Out[64]: array([False, False, False, False, False,  True], dtype=bool)

In [65]: (A1['labels'][:,[0,1]]==['a1','b2']).all(axis=1)
Out[65]: array([False, False, False, False, False,  True], dtype=bool)

Sometimes it might be clearer to give individual columns their own id:

var1 = A1['labels'][:,0]   # or A['var1']
....
(var1=='a1')&(var2='b1')&...

Repeated queries, or combinations could be saved.

I believe pandas stores its series in numpy arrays, with different dtype for each column (and object dtype if types varies within a column). But I haven't seen discussion of pandas speed and speed tricks. I don't expect much speed improvement unless it provides for some sort of indexing.

I can imagine writing this data to a database. sqlite3 is builtin and has a memory mode so you don't need file access. But I'm sufficiently out of practice with that code that I'll pass on demonstrating it. Nor do I have a sense of how easy or fast it is to do these sorts of queries.

https://mail.scipy.org/pipermail/scipy-user/2007-August/013350.html has some code that can save a structured array to a sqlite3 database. It includes a function that converts a dtype into a table creation statement.

====================

I've got that pipermail example working with python3. The test example has 11 fields. With 5000 records,

data[np.where(data['id']=='id2000')]

is 6x faster than a corresponding sqlite3 query (with an existing cursor):

cursor.execute('select * from data where id=?',('id2000',))
cursor.fetchone()

Kartik · Accepted Answer · 2016-08-05 23:42:17Z

0

Use Pandas, it is built for tasks like this:

# Import
import pandas as pd

# Read CSV
df = pd.read_csv('/path/to/file.csv')

# Selection criteria
# using `.query` method:
df.query('var1 == "a2" & var3 == "c1"')
df.query('var2 == "b1" & 1 < var4 < 3 & 0 < var5 < 3')
# using indexing:
df[(df['var1'] == 'a2') & (df['var3'] == 'c1')]
df[(df['var2'] == 'b1') & df['var4'].between(1,3) & df['var5'].between(0,3)]
# using `.where` method:
df.where((df['var1'] == 'a2') & (df['var3'] == 'c1'))
df.where(df['var2'] == 'b1') & df['var4'].between(1,3) & df['var5'].between(0,3))

More indexing and selecting information

answered Aug 5, 2016 at 23:42

Kartik

8,73345 silver badges78 bronze badges

4 Comments

wsdzbm Over a year ago

I guess the query in pandas.dataframe is the same as the numpy.recarray comparison in the background? or it is optimized using some tricks?

hpaulj Over a year ago

pandas tends to use numpy arrays for columns, though the dtype might not be structured (recarray). It does have faster code for tasks like loading a CSV, but I doubt if there's much of a speed difference in tasks like this. But you just might have to do your own speed tests.

Kartik Over a year ago

@Lee, Pandas parses queries using numexpr, which is allegedly quite efficient at these things. Like @hpaulj says, you will have to do your own tests. The great thing though is that you can create masks out of your conditions. So if you have a fixed combination that you want to index, you can simply create a mask, and reuse that in your program. Ex: mask = (df['var1'] == 'a2') & (df['var3'] == 'c1') is a mask that you can use to index into your data frame like so: df[mask]. Potentially, you make comparisons once, and reuse the results.

Kartik Over a year ago

Further, @Lee, you may not even need masking and indexing in pandas. There is the groupby method, for instance with can group your data on different criterea, and apply any number of functions to the members of each group.

Collectives™ on Stack Overflow

Fast query in formatted data

2 Answers 2

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related