In my program I need to query through metadata.
I read data into numpy record array A from csv-like text file ** without duplicate rows**.
var1|var2|var3|var4|var5|var6
'a1'|'b1'|'c1'|1.2|2.2|3.4
'a1'|'b1'|'c4'|3.2|6.2|3.2
'a2'|''|'c1'|1.4|5.7|3.8
'a2'|'b1'|'c2'|1.2|2.2|3.4
'a3'|''|'c2'|1.2|2.2|3.4
'a1'|'b2'|'c4'|7.2|6.2|3.2
...
There are millions of rows and the query in nested loops can be up to billion times (mostly matching the first 3 columns), so the efficiency becomes critical.
There are 3 types of queries and the first one is the most frequent.Get rows matching one or more of the first 3 columns with given strings, e.g.,
To match a record where
var1='a2'andvar2='b1',ind = np.logical_and(A['var1']=='a2', A['var2']=='b1')To match a record where
var1='a2',var2='b1'andvar3='c1',ind = np.logical_and(np.logical_and(A['var1']=='a2', A['var2']=='b1'), A['var3']=='c1')
As one can see, each time we compare the all elements of columns with given strings.
I thought mapping could be a more efficient way for indexing, so I converted the recarray A to a dict D = {'var1_var2_var3: [var4, var5, var6], ...}, and search through the keys byfnmatch(keys, pat)`. I'm not sure it's a better way.
Or I can make a hierachical dict {'var1':{'var2':{'var3':[],...},...},...} or in-memory hdf5 /var1/var2/var3 and just try to get the item if exists. This looks the fastest way?
The latter two types of queries are not very frequent and I can accept the way of numpy recarray comparison.
Get all rows the numeric values in the latter columns in a specific range, e.g.,
get rows where '1
ind = np.logical_and(1<A['var4']<3), 0<A['var5']<3)
A combination of the above two, e.g.,
get rows where
var2='b1', '1ind = np.logical_and(np.logical_and(A['var2']=='b1', 1<A['var4']<3), 0<A['var5']<3)
SQL could be a good way but it looks too heavy to use database for this small task. And I don't have authority to install database support everywhere.
Any suggestions for data structure for fast in-memory query? (If it is hard to have a simple customed implementation, sqlite and pandas.dateframe seem to be possible solutions, as suggested.)
numpyefficiency worries more? As for a database solution,sqlite3can be used in-memory.dictbut it seems the regexp matching is not faster. If it is hard to have a simple customed python code for the purpose,sqliteandpandas.dateframeseem to be possible solutions.numpythe bilateral tests have to be written as(1<A['var4'])&(A['var4']<3)