Is there an implementation where I can join two arrays based on their keys? Speaking of which, is the canonical way to store keys in one of the NumPy columns (NumPy doesn't have an 'id' or 'rownames' attribute)?
2 Answers
If you want to use only numpy, you can use structured arrays and the lib.recfunctions.join_by function (see http://pyopengl.sourceforge.net/pydoc/numpy.lib.recfunctions.html). A little example:
In [1]: import numpy as np
...: import numpy.lib.recfunctions as rfn
...: a = np.array([(1, 10.), (2, 20.), (3, 30.)], dtype=[('id', int), ('A', float)])
...: b = np.array([(2, 200.), (3, 300.), (4, 400.)], dtype=[('id', int), ('B', float)])
In [2]: rfn.join_by('id', a, b, jointype='inner', usemask=False)
Out[2]:
array([(2, 20.0, 200.0), (3, 30.0, 300.0)],
dtype=[('id', '<i4'), ('A', '<f8'), ('B', '<f8')])
Another option is to use pandas (documentation). I have no experience with it, but it provides more powerful data structures and functionality than standard numpy, "to make working with “relational” or “labeled” data both easy and intuitive". And it certainly has joining and merging functions (for example see http://pandas.sourceforge.net/merging.html#joining-on-a-key).
Comments
If you have any duplicates in the joined key fields, you should use pandas.merge instead of recfunctions. Per the docs (as mentioned by #joris, http://pyopengl.sourceforge.net/pydoc/numpy.lib.recfunctions.html):
Neither
r1norr2should have any duplicates alongkey: the presence of duplicates will make the output quite unreliable. Note that duplicates are not looked for by the algorithm.
In my case, I absolutely want duplicate keys. I'm comparing the rows of each column with the rows of all the other columns, inclusive (or, thinking like a database person, I want an inner join without an on or where clause). Or, translated into a loop, something like this:
for i in a:
for j in a:
print(i, j, i*j)
Such procedures are frequent in data mining operations.