SQL join or R's merge() function in NumPy?

Question

Is there an implementation where I can join two arrays based on their keys? Speaking of which, is the canonical way to store keys in one of the NumPy columns (NumPy doesn't have an 'id' or 'rownames' attribute)?

Related: NumPy equivalent of merge

Georgy
– Georgy

2019-04-22 10:17:30 +00:00
Commented Apr 22, 2019 at 10:17 — Georgy
– Georgy, Commented Apr 22, 2019 at 10:17

joris · Accepted Answer · 2011-10-15 15:34:46Z

If you want to use only numpy, you can use structured arrays and the lib.recfunctions.join_by function (see http://pyopengl.sourceforge.net/pydoc/numpy.lib.recfunctions.html). A little example:

In [1]: import numpy as np
   ...: import numpy.lib.recfunctions as rfn
   ...: a = np.array([(1, 10.), (2, 20.), (3, 30.)], dtype=[('id', int), ('A', float)])
   ...: b = np.array([(2, 200.), (3, 300.), (4, 400.)], dtype=[('id', int), ('B', float)])

In [2]: rfn.join_by('id', a, b, jointype='inner', usemask=False)
Out[2]: 
array([(2, 20.0, 200.0), (3, 30.0, 300.0)], 
      dtype=[('id', '<i4'), ('A', '<f8'), ('B', '<f8')])

Another option is to use pandas (documentation). I have no experience with it, but it provides more powerful data structures and functionality than standard numpy, "to make working with “relational” or “labeled” data both easy and intuitive". And it certainly has joining and merging functions (for example see http://pandas.sourceforge.net/merging.html#joining-on-a-key).

retroData · Accepted Answer · 2022-01-27 19:08:12Z

1

If you have any duplicates in the joined key fields, you should use pandas.merge instead of recfunctions. Per the docs (as mentioned by #joris, http://pyopengl.sourceforge.net/pydoc/numpy.lib.recfunctions.html):

Neither r1 nor r2 should have any duplicates along key: the presence of duplicates will make the output quite unreliable. Note that duplicates are not looked for by the algorithm.

In my case, I absolutely want duplicate keys. I'm comparing the rows of each column with the rows of all the other columns, inclusive (or, thinking like a database person, I want an inner join without an on or where clause). Or, translated into a loop, something like this:

for i in a:
  for j in a:
    print(i, j, i*j)

Such procedures are frequent in data mining operations.

edited Jan 27, 2022 at 19:08

answered Jan 25, 2022 at 22:38

retroData

413 bronze badges

1 Comment

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Collectives™ on Stack Overflow

SQL join or R's merge() function in NumPy?

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related