Split a pandas dataframe by a list of values from another data frame

Question

I'm pretty sure there's a really simple solution for this and I'm just not realising it. However...

I have a data frame of high-frequency data. Call this data frame A. I also have a separate list of far lower frequency demarcation points, call this B. I would like to append a column to A that would display 1 if A's timestamp column is between B[0] and B[1], 2 if it is between B[1] and B[2], and so on.

As said, it's probably incredibly trivial, and I'm just not realising it at this late an hour.

score 2 · Accepted Answer · 2016-11-12 10:04:17Z

2

Here is a quick and dirty approach using a list comprehension.

>>> df = pd.DataFrame({'A': np.arange(1, 3, 0.2)})

>>> A = df.A.values.tolist()
A: [1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.5, 2.6, 2.8]

>>> B = np.arange(0, 3, 1).tolist()
B: [0, 1, 2]

>>> BA = [k for k in range(0, len(B)-1) for a in A if (B[k]<=a) & (B[k+1]>a) or (a>max(B))]
BA: [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

edited Nov 12, 2016 at 10:04

answered Nov 4, 2016 at 3:08

user4322543

Sign up to request clarification or add additional context in comments.

Comments

unutbu · Accepted Answer · 2016-11-04 02:19:20Z

Use searchsorted:

A['group'] = B['timestamp'].searchsorted(A['timestamp'])

For each value in A['timestamp'], an index value is returned. That index indicates where amongst the sorted values in B['timestamp'] that value from A would be inserted into B in order to maintain sorted order.

For example,

import numpy as np
import pandas as pd
np.random.seed(2016)

N = 10
A = pd.DataFrame({'timestamp':np.random.uniform(0, 1, size=N).cumsum()})
B = pd.DataFrame({'timestamp':np.random.uniform(0, 3, size=N).cumsum()})
#    timestamp
# 0   1.739869
# 1   2.467790
# 2   2.863659
# 3   3.295505
# 4   5.106419
# 5   6.872791
# 6   7.080834
# 7   9.909320
# 8  11.027117
# 9  12.383085

A['group'] = B['timestamp'].searchsorted(A['timestamp'])
print(A)

yields

   timestamp  group
0   0.896705      0
1   1.626945      0
2   2.410220      1
3   3.151872      3
4   3.613962      4
5   4.256528      4
6   4.481392      4
7   5.189938      5
8   5.937064      5
9   6.562172      5

Thus, the timestamp 0.896705 is in group 0 because it comes before B['timestamp'][0] (i.e. 1.739869). The timestamp 2.410220 is in group 1 because it is larger than B['timestamp'][0] (i.e. 1.739869) but smaller than B['timestamp'][1] (i.e. 2.467790).

You should also decide what to do if a value in A['timestamp'] is exactly equal to one of the cutoff values in B['timestamp']. Use

B['timestamp'].searchsorted(A['timestamp'], side='left')

if you want searchsorted to return i when B['timestamp'][i] <= A['timestamp'][i] <= B['timestamp'][i+1]. Use

B['timestamp'].searchsorted(A['timestamp'], side='right')

if you want searchsorted to return i+1 in that situation. If you don't specify side, then side='left' is used by default.

Collectives™ on Stack Overflow

Split a pandas dataframe by a list of values from another data frame

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related