1

I'm pretty sure there's a really simple solution for this and I'm just not realising it. However...

I have a data frame of high-frequency data. Call this data frame A. I also have a separate list of far lower frequency demarcation points, call this B. I would like to append a column to A that would display 1 if A's timestamp column is between B[0] and B[1], 2 if it is between B[1] and B[2], and so on.

As said, it's probably incredibly trivial, and I'm just not realising it at this late an hour.

2 Answers 2

2

Here is a quick and dirty approach using a list comprehension.

>>> df = pd.DataFrame({'A': np.arange(1, 3, 0.2)})

>>> A = df.A.values.tolist()
A: [1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.5, 2.6, 2.8]

>>> B = np.arange(0, 3, 1).tolist()
B: [0, 1, 2]

>>> BA = [k for k in range(0, len(B)-1) for a in A if (B[k]<=a) & (B[k+1]>a) or (a>max(B))]
BA: [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Sign up to request clarification or add additional context in comments.

Comments

2

Use searchsorted:

A['group'] = B['timestamp'].searchsorted(A['timestamp'])

For each value in A['timestamp'], an index value is returned. That index indicates where amongst the sorted values in B['timestamp'] that value from A would be inserted into B in order to maintain sorted order.

For example,

import numpy as np
import pandas as pd
np.random.seed(2016)

N = 10
A = pd.DataFrame({'timestamp':np.random.uniform(0, 1, size=N).cumsum()})
B = pd.DataFrame({'timestamp':np.random.uniform(0, 3, size=N).cumsum()})
#    timestamp
# 0   1.739869
# 1   2.467790
# 2   2.863659
# 3   3.295505
# 4   5.106419
# 5   6.872791
# 6   7.080834
# 7   9.909320
# 8  11.027117
# 9  12.383085

A['group'] = B['timestamp'].searchsorted(A['timestamp'])
print(A)

yields

   timestamp  group
0   0.896705      0
1   1.626945      0
2   2.410220      1
3   3.151872      3
4   3.613962      4
5   4.256528      4
6   4.481392      4
7   5.189938      5
8   5.937064      5
9   6.562172      5

Thus, the timestamp 0.896705 is in group 0 because it comes before B['timestamp'][0] (i.e. 1.739869). The timestamp 2.410220 is in group 1 because it is larger than B['timestamp'][0] (i.e. 1.739869) but smaller than B['timestamp'][1] (i.e. 2.467790).


You should also decide what to do if a value in A['timestamp'] is exactly equal to one of the cutoff values in B['timestamp']. Use

B['timestamp'].searchsorted(A['timestamp'], side='left')

if you want searchsorted to return i when B['timestamp'][i] <= A['timestamp'][i] <= B['timestamp'][i+1]. Use

B['timestamp'].searchsorted(A['timestamp'], side='right')

if you want searchsorted to return i+1 in that situation. If you don't specify side, then side='left' is used by default.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.