2

Given the following data frame:

  date        type       price         
20150101       X           0.8
20150102       X           0.9
20150103       X           1.0
20150104       X           0.9
20150105       abc         12.3
20150106       abc         12.4
20150107       abc         12.4
20150108       X           0.7
20150109       X           0.6
20150110       X           0.9
20150111       abc         12.3
20150112       abc         12.4
20150113       X           0.5
20150114       X           0.6
20150115       abc         12.3
20150116       abc         12.4

The data is formed of clusters prices of X and prices of abc. I want to calculate a new column (call it 'position') based on entries in'type' and 'price' with the following rules:

1. 'position' = 0  if 'type'=='X'
2. 'position' = 1  if 'type'=='abc' and max of price of X in the 'previous section' is >=1
3. 'position' = -1  if 'type'=='abc' and min of price of X in the 'previous section' is <=0.5
4. 'position' = 0  if 'type'=='abc' and otherwise
5.Notes: definition of "previous section" is the period with cluster of prices of "X" between two sections of 'abc' prices. For example

  for 20150105-20150107  previous section is 20150101-20150104
  for 20150111-20150112  previous section is 20150108-20150110
  for 20150115-20150116  previous section is 20150113-20150114

so that I can create the following data frame:

  date        type       price     position   
20150101       X           0.8         0
20150102       X           0.9         0
20150103       X           1.0         0
20150104       X           0.9         0
20150105       abc         12.3        1
20150106       abc         12.4        1
20150107       abc         12.4        1
20150108       X           0.7         0
20150109       X           0.6         0
20150110       X           0.9         0
20150111       abc         12.3        0
20150112       abc         12.4        0
20150113       X           0.5         0
20150114       X           0.6         0
20150115       abc         12.3       -1
20150116       abc         12.4       -1

The difficulty for me is that I don't know how to define 'previous section'. I tried to use pivot_table, which seems easier to operator and I want generate the same 'position' column as follows:

  date        X             abc    position
20150101      0.8           nan        0
20150102      0.9           nan        0
20150103      1.0           nan        0
20150104      0.9           nan        0
20150105      nan          12.3        1
20150106      nan          12.4        1
20150107      nan          12.4        1
20150108      0.7          nan         0
20150109      0.6          nan         0
20150110      0.9          nan         0
20150111      nan          12.3        0
20150112      nan          12.4        0
20150113      0.5          nan         0
20150114      0.6          nan         0
20150115      nan          12.3       -1
20150116      nan          12.4       -1

but I still don't know how to define 'previous section' to calculate max, min or any other value of each section of prices of X. Help!!!

2
  • It doesn't look to me like your conditions are exclusive: if the type is abc, and the previous X section ranges from 0.2 to 1.2, should the position be -1 or 1? Commented Apr 22, 2015 at 1:25
  • you are right. It is a simple version, and always true for my actual data set. Commented Apr 22, 2015 at 2:15

1 Answer 1

1

The general form of your problem is finding occurrences of repeating values. The Pandas instinct should be to reach for groupby, but a simple groupby on the actual series value won't work here, because it will combine non-consecutive stretches of like-values. Instead, I like using Series.diff and Series.cumsum for this.

series = pd.Series(["abc", "abc", "x", "x", "x", "abc", "abc"])

You can't use Series.diff on a strings, so first create a mapping of string to int. The values need only be unique.

mapping = {v: k for k, v in enumerate(set(series))  # {"abc": 0, "x" 1}
int_series = series.map(mapping) # pd.Series([0, 0, 1, 1, 1, 0, 0])

Now you can use Series.diff. Series.diff gives you series[n] - series[n - 1] . The starting value doesn't have a previous row, so it's always NaN.

int_series.diff()  # [NaN, 0, 1, 0, 0, -1, 0]

With Series.diff, we can find the start of every group by testing != 0.

starts = int_series.diff() != 0  # [True, False, True, False, False, True, False]

Compare this with your original values to see how we've found the start of each group:

starts  # [True, False, True, False, False, True, False]
series  # ["abc", "abc", "x", "x", "x", "abc", "abc"]

We don't want to just know the start of each group though -- we want to know what group each row is in. Easy-peasy -- Series.cumsum adds each row to the previous one. Conveniently, if you try to add bools in Python, they get forced to ints, their superclass.

True + True  # 2
True + False  # 1
groups = starts.cumsum()  # [1, 1, 2, 2, 2, 3, 3]

Now, you can use groupby(groups) to act on each group independently.

for _, sequence in series.groupby(groups):
     print sequence
# ["abc", "abc"]
# ["x", "x, "x"]
# ["abc", "abc"]

In your particular case:

group_mins = prices.groupby(groups).min()
previous_group_below_min = (groups - 1).map(group_mins) < SOME_CONSTANT
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.