0

I have the following dataframe

+-------+------------+--+
| index |    keep    |  |
+-------+------------+--+
|     0 | not useful |  |
|     1 | start_1    |  |
|     2 | useful     |  |
|     3 | end_1      |  |
|     4 | not useful |  |
|     5 | start_2    |  |
|     6 | useful     |  |
|     7 | useful     |  |
|     8 | end_2      |  |
+-------+------------+--+

There are two pairs of strings (start_1, end_1, start_2, end_2) that indicate that the rows between those strings are the only ones relevant in the data. Hence, in the dataframe below, the output dataframe would be only composed of the rows at index 2, 6, 7 (since 2 is between start_1 and end_1; and 6 and 7 is between start_2 and end_2)

d = {'keep': ["not useful", "start_1", "useful", "end_1", "not useful", "start_2", "useful", "useful", "end_2"]}
df = pd.DataFrame(data=d)

What is the most Pythonic/Pandas approach to this problem? Thanks

1 Answer 1

2

Here's one way to do that (in a couple of steps, for clarity). There might be others:

df["sections"] = 0
df.loc[df.keep.str.startswith("start"), "sections"] = 1
df.loc[df.keep.str.startswith("end"), "sections"] = -1
df["in_section"] = df.sections.cumsum()
res = df[(df.in_section == 1) & ~df.keep.str.startswith("start")]

Output:

   index    keep  sections  in_section
2      2  useful         0           1
6      6  useful         0           1
7      7  useful         0           1
Sign up to request clarification or add additional context in comments.

1 Comment

True. The result of a copy-paste. Will change.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.