Finding matching string in Pandas Dataframe, starting at specific indices

Question

I have a pandas dataframe where 5 matching strings, lets call them "xyz" occurs x lines after a initial matching string, lets call them "intial string1" and "intial string2"

    index   col0        col3
     500    data   " initial string1"
      ..     ..           ..
     600    data        "xyz"
     ...    ...          ...
     1343   data    "intial string1"
      ..      ..          .. 
     1443   data        "xyz"
      ...   ...          ...
     2432   data    "intial string2"
      ..     ..          ..
     2453   data        "xyz"
       ..    ..           ..
     2467   data    "intial string2"
      ..     ..          ..
     2487   data        "xyz"

I want to be able to iterate through the dataframe starting at these indices, to find the first occurrence of "xyz" and write the rows where these "xyz" occurs to a new dataframe, and then to excel, based on which initial string it has encountered. IE store all xyz corresponding to intial string1 in a dataframe, and store all xyz corresponding to intial string2 in a another dataframe.

I am not sure how to use the combinations of iterrorws, and df["column"].str.match ("matching string") to carry out these iterations. Appreciate the help!

mrp · Accepted Answer · 2018-09-18 15:39:33Z

1

Why can't you just search for the the xyz strings?

df = pd.DataFrame({"col1": ['data', 'data', 'data', 'data', 'data', 'data', 'data'], 
                   'col3': ['initial string', 'something', 'xyz', 
                            'initial string', 'xyz', 'nothing', 'xyz']})

df[df.col3.str.match('xyz')].index

If you have multiple, different strings, just use the .isin method:

df[df.col3.isin(['something', 'xyz'])].index

edited Sep 18, 2018 at 15:39

answered Sep 18, 2018 at 15:37

mrp

7212 gold badges12 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

KRB Over a year ago

the xyz strings occur for two different conditions. one matching string will yield a set of xyz, and another matching string will yield another set of xyz. I did a raw string search for all xyz, and that gave me a mixture, I want to be able to separate the xyz by their respective matching strings, whose indices I know.

mrp Over a year ago

I'm sorry but I don't understand what that means.

mrp Over a year ago

@KarthikBalakrishnan could you elaborate, perhaps in your question, why this doesn't work?

mrp Over a year ago

@KRB do you only want the first occurrence of xyz in for each initial string, or all occurrences?

KRB Over a year ago

first occurrence

Chris Adams · Accepted Answer · 2018-09-18 16:08:28Z

0

What about something like this:

indices_initial = [500, 1343, 2432, 5433, 7533]
indices_xyz = []


for i, j in zip(indices[:], indices[1:]):
    indices_xyz.append(df.loc[i:j, 'col3'].eq('xyz').idxmax())

df.loc[indices_xyz]

[out]

        col0    col3
index       
600     data    xyz
1443    data    xyz
2453    data    xyz

edited Sep 18, 2018 at 16:08

answered Sep 18, 2018 at 16:02

Chris Adams

18.7k4 gold badges26 silver badges44 bronze badges

Comments

Sergey · Accepted Answer · 2018-09-18 17:57:42Z

0

# Setting up input data
df = pd.DataFrame(np.random.rand(12500,2), columns=['col0','col1'])
for i in [0, 500, 1343, 2432, 5433, 7533]:
    df.loc[i,'col1']='init string'
for i in range(1,12000,100):
    df.loc[i,'col1']='xyz'

# Hopefully solution to your question
search_results=pd.DataFrame()
for init_index, next_init_index in zip(df[df.col1=='init string'].index, df[df.col1=='init string'][1::].index):
    search_results = search_results.append(df.query('index>'+str(init_index)+
                                                    ' & index<'+str(next_init_index)+
                                                    ' & col1=="xyz"').head(1))
search_results

edited Sep 18, 2018 at 17:57

answered Sep 18, 2018 at 16:44

Sergey

6715 silver badges6 bronze badges

Comments

KRB · Accepted Answer · 2018-09-21 04:17:43Z

0

I was able to solve this question by using the itertools next feature to search and break out the first occurrence of the string of interest, and splicing the list into regions where i want to search for the strings.

answered Sep 21, 2018 at 4:17

KRB

591 gold badge2 silver badges6 bronze badges

Collectives™ on Stack Overflow

Finding matching string in Pandas Dataframe, starting at specific indices

4 Answers 4

5 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related