0

I have a pandas dataframe where 5 matching strings, lets call them "xyz" occurs x lines after a initial matching string, lets call them "intial string1" and "intial string2"

    index   col0        col3
     500    data   " initial string1"
      ..     ..           ..
     600    data        "xyz"
     ...    ...          ...
     1343   data    "intial string1"
      ..      ..          .. 
     1443   data        "xyz"
      ...   ...          ...
     2432   data    "intial string2"
      ..     ..          ..
     2453   data        "xyz"
       ..    ..           ..
     2467   data    "intial string2"
      ..     ..          ..
     2487   data        "xyz"

I want to be able to iterate through the dataframe starting at these indices, to find the first occurrence of "xyz" and write the rows where these "xyz" occurs to a new dataframe, and then to excel, based on which initial string it has encountered. IE store all xyz corresponding to intial string1 in a dataframe, and store all xyz corresponding to intial string2 in a another dataframe.

I am not sure how to use the combinations of iterrorws, and df["column"].str.match ("matching string") to carry out these iterations. Appreciate the help!

4 Answers 4

1

Why can't you just search for the the xyz strings?

df = pd.DataFrame({"col1": ['data', 'data', 'data', 'data', 'data', 'data', 'data'], 
                   'col3': ['initial string', 'something', 'xyz', 
                            'initial string', 'xyz', 'nothing', 'xyz']})

df[df.col3.str.match('xyz')].index

If you have multiple, different strings, just use the .isin method:

df[df.col3.isin(['something', 'xyz'])].index
Sign up to request clarification or add additional context in comments.

5 Comments

the xyz strings occur for two different conditions. one matching string will yield a set of xyz, and another matching string will yield another set of xyz. I did a raw string search for all xyz, and that gave me a mixture, I want to be able to separate the xyz by their respective matching strings, whose indices I know.
I'm sorry but I don't understand what that means.
@KarthikBalakrishnan could you elaborate, perhaps in your question, why this doesn't work?
@KRB do you only want the first occurrence of xyz in for each initial string, or all occurrences?
first occurrence
0

What about something like this:

indices_initial = [500, 1343, 2432, 5433, 7533]
indices_xyz = []


for i, j in zip(indices[:], indices[1:]):
    indices_xyz.append(df.loc[i:j, 'col3'].eq('xyz').idxmax())

df.loc[indices_xyz]

[out]

        col0    col3
index       
600     data    xyz
1443    data    xyz
2453    data    xyz

Comments

0
# Setting up input data
df = pd.DataFrame(np.random.rand(12500,2), columns=['col0','col1'])
for i in [0, 500, 1343, 2432, 5433, 7533]:
    df.loc[i,'col1']='init string'
for i in range(1,12000,100):
    df.loc[i,'col1']='xyz'

# Hopefully solution to your question
search_results=pd.DataFrame()
for init_index, next_init_index in zip(df[df.col1=='init string'].index, df[df.col1=='init string'][1::].index):
    search_results = search_results.append(df.query('index>'+str(init_index)+
                                                    ' & index<'+str(next_init_index)+
                                                    ' & col1=="xyz"').head(1))
search_results

enter image description here

Comments

0

I was able to solve this question by using the itertools next feature to search and break out the first occurrence of the string of interest, and splicing the list into regions where i want to search for the strings.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.