How to Match Strings from multiple data frame and return indexes with AND and OR options

Question

This is the data frame that I want to search on and get back the matching row number. 'A' and 'AB' are completely different things.

df2 = pd.DataFrame(np.array(['A','B','AC','AD','NAN','XX','BC','SLK','AC','AD','NAN','XU','BB','FG','XZ','XY','AD','NAN','NF','XY','AB','AC','AD','NAN','XY','LK','AC','AC','AD','NAN','KH','BC','GF','BC','AD']).reshape(5,7),columns=['a','b','c','d','e','f','g'])


    a   b   c   d   e   f   g
0   A   B   AC  AD  NAN XX  BC
1   SLK AC  AD  NAN XU  BB  FG
2   XZ  XY  AD  NAN NF  XY  AB
3   AC  AD  NAN XY  LK  AC  AC
4   AD  NAN KH  BC  GF  BC  AD

The strings I will be searching for are from this smaller data frame. Where each row has to be searched as AND, to get back matched string row index of data frame df2.

df = pd.DataFrame(np.array(['A','B','C','D','AA','AB','AC','AD','NAN','BB','BC','AD']).reshape(6,2),columns=['a1','b1'])


a1  b1
0   A   B  # present in the first row of df2
1   C   D  # not present in any row of df2
2   AA  AB # not present in any row of df2
3   AC  AD # present in the second row of df2
4   NAN BB # present in the second row of df2
5   BC  AD # present in the fourth row of df2

AND part

Desired output [0,1,3,4]

import pandas as pd
import numpy as np


index1 = df.index # Finds the number of row in df
terms=[]
React=[]
for i in range(len(index1)): #for loop to search each row of df dataframe
  terms=df.iloc[i] # Get i row
  terms[i]=terms.values.tolist() # converts to a list
  print(terms[i]) # to check
    # each row
  for term in terms[i]: # to search for each string in the 
    print(term)
    results = pd.DataFrame()
    if results.empty:
      results = df2.isin( [ term ] )
    else:
      results |= df2.isin( [ term ] ) 
  results['count'] = results.sum(axis=1)
  print(results['count'])
  print(results[results['count']==len(terms[i])].index.tolist())
  React=results[results['count']==len(terms[i])].index.tolist()
  React

Getting TypeError: unhashable type: 'list' on results = df2.isin( [ term ] )

For OR it should be easy buy have to exclude AND parts which are already Accounted in the first section

React2=df2.isin([X]).any(1).index.tolist()
React2

@r-beginners Thank you so much for your comment. I did add the desired output after your comment. — Protima Rani Paul
– Protima Rani Paul, Commented Aug 10, 2020 at 2:35

r-beginners · Accepted Answer · 2020-08-10 04:25:36Z

1

It's not the output you'd expect, but I asked for the index in the AND condition. The resulting list of output contains the df2 indexes on a df row-by-row basis. Does this meet the intent of your question?

output = []
for i in range(len(df)):
    tmp = []
    for k in range(len(df2)):
        d = df2.loc[k].isin(df.loc[i,['a1']])
        f = df2.loc[k].isin(df.loc[i,['b1']])
        d = d.tolist()
        f = f.tolist()
        if sum(d) >= 1 and sum(f) >=1:
            tmp.append(k)
    output.append(tmp)

output
[[0], [], [], [0, 1, 3], [1], [0, 4]]

answered Aug 10, 2020 at 4:25

r-beginners

35.7k3 gold badges20 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Protima Rani Paul Over a year ago

Perfect this is working but i need some time to test with my original data. Please allow me sometimes 12 hr would be enough for me to test this. Thank you so much.

Collectives™ on Stack Overflow

How to Match Strings from multiple data frame and return indexes with AND and OR options

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related