0

Would really appreciate some help with the following problem. I'm intending on using Pandas library to solve this problem, so would appreciate if you could explain how this can be done using Pandas if possible.

I want to take the following excel file:

Before

and:

  • 1)convert the 'before' file into a pandas data frame
  • 2)look for the text in 'Site' column. Where this text appears within the string in the 'Domain' column, return the value in 'Owner' Column under 'Output'.
  • 3)the result should look like the 'After' file. I would like to convert this back into CSV format.

After

So essentially this is similar to an excel vlookup exercise, except its not an exact match we're looking for between the 'Site' and 'Domain' column.

I have already attempted this in Excel but im looking at over 100,000 rows, and comparing them against over 1000 sites, which crashes excel.

I have attempted to store the lookup list in the same file as the list of domains we want to classify with the 'Owner'. If there's a much better way to do this eg storing the lookup list in a separate data frame altogether, then that's fine.

Thanks in advance for any help, i really appreciate it.

Colin

2

1 Answer 1

0

I think the OP's question differs somewhat from the solutions linked in the comments which either deal with exact lookups (map) or lookups between dataframes. Here there is a single dataframe and a partial match to find.

import pandas as pd
import numpy as np

df = pd.ExcelFile('data.xlsx').parse(0)
df = df.astype(str)
df['Test'] = df.apply(lambda x: x['Site'] in x['Domain'],axis=1)
df['Output'] = np.where(df['Test']==True, df['Owner'], '')
df

The lambda allows reiteration of the in test to be applied across the axis, to return a boolean in Test. This then acts as a rule for looking up Owner and placing in Output.

Sign up to request clarification or add additional context in comments.

2 Comments

thanks joshi123. I've tried the above and the 'Output' column doesnt populate in my bigger data set. I could be wrong, but i think the above only looks at a single row at a time, searching for the 'Site' string within the 'Domain' column (on the same row), and if TRUE, it returns the string in column 'Owner'. However, what i want to do is look at 'domain' string in the first row, then scan down ALL strings in the 'Site' column and look for a partial match. Can you amend the above to achieve this? Thanks
i should add that in my dummy excerpt of the data, i have perhaps oversimplified the problem by having the 'site' which partially matches the 'domain' appearing on the same row.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.