How to check if string is in a dictionary dataframe

Question

1. Summarize the problem

I have a text file and a specific dictionary containing words in a dataframe. The txt file contains sentences (Strings) separated with lines.

Only a specific column of the dictionary is relevant for me and contains the keywords that I want to match with my text. I want then to print the best match(by best I mean the longest one) in a dataframe.

2. Describe what you’ve tried

I created two Dataframes: one for the output and the other to import the csv dictionary:

Output = pd.DataFrame(columns=['stuff','Bestmatch'])
MyDictionary = pd.read_csv('mydic.csv', sep=r'\t', engine='python', encoding='utf-8')

3. Show some code Then I tried to code the main function:

def fetchword():
    with open (mytext.txt", "w+") as f:
        lines = f.readlines()
        for value in MyDictionary["substance_name"].values:

Here, I am not sure what I can do to finish the loop.

f.close()

PS: if there are many matches in the MyDictionary column, I want to choose the longest one and to print it into a new dataframe

Example for the csv dictionary file MyDictionary:

substance_name  Quantity
Acetaminophen   3
ibuprofen   4
Levothyroxin    5
Metformin   7

My text file for instance:

Acetaminophen 3x/d for one week
ibuprofen 1/d for 3 days

I have added an answer. I need one more clarification: what do you really mean by longest one? size of the name or number of matches. — Grayrigel
– Grayrigel, Commented Sep 17, 2020 at 13:30
The longest one. For instance "Acetaminophen category A section B" is a better match than "Acetaminophen category A". Apologies, I may have forgotten to tell you that the "substance_name" string can be composed with multiple words — marou95thebest
– marou95thebest, Commented Sep 17, 2020 at 13:43
Thanks. I accepted the answer and upvoted. It is just not displayed because I don't have enough reputation points but normally you should still have received the +1. — marou95thebest
– marou95thebest, Commented Sep 17, 2020 at 15:58

Grayrigel · Accepted Answer · 2020-09-17 13:59:29Z

1

Try this:

import pandas as pd
  
MyDictionary = pd.read_csv('test.csv',delimiter='\t', encoding='utf-8')
def fetchword(df):
        data=[]
        with open ("test.txt", "r") as f:
            lines = str(f.readlines())
            print(df.columns)
            for value in df["substance_name"].values:
               data.append([value,lines.count(value), len(value)])
        f.close()
        data = pd.DataFrame(data, columns=['Word','Count','Length'])
        return(data)

out = fetchword(MyDictionary)

Output:

            Word  Count  Length
0  Acetaminophen      1      13
1      ibuprofen      1       9
2   Levothyroxin      0      12
3      Metformin      0       9

Then, you can filter the out as you like:

print(out.loc[[out[out['Count']>0]['Length'].argmax()]])

Output:

            Word  Count  Length
0  Acetaminophen      1      13

edited Sep 17, 2020 at 13:59

answered Sep 17, 2020 at 13:24

Grayrigel

3,6045 gold badges19 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to check if string is in a dictionary dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related