1

1. Summarize the problem

I have a text file and a specific dictionary containing words in a dataframe. The txt file contains sentences (Strings) separated with lines.

Only a specific column of the dictionary is relevant for me and contains the keywords that I want to match with my text. I want then to print the best match(by best I mean the longest one) in a dataframe.

2. Describe what you’ve tried

I created two Dataframes: one for the output and the other to import the csv dictionary:

Output = pd.DataFrame(columns=['stuff','Bestmatch'])
MyDictionary = pd.read_csv('mydic.csv', sep=r'\t', engine='python', encoding='utf-8')

3. Show some code Then I tried to code the main function:

def fetchword():
    with open (mytext.txt", "w+") as f:
        lines = f.readlines()
        for value in MyDictionary["substance_name"].values:

Here, I am not sure what I can do to finish the loop.

f.close()

PS: if there are many matches in the MyDictionary column, I want to choose the longest one and to print it into a new dataframe

Example for the csv dictionary file MyDictionary:

substance_name  Quantity
Acetaminophen   3
ibuprofen   4
Levothyroxin    5
Metformin   7

My text file for instance:

Acetaminophen 3x/d for one week
ibuprofen 1/d for 3 days
11
  • Can you share a sample of the CSV file? Commented Sep 17, 2020 at 10:20
  • I edited my post with a CSV example Commented Sep 17, 2020 at 12:05
  • I have added an answer. I need one more clarification: what do you really mean by longest one? size of the name or number of matches. Commented Sep 17, 2020 at 13:30
  • The longest one. For instance "Acetaminophen category A section B" is a better match than "Acetaminophen category A". Apologies, I may have forgotten to tell you that the "substance_name" string can be composed with multiple words Commented Sep 17, 2020 at 13:43
  • 1
    Thanks. I accepted the answer and upvoted. It is just not displayed because I don't have enough reputation points but normally you should still have received the +1. Commented Sep 17, 2020 at 15:58

1 Answer 1

1

Try this:

import pandas as pd
  
MyDictionary = pd.read_csv('test.csv',delimiter='\t', encoding='utf-8')
def fetchword(df):
        data=[]
        with open ("test.txt", "r") as f:
            lines = str(f.readlines())
            print(df.columns)
            for value in df["substance_name"].values:
               data.append([value,lines.count(value), len(value)])
        f.close()
        data = pd.DataFrame(data, columns=['Word','Count','Length'])
        return(data)

out = fetchword(MyDictionary)

Output:

            Word  Count  Length
0  Acetaminophen      1      13
1      ibuprofen      1       9
2   Levothyroxin      0      12
3      Metformin      0       9

Then, you can filter the out as you like:

print(out.loc[[out[out['Count']>0]['Length'].argmax()]])

Output:

            Word  Count  Length
0  Acetaminophen      1      13
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.