0

I have a dataframe df that looks like this

index       Posts                    clean_text
  0     Hi I am fine.              [Hi, I, am, fine]
  1     You are a piece of shit.   [You, are, a, piece, of, shit]
.
.
.

I have a list named corpus that has 3000 foul words.

I want to go through the column clean_text and add a new row result to the df by check a condition for all the rows. The condition is: if any one of the words of the list in any row of the column clean_text is present in the corpus, the column result will have the string Irrelevant, otherwise Relevant.

Example: if any word of the list [Hi, I, am, fine] is present in the corpus, the column result will have Irrelevant, otherwise relevant. Since, this list dose not have any foul words, the output should be relevant.

The desired output is :

index       Posts                    clean_text                       result
  0     Hi I am fine.              [Hi, I, am, fine]                  Relevant
  1     You are a piece of shit.   [You, are, a, piece, of, shit]     Irrelevant
.
.
.

I want to do this using lambda function. I have done this so far-

df['result'] = df['clean_text'].map(lambda x: ["Relevant" for w in x if w not in corpus]) Firstly, I am unable to write the else part here and secondly it is showing an undesirable output like below.

index       Posts                    clean_text                       result
  0     Hi I am fine.              [Hi, I, am, fine]                  [Relevant, Relevant, Relevant, Relevant]
  1     You are a piece of shit.   [You, are, a, piece, of, shit]     [Relevant, Relevant, Relevant,...]
.
.
.

I also tried writing a ``for``` loop like this but it takes a lot of time:

for i in range(df.shape[0]):
    for word in df.loc[i]['clean_text']:
      if word in corpus:
        df['result'] = "Irrelevant"
        #break
      else:
        #continue
        df['result'] = "Relevant"

Kindly help me to get the desired output using lambda function.

5
  • why does it have to be a lambda expression? Why not a regular function defintion? Commented Feb 5, 2021 at 8:31
  • Probably the biggest problem is that you are using a list for corpus. You should use a set. Commented Feb 5, 2021 at 8:32
  • @juanpa.arrivillaga , how will set help ? Commented Feb 5, 2021 at 8:37
  • Because membership testing in a set is constant time, whereas in a list it's linear time. Commented Feb 5, 2021 at 8:38
  • @juanpa.arrivillaga i can change that to set. Commented Feb 5, 2021 at 8:42

1 Answer 1

3

Use corpus = set(corpus).

Then you can use something like

df['clean_text'].map(lambda l: "Relevant" if any(x in corpus for x in l) else "Irrelevant")

Note, the fact that you are using a lambda is really not relevant. You could have done something like:

def search_corpus(tokens):
    if any(token in corpus for token in tokens):
        return "Relevant"
    return "Irrelevant" 

And do:

df['clean_text'].map(search_corpus)

And this won't affect performance. lambda expressions don't create anything special, and you never have to use one.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the solution. Also, is there a way by which I can check which words are matchng from the corpus and add those in a separate column.
@dipanjana why not map(lambda l: [x for x in l if x in corpus])

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.