0

I'm trying to write two separate tokenizes functions in python the first one basically takes in a string and returns a list of tokens such that 1) all tokens are lowercase, (2) all punctuation is kept as separate tokens.

The second one does the same thing as the one mentioned above, with the following difference: whenever the term 'not' appears, change the two subsequent tokens to have the prefix 'not_' prior to the token. See the example below.

I was able to do construct the first one, below is my code for the first tokenize function:

def token(text):
    x=re.findall(r"[\w]+|['(&#@$*.,/)!?;^]", text.lower())
    return x

output:

token("Hi! How's it going??? an_underscore is not *really* punctuation.")
['hi','!','how',"'",'s','it','going','?','?','?','e','an_underscore','is','not','*','really','*','punctuation','.']

Expected output for 2nd tokenize function:

tokenize_with_not("This movie is not good. In fact, it is not even really a movie.")
['this','movie','is','not','not_good','not_.','in','fact',',','it','is','not','not_even','not_really','a','movie','.']

Can somebody help me out in completing the second tokenize function, any help is appreciated.

0

2 Answers 2

1

Try:

import re

def token(text):
    x=re.findall(r"[\w]+|['(&#@$*.,/)!?;^]", text.lower())
    return x

def tokenize_with_not(text):
    result = []
    c=0
    for t in token(text):
        if t == 'not':
            c=2
            result.append(t)
        else:
            if c>0:
                result.append('not_'+t)
                c -= 1
            else:
                result.append(t)

    return result

print tokenize_with_not("This movie is not good. In fact, it is not even really a movie.")
Sign up to request clarification or add additional context in comments.

Comments

0

You can try this:

def token_with(text, t):
    ret = token(text)
    for i in range(len(ret)):
        if ret[i] == t:
            try:
                ret[i+1] = '{}_{}'.format(t, ret[i+1])
                ret[i+2] = '{}_{}'.format(t, ret[i+2])
            except IndexError:
                pass
     return ret

How to use:

token_with("This movie is not good. In fact, it is not even really a movie.", "not")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.