python tokenization in a differenct logic

Question

I'm trying to write two separate tokenizes functions in python the first one basically takes in a string and returns a list of tokens such that 1) all tokens are lowercase, (2) all punctuation is kept as separate tokens.

The second one does the same thing as the one mentioned above, with the following difference: whenever the term 'not' appears, change the two subsequent tokens to have the prefix 'not_' prior to the token. See the example below.

I was able to do construct the first one, below is my code for the first tokenize function:

def token(text):
    x=re.findall(r"[\w]+|['(&#@$*.,/)!?;^]", text.lower())
    return x

output:

token("Hi! How's it going??? an_underscore is not *really* punctuation.")
['hi','!','how',"'",'s','it','going','?','?','?','e','an_underscore','is','not','*','really','*','punctuation','.']

Expected output for 2nd tokenize function:

tokenize_with_not("This movie is not good. In fact, it is not even really a movie.")
['this','movie','is','not','not_good','not_.','in','fact',',','it','is','not','not_even','not_really','a','movie','.']

Can somebody help me out in completing the second tokenize function, any help is appreciated.

Martín Muñoz del Río · Accepted Answer · 2015-11-05 01:46:03Z

1

Try:

import re

def token(text):
    x=re.findall(r"[\w]+|['(&#@$*.,/)!?;^]", text.lower())
    return x

def tokenize_with_not(text):
    result = []
    c=0
    for t in token(text):
        if t == 'not':
            c=2
            result.append(t)
        else:
            if c>0:
                result.append('not_'+t)
                c -= 1
            else:
                result.append(t)

    return result

print tokenize_with_not("This movie is not good. In fact, it is not even really a movie.")

answered Nov 5, 2015 at 1:46

Martín Muñoz del Río

1,96813 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Vuong Hoang · Accepted Answer · 2015-11-05 01:52:11Z

0

You can try this:

def token_with(text, t):
    ret = token(text)
    for i in range(len(ret)):
        if ret[i] == t:
            try:
                ret[i+1] = '{}_{}'.format(t, ret[i+1])
                ret[i+2] = '{}_{}'.format(t, ret[i+2])
            except IndexError:
                pass
     return ret

How to use:

token_with("This movie is not good. In fact, it is not even really a movie.", "not")

answered Nov 5, 2015 at 1:52

Vuong Hoang

1498 bronze badges

Collectives™ on Stack Overflow

python tokenization in a differenct logic

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related