I'm trying to write two separate tokenizes functions in python the first one basically takes in a string and returns a list of tokens such that 1) all tokens are lowercase, (2) all punctuation is kept as separate tokens.
The second one does the same thing as the one mentioned above, with the following difference: whenever the term 'not' appears, change the two subsequent tokens to have the prefix 'not_' prior to the token. See the example below.
I was able to do construct the first one, below is my code for the first tokenize function:
def token(text):
x=re.findall(r"[\w]+|['(&#@$*.,/)!?;^]", text.lower())
return x
output:
token("Hi! How's it going??? an_underscore is not *really* punctuation.")
['hi','!','how',"'",'s','it','going','?','?','?','e','an_underscore','is','not','*','really','*','punctuation','.']
Expected output for 2nd tokenize function:
tokenize_with_not("This movie is not good. In fact, it is not even really a movie.")
['this','movie','is','not','not_good','not_.','in','fact',',','it','is','not','not_even','not_really','a','movie','.']
Can somebody help me out in completing the second tokenize function, any help is appreciated.