I'm trying to use regular expressions to remove specific key codes that are tied to the name of a genre in my dataset. However, what I have so far is getting rid of most of the key-codes but leaving behind some letters and I am not sure why. Upon inspection it seems to mostly be having trouble where there is a 0 with letters following it, for example "/m/0lxr" leaves behind lxr.
If anyone out there knows how I would go about to fix this, please let me know!
This is the code I have so far.
def prepare(self, word):
word = re.sub(r'//', "", word)
word = re.sub(r'/\u[0-9][a-z]', "", word)
word = re.sub(r'/.', "", word)
word = re.sub(r'/,', "", word)
word = re.sub(r'/!', "", word)
word = re.sub(r'/?', "", word)
word = re.sub(r'/{', "", word)
word = re.sub(r"'", "", word)
word = re.sub(r"//m//[0-9][a-z]+", "", word)
word = re.sub(r'[0-9][a-z]+', "", word)
word = re.sub(r'[a-z][0-9]+', "", word)
return word
(?<=:")[^"]*(?=")?re.sub(r'(?<=:")[^"]*(?=")', "", word)