1

I need to find and replace version numbers within text with a generic placeholder e.g. '*'.

Problem is writing the regex that would capture the version numbers.


Some examples:

Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.1 (KHTML, like Gecko) Ubuntu/11.04 Chromium/14.0.825.0 Chrome/14.0.825.0 Safari/535.1

Mozilla/5.0(iPad; U; CPU iPhone OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B314 Safari/531.21.10gin_lib.cc

Mozilla/5.0 (Windows; U; Windows NT 5.1; pt-PT; rv:1.9.2.7) Gecko/20100713 Firefox/3.6.7 (.NET CLR 3.5.30729)

Version numbers contain:

  • alphanumeric characters
  • special characters i.e. '.-_:'

A simple regex might be r'[0-9._:-]+' but this does not work as version number needs at least 1 alphanumeric chars and special character in between alphanumeric characters.


Any ideas?

1

1 Answer 1

1

In the re module, use the sub function. It will return a string where all the matches for the input regex are replaced by the output of a function, or just a string. The problem is in determining which version numbers in each string you want to replace. I'm assuming that you want all version numbers replaced.

import re
data = ["Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.1 (KHTML, like Gecko) Ubuntu/11.04 Chromium/14.0.825.0 Chrome/14.0.825.0 Safari/535.1",
"Mozilla/5.0(iPad; U; CPU iPhone OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B314 Safari/531.21.10gin_lib.cc",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; pt-PT; rv:1.9.2.7) Gecko/20100713 Firefox/3.6.7 (.NET CLR 3.5.30729)"]
output = []
for str in data:
   output.append(re.sub(r'\d[0-9a-zA-Z._:-]+', '*', str))
print output

gives these results:

['Mozilla/* (X*; Linux i*) AppleWebKit/* (KHTML, like Gecko) Ubuntu/* Chromium/* Chrome/* Safari/*', 
'Mozilla/*(iPad; U; CPU iPhone OS * like Mac OS X; en-us) AppleWebKit/* (KHTML, like Gecko) Version/* Mobile/* Safari/*',
'Mozilla/* (Windows; U; Windows NT *; pt-PT; rv:*) Gecko/* Firefox/* (.NET CLR *)']

The regex isn't very good, I wanted a repeating set of alphanumerics followed by a delimiter. But I couldn't seem to get it to work. Something like ([0-9a-zA-Z]+[._:-])+

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot for your reply and effort :) Very useful!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.