0

I have a list of items stored in a variable as shown below:

listitems = ['<a href=\"\/other\/end\/f1\/738638\/adams\">Adams<\/a>\n', '<a href=\"\/other\/end\/f1\/23138\/donovan-smith\">Donovan Smith<\/a>\n']

I am trying to find the persons name, in my example the names are "Adams" and "Donovan Smith", however I need help accepting the special characters into the pattern, usually you would use a backslash but I am wondering if there is a way to accept multiple special characters at once without inserting multiple backslashes

I am also wanting to wildcard (ignore) the unique number and name in the weblink for example: 23138 and 'donovan-smith'

My current pattern looks as follows:

pattern1 = re.compile('<a href=\"\/other\/end\/f1\/23138\/donovan-smith\">(.*?)<\/a>\n')

Any help would be much appreciated.

5
  • 1
    Why are there backslashes before the forward slashes? Commented Jul 31, 2013 at 23:33
  • 2
    You probably want raw strings for those regexes. Commented Jul 31, 2013 at 23:35
  • @Blender my data under "listitems" is automatically imported in the format. I have no control over that data. Commented Aug 1, 2013 at 0:28
  • @user2357112 How would I go about doing that, I've had a look and read that entire page but nothing seems to help. Are you able to give me an example using my example data? Commented Aug 1, 2013 at 0:29
  • @Hyflex: Stick an r on the front of the string, before the opening quote, and use a single backslash instead of two whenever you want a backslash in your regex. (I suspect you weren't aware of the need to use two backslashes without raw strings.) Commented Aug 1, 2013 at 1:25

1 Answer 1

2

If what you are doing is parsing html, why not try BeautifulSoup, mechanize or lxml.html?

For instance,

import lxml.html

listitems = ['<a href=\"\/other\/end\/f1\/738638\/adams\">Adams<\/a>\n', '<a href=\"\/other\/end\/f1\/23138\/donovan-smith\">Donovan Smith<\/a>\n']

string = ' '.join(listitems)

page = lxml.html.fromstring(string)

a_tags = page.cssselect('a')

names = []
for tag in a_tags:
  names.append(tag.text_content().strip())

print names
['Adams', 'Donovan Smith']

Would give you what you want. Plus, you can fine-tune the tags you select based on their xpaths, css, etc.

But if you really want to go for writing your regex yourself, what don't you start with something more simple, e.g.

PATTERN = re.compile(r'<a.*?">(.*?)<\\/a>')

So:

import re

listitems = ['<a href=\"\/other\/end\/f1\/738638\/adams\">Adams<\/a>\n', '<a href=\"\/other\/end\/f1\/23138\/donovan-smith\">Donovan Smith<\/a>\n']

PATTERN = re.compile(r'<a.*?">(.*?)<\\/a>')

names = []
for item in listitems:
  n = re.search(PATTERN, item).group(1)
  names.append(n)

print names
['Adams', 'Donovan Smith']
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your reply, I've had a look into beautiful soup but it doesn't seem to work for me ever. I am using mechanize to get the data from a json file, visiting the json directly it comes out in the same format as my "listitems" with all of the raw /n formatting and html visible.
Edit the answer and added a regex there. Consider including a flag like re.DOTALL do the the dot matches new lines. But you can always transform your listitems to a string and pass it to lxml.html (or BS) -- in fact 'yourfile.html' in the code I showed is a string.
The only way I've found at the moment is: pattern1 = re.compile(r'\<a href\=\\\"\\/other\\/end\\/f1\\/.*?\\/.*?\\\"\>(.*?)\<\\\/a\>\\n') Which is not only hard to create, but really hard to read and messy :S
@Hyflex -- unless your whole data looks different, my answer works, as showed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.