Is accepting special characters for a range within a regex pattern possible?

Question

I have a list of items stored in a variable as shown below:

listitems = ['<a href=\"\/other\/end\/f1\/738638\/adams\">Adams<\/a>\n', '<a href=\"\/other\/end\/f1\/23138\/donovan-smith\">Donovan Smith<\/a>\n']

I am trying to find the persons name, in my example the names are "Adams" and "Donovan Smith", however I need help accepting the special characters into the pattern, usually you would use a backslash but I am wondering if there is a way to accept multiple special characters at once without inserting multiple backslashes

I am also wanting to wildcard (ignore) the unique number and name in the weblink for example: 23138 and 'donovan-smith'

My current pattern looks as follows:

pattern1 = re.compile('<a href=\"\/other\/end\/f1\/23138\/donovan-smith\">(.*?)<\/a>\n')

Any help would be much appreciated.

@Blender my data under "listitems" is automatically imported in the format. I have no control over that data. — Ryflex
– Ryflex, Commented Aug 1, 2013 at 0:28
@user2357112 How would I go about doing that, I've had a look and read that entire page but nothing seems to help. Are you able to give me an example using my example data? — Ryflex
– Ryflex, Commented Aug 1, 2013 at 0:29
@Hyflex: Stick an r on the front of the string, before the opening quote, and use a single backslash instead of two whenever you want a backslash in your regex. (I suspect you weren't aware of the need to use two backslashes without raw strings.) — user2357112
– user2357112, Commented Aug 1, 2013 at 1:25

djas · Accepted Answer · 2013-08-01 13:13:55Z

2

If what you are doing is parsing html, why not try BeautifulSoup, mechanize or lxml.html?

For instance,

import lxml.html

listitems = ['<a href=\"\/other\/end\/f1\/738638\/adams\">Adams<\/a>\n', '<a href=\"\/other\/end\/f1\/23138\/donovan-smith\">Donovan Smith<\/a>\n']

string = ' '.join(listitems)

page = lxml.html.fromstring(string)

a_tags = page.cssselect('a')

names = []
for tag in a_tags:
  names.append(tag.text_content().strip())

print names
['Adams', 'Donovan Smith']

Would give you what you want. Plus, you can fine-tune the tags you select based on their xpaths, css, etc.

But if you really want to go for writing your regex yourself, what don't you start with something more simple, e.g.

PATTERN = re.compile(r'<a.*?">(.*?)<\\/a>')

So:

import re

listitems = ['<a href=\"\/other\/end\/f1\/738638\/adams\">Adams<\/a>\n', '<a href=\"\/other\/end\/f1\/23138\/donovan-smith\">Donovan Smith<\/a>\n']

PATTERN = re.compile(r'<a.*?">(.*?)<\\/a>')

names = []
for item in listitems:
  n = re.search(PATTERN, item).group(1)
  names.append(n)

print names
['Adams', 'Donovan Smith']

edited Aug 1, 2013 at 13:13

answered Aug 1, 2013 at 0:29

djas

1,0239 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ryflex Over a year ago

Thanks for your reply, I've had a look into beautiful soup but it doesn't seem to work for me ever. I am using mechanize to get the data from a json file, visiting the json directly it comes out in the same format as my "listitems" with all of the raw /n formatting and html visible.

djas Over a year ago

Edit the answer and added a regex there. Consider including a flag like re.DOTALL do the the dot matches new lines. But you can always transform your listitems to a string and pass it to lxml.html (or BS) -- in fact 'yourfile.html' in the code I showed is a string.

Ryflex Over a year ago

The only way I've found at the moment is: pattern1 = re.compile(r'\<a href\=\\\"\\/other\\/end\\/f1\\/.*?\\/.*?\\\"\>(.*?)\<\\\/a\>\\n') Which is not only hard to create, but really hard to read and messy :S

djas Over a year ago

@Hyflex -- unless your whole data looks different, my answer works, as showed.

Collectives™ on Stack Overflow

Is accepting special characters for a range within a regex pattern possible?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related