Python regex match but not include characters beautiful soup

Question

I am using beautiful soup and requests to put down information from a webpage, I am trying to get a list of book titles that are just the titles and do not include the text title= in font of the title.

Example text = 'a bunch of junk title=book1 more junk text title=book2'

what I am getting is titleList = ['title=book1', 'title=book2']

I want titleList = ['book1', 'book2']

I have tried matching groups and that does break the words title= and book1 apart but I am not sure how to append just group(2) to the list.

titleList = []

def getTitle(productUrl):

  res = requests.get(productUrl, headers=headers)
  res.raise_for_status()

  soup = bs4.BeautifulSoup(res.text, 'lxml')
  title = re.compile(r'title=[A-Za-z0-9]+')
  findTitle = title.findall(res.text.strip())
  titleList.append(findTitle)

Can you post an example of the html that you are working with? — Stats4224
– Stats4224, Commented Dec 12, 2016 at 14:35
Is this really a BeautifulSoup question? You don't actually use soup object. — alecxe
– alecxe, Commented Dec 12, 2016 at 14:41

DeepSpace · Accepted Answer · 2016-12-12 14:37:14Z

4

Your regex has no capture groups. You should also note that findall returns a list so you should use extend instead of append (unless you want titleList to be a list of lists).

title = re.compile(r'title=([A-Za-z0-9]+)')   # note parenthesis
findTitle = title.findall(res.text.strip())
titleList.extend(findTitle)   # using extend and not append

A stand-alone example:

import re

titleList = []
text = 'a bunch of junk title=book1 more junk text title=book2'

title = re.compile(r'title=([A-Za-z0-9]+)') 
findTitle = title.findall(text.strip())
titleList.extend(findTitle) 
print(titleList)
>> ['book1', 'book2']

answered Dec 12, 2016 at 14:37

DeepSpace

82.1k12 gold badges119 silver badges166 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

turtle02 Over a year ago

Thanks so much, I all of my searching I did not find the extend option, also just adding the capture group, I just needed a second pair of eyes.

Jeremy Jones · Accepted Answer · 2016-12-12 14:55:44Z

2

Using re.findall with a capture group will do it:

>>> import re
>>> text = 'a bunch of junk title=book1 more junk text title=book2'
>>> re.findall(r'title=(\S+)', text)
['book1', 'book2']
>>>

answered Dec 12, 2016 at 14:55

Jeremy Jones

5,7414 gold badges23 silver badges28 bronze badges

Collectives™ on Stack Overflow

Python regex match but not include characters beautiful soup

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related