0

I am using beautiful soup and requests to put down information from a webpage, I am trying to get a list of book titles that are just the titles and do not include the text title= in font of the title.

Example text = 'a bunch of junk title=book1 more junk text title=book2'

what I am getting is titleList = ['title=book1', 'title=book2']

I want titleList = ['book1', 'book2']

I have tried matching groups and that does break the words title= and book1 apart but I am not sure how to append just group(2) to the list.

titleList = []

def getTitle(productUrl):

  res = requests.get(productUrl, headers=headers)
  res.raise_for_status()

  soup = bs4.BeautifulSoup(res.text, 'lxml')
  title = re.compile(r'title=[A-Za-z0-9]+')
  findTitle = title.findall(res.text.strip())
  titleList.append(findTitle)
3
  • 2
    Can you post an example of the html that you are working with? Commented Dec 12, 2016 at 14:35
  • Is this really a BeautifulSoup question? You don't actually use soup object. Commented Dec 12, 2016 at 14:41
  • the question is why you use beautifulsoup? Commented Dec 12, 2016 at 15:13

2 Answers 2

4

Your regex has no capture groups. You should also note that findall returns a list so you should use extend instead of append (unless you want titleList to be a list of lists).

title = re.compile(r'title=([A-Za-z0-9]+)')   # note parenthesis
findTitle = title.findall(res.text.strip())
titleList.extend(findTitle)   # using extend and not append

A stand-alone example:

import re

titleList = []
text = 'a bunch of junk title=book1 more junk text title=book2'

title = re.compile(r'title=([A-Za-z0-9]+)') 
findTitle = title.findall(text.strip())
titleList.extend(findTitle) 
print(titleList)
>> ['book1', 'book2']
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks so much, I all of my searching I did not find the extend option, also just adding the capture group, I just needed a second pair of eyes.
2

Using re.findall with a capture group will do it:

>>> import re
>>> text = 'a bunch of junk title=book1 more junk text title=book2'
>>> re.findall(r'title=(\S+)', text)
['book1', 'book2']
>>>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.