Extracting text using BeautifulSoup and Regex

Question

I wanted to extract some numbers from text files. The text line is like 074 N00AA00 623938 and I need to extract the number 623938. I'm using the code below but it returns nothing:

url = 'https://www.sec.gov/Archives/edgar/data/1000249/0001236835-11-000143.txt'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
all_74s = soup.find_all(r'^(074\s[n|N].*\s)(\d*)*$')

I would appreciate your thoughts on the best way to extract the numbers.

The right regex string is ^(074\s[n|N].*\s)(\d*)$ without the last * — Frederick
– Frederick, Commented Aug 10, 2021 at 16:21

Andrej Kesely · Accepted Answer · 2021-08-10 16:20:01Z

1

To get correct response from the server, set User-Agent HTTP header first.

Then, from the soup select text from the <TEXT> tag and apply regex on it:

import re
import requests
from bs4 import BeautifulSoup

url = "https://www.sec.gov/Archives/edgar/data/1000249/0001236835-11-000143.txt"
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0"
}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
all_74s = re.findall(
    r"^074\s+[n|N].*?\s+(\d+)$", soup.find("text").text, flags=re.M
)
print(all_74s)

Prints:

['623938']

answered Aug 10, 2021 at 16:20

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Extracting text using BeautifulSoup and Regex

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related