0

I wanted to extract some numbers from text files. The text line is like 074 N00AA00 623938 and I need to extract the number 623938. I'm using the code below but it returns nothing:

url = 'https://www.sec.gov/Archives/edgar/data/1000249/0001236835-11-000143.txt'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
all_74s = soup.find_all(r'^(074\s[n|N].*\s)(\d*)*$')

I would appreciate your thoughts on the best way to extract the numbers.

1
  • 1
    The right regex string is ^(074\s[n|N].*\s)(\d*)$ without the last * Commented Aug 10, 2021 at 16:21

1 Answer 1

1

To get correct response from the server, set User-Agent HTTP header first.

Then, from the soup select text from the <TEXT> tag and apply regex on it:

import re
import requests
from bs4 import BeautifulSoup

url = "https://www.sec.gov/Archives/edgar/data/1000249/0001236835-11-000143.txt"
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0"
}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
all_74s = re.findall(
    r"^074\s+[n|N].*?\s+(\d+)$", soup.find("text").text, flags=re.M
)
print(all_74s)

Prints:

['623938']
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.