1

I am using beautifulsoup in Python to do web scraping. The text on the website has the names written in red font color, and I need to have the color codes. I am using the text on the website as my training data for NER (only for proper names). How can I get the color code using beautifulsoup? At the moment my code looks like this.

from bs4 import BeautifulSoup

import requests

req = requests.get('https://www.islamweb.net/ar/library/index.php?page=bookcontents&idfrom=1&idto=272&bk_no=86&ID=2')
soup = BeautifulSoup(req.text, 'html.parser')

print(soup.get_text())
4
  • Can you share the URL? Commented Aug 7, 2021 at 22:00
  • Just added the URL. The website is in Arabic. Commented Aug 7, 2021 at 22:01
  • The text is in arabic. Do you need to extract the text that is in red color? Commented Aug 7, 2021 at 22:03
  • I need to extract the main text in the website, and I need to extract everything in those couple of paragraphs. I just need for the red segments to have some sort of a tag so that I can then manually turn the entirety of the text into my training data. Commented Aug 7, 2021 at 22:09

1 Answer 1

1

I hope I've understood your question right. This script will get all text from the main body without any tags. Only red text portions are enclosed in <TAG>:

import requests
from bs4 import BeautifulSoup, NavigableString


url = "https://www.islamweb.net/ar/library/index.php?page=bookcontents&idfrom=1&idto=272&bk_no=86&ID=2"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

body = soup.select_one("#pagebody")
for tag in body.find_all(
    lambda tag: tag.name == "span" and "none" in tag.get("style", "")
):
    tag.extract()

for tag in body.select(":not(.names)"):
    tag.unwrap()

out = []
for c in body.contents:
    if isinstance(c, NavigableString):
        c = c.strip()
        if c:
            out.append(c)
    else:
        out.append("<TAG>{}</TAG>".format(c.get_text(strip=True)))

print(" ".join(out))

Prints:

1 [ ص: 7 ] بسم الله الرحمن الرحيم وصلى الله على سيدنا محمد وآله وصحبه وسلم أخبرنا الإمام الحافظ <TAG>أبو القاسم سليمان بن أحمد بن أيوب اللخمي الطبراني</TAG> - رحمه الله - قال : هذا أول كتاب فوائد مشائخي الذين كت
بت عنهم بالأمصار ، خرجت عن كل واحد منهم حديثا واحدا وجعلت أسماءهم على حروف المعجم . باب الألف من اسمه أحمد حدثنا <TAG>أحمد بن عبد الوهاب بن نجدة الحوطي أبو عبد الله</TAG> بمدينة جبلة سنة تسع وسبعين ومائتين ، حدث
نا جنادة بن مروان الأزدي [ ص: 8 ] الحمصي ، حدثنا <TAG>مبارك بن فضالة</TAG> ، عن الحسن ، عن <TAG>أنس بن مالك</TAG> - رضي الله عنه - قال : قال رسول الله - صلى الله عليه وآله وسلم - : " سألت ربي - عز وجل - ثلاث خصا
ل فأعطاني اثنتين ومنعني واحدة ، سألته أن لا يسلط على أمتي عدوا من غيرهم فأعطانيها ، وسألته أن لا يقتل أمتي بالسنة فأعطانيها ، وسألته أن لا يلبسهم شيعا فأبى علي " . لم يروه عن <TAG>مبارك بن فضالة</TAG> إلا جنادة .
Sign up to request clarification or add additional context in comments.

1 Comment

Great. This is the exact thing I was looking for!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.