0

I am writing a piece of code that requests a web page, parse it, and extracts certain information in it.

Everything works perfectly except for this part (Edited)

start_time = time.time()
r = requests.get(item_url)
print(time.time() - start_time)
formatted_html = r.text.replace('=\r\n', '')
print(time.time() - start_time)
formatted_html = re.sub('=\r\n', '', r.text)
print(time.time() - start_time)

Output

0.9731616973876953
1.9460444450378418
3.0275654792785645

The text.replace takes 1 seconds+ to complete and is used to fix a "quoted-printable" html string. There are also many web pages to cover (separated in threads), so I'm trying to speed up this "fix". I couldn't find a way to request the web page without being quoted-printable as well.

Any ideas?

Edit: It takes a lot of time because the string is huge (100k+ length)

Edit: I have tried re.sub, String.split() then join (None are significantly faster)

12
  • 4
    Are you sure that .replace() takes 1+ sec? You can type print(r.elapsed) to see how much time the actual request took. Commented Jun 10, 2020 at 10:36
  • You can use multithreading if the threads are not dependent on each other.. Commented Jun 10, 2020 at 10:38
  • 1
    @HarrisonSeow Since you are parsing webpages that must involve HUGE amount of text, and the best solution for that is regex. Perhaps you might want to look in this question -stackoverflow.com/questions/4893506/… Commented Jun 10, 2020 at 10:39
  • 1
    @IvanVinogradov I've added the way I measured the time for the function (with only 1 thread) Commented Jun 10, 2020 at 10:40
  • @RushabhSudame I thought threading is multithreading already, is there any difference? t = threading.Thread(target=processMHT, kwargs=item) is what I'm using Commented Jun 10, 2020 at 10:51

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.