Use regular epxression to extract text from html code in python

Question

I have a body of html code scraped from a website using beautifulsoup. I want to use regular expressions in python to extract a portion of a url from the html code. Here is a portion of the html:

<link rel="stylesheet" type="text/css" href="/include/xbrlViewerStyle.css">
<style type="text/css">li.octave {border-top: 1px solid black;}</style>
<!--[if lt IE 8]>
<style type="text/css">
li.accordion a {display:inline-block;}
li.accordion a {display:block;}
</style>
<![endif]-->
<script type="text/javascript" language="javascript">
var InstanceReportXslt = "/include/InstanceReport.xslt";
var reports = new Array(161);
reports[0+1] = "/Archives/edgar/data/49196/000004919618000008/R1.htm";
reports[1+1] = "/Archives/edgar/data/49196/000004919618000008/R2.htm";
reports[2+1] = "/Archives/edgar/data/49196/000004919618000008/R3.htm";
reports[3+1] = "/Archives/edgar/data/49196/000004919618000008/R4.htm";
reports[4+1] = "/Archives/edgar/data/49196/000004919618000008/R5.htm";
reports[5+1] = "/Archives/edgar/data/49196/000004919618000008/R6.htm";
reports[6+1] = "/Archives/edgar/data/49196/000004919618000008/R7.htm";
reports[7+1] = "/Archives/edgar/data/49196/000004919618000008/R8.htm";
reports[8+1] = "/Archives/edgar/data/49196/000004919618000008/R9.htm";
reports[9+1] = "/Archives/edgar/data/49196/000004919618000008/R10.htm";
reports[10+1] = "/Archives/edgar/data/49196/000004919618000008/R11.htm"

I want to use regular expressions to identify "R4" to extract "/Archives/edgar/data/49196/000004919618000008/R4.htm".

Just R4 lines?

Mauro Baraldi
– Mauro Baraldi

2018-12-12 15:26:58 +00:00
Commented Dec 12, 2018 at 15:26 — Mauro Baraldi
– Mauro Baraldi, Commented Dec 12, 2018 at 15:26
And what did you try?

Jongware
– Jongware

2018-12-12 16:25:59 +00:00
Commented Dec 12, 2018 at 16:25 — Jongware
– Jongware, Commented Dec 12, 2018 at 16:25

Dmitry · Accepted Answer · 2018-12-12 15:19:48Z

1

You can use this expression:

>>> import re
>>> s = '''reports[0+1] = "/Archives/edgar/data/49196/000004919618000008/R1.htm";
... reports[1+1] = "/Archives/edgar/data/49196/000004919618000008/R2.htm";
... reports[2+1] = "/Archives/edgar/data/49196/000004919618000008/R3.htm";
... reports[3+1] = "/Archives/edgar/data/49196/000004919618000008/R4.htm";
... reports[4+1] = "/Archives/edgar/data/49196/000004919618000008/R5.htm";
... reports[5+1] = "/Archives/edgar/data/49196/000004919618000008/R6.htm";
... reports[6+1] = "/Archives/edgar/data/49196/000004919618000008/R7.htm";
... reports[7+1] = "/Archives/edgar/data/49196/000004919618000008/R8.htm";'''
>>> for i in re.findall(r'([\w./]+R4[\w./]+)', a):
...     print(i)
... 
/Archives/edgar/data/49196/000004919618000008/R4.htm

answered Dec 12, 2018 at 15:19

Dmitry

2,1161 gold badge19 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Mauro Baraldi Over a year ago

If you want all lines matching URLs, replace R4 to R[0-9].

A. Ryan Over a year ago

That works perfectly. Can you briefly explain the regular expression?

Dmitry Over a year ago

@A.Ryan this re consists of 3 parts: R4 in the middle and two [\w./]+ from both sides. R4 provides exact match. [\w./]+ means that any word character (equals [a-zA-Z0-9_]) or . or /, but at least one (+) can be in match.

Dmitry Over a year ago

@A.Ryan for the second question: you can explicitly define the following character after R4, and modify regular expression, for example, this way: r'([\w./]+R4[a-zA-Z_./][\w./]*)'.

A. Ryan Over a year ago

That makes sense. Is there a way to specify that I want to capture the entire string? i.e. rather than just capture "/Archives/edgar/data/49196/000004919618000008/R4.htm", it would capture "reports[3+1] = "/Archives/edgar/data/49196/000004919618000008/R4.htm"; "?

|

Collectives™ on Stack Overflow

Use regular epxression to extract text from html code in python

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related