0

I have a body of html code scraped from a website using beautifulsoup. I want to use regular expressions in python to extract a portion of a url from the html code. Here is a portion of the html:

<link rel="stylesheet" type="text/css" href="/include/xbrlViewerStyle.css">
<style type="text/css">li.octave {border-top: 1px solid black;}</style>
<!--[if lt IE 8]>
<style type="text/css">
li.accordion a {display:inline-block;}
li.accordion a {display:block;}
</style>
<![endif]-->
<script type="text/javascript" language="javascript">
var InstanceReportXslt = "/include/InstanceReport.xslt";
var reports = new Array(161);
reports[0+1] = "/Archives/edgar/data/49196/000004919618000008/R1.htm";
reports[1+1] = "/Archives/edgar/data/49196/000004919618000008/R2.htm";
reports[2+1] = "/Archives/edgar/data/49196/000004919618000008/R3.htm";
reports[3+1] = "/Archives/edgar/data/49196/000004919618000008/R4.htm";
reports[4+1] = "/Archives/edgar/data/49196/000004919618000008/R5.htm";
reports[5+1] = "/Archives/edgar/data/49196/000004919618000008/R6.htm";
reports[6+1] = "/Archives/edgar/data/49196/000004919618000008/R7.htm";
reports[7+1] = "/Archives/edgar/data/49196/000004919618000008/R8.htm";
reports[8+1] = "/Archives/edgar/data/49196/000004919618000008/R9.htm";
reports[9+1] = "/Archives/edgar/data/49196/000004919618000008/R10.htm";
reports[10+1] = "/Archives/edgar/data/49196/000004919618000008/R11.htm"

I want to use regular expressions to identify "R4" to extract "/Archives/edgar/data/49196/000004919618000008/R4.htm".

2
  • Just R4 lines? Commented Dec 12, 2018 at 15:26
  • And what did you try? Commented Dec 12, 2018 at 16:25

1 Answer 1

1

You can use this expression:

>>> import re
>>> s = '''reports[0+1] = "/Archives/edgar/data/49196/000004919618000008/R1.htm";
... reports[1+1] = "/Archives/edgar/data/49196/000004919618000008/R2.htm";
... reports[2+1] = "/Archives/edgar/data/49196/000004919618000008/R3.htm";
... reports[3+1] = "/Archives/edgar/data/49196/000004919618000008/R4.htm";
... reports[4+1] = "/Archives/edgar/data/49196/000004919618000008/R5.htm";
... reports[5+1] = "/Archives/edgar/data/49196/000004919618000008/R6.htm";
... reports[6+1] = "/Archives/edgar/data/49196/000004919618000008/R7.htm";
... reports[7+1] = "/Archives/edgar/data/49196/000004919618000008/R8.htm";'''
>>> for i in re.findall(r'([\w./]+R4[\w./]+)', a):
...     print(i)
... 
/Archives/edgar/data/49196/000004919618000008/R4.htm
Sign up to request clarification or add additional context in comments.

7 Comments

If you want all lines matching URLs, replace R4 to R[0-9].
That works perfectly. Can you briefly explain the regular expression?
@A.Ryan this re consists of 3 parts: R4 in the middle and two [\w./]+ from both sides. R4 provides exact match. [\w./]+ means that any word character (equals [a-zA-Z0-9_]) or . or /, but at least one (+) can be in match.
@A.Ryan for the second question: you can explicitly define the following character after R4, and modify regular expression, for example, this way: r'([\w./]+R4[a-zA-Z_./][\w./]*)'.
That makes sense. Is there a way to specify that I want to capture the entire string? i.e. rather than just capture "/Archives/edgar/data/49196/000004919618000008/R4.htm", it would capture "reports[3+1] = "/Archives/edgar/data/49196/000004919618000008/R4.htm"; "?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.