Retrieving a string using REGEX in Python 2.7.2

Question

I have the following code snippet from page source:

var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder");

the

'PDFObject('

is unique on the page. I want to retreive url content using REGEX. In this case I need to get

http://www.site.com/doc55.pdf

Please help.

Regex should work pretty good for this.

Smandoli
– Smandoli

2013-07-04 20:32:45 +00:00
Commented Jul 4, 2013 at 20:32 — Smandoli
– Smandoli, Commented Jul 4, 2013 at 20:32

perreal · Accepted Answer · 2013-07-04 21:29:06Z

3

Here is an alternative for solving your problem without using regex:

url,in_object = None, False
with open('input') as f:
    for line in f:
        in_object = in_object or 'PDFObject(' in line
        if in_object and 'url:' in line:
            url = line.split('"')[1]
            break
print url

edited Jul 4, 2013 at 21:29

answered Jul 4, 2013 at 20:48

perreal

98.7k23 gold badges159 silver badges187 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Stephan Over a year ago

Why on hearth OPs can't be helped with the tools they want to use? There is always someone to tell them "Hey dude! That's not HOW to think about it..."

perreal Over a year ago

you picked this out within all those regex answers?

l4mpi Over a year ago

Good answer, I agree that regex is not the best tool for this. But you should probably break the loop after finding the url (or just put the code into a function and return), otherwise you could have false positives if other lines contain "url:".

Floris · Accepted Answer · 2013-07-04 21:18:26Z

0

In order to be able to find "something that happens in the line after something else", you need to match things "including the newline". For this you use the (dotall) modifier - a flag added during the compilation.

Thus the following code works:

import re
r = re.compile(r'(?<=PDFObject).*?url:.*?(http.*?)"', re.DOTALL)
s = '''var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder"); '''

print r.findall(s)

Explanation:

r = re.compile(         compile regular expression
    r'                  treat this string as a regular expression
    (?<=PDFObject)      the match I want happens right after PDFObject
    .*?                 then there may be some other characters...
    url:                followed by the string url:
    .*?                 then match whatever follows until you get to the first instance (`?` : non-greedy match of
    (http:.*?)"         match the string http: up to (but not including) the first "
    ',                  end of regex string, but there's more...
    re.DOTALL)          set the DOTALL flag - this means the dot matches all characters
                        including newlines. This allows the match to continue from one line
                        to the next in the .*? right after the lookbehind

edited Jul 4, 2013 at 21:18

answered Jul 4, 2013 at 21:02

Floris

46.6k7 gold badges73 silver badges128 bronze badges

2 Comments

Ash Over a year ago

Thanks a lot Floris. Your code is the shortest and it works just fine:)

Floris Over a year ago

Glad it worked for you. Was an opportunity for me to figure out the re.DOTALL thing... I knew it existed, had not used it, this was my chance to learn about it. So we both came out ahead.

iruvar · Accepted Answer · 2013-07-04 20:41:20Z

0

using a combination of look-behind and look-ahead assertions

import re
re.search(r'(?<=url:).*?(?=",)', s).group().strip('" ')
'http://www.site.com/doc55.pdf'

answered Jul 4, 2013 at 20:41

iruvar

23.5k7 gold badges58 silver badges83 bronze badges

Comments

dawg · Accepted Answer · 2013-07-04 21:07:51Z

0

This works:

import re

src='''\
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
URL: "http://www.site.com/doc52.PDF",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder"); '''   

print [m.group(1).strip('"') for m in 
        re.finditer(r'^url:\s*(.*)[\W]$',
        re.search(r'PDFObject\(\{(.*)',src,re.M | re.S | re.I).group(1),re.M|re.I)]

prints:

['http://www.site.com/doc55.pdf', 'http://www.site.com/doc52.PDF']

edited Jul 4, 2013 at 21:07

answered Jul 4, 2013 at 20:48

dawg

105k24 gold badges143 silver badges217 bronze badges

Comments

Community · Accepted Answer · 2020-06-20 09:12:55Z

0

Regex

new\s+PDFObject\(\{\s*url:\s*"[^"]+"

Regular expression image

Demo

Extract url only

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Jul 4, 2013 at 21:04

Stephan

43.2k69 gold badges245 silver badges342 bronze badges

1 Comment

Floris Over a year ago

This doesn't address the "after PDFObject" part. There will be other instances of url: "http:.*" on the page - OP wants a specific one.

noirbizarre · Accepted Answer · 2013-07-04 21:35:22Z

If 'PDFObject(' is the unique identifier in the page, you only have to match the first next quoted content.

Using the DOTALL flag (re.DOTALL or re.S) and the non-greedy star (*?), you can write:

import re

snippet = '''                                    
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder");
'''

# First version using unnamed groups
RE_UNNAMED = re.compile(r'PDFObject\(.*?"(.*?)"', re.S)

# Second version using named groups
RE_NAMED = re.compile(r'PDFObject\(.*?"(?P<url>.*?)"', re.S)

RE_UNNAMED.search(snippet, re.S).group(1)
RE_NAMED.search(snippet, re.S).group('url')
# result for both: 'http://www.site.com/doc55.pdf'

If you don't want to compile your regex because it's used once, simply this syntax:

re.search(r'PDFObject\(.*?"(.*?)"', snippet, re.S).group(1)
re.search(r'PDFObject\(.*?"(?P<url>.*?)"', snippet, re.S).group('url')

Four choices, one should match you need and taste!

luke · Accepted Answer · 2013-07-05 01:40:08Z

0

Although the other answers may appear to work, most do not take into account that the only unique thing on the page is 'PDFObject('. A much better regular expression would be the following:

PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",

It takes into account that 'PDFObject(' is unique and contains some basic URL verification.

Below is an example of how this regex could be used in python

>>> import re
>>> strs = """var myPDF = new PDFObject({
... url: "http://www.site.com/doc55.pdf",
...   id: "pdfObjectContainer",
...   width: "100%",
...   height: "700px",
...   pdfOpenParams: {
...     navpanes: 0,
...     statusbar: 1,
...     toolbar: 1,
...     view: "FitH"
...   }
... }).embed("pdf_placeholder");"""
>>> re.search(r'PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",',strs).group(1)
'http://www.site.com/doc55.pdf'

A pure python (no regex) alternative would be:

>>> unique = 'PDFObject({\nurl: "'
>>> start = strs.find(unique) + len(unique)
>>> end = start + strs[start:].find('"')
>>> strs[start:end]
'http://www.site.com/doc55.pdf'

No regex oneliner:

>>> (lambda u:(lambda s:(lambda e:strs[s:e])(s+strs[s:].find('"')))(strs.find(u)+len(u)))('PDFObject({\nurl: "')
'http://www.site.com/doc55.pdf'

edited Jul 5, 2013 at 1:40

answered Jul 4, 2013 at 20:46

luke

1,0358 silver badges20 bronze badges

4 Comments

Floris Over a year ago

Not sure the link validation is needed, but I appreciate that matching http: in my example was going one character too far, as it would skip any https: links - I have modified my answer, and thanks. Does your regex permit all legal links (even ones with URL encoded queries attached)? It's a bit hard to be sure...

luke Over a year ago

@Floris yes this regex accepts all links, even ones with URL encoded queries, given their protocol is either http or https.

Floris Over a year ago

That's cool - I will keep a copy, might come in handy. Of your own making, or did you find it somewhere?

luke Over a year ago

I think I found it somewhere, can't remember where though. I used it in one of my projects a while back, just copied it out of there for this.

Collectives™ on Stack Overflow

Retrieving a string using REGEX in Python 2.7.2

7 Answers 7

3 Comments

2 Comments

Comments

Comments

Regex

Demo

1 Comment

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

3 Comments

2 Comments

Comments

Comments

Regex

Demo

1 Comment

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related