1

This is the code I am using from Christophers Reeves tutorial on stock scraping it's his 3rd video on the subject on youtube.

import urllib
import re

symbolslist = ["aapl","spy","goog","nflx"]

i=0
while i<len(symbolslist):
    url = "http://finance.yahoo.com/q?s=" +symbolslist[i] +"&q1=1"
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<span id="yfs_l84_'+symbolslist[i] +'">(.?+)</span>'
    pattern = re.compile(regex)
    price = re.findall(pattern,htmltext)
    print "The price of", symbolslist[i]," is", price
    i+=1

I get the following error when I run this code in python 2.7.5

Traceback <most recent call last>:
File "fundamentalism)stocks.py, line 12, in <module>
pattern = re.compile(regex)
File "C:\Python27\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py, line 242, in compile
raise error, v # invalid expression
sre_constant.error: multiple repeat

I don't know if the problem is with the way my library, is installed, my version of python or what. I appreciate your help.

2 Answers 2

3

The problem is in using multiple repeat characters: + and ?.

Probably, non-greedy matching was meant instead: (.+?):

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behavior isn’t desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'..

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the help. That makes sense. Where could I find out more about that?
Well, here, here, here etc :)
0

Others have answered about the greedy match, but on an unrelated note you'll want to write that more like:

for symbol in symbolslist:
    url = "http://finance.yahoo.com/q?s=%s&q1=1" % symbol
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<span id="yfs_l84_%s">(.?+)</span>' % symbol
    price = re.findall(regex, htmltext)[0]
    print "The price of", symbol," is", price
  • The standard Python idiom is to iterate across all the values in a list, not to pick them out by index.
  • "String interpolation" is a lot easier to manage than string concatenation, especially if you're adding several values into the mix (like maybe you want to specify the value of q1 in a later version).
  • re.findall takes a string as its first argument. Explicitly compiling a pattern and then throwing it away in the next loop doesn't get you anything.
  • re.findall returns a list, and you only want the first element from it.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.