3

please watch through the code of my parser. It grabs some statistics from web pages accessing them in a loop and puts specified records in SQLite3 database.

Everything is going right until the line 87 (the SQL statement), where the process consumes all CPU resources and in fact get blocked.

File "./parser.py", line 86, in while (j < i):

Database file in the beginning of the code is created with correct structure, so the problem is in loops. Inner block of main loop for season in season_list: works just fine. Here is the whole code of my script:

#!/usr/bin/env python
from bs4 import BeautifulStoneSoup
from urllib2 import urlopen
import re
import sqlite3
from time import gmtime, strftime

# Print start time 
print "We started at ", strftime("%Y-%m-%d %H:%M:%S", gmtime())

# Create DB
print "Trying to create DB"
con = sqlite3.connect('england.db')
cur = con.cursor()
sql = """\
CREATE TABLE english_premier_league (
    id_match INTEGER PRIMARY KEY AUTOINCREMENT,
    season TEXT,
    tour INTEGER,
    date TEXT,
    home TEXT,
    visitor TEXT,
    home_score INTEGER,
    visitor_score INTEGER
    );
"""
try:
    cur.executescript(sql)
except sqlite3.DatabaseError as err:
    print "Error creating database: ", err
else:
    print "Succesfully created your database..."
    con.commit()
    cur.close()
    con.close()

# list of variables
postfix = 2011
threshold = 1999
season_list = []
while postfix >= threshold:
    end = (postfix + 1) % 2000
    if (end >= 10):
        season = str(postfix) + str(end)
    else:
        season = str(postfix) + str(0) + str(end)
    season_list.append(season)
    postfix -= 1
print season_list

# main loop
for season in season_list:
    href = 'http://www.stat-football.com/en/a/eng.php?b=10&d='+season+'&c=51'
    print href
    xml = urlopen(href).read()
    xmlSoup = BeautifulStoneSoup(xml)
    tablet = xmlSoup.find(attrs={"class" : "bd5"})

    #Access DB      
    con = sqlite3.connect('england.db')
    cur = con.cursor()

    #Parse site
    tour = tablet.findAll(attrs = { "class" : re.compile(r"^(s3|cc s3)$") })
    date = tablet.findAll(text = re.compile(r"(0[1-9]|[12][0-9]|3[01])\.(0[1-9]|1[012])\.(19|20)\d\d"))
    home = tablet.findAll(attrs = {"class" : "nw"})
    guest = tablet.findAll(attrs = {"class" : "s1"})
    score = tablet.findAll(attrs = {"class" : "nw pr15"})

    #
    def parse_string(sequence):
        result=[]
        for unit in sequence:
            text = ''.join(unit.findAll(text=True))
            result.append(text.strip())
        return result

    tour_list=parse_string(tour)
    home_list=parse_string(home)
    guest_list=parse_string(guest)
    score_list=parse_string(score)

    #Loop over found records to put them into sqlite3 DB
    i = len(tour_list)
    j = 0
    while (j < i):
        sql_add = 'INSERT INTO english_premier_league (season, tour, date, home, visitor, home_score, visitor_score) VALUES (?, ?, ?, ?, ?, ?, ?)'
        match = (season, int(tour_list[j]), date[j], home_list[j], guest_list[j], int(score_list[j][0:1]), int(score_list[j][2:3]))
        try:
            cur.executemany(sql_add, match)
        except sqlite3.DatabaseError as err:
            print "Error matching the record: ", err
        else:
            con.commit()
        part = float(j)/float(i)*100
        if (part%10 == 0):
            print (int(part)), "%"
        j += 1
    cur.close()
    con.close()

Also it may be useful to look at the end of strace output:

getcwd("/home/vitaly/football_forecast/epl", 512) = 35 stat("/home/vitaly/football_forecast/epl/england.db", {st_mode=S_IFREG|0644, st_size=24576, ...}) = 0 open("/home/vitaly/football_forecast/epl/england.db", O_RDWR|O_CREAT, 0644) = 3 fcntl(3, F_GETFD) = 0 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=24576, ...}) = 0 lseek(3, 0, SEEK_SET) = 0 read(3, "SQLite format 3\0\4\0\1\1\0@ \0\0\1~\0\0\0\30"..., 100) = 100

I'm running Python 2.7 on Ubuntu 12.04. Thanks a lot.

4
  • executemany should probably just be execute. Commented Aug 13, 2013 at 18:35
  • Unfortunately script does not reach this point. It's getting blocked in the beginning of the while loop (lines 86-87). Commented Aug 13, 2013 at 18:48
  • Why do you think so? The script works fine for me with the replacement. Commented Aug 13, 2013 at 19:24
  • Thanks a lot, i fixed the bug you've noticed and ran script with python2.7 (in previous case i somehow missed it and used python2) Commented Aug 13, 2013 at 19:43

1 Answer 1

1

Replace cur.executemany(sql_add, match) with cur.execute(sql_add, match). executemany() is used for performing the same operation multiple times over an iterable of values. For example, if you had this:

match = [ (season1, tour1, date1, home1, visitor1, home_score1, visitor_score1),
          (season2, tour2, date2, home2, visitor2, home_score2, visitor_score2),
          (season3, tour3, date3, home3, visitor3, home_score3, visitor_score3) ]

cur.executemany(sql_add, match)

... it would be appropriate, since the cursor could iterate over the tuples in match and perform the insert operation on each of them.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.