1

I'd like to be able to retrieve specifics rows in a large dataset (9M lines, 1.4 GB) given two or more parameters through Python.

For example, from this dataset :

ID1 2   10  2   2   1   2   2   2   2   2   1

ID2 10  12  2   2   2   2   2   2   2   1   2

ID3 2   22  0   1   0   0   0   0   0   1   2

ID4 14  45  0   0   0   0   1   0   0   1   1

ID5 2   8   1   1   1   1   1   1   1   1   2

Given the example parameters :

  • second column must be equal to 2, and
  • third column must be within a range from 4 to 15

I should obtain :

ID1 2   10  2   2   1   2   2   2   2   2   1

ID5 2   8   1   1   1   1   1   1   1   1   2

The problem is that i don't know how to do these operations efficiently on a two dimensional array in Python.

This is what i tried :

line_list = []

# Loading of the whole file in memory
for line in file:
    line_list.append(line)

# set conditions
i = 2
start_range = 4
end_range = 15

# Iteration through the loaded list and split for each column
for index in data_list:
    data = index.strip().split()
    # now test if the current line matches with conditions
    if(data[1] == i and data[2] >= start_range and data[2] <= end_range):
        print str(data)

I'd like to perform this process a lot of times an the way i'm doing it is really slow, even with the data file loaded in memory.

I was thinking about using numpy arrays but i don't know how to retrieve a row given conditions.

Thanks for your help !

UPDATE :

As suggested, i used a relational database system. I chose Sqlite3 as it is pretty easy to use and quick to deploy.

My file was loaded through an import function in sqlite3 in roughly 4 minutes.

I did an index on the second and third column to accelerate the process when retrieving information.

The query was done through Python, with the module "sqlite3".

That is way, way faster !

3
  • Have you considered using a database? Commented Feb 1, 2013 at 0:59
  • This looks like a job for a relational database. Commented Feb 1, 2013 at 1:00
  • Good point, I didn't think about that. I can still split the files to process as I work on a cluster. I just want to be sure that it can't be done efficiently in pure Python. Commented Feb 1, 2013 at 1:02

1 Answer 1

1

I'd go for almost what you've got (un-tested):

with open('somefile') as fin:
    rows = (line.split() for line in fin)
    take = (row for row in rows if int(row[1] == 2) and 4 <= int(row[2]) <= 15)
    # data = list(take)
    for row in take:
        pass # do something
Sign up to request clarification or add additional context in comments.

7 Comments

This will minimize memory overhead but not necessarily speed. The fundamental reason for the slow speed is the IO and string parsing, not the row searching. You should concentrate your efforts on speeding up IO by using mmap, a file that is faster to parse (binary or fixed record length), or putting your dataset into some kind of database.
@FrancisAvila I won't argue that something like this should be in a DB with appropriate indices. However, for pure disk IO (short of mmap'ing which I've never seen help for simple sequential access of a file in terms of efficiency), this is probably as good as it gets.
that was more a note to the OP than a criticism of your answer, which is quite on-target. I'm warning him not to expect miracles unless he can improve his IO/parsing story.
@FrancisAvila my comment was also more towards the OP and not really to yourself. I do agree, it's an IO problem... And it definitely would not hurt to use a RDMS' load file or similar to smash it into a simple table, index it, then query it from Python if needs be
Okay, so the IO is the bottleneck. Thanks for your replies!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.