Search in two dimensional array in Python

Question

I'd like to be able to retrieve specifics rows in a large dataset (9M lines, 1.4 GB) given two or more parameters through Python.

For example, from this dataset :

ID1 2   10  2   2   1   2   2   2   2   2   1

ID2 10  12  2   2   2   2   2   2   2   1   2

ID3 2   22  0   1   0   0   0   0   0   1   2

ID4 14  45  0   0   0   0   1   0   0   1   1

ID5 2   8   1   1   1   1   1   1   1   1   2

Given the example parameters :

second column must be equal to 2, and
third column must be within a range from 4 to 15

I should obtain :

ID1 2   10  2   2   1   2   2   2   2   2   1

ID5 2   8   1   1   1   1   1   1   1   1   2

The problem is that i don't know how to do these operations efficiently on a two dimensional array in Python.

This is what i tried :

line_list = []

# Loading of the whole file in memory
for line in file:
    line_list.append(line)

# set conditions
i = 2
start_range = 4
end_range = 15

# Iteration through the loaded list and split for each column
for index in data_list:
    data = index.strip().split()
    # now test if the current line matches with conditions
    if(data[1] == i and data[2] >= start_range and data[2] <= end_range):
        print str(data)

I'd like to perform this process a lot of times an the way i'm doing it is really slow, even with the data file loaded in memory.

I was thinking about using numpy arrays but i don't know how to retrieve a row given conditions.

Thanks for your help !

UPDATE :

As suggested, i used a relational database system. I chose Sqlite3 as it is pretty easy to use and quick to deploy.

My file was loaded through an import function in sqlite3 in roughly 4 minutes.

I did an index on the second and third column to accelerate the process when retrieving information.

The query was done through Python, with the module "sqlite3".

That is way, way faster !

Good point, I didn't think about that. I can still split the files to process as I work on a cluster. I just want to be sure that it can't be done efficiently in pure Python. — Melka
– Melka, Commented Feb 1, 2013 at 1:02

Jon Clements · Accepted Answer · 2013-02-01 01:08:14Z

1

I'd go for almost what you've got (un-tested):

with open('somefile') as fin:
    rows = (line.split() for line in fin)
    take = (row for row in rows if int(row[1] == 2) and 4 <= int(row[2]) <= 15)
    # data = list(take)
    for row in take:
        pass # do something

answered Feb 1, 2013 at 1:08

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Francis Avila Over a year ago

This will minimize memory overhead but not necessarily speed. The fundamental reason for the slow speed is the IO and string parsing, not the row searching. You should concentrate your efforts on speeding up IO by using mmap, a file that is faster to parse (binary or fixed record length), or putting your dataset into some kind of database.

Jon Clements Over a year ago

@FrancisAvila I won't argue that something like this should be in a DB with appropriate indices. However, for pure disk IO (short of mmap'ing which I've never seen help for simple sequential access of a file in terms of efficiency), this is probably as good as it gets.

Francis Avila Over a year ago

that was more a note to the OP than a criticism of your answer, which is quite on-target. I'm warning him not to expect miracles unless he can improve his IO/parsing story.

Jon Clements Over a year ago

@FrancisAvila my comment was also more towards the OP and not really to yourself. I do agree, it's an IO problem... And it definitely would not hurt to use a RDMS' load file or similar to smash it into a simple table, index it, then query it from Python if needs be

Melka Over a year ago

Okay, so the IO is the bottleneck. Thanks for your replies!

|

Collectives™ on Stack Overflow

Search in two dimensional array in Python

UPDATE :

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

UPDATE :

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related