Reading .dat without delimiters into array in python

Question

I have a .dat file with no delimiters that I am trying to read into an array. Say each new line represents one person, and variables in each line are defined in terms of a fixed number of characters, e.g the first variable "year" is the first four characters, the second variable "age" is the next 2 characters (no delimiters within the line) e.g.:

201219\n
201220\n
201256\n

Here is what I am doing right now:

data_file = 'filename.dat'
file = open(data_file, 'r')

year = []
age = []

for line in file:   
    year.append(line[0:4])
    age.append(line[4:])

This works fine for a small number of lines and variables, but when I try loading the full data file (500Mb with 10 million lines and 20 variables) I get a MemoryError. Is there a more efficient way to load this type of data into arrays?

What kind of processing are you doing? Is it possible to process the data file(s) a chunk at a time? Is it all numbers? Do you want them eventually in numeric (as opposed to their current string) form? — ooga
– ooga, Commented Apr 15, 2014 at 0:27
These are survey data. I would like to have each variable in a separate float array, since I would need to query by multiple conditions (e.g. all people of a particular race, age in a certain year, and be able to evaluate distributions of other variables given these conditions - e.g. mean income). So it would be nice to have all data loaded simultaneously. — Herrkitoff
– Herrkitoff, Commented Apr 15, 2014 at 0:39
If the data is numeric value, using numpy can be the efficient way. — emesday
– emesday, Commented Apr 15, 2014 at 1:05

dstromberg · Accepted Answer · 2014-04-15 00:50:50Z

First off, you're probably better off with a list of class instances than a bunch of parallel lists, from a software engineering standpoint. If you try this, you probably should look into __slots__ to decrease the memory overhead.

You could also try pypy - it has some memory optimizations for homogeneous lists.

I'd probably use gdbm or bsddb rather than sqlite, if you want an on-disk solution. gdbm and bsddb look like dict's, except you index them (the keys) by a string and the values are strings too. So your class (the one I mentioned above) would have a __str__ and/or __repr__ method(s) that would convert to a string (could use pickle) for storage in the table. Then your constructor would be made to deal with reversing the process somehow.

If you ever get to such large data that a gdbm or bsddb is too slow, you could try just writing to a flat file - that'll not be as nice for jumping around obviously, but it eliminates a lot of seek()'ing which can be very advantageous sometimes.

HTH

Wayne Werner · Accepted Answer · 2014-04-15 00:55:07Z

0

The problem here doesn't appear to be that you're having problems reading it as fitting it into memory. When you're talking about 200 million anything in memory you're going to have some issues.

Try storing it as a list of strings (i.e. trade memory for CPU), or if you can just don't store it at all.

Another option to try is dumping it into a sqlite database. If you use an in-memory db you might end out with the same issue, but maybe not.

If you go for the string style, do something like this:

def get_age(person):
    return int(person[4:])


people = file.readlines()  # Wait a while....

for person in people:
    print(get_age(person)*2)  # Or something else

Here's an example of getting mean income for a particular age in a particular year:

def get_mean_income_by_age_and_year(people, target_age, target_year):
    count = 0
    total = 0.0
    for person in people:
        income, age, year = get_income(person), get_age(person), get_year(person)
        if age == target_age and year == target_year:
            total += income
            count += 1
    if count:
        return total/count
    else:
        return 0.0

Really, though, this basically does what storing it in a sqlite database would do for you. If there are only a couple of very specific things you want to do, then going this way is probably reasonable. But it sounds like there are probably several things you want to be doing with this info - if so a sqlite database is probably what you want.

edited Apr 15, 2014 at 0:55

answered Apr 15, 2014 at 0:28

Wayne Werner

52.3k35 gold badges213 silver badges304 bronze badges

4 Comments

Herrkitoff Over a year ago

I would like to have these variables in arrays with different types (int for year and age, and float for say income) and be able to manipulate with these data as arrays (multiply, divide etc), e.g. create a function to evaluate mean income for a particular age in a particular year.

Wayne Werner Over a year ago

It sounds like you really want to stick this data in a database. You could still do what you're looking for... (adding it to my answer)

Herrkitoff Over a year ago

Yes, that would be ideal, but since I am VERY new to Python, I am not sure how to reconcile a .dat file without delimiters with an array. If I had a delimited file, I could use a split() method?

Herrkitoff Over a year ago

I will try sqlite, this looks like a more efficient way of data storage. Thanks

ooga · Accepted Answer · 2014-04-15 00:50:57Z

0

A more efficient data structure for lots of uniform numeric data is the array. Depending on how much memory you have, using an array may work.

import array

year = array.array('i')    # int
age = array.array('i')     # int
income = array.array('f')  # float

with open('data.txt', 'r') as f:
  for line in f:
    year.append(int(line[0:4]))
    age.append(int(line[4:6]))
    income.append(float(line[6:12]))

answered Apr 15, 2014 at 0:50

ooga

15.6k2 gold badges23 silver badges23 bronze badges

Collectives™ on Stack Overflow

Reading .dat without delimiters into array in python

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related