Python: push item vs creating empty list (efficiency)

Question

I have done two algorithms and I want to check which one of them is more 'efficient' and uses less memory. The first one creates a numpy array and modifies the array. The second one creates a python empty array and pushes values into this array. Who's better? First program:

 f = open('/Users/marcortiz/Documents/vLex/pylearn2/mlearning/classify/files/models/model_training.txt')
        lines = f.readlines()
        f.close()
        zeros = np.zeros((60343,4917))

        for l in lines:
            row = l.split(",")
            for element in row:
                zeros[lines.index(l), row.index(element)] = element

        X = zeros[1,:]
        Y = zeros[:,0]
        one_hot = np.ones((counter, 2))

The second one:

 f = open('/Users/marcortiz/Documents/vLex/pylearn2/mlearning/classify/files/models/model_training.txt')
        lines = f.readlines()
        f.close()
        X = []
        Y = []

        for l in lines:
            row = l.split(",")
            X.append([float(elem) for elem in row[1:]])
            Y.append(float(row[0]))

        X = np.array(X)
        Y = np.array(Y)
        one_hot = np.ones((counter, 2))

My theory is that the first one is slower but uses less memory and it's more 'stable' while working with large files. The second one it's faster but uses a lot of memory and its not so stable while working with large files (543MB, 70,000 lines)

Thanks!

The reason for memory in-efficiency is that you're using file.readlines(), which loads all the lines of the file in memory. You should iterate over the file object directly. — Ashwini Chaudhary
– Ashwini Chaudhary, Commented Jul 31, 2013 at 12:59
they are both not very elegant and you should use numpy.loadtxt() instead — Giorgio Gilestro
– Giorgio Gilestro, Commented Jul 31, 2013 at 13:02

Ashwini Chaudhary · Accepted Answer · 2013-07-31 13:13:13Z

1

The problem with both codes is that you're loading the whole file in memory first using file.readlines(), you should iterate over the file object directly to get one line at a time.

from itertools import izip
#generator function
def func():
   with open('filename.txt') as f:
       for line in f:
          row = map(float, l.split(","))
          yield row[1:], row[0]

X, Y = izip(*func())
X = np.array(X)
Y = np.array(Y)
...

I am sure a pure numpy solution is going to be faster than this.

answered Jul 31, 2013 at 13:13

Ashwini Chaudhary

252k60 gold badges478 silver badges519 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Marc Ortiz Over a year ago

There's a prettier solution with numpy as you said but It's very helpful! Thanks

Marc Ortiz · Accepted Answer · 2013-07-31 14:15:41Z

1

Well finally I made some changes thanks to the answers. My two programs:

f = open('/Users/marcortiz/Documents/vLex/pylearn2/mlearning/classify/files/models/model_training.txt')
    zeros = np.zeros((60343,4917))
    counter = 0

    start = timeit.default_timer()
    for l in f:
        row = l.split(",")
        counter2 = 0
        for element in row:
            zeros[counter, counter2] = element
            counter2 += 1
        counter = counter + 1
    stop = timeit.default_timer()  
    print stop - start 
    f.close()

Time of the first program--> 122.243036032 seconds

Second program:

f = open('/Users/marcortiz/Documents/vLex/pylearn2/mlearning/classify/files/models/model_training.txt')
    zeros = np.zeros((60343,4917))
    counter = 0

    start = timeit.default_timer()
    for l in f:
        row = l.split(",")
        counter2 = 0
        zeros[counter, :] = [i for i in row]
        counter = counter + 1
    stop = timeit.default_timer()
    print stop - start
    f.close()

Time of the second program: 102.208696127 seconds! Thanks.

edited Jul 31, 2013 at 14:15

answered Jul 31, 2013 at 13:55

Marc Ortiz

2,4425 gold badges29 silver badges46 bronze badges

1 Comment

Ashwini Chaudhary Over a year ago

I'd use enumerate, manual counter is unpythonic. And what's the poin t of [i for i in row]?, simply assign row to it.

Oleg Eterevsky · Accepted Answer · 2013-07-31 13:13:42Z

0

Python has a useful profiler in its default library. It's really easy to use: just wrap your code in a function and call cProfile.run in the following fashion:

import cProfile
cProfile.run('my_function()')

One advice for the both cases: you really do not need to read all the lines to a list. Instead, if you just iterate over the file, you'll get the lines without storing them in memory:

f = open('some_file.txt')
for line in f:
    # Do something

In terms of memory usage, numpy array is significantly better than list.

answered Jul 31, 2013 at 13:13

Oleg Eterevsky

1,6642 gold badges14 silver badges14 bronze badges

Collectives™ on Stack Overflow

Python: push item vs creating empty list (efficiency)

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related