0

I have done two algorithms and I want to check which one of them is more 'efficient' and uses less memory. The first one creates a numpy array and modifies the array. The second one creates a python empty array and pushes values into this array. Who's better? First program:

 f = open('/Users/marcortiz/Documents/vLex/pylearn2/mlearning/classify/files/models/model_training.txt')
        lines = f.readlines()
        f.close()
        zeros = np.zeros((60343,4917))

        for l in lines:
            row = l.split(",")
            for element in row:
                zeros[lines.index(l), row.index(element)] = element

        X = zeros[1,:]
        Y = zeros[:,0]
        one_hot = np.ones((counter, 2))

The second one:

 f = open('/Users/marcortiz/Documents/vLex/pylearn2/mlearning/classify/files/models/model_training.txt')
        lines = f.readlines()
        f.close()
        X = []
        Y = []

        for l in lines:
            row = l.split(",")
            X.append([float(elem) for elem in row[1:]])
            Y.append(float(row[0]))

        X = np.array(X)
        Y = np.array(Y)
        one_hot = np.ones((counter, 2))

My theory is that the first one is slower but uses less memory and it's more 'stable' while working with large files. The second one it's faster but uses a lot of memory and its not so stable while working with large files (543MB, 70,000 lines)

Thanks!

2
  • 2
    The reason for memory in-efficiency is that you're using file.readlines(), which loads all the lines of the file in memory. You should iterate over the file object directly. Commented Jul 31, 2013 at 12:59
  • 1
    they are both not very elegant and you should use numpy.loadtxt() instead Commented Jul 31, 2013 at 13:02

3 Answers 3

1

The problem with both codes is that you're loading the whole file in memory first using file.readlines(), you should iterate over the file object directly to get one line at a time.

from itertools import izip
#generator function
def func():
   with open('filename.txt') as f:
       for line in f:
          row = map(float, l.split(","))
          yield row[1:], row[0]

X, Y = izip(*func())
X = np.array(X)
Y = np.array(Y)
...

I am sure a pure numpy solution is going to be faster than this.

Sign up to request clarification or add additional context in comments.

1 Comment

There's a prettier solution with numpy as you said but It's very helpful! Thanks
1

Well finally I made some changes thanks to the answers. My two programs:

f = open('/Users/marcortiz/Documents/vLex/pylearn2/mlearning/classify/files/models/model_training.txt')
    zeros = np.zeros((60343,4917))
    counter = 0

    start = timeit.default_timer()
    for l in f:
        row = l.split(",")
        counter2 = 0
        for element in row:
            zeros[counter, counter2] = element
            counter2 += 1
        counter = counter + 1
    stop = timeit.default_timer()  
    print stop - start 
    f.close()

Time of the first program--> 122.243036032 seconds

Second program:

f = open('/Users/marcortiz/Documents/vLex/pylearn2/mlearning/classify/files/models/model_training.txt')
    zeros = np.zeros((60343,4917))
    counter = 0

    start = timeit.default_timer()
    for l in f:
        row = l.split(",")
        counter2 = 0
        zeros[counter, :] = [i for i in row]
        counter = counter + 1
    stop = timeit.default_timer()
    print stop - start
    f.close()

Time of the second program: 102.208696127 seconds! Thanks.

1 Comment

I'd use enumerate, manual counter is unpythonic. And what's the poin t of [i for i in row]?, simply assign row to it.
0

Python has a useful profiler in its default library. It's really easy to use: just wrap your code in a function and call cProfile.run in the following fashion:

import cProfile
cProfile.run('my_function()')

One advice for the both cases: you really do not need to read all the lines to a list. Instead, if you just iterate over the file, you'll get the lines without storing them in memory:

f = open('some_file.txt')
for line in f:
    # Do something

In terms of memory usage, numpy array is significantly better than list.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.