Smart way to read big input file with multiple unmarked variables (assorted in columns) in python

Question

I have the following code that runs for over a million lines. But this takes a lot of time. Is there a better way to read in such files? The current code looks like this:

for line in lines:
    line = line.strip()             #Strips extra characters from lines
    columns = line.split()          #Splits lines into individual 'strings'
    x = columns[0]                  #Reads in x position
    x = float(x)                    #Converts the strings to float
    y = columns[1]                  #Reads in y 
    y = float(y)                    #Converts the strings to float
    z = columns[2]                  #Reads in z 
    z = float(z)                    #Converts the strings to float

The file data looks like this:

  347.528218024     354.824474847   223.554247185   -47.3141937738  -18.7595743981   
  317.843928028     652.710791858   795.452586986   -177.876355361  7.77755408015   
  789.419369714     557.566066378   338.090799912   -238.803813301  -209.784710166   
  449.259334688     639.283337249   304.600907059   26.9716202117   -167.461497735  
  739.302109761     532.139588049   635.08307865    -24.5716064556  -91.5271790951

I want to extract each number from different columns. Every element in a column is the same variable. How do I do that? For example I want a list, l, say to store the floats of first column.

colcarroll · Accepted Answer · 2013-11-22 16:38:50Z

4

It would be helpful to know what you are planning on doing with the data, but you might try:

data = [map(float, line.split()) for line in lines]

This will give you a list of lists with your data.

edited Nov 22, 2013 at 16:38

answered Nov 22, 2013 at 4:48

colcarroll

3,68219 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

shad0w_wa1k3r Over a year ago

would be slightly faster too, being a list comprehension.

DSM Over a year ago

No need for .strip() -- .split() will automatically drop whitespace on the ends.

Abhinav Kumar Over a year ago

I want to extract each number from different columns. Every element in a column is the same variable. How do I do that? For example I want a list, l, say to store the floats of first column.

colcarroll Over a year ago

[j[0] for j in data] extracts the first column. pandas or numpy/scipy is probably a better idea if you'll be doing a lot of manipulation. For numpy you can numpy.array(data)[:,0] to get the first column.

Matthew Adams · Accepted Answer · 2013-11-22 06:07:32Z

1

Pandas is built for this (among many other things)!

It uses numpy, which uses C under the hood and is very fast. (Actually, depending on what you're doing with the data, you may want to use numpy directly instead of pandas. However, I'd only do that after you've tried pandas; numpy is lower level and pandas will make your life easier.)

Here's how you could read in your data:

import pandas as pd

with open('testfile', 'r') as f:
    d = pd.read_csv(f, delim_whitespace=True, header=None,
                    names=['delete me','col1','col2','col3','col4','col5'])

d = d.drop('delete me',1) # the first column is all spaces and gets interpreted
                          # as an empty column, so delete it
print d

This outputs:

         col1        col2        col3        col4        col5
0  347.528218  354.824475  223.554247  -47.314194  -18.759574
1  317.843928  652.710792  795.452587 -177.876355    7.777554
2  789.419370  557.566066  338.090800 -238.803813 -209.784710
3  449.259335  639.283337  304.600907   26.971620 -167.461498
4  739.302110  532.139588  635.083079  -24.571606  -91.527179

The result d in this case is a powerful data structure called a dataframe that gives you a lot of options for manipulating the data very quickly.

As a simple example, this adds the two first columns and gets the mean of the result:

(d['col1'] + d['col2']).mean() # 1075.97544372

Pandas also handles missing data very nicely; if there are missing/bad values in the data file, pandas will simply replace them with NaN or None as appropriate when it reads them in.

Anyways, for fast,easy data analysis, I highly recommend this library.

edited Nov 22, 2013 at 6:07

answered Nov 22, 2013 at 5:40

Matthew Adams

10.3k3 gold badges31 silver badges43 bronze badges

4 Comments

Abhinav Kumar Over a year ago

I want to extract each number from different columns. Every element in a column is the same variable. How do I do that? For example I want a list, l, say to store the floats of first column.

Matthew Adams Over a year ago

I'm not sure I understand the question...? You can certainly send things to lists, but it is much faster and easier to keep everything in pandas data structures (dataframes and series).

Matthew Adams Over a year ago

d['col1'] is essentially a list of all the values in the first column. It is just a super fast list with extra methods to do calculations on it.

Abhinav Kumar Over a year ago

Thanks, I now understand what you said.

Collectives™ on Stack Overflow

Smart way to read big input file with multiple unmarked variables (assorted in columns) in python

2 Answers 2

4 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related