0

I have the following code that runs for over a million lines. But this takes a lot of time. Is there a better way to read in such files? The current code looks like this:

for line in lines:
    line = line.strip()             #Strips extra characters from lines
    columns = line.split()          #Splits lines into individual 'strings'
    x = columns[0]                  #Reads in x position
    x = float(x)                    #Converts the strings to float
    y = columns[1]                  #Reads in y 
    y = float(y)                    #Converts the strings to float
    z = columns[2]                  #Reads in z 
    z = float(z)                    #Converts the strings to float

The file data looks like this:

  347.528218024     354.824474847   223.554247185   -47.3141937738  -18.7595743981   
  317.843928028     652.710791858   795.452586986   -177.876355361  7.77755408015   
  789.419369714     557.566066378   338.090799912   -238.803813301  -209.784710166   
  449.259334688     639.283337249   304.600907059   26.9716202117   -167.461497735  
  739.302109761     532.139588049   635.08307865    -24.5716064556  -91.5271790951  

I want to extract each number from different columns. Every element in a column is the same variable. How do I do that? For example I want a list, l, say to store the floats of first column.

2 Answers 2

4

It would be helpful to know what you are planning on doing with the data, but you might try:

data = [map(float, line.split()) for line in lines]

This will give you a list of lists with your data.

Sign up to request clarification or add additional context in comments.

4 Comments

would be slightly faster too, being a list comprehension.
No need for .strip() -- .split() will automatically drop whitespace on the ends.
I want to extract each number from different columns. Every element in a column is the same variable. How do I do that? For example I want a list, l, say to store the floats of first column.
[j[0] for j in data] extracts the first column. pandas or numpy/scipy is probably a better idea if you'll be doing a lot of manipulation. For numpy you can numpy.array(data)[:,0] to get the first column.
1

Pandas is built for this (among many other things)!

It uses numpy, which uses C under the hood and is very fast. (Actually, depending on what you're doing with the data, you may want to use numpy directly instead of pandas. However, I'd only do that after you've tried pandas; numpy is lower level and pandas will make your life easier.)

Here's how you could read in your data:

import pandas as pd

with open('testfile', 'r') as f:
    d = pd.read_csv(f, delim_whitespace=True, header=None,
                    names=['delete me','col1','col2','col3','col4','col5'])

d = d.drop('delete me',1) # the first column is all spaces and gets interpreted
                          # as an empty column, so delete it
print d

This outputs:

         col1        col2        col3        col4        col5
0  347.528218  354.824475  223.554247  -47.314194  -18.759574
1  317.843928  652.710792  795.452587 -177.876355    7.777554
2  789.419370  557.566066  338.090800 -238.803813 -209.784710
3  449.259335  639.283337  304.600907   26.971620 -167.461498
4  739.302110  532.139588  635.083079  -24.571606  -91.527179

The result d in this case is a powerful data structure called a dataframe that gives you a lot of options for manipulating the data very quickly.

As a simple example, this adds the two first columns and gets the mean of the result:

(d['col1'] + d['col2']).mean() # 1075.97544372

Pandas also handles missing data very nicely; if there are missing/bad values in the data file, pandas will simply replace them with NaN or None as appropriate when it reads them in.

Anyways, for fast,easy data analysis, I highly recommend this library.

4 Comments

I want to extract each number from different columns. Every element in a column is the same variable. How do I do that? For example I want a list, l, say to store the floats of first column.
I'm not sure I understand the question...? You can certainly send things to lists, but it is much faster and easier to keep everything in pandas data structures (dataframes and series).
d['col1'] is essentially a list of all the values in the first column. It is just a super fast list with extra methods to do calculations on it.
Thanks, I now understand what you said.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.