1

I've looked at this response to try and get numpy to print the full array rather than a summarized view, but it doesn't seem to be working.

I have a CSV with named headers. Here are the first five rows

v0  v1  v2  v3  v4
1001    5529    24  56663   16445
1002    4809    30.125  49853   28069
1003    407 20  28462   8491
1005    605 19.55   75423   4798
1007    1607    20.26   79076   12962

I'd like to read in the data and be able to view it fully. I tried doing this:

import numpy as np
np.set_printoptions(threshold=np.inf)

main_df2=np.genfromtxt('file location', delimiter=",")
main_df2[0:3,:]

However this still returns the truncated array, and the performance seems greatly slowed. What am I doing wrong?

1
  • 1
    what does that last line show? Thats only 3 rows and 5 columns, if the genfromtxt is right. Commented Feb 24, 2017 at 15:57

3 Answers 3

2

OK, in a regular Python session (I usually use Ipython instead), I set the print options, and made a large array:

>>> np.set_printoptions(threshold=np.inf, suppress=True)
>>> x=np.random.rand(25000,5)

When I execute the next line, it spends about 21 seconds formatting the array, and then writes the resulting string to the screen (with more lines than fit the terminal's window buffer).

>>> x

This is the same as

>>> print(repr(x))

The internal storage for x is a buffer of floats (which you can 'see' with x.tostring(). To print x it has to format it, create a multiline string that contains a print representation of each number, all 125000 of them. The result of repr(x) is a string 1850000 char long, 25000 lines. This is what takes 21 seconds. Displaying that on the screen is just limited by the terminal scroll speed.

I haven't looked at the details, but I think the numpy formatting is mostly written in Python, not compiled. It's designed more for flexibility than speed. It's normal to want to see 10-100 lines of an array. 25000 lines is an unusual case.

Somewhat curiously, writing this array as a csv is fast, with a minimal delay:

>>> np.savetxt('test.txt', x, fmt='%10f', delimiter=',')

And I know what savetxt does - it iterates on rows, and does a file write

f.write(fmt % tuple(row))

Evidently all the bells-n-whistles of the regular repr are expensive. It can summarize, it can handle many dimensions, it can handle complicated dtypes, etc. Simply formatting each row with a known fixed format is not the time consuming step.

Actually that savetxt route might be more useful, as well as fast. You can control the display format, and you can view the resulting text file in an editor or terminal window at your leisure. You won't be limited by the scroll buffer of your terminal window. But how will this savetxt file be different from the original csv?

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, I was trying to keep it simple but maybe your idea might work just as well.
1

I'm surprised you get an array at all as your example does not use ',' as delimiter. But maybe you forgot to included commas in your example file.

I would use the DataFrame functionality of pandas if I work with csv data. It uses numpy under the hood, so all numpy operation work on pandas DataFrames.

Pandas has many tricks for operating with table like data.

import pandas as pd

df = pd.read_csv('nothing.txt')
#==============================================================================
# next line remove blanks from the column names
#==============================================================================
df.columns = [name.strip(' ') for name in df.columns]

pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

print(df)

2 Comments

When I copied and pasted it the data here it was open in Excel, but the file is a CSV.
I see. Excel did the nice formatting. Does the approach with pandas work?
0

When I copied and pasted it the data here it was open in Excel, but the file is a CSV.

I'm doing a class exercise and we have to use numpy. One thing I noticed was that the results were quite illegible thanks for the scientific notation, so I did the following and things are much smoother:

np.set_printoptions(threshold=100000, suppress=True)

The suppress statement saved me a lot of formatting. The performance does suffer a lot when I change the threshold to something like 'nan' or inf, and I'm not sure why.

4 Comments

How big is this file? Pages and pages of rows?
25,000 rows, so I wouldn't expect it to be slow in Python? Or is that typical in Python? My other programming experience is in R.
I can't imagine trying to print (write to the screen) 25000 rows of anything! I might pipe it to less/more and scroll through looking at selected rows. But the whole thing?
Sure, I can agree to that. I guess I should just slice a few rows? Is there a a command to randomly select some rows?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.