Pandas read_fwf ignoring values

Question

I am running Python 3.5.2 and Pandas 0.19.1. I use read_fwf() to read in a large data file that was originally formatted in FORTRAN. It has columns that look like this:

SiC4+  e-    C2     c-SiC2     1.500e-07 -5.000e-01  0.000e+00 2.00e+00 0.00e+00 logn  8     10    280  3   746 1  1
SiC4+  e-    C      l-SiC3     1.500e-07 -5.000e-01  0.000e+00 2.00e+00 0.00e+00 logn  8     10    280  3   747 1  1
O      e-    O-                1.500e-15  0.000e+00  0.000e+00 2.00e+00 0.00e+00 logn  8     10    280  3   744 1  1
S      e-    S-                5.000e-15  0.000e+00  0.000e+00 2.00e+00 0.00e+00 logn  8     10    280  3   745 1  1

To read this in, I'm using this code:

convert = lambda x: int(species[x]) if x!='' else None
reactions = pd.read_fwf('data.dat',sep='\s+',converters{0:convert,1:convert,2:convert,3:convert})
reactions.fillna(0,inplace=True)

The converters take the first 4 columns' chemical names and replace them with index numbers (from another file), and any missing data is replaced with index number zero. This works fine.

What doesn't work is the 6th column and the 15th column.

116      76        7       30    1.500000e-07   0.5    0.0    2.0  0.0  logn   8   10  280     3  46  1  1 
116      76        1       41    1.500000e-07   0.5    0.0    2.0  0.0  logn   8   10  280     3  47  1  1  
  4      76       74        0    1.500000e-15   0.0    0.0    2.0  0.0  logn   8   10  280     3  44  1  1 
  5      76       75        0    5.000000e-15   0.0    0.0    2.0  0.0  logn   8   10  280     3  45  1  1

What is going on here? The 6th column loses it's negative sign, and the 15th column is missing its leading '7'. I can't find a reason for why this is happening, and it doesn't make sense. Other columns in the file that have leading negative signs are left untouched.

Update

The solution below is not incorrect, but for it to work for me required a very important change to the file header. The first 7 columns of my file looks like this (with headers):

Input1    Input2   Output1    Output2    alpha      beta       gamma     
NC3       CRP      C2         CN         2.000e+03  0.000e+00  0.000e+00
C2N2      CRP      CN         CN         2.000e+03  0.000e+00  0.000e+00 
NC7       CRP      C6         CN         2.000e+03 -1.000e+00  0.000e+00

read_fwf() read in the headers and the spaces in between, and must have presumed that the column marked beta was spaced 2 characters away from the end of the column marked alpha, completely ignoring the negative sign on some of the values in beta.

I changed the header position for all columns that this could be a problem for, and the problem was fixed.

Input1    Input2   Output1    Output2    alpha     beta       gamma     
NC3       CRP      C2         CN         2.000e+03  0.000e+00  0.000e+00
C2N2      CRP      CN         CN         2.000e+03  0.000e+00  0.000e+00 
NC7       CRP      C6         CN         2.000e+03 -1.000e+00  0.000e+00

Notice that the file header for beta (and gamma) are pulled one space to the left. This starts the column early enough for read_fwf() to include the negative sign.

MaxU's answer is good but just a quick comment: with the sep= you are giving a separator but the point of read_fwf is that you have a column organized file, not a separator-organized file. So I don't think you ever want to combine read_fwf with the sep= argument. If you want to use a separator, just use read_csv — JohnE
– JohnE, Commented Nov 17, 2016 at 21:53
It never occurred to me that sep= would have been the problem. I had thought it benign since it was included in the docs for read_fwf(). — SteelAngel
– SteelAngel, Commented Nov 18, 2016 at 2:46

MaxU - stand with Ukraine · Accepted Answer · 2016-11-18 16:56:33Z

2

UPDATE: solution for the updated question:

Assuming you have the following file:

Input1    Input2   Output1    Output2    alpha      beta       gamma     
NC3       CRP      C2         CN         2.000e+03  0.000e+00  0.000e+00
C2N2      CRP                 CN         2.000e+03  0.000e+00  0.000e+00 
NC7                C6         CN         2.000e+03 -1.000e+00  0.000e+00

Solution: (fn - is a full path to the file)

In [164]: df = pd.read_fwf(fn, header=None, skiprows=1)

In [165]: df.columns = pd.read_csv(fn, delim_whitespace=True, nrows=1).columns

In [166]: df
Out[166]:
  Input1 Input2 Output1 Output2   alpha  beta  gamma
0    NC3    CRP      C2      CN  2000.0   0.0    0.0
1   C2N2    CRP     NaN      CN  2000.0   0.0    0.0
2    NC7    NaN      C6      CN  2000.0  -1.0    0.0

OLD answer:

Try this:

In [63]: fn = r'D:\temp\.data\1.fwf'

In [64]: df = pd.read_fwf(fn, header=None)

In [65]: df
Out[65]:
      0   1   2       3             4    5    6    7    8     9   10  11   12  13   14  15  16
0  SiC4+  e-  C2  c-SiC2  1.500000e-07 -0.5  0.0  2.0  0.0  logn   8  10  280   3  746   1   1
1  SiC4+  e-   C  l-SiC3  1.500000e-07 -0.5  0.0  2.0  0.0  logn   8  10  280   3  747   1   1
2      O  e-  O-     NaN  1.500000e-15  0.0  0.0  2.0  0.0  logn   8  10  280   3  744   1   1
3      S  e-  S-     NaN  5.000000e-15  0.0  0.0  2.0  0.0  logn   8  10  280   3  745   1   1

edited Nov 18, 2016 at 16:56

answered Nov 17, 2016 at 19:37

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

MaxU - stand with Ukraine Over a year ago

@SteelAngel, please see UPDATE

Collectives™ on Stack Overflow

Pandas read_fwf ignoring values

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related