.dat file import in pandas

Question

I want to import this publicly available file using pandas. Simply as csv (I have renamed simply .dat to .csv):

clinton = pd.read_csv("C:/Users/Mateusz/Downloads/ML_DS-20180523T193457Z-001/ML_DS/clinton1.csv")

However in some cases country name is composed of two words, not just one. In those cases shifts my data frame to the right. This looks like (name hot springs is in two columns): How to fix it for the entire dataset at once?

divibisan · Accepted Answer · 2018-05-31 17:07:24Z

No need to rename the .dat to .csv. Instead you can use a regex that matches two or more spaces as a column separator.

Try use sep parameter:

pd.read_csv('http://users.stat.ufl.edu/~winner/data/clinton1.dat',
            header=None, sep='\s\s+', engine='python')

Output:

            0      1     2      3      4     5      6      7     8     9    10
0  Autauga, AL  30.92  31.7  57623  15768  15.2  10.74  51.41  60.4  2.36  457
1  Baldwin, AL  26.24  35.5  84935  16954  13.6   9.73  51.34  66.5  5.40  282
2  Barbour, AL  46.36  32.8  83656  15532  25.0   8.82  53.03  28.8  7.02   47
3   Blount, AL  32.92  34.5  61249  14820  15.0   9.67  51.15  62.4  2.36  185
4  Bullock, AL  67.67  31.7  75725  11120  33.0   7.08  50.76  17.6  2.91  141

If you want your state as a seperate column you can use this sep='\s\s+|,' which means seperate columns on two spaces or more OR a comma.

pd.read_csv('http://users.stat.ufl.edu/~winner/data/clinton1.dat',
            header=None, sep='\s\s+|,', engine='python')

Output:

        0    1      2     3      4        5     6      7      8     9     10     11
0  Autauga   AL  30.92  31.7  57623  15768.0  15.2  10.74  51.41  60.4  2.36  457.0
1  Baldwin   AL  26.24  35.5  84935  16954.0  13.6   9.73  51.34  66.5  5.40  282.0
2  Barbour   AL  46.36  32.8  83656  15532.0  25.0   8.82  53.03  28.8  7.02   47.0
3   Blount   AL  32.92  34.5  61249  14820.0  15.0   9.67  51.15  62.4  2.36  185.0
4  Bullock   AL  67.67  31.7  75725  11120.0  33.0   7.08  50.76  17.6  2.91  141.0

Tim Nixon · Accepted Answer · 2018-05-31 17:18:47Z

0

You can use a regular expression as a separator. In your specific case, all the delimiters are more than one space whereas the spaces in the names are just single spaces.

import pandas as pd

clinton = pd.read_csv("clinton1.csv", sep='\s{2,}', header=None, engine='python')

edited May 31, 2018 at 17:18

answered May 31, 2018 at 16:44

Tim Nixon

458 bronze badges

Collectives™ on Stack Overflow

.dat file import in pandas

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related