Importing file containing text and numerical data using Python

Question

I have a .txt file which has text data and numerical data. The first two rows of the file have essential information in text data form, while the first column (I am referring to the zeroth column as the first column) also has essential data in text form. At all other locations in the file, the data is in numerical form. I wish to analyze the numerical data present in the file using libraries in python ,preferably numpy or pandas, or a combination of both (analysis like regression, correlation, scikit-learn etc). I reiterate that all of the data in the file is essential for my analysis. The following snapshot (taken from Excel) shows a truncated version of the format in which my data is in:

The data shown in this snapshot can be found here.

In particular, what I want is to be able to import all the numerical data from this file using python (numpy or pandas), and be able to refer to specific rows in this data using the text data in the first two rows (Type, Tag) and the first column (object number). In my actual data file, I have hundreds of thousands of rows (object types) and scores of columns.

I have already tried using numpy.loadtxt(...) and pandas.read_csv(...) to open this file, but I have either run into errors, or have loaded data in clumsy formats. I will be really thankful to have some direction as to how I can import the file in python in a way so that I have the functionality that I desire.

sacuL · Accepted Answer · 2018-09-26 01:24:13Z

4

If I were you, I would use pandas, and import it using something like this:

df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)

This gives you the dataframe:

>>> df
Type      T1   T2   T3   T4   T5
Tag     Good Good Good Good Good
object1  1.1  2.1  3.1  4.1  5.1
object2  1.2  2.2  3.2  4.2  5.2
object3  1.3  2.3  3.3  4.3  5.3
object4  1.4  2.4  3.4  4.4  5.4
object5  1.5  2.5  3.5  4.5  5.5
object6  1.6  2.6  3.6  4.6  5.6
object7  1.7  2.7  3.7  4.7  5.7
object8  1.8  2.8  3.8  4.8  5.8

And all of your columns are floats:

>>> df.dtypes
Type  Tag 
T1    Good    float64
T2    Good    float64
T3    Good    float64
T4    Good    float64
T5    Good    float64
dtype: object

It contains a multi-indexed column header:

>>> df.columns
MultiIndex(levels=[['T1', 'T2', 'T3', 'T4', 'T5'], ['Good']],
           labels=[[0, 1, 2, 3, 4], [0, 0, 0, 0, 0]],
           names=['Type', 'Tag'])

And a regular index containing the information from Type:

>>> df.index
Index(['object1', 'object2', 'object3', 'object4', 'object5', 'object6',
       'object7', 'object8'],
      dtype='object')

Furthermore, you can convert your values to a numpy array of floats simply by using:

>>> df.values
array([[1.1, 2.1, 3.1, 4.1, 5.1],
       [1.2, 2.2, 3.2, 4.2, 5.2],
       [1.3, 2.3, 3.3, 4.3, 5.3],
       [1.4, 2.4, 3.4, 4.4, 5.4],
       [1.5, 2.5, 3.5, 4.5, 5.5],
       [1.6, 2.6, 3.6, 4.6, 5.6],
       [1.7, 2.7, 3.7, 4.7, 5.7],
       [1.8, 2.8, 3.8, 4.8, 5.8]])

answered Sep 26, 2018 at 1:24

sacuL

51.6k9 gold badges88 silver badges115 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ling Guo Over a year ago

Sacul: Thank you so much, this is really helpful =)

sacuL Over a year ago

Glad I could help!

U13-Forward · Accepted Answer · 2018-09-26 01:31:23Z

3

Use sep with \s for any spaces not only tabs, engine='python' for removing warning:

df=pd.read_csv('dum.txt',engine='python',sep='\s')
print(df)

Output:

      Type    T1    T2    T3    T4    T5
0      Tag  Good  Good  Good  Good  Good
1  object1   1.1   2.1   3.1   4.1   5.1
2  object2   1.2   2.2   3.2   4.2   5.2
3  object3   1.3   2.3   3.3   4.3   5.3
4  object4   1.4   2.4   3.4   4.4   5.4
5  object5   1.5   2.5   3.5   4.5   5.5
6  object6   1.6   2.6   3.6   4.6   5.6
7  object7   1.7   2.7   3.7   4.7   5.7
8  object8   1.8   2.8   3.8   4.8   5.8

Or if want two rows of columns (i would not recommend because then it's hard to use):

df=pd.read_csv('dum.txt',engine='python',sep='\s',header=[0,1])
print(df)

Output:

      Type   T1   T2   T3   T4   T5
       Tag Good Good Good Good Good
0  object1  1.1  2.1  3.1  4.1  5.1
1  object2  1.2  2.2  3.2  4.2  5.2
2  object3  1.3  2.3  3.3  4.3  5.3
3  object4  1.4  2.4  3.4  4.4  5.4
4  object5  1.5  2.5  3.5  4.5  5.5
5  object6  1.6  2.6  3.6  4.6  5.6
6  object7  1.7  2.7  3.7  4.7  5.7

Otherwise default direct read_csv (like pd.read_csv('dum.txt')) will return:

            Type\tT1\tT2\tT3\tT4\tT5
0  Tag\tGood\tGood\tGood\tGood\tGood
1   object1\t1.1\t2.1\t3.1\t4.1\t5.1
2   object2\t1.2\t2.2\t3.2\t4.2\t5.2
3   object3\t1.3\t2.3\t3.3\t4.3\t5.3
4   object4\t1.4\t2.4\t3.4\t4.4\t5.4
5   object5\t1.5\t2.5\t3.5\t4.5\t5.5
6   object6\t1.6\t2.6\t3.6\t4.6\t5.6
7   object7\t1.7\t2.7\t3.7\t4.7\t5.7
8   object8\t1.8\t2.8\t3.8\t4.8\t5.8

edited Sep 26, 2018 at 1:31

answered Sep 26, 2018 at 1:26

U13-Forward

71.8k15 gold badges100 silver badges125 bronze badges

6 Comments

Ling Guo Over a year ago

U9-Forward: Thank you very much, your comment is very helpful. =)

Ling Guo Over a year ago

U9-Forward: I have to wait for one more minute to accept any answer. I see this message You can accept an answer in 1 more minute. ;)

U13-Forward Over a year ago

@LingGuo Then wait a minute :-)

Ling Guo Over a year ago

U9-Forward: I upvoted your answer, and all your (and sacul's) comments. There hasn't been any downvote from my side. You and sacul have been most helpful. Did you see any downvote?

U13-Forward Over a year ago

@LingGuo I did, but have no idea why? :-) Thanks tho :-)

|

Collectives™ on Stack Overflow

Importing file containing text and numerical data using Python

2 Answers 2

2 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related