2

I have a .txt file which has text data and numerical data. The first two rows of the file have essential information in text data form, while the first column (I am referring to the zeroth column as the first column) also has essential data in text form. At all other locations in the file, the data is in numerical form. I wish to analyze the numerical data present in the file using libraries in python ,preferably numpy or pandas, or a combination of both (analysis like regression, correlation, scikit-learn etc). I reiterate that all of the data in the file is essential for my analysis. The following snapshot (taken from Excel) shows a truncated version of the format in which my data is in:enter image description here

The data shown in this snapshot can be found here.

In particular, what I want is to be able to import all the numerical data from this file using python (numpy or pandas), and be able to refer to specific rows in this data using the text data in the first two rows (Type, Tag) and the first column (object number). In my actual data file, I have hundreds of thousands of rows (object types) and scores of columns.

I have already tried using numpy.loadtxt(...) and pandas.read_csv(...) to open this file, but I have either run into errors, or have loaded data in clumsy formats. I will be really thankful to have some direction as to how I can import the file in python in a way so that I have the functionality that I desire.

2 Answers 2

4

If I were you, I would use pandas, and import it using something like this:

df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)

This gives you the dataframe:

>>> df
Type      T1   T2   T3   T4   T5
Tag     Good Good Good Good Good
object1  1.1  2.1  3.1  4.1  5.1
object2  1.2  2.2  3.2  4.2  5.2
object3  1.3  2.3  3.3  4.3  5.3
object4  1.4  2.4  3.4  4.4  5.4
object5  1.5  2.5  3.5  4.5  5.5
object6  1.6  2.6  3.6  4.6  5.6
object7  1.7  2.7  3.7  4.7  5.7
object8  1.8  2.8  3.8  4.8  5.8

And all of your columns are floats:

>>> df.dtypes
Type  Tag 
T1    Good    float64
T2    Good    float64
T3    Good    float64
T4    Good    float64
T5    Good    float64
dtype: object

It contains a multi-indexed column header:

>>> df.columns
MultiIndex(levels=[['T1', 'T2', 'T3', 'T4', 'T5'], ['Good']],
           labels=[[0, 1, 2, 3, 4], [0, 0, 0, 0, 0]],
           names=['Type', 'Tag'])

And a regular index containing the information from Type:

>>> df.index
Index(['object1', 'object2', 'object3', 'object4', 'object5', 'object6',
       'object7', 'object8'],
      dtype='object')

Furthermore, you can convert your values to a numpy array of floats simply by using:

>>> df.values
array([[1.1, 2.1, 3.1, 4.1, 5.1],
       [1.2, 2.2, 3.2, 4.2, 5.2],
       [1.3, 2.3, 3.3, 4.3, 5.3],
       [1.4, 2.4, 3.4, 4.4, 5.4],
       [1.5, 2.5, 3.5, 4.5, 5.5],
       [1.6, 2.6, 3.6, 4.6, 5.6],
       [1.7, 2.7, 3.7, 4.7, 5.7],
       [1.8, 2.8, 3.8, 4.8, 5.8]])
Sign up to request clarification or add additional context in comments.

2 Comments

Sacul: Thank you so much, this is really helpful =)
Glad I could help!
3

Use sep with \s for any spaces not only tabs, engine='python' for removing warning:

df=pd.read_csv('dum.txt',engine='python',sep='\s')
print(df)

Output:

      Type    T1    T2    T3    T4    T5
0      Tag  Good  Good  Good  Good  Good
1  object1   1.1   2.1   3.1   4.1   5.1
2  object2   1.2   2.2   3.2   4.2   5.2
3  object3   1.3   2.3   3.3   4.3   5.3
4  object4   1.4   2.4   3.4   4.4   5.4
5  object5   1.5   2.5   3.5   4.5   5.5
6  object6   1.6   2.6   3.6   4.6   5.6
7  object7   1.7   2.7   3.7   4.7   5.7
8  object8   1.8   2.8   3.8   4.8   5.8

Or if want two rows of columns (i would not recommend because then it's hard to use):

df=pd.read_csv('dum.txt',engine='python',sep='\s',header=[0,1])
print(df)

Output:

      Type   T1   T2   T3   T4   T5
       Tag Good Good Good Good Good
0  object1  1.1  2.1  3.1  4.1  5.1
1  object2  1.2  2.2  3.2  4.2  5.2
2  object3  1.3  2.3  3.3  4.3  5.3
3  object4  1.4  2.4  3.4  4.4  5.4
4  object5  1.5  2.5  3.5  4.5  5.5
5  object6  1.6  2.6  3.6  4.6  5.6
6  object7  1.7  2.7  3.7  4.7  5.7

Otherwise default direct read_csv (like pd.read_csv('dum.txt')) will return:

            Type\tT1\tT2\tT3\tT4\tT5
0  Tag\tGood\tGood\tGood\tGood\tGood
1   object1\t1.1\t2.1\t3.1\t4.1\t5.1
2   object2\t1.2\t2.2\t3.2\t4.2\t5.2
3   object3\t1.3\t2.3\t3.3\t4.3\t5.3
4   object4\t1.4\t2.4\t3.4\t4.4\t5.4
5   object5\t1.5\t2.5\t3.5\t4.5\t5.5
6   object6\t1.6\t2.6\t3.6\t4.6\t5.6
7   object7\t1.7\t2.7\t3.7\t4.7\t5.7
8   object8\t1.8\t2.8\t3.8\t4.8\t5.8

6 Comments

U9-Forward: Thank you very much, your comment is very helpful. =)
U9-Forward: I have to wait for one more minute to accept any answer. I see this message You can accept an answer in 1 more minute. ;)
@LingGuo Then wait a minute :-)
U9-Forward: I upvoted your answer, and all your (and sacul's) comments. There hasn't been any downvote from my side. You and sacul have been most helpful. Did you see any downvote?
@LingGuo I did, but have no idea why? :-) Thanks tho :-)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.