build matrix from multiple files using pandas

Question

Have multiple files(20) in a directory with 2 columns, for eg

transcript_id value
ENMUST001     2
ENMUST003     3
ENMUST004     5

number of rows differ in each file what I would like to do is merge all the 20 files in one huge matrix in this way

transcript_id value_file1 value_file2....value_file20
ENMUST001     2  3 
ENMUST003     3  4
ENMUST004     5  0

Collect all the ids from transcript_id column and corresponding values from each file(filename as column name) and if there is no value use 0.

I tried to do this using pandas,

import os
import glob
import pandas as pd
path = 'pathtofiles'
transFiles = glob.glob(path + "*.tsv")
df_files = []
for file in transFiles:
    df = pd.read_csv(file, sep='\t')
    df.set_index('transcript_id')
    df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0) 

Error:
ValueError: No objects to concatenate

Wondering if non pandas way is a better approach? any pseudo code is appreciated.

edit

output found

df.set_index('transcript_id')
print (df.shape)

    (921, 1)
    (1414, 1)
    (659, 1)
    (696, 1)
    (313, 1)
print (df.is_unique)
    (921, 1)
False
(1414, 1)
False
(659, 1)
False
(696, 1)
False
(313, 1)
False
df = df.drop_duplicates(inplace=True)
df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0)

New error
ValueError: All objects passed were None

duplicate print

before:  (921, 1)
after:  (914, 1)
before:  (1414, 1)
after:  (1410, 1)
before:  (659, 1)
after:  (658, 1)
before:  (696, 1)
after:  (694, 1)
before:  (313, 1)
after:  (312, 1)

vahndi · Accepted Answer · 2017-08-05 17:52:58Z

3

The default behaviour for set_index is inplace=False. Try replacing df.set_index('transcript_id') with df = df.set_index('transcript_id'). Also you can remove duplicate values in the index using df = df[~df.index.duplicated(keep='first')].

import os
import glob
import pandas as pd

path = 'pathtofiles'
transFiles = glob.glob(path + "*.tsv")
df_files = []
for file in transFiles:
    df = pd.read_csv(file, sep='\t')
    df = df.set_index('transcript_id') # set index
    df = df[~df.index.duplicated(keep='first')] # remove duplicates
    df.columns = [os.path.split(file)[-1]] # set column name to filename
    df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0)

edited Aug 5, 2017 at 17:52

answered Aug 5, 2017 at 16:54

vahndi

1,0458 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

sid Over a year ago

hi, thank you for reply i found this error ValueError: Shape of passed values is (5, 3063), indices imply (5, 3056) when i change it, i also try df_combine = pd.concat(df_files, axis=0).fillna(0) i get only two column as output

vahndi Over a year ago

Can you add a line print(df.shape) after df = df.set_index('transcript_id') and tell me what you get?

vahndi Over a year ago

Sounds like you might have duplicate indexes. Try adding print(df.index.is_unique). If it is False for any df then you can do df.drop_duplicates(inplace=True).

vahndi Over a year ago

No worries - I might have led you astray there. Try df = df[~df.index.duplicated(keep='first')] to remove duplicate index entries (from stackoverflow.com/questions/13035764/…)

vahndi Over a year ago

added in answer above

|

Collectives™ on Stack Overflow

build matrix from multiple files using pandas

edit

duplicate print

1 Answer 1

12 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

edit

duplicate print

1 Answer 1

12 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related