3

Have multiple files(20) in a directory with 2 columns, for eg

transcript_id value
ENMUST001     2
ENMUST003     3
ENMUST004     5

number of rows differ in each file what I would like to do is merge all the 20 files in one huge matrix in this way

transcript_id value_file1 value_file2....value_file20
ENMUST001     2  3 
ENMUST003     3  4
ENMUST004     5  0

Collect all the ids from transcript_id column and corresponding values from each file(filename as column name) and if there is no value use 0.

I tried to do this using pandas,

import os
import glob
import pandas as pd
path = 'pathtofiles'
transFiles = glob.glob(path + "*.tsv")
df_files = []
for file in transFiles:
    df = pd.read_csv(file, sep='\t')
    df.set_index('transcript_id')
    df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0) 

Error:
ValueError: No objects to concatenate

Wondering if non pandas way is a better approach? any pseudo code is appreciated.

edit

output found

df.set_index('transcript_id')
print (df.shape)

    (921, 1)
    (1414, 1)
    (659, 1)
    (696, 1)
    (313, 1)
print (df.is_unique)
    (921, 1)
False
(1414, 1)
False
(659, 1)
False
(696, 1)
False
(313, 1)
False
df = df.drop_duplicates(inplace=True)
df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0)

New error
ValueError: All objects passed were None

duplicate print

before:  (921, 1)
after:  (914, 1)
before:  (1414, 1)
after:  (1410, 1)
before:  (659, 1)
after:  (658, 1)
before:  (696, 1)
after:  (694, 1)
before:  (313, 1)
after:  (312, 1)
0

1 Answer 1

3

The default behaviour for set_index is inplace=False. Try replacing df.set_index('transcript_id') with df = df.set_index('transcript_id'). Also you can remove duplicate values in the index using df = df[~df.index.duplicated(keep='first')].

import os
import glob
import pandas as pd

path = 'pathtofiles'
transFiles = glob.glob(path + "*.tsv")
df_files = []
for file in transFiles:
    df = pd.read_csv(file, sep='\t')
    df = df.set_index('transcript_id') # set index
    df = df[~df.index.duplicated(keep='first')] # remove duplicates
    df.columns = [os.path.split(file)[-1]] # set column name to filename
    df_files.append(df)
df_combine = pd.concat(df_files, axis=1).fillna(0) 
Sign up to request clarification or add additional context in comments.

12 Comments

hi, thank you for reply i found this error ValueError: Shape of passed values is (5, 3063), indices imply (5, 3056) when i change it, i also try df_combine = pd.concat(df_files, axis=0).fillna(0) i get only two column as output
Can you add a line print(df.shape) after df = df.set_index('transcript_id') and tell me what you get?
Sounds like you might have duplicate indexes. Try adding print(df.index.is_unique). If it is False for any df then you can do df.drop_duplicates(inplace=True).
No worries - I might have led you astray there. Try df = df[~df.index.duplicated(keep='first')] to remove duplicate index entries (from stackoverflow.com/questions/13035764/…)
added in answer above
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.