Create two dataframes using Pandas from a text file Python

Question

I need to create two dataframes to operate my data and I have thinked about doing it with pandas.

This is the provided data:

class([1,0,0,0],"Small-molecule metabolism ").
class([1,1,0,0],"Degradation ").
class([1,1,1,0],"Carbon compounds ").
function(tb186,[1,1,1,0],'bglS',"beta-glucosidase").
function(tb2202,[1,1,1,0],'cbhK',"carbohydrate kinase").
function(tb727,[1,1,1,0],'fucA',"L-fuculose phosphate aldolase").
function(tb1731,[1,1,1,0],'gabD1',"succinate-semialdehyde dehydrogenase").
function(tb234,[1,1,1,0],'gabD2',"succinate-semialdehyde dehydrogenase").
function(tb501,[1,1,1,0],'galE1',"UDP-glucose 4-epimerase").
function(tb536,[1,1,1,0],'galE2',"UDP-glucose 4-epimerase").
function(tb620,[1,1,1,0],'galK',"galactokinase").
function(tb619,[1,1,1,0],'galT',"galactose-1-phosphate uridylyltransferase C-term").
function(tb618,[1,1,1,0],'galT',"null").
function(tb993,[1,1,1,0],'galU',"UTP-glucose-1-phosphate uridylyltransferase").
function(tb3696,[1,1,1,0],'glpK',"ATP:glycerol 3-phosphotransferase").
function(tb3255,[1,1,1,0],'manA',"mannose-6-phosphate isomerase").
function(tb3441,[1,1,1,0],'mrsA',"phosphoglucomutase or phosphomannomutase").
function(tb118,[1,1,1,0],'oxcA',"oxalyl-CoA decarboxylase").
function(tb3068,[1,1,1,0],'pgmA',"phosphoglucomutase").
function(tb3257,[1,1,1,0],'pmmA',"phosphomannomutase").
function(tb3308,[1,1,1,0],'pmmB',"phosphomannomutase").
function(tb2702,[1,1,1,0],'ppgK',"polyphosphate glucokinase").
function(tb408,[1,1,1,0],'pta',"phosphate acetyltransferase").
function(tb729,[1,1,1,0],'xylB',"xylulose kinase").
function(tb1096,[1,1,1,0],'null',"null").
class([1,1,2,0],"Amino acids and amines ").
function(tb1905,[1,1,2,0],'aao',"D-amino acid oxidase").
function(tb2531,[1,1,2,0],'adi',"ornithine/arginine decarboxylase").
function(tb2780,[1,1,2,0],'ald',"L-alanine dehydrogenase").
function(tb1538,[1,1,2,0],'ansA',"L-asparaginase").
function(tb1001,[1,1,2,0],'arcA',"arginine deiminase").
function(tb753,[1,1,2,0],'mmsA',"methylmalmonate semialdehyde dehydrogenase").
function(tb751,[1,1,2,0],'mmsB',"methylmalmonate semialdehyde oxidoreductase").

And I would like to have something like:

Is it possible with Pandas? Thanks is advance,

Is the data in a text file?

goalie1998
– goalie1998

2021-01-12 18:36:43 +00:00
Commented Jan 12, 2021 at 18:36 — goalie1998
– goalie1998, Commented Jan 12, 2021 at 18:36
Yes sir @goalie1998

Thony
– Thony

2021-01-12 18:49:20 +00:00
Commented Jan 12, 2021 at 18:49 — Thony
– Thony, Commented Jan 12, 2021 at 18:49

edinho · Accepted Answer · 2021-01-12 19:00:38Z

1

Yes it is possible. Bellow is an example.
There are many ways for doing it (some already in other answers). In this example I tried to make the steps clearer in the code.

import io
import pandas as pd

with open("file.txt") as f:
    lines = f.readlines()  # reads your file line by line and returns a list

### sample:
# ['class([1,0,0,0],"Small-molecule metabolism ").\n',
#  'class([1,1,0,0],"Degradation ").\n',
#  'class([1,1,1,0],"Carbon compounds ").\n',
#  'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").\n', ... ]

df1 = []
df2 = []

for line in lines:
    # this transformation will be common to all lines
    line = line.strip(').\n').replace("[", '"[').replace("]", ']"')

    # here we will separate the lines, perform the specific transformation and append them to their specific variable
    if line.startswith("class"):
        line = line.strip("class(")  # specific transform for "class" line
        df1.append(line)
    else:
        line = line.strip("function(")  # specific transform for "function" line
        df2.append(line)

# in this final block we prepare the variable to be read with pandas and read
df1 = "\n".join(df1)  # prepare
df1 = pd.read_csv(
    io.StringIO(df1),  # as pandas expects a file handler, we use io.StringIO
    header=None,  # no headers, they are given "manually"
    names=['id', 'name'],  # headers
)

# the same as before
df2 = "\n".join(df2)
df2 = pd.read_csv(
    io.StringIO(df2),
    header=None,
    names=['orf', 'class', 'genName', 'desc']
)

answered Jan 12, 2021 at 19:00

edinho

3962 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Thony Over a year ago

Hi, I just realized that df.dtype are objects, is there any way to make them strings?

edinho Over a year ago

In this context, an object is a synonym of a string. Take a look here: pandas.pydata.org/pandas-docs/stable/reference/api/…

Thony Over a year ago

So if I want to get the IDClass with ClassDesc equals to "Respiration" this should work? result2 = df1.loc(df1['ClassDesc'].equals("Respiration"), 'IDClass')

Thony Over a year ago

Allright, I just realized that "class" lines end with ' ").' So it can find "Respiration " instead of "Respiration" @edinho . Should i replace line = line.strip(').\n').replace("[", '"[').replace("]", ']"') for line = line.strip(').\n').replace("[", '"[').replace("]", ']"').replace(' ").', '").') in order to fix it?

edinho Over a year ago

You can do that. But I rather fix thi data in the dataframe later. Something like df["name"] = df["name"].str.strip().

r.burak · Accepted Answer · 2021-01-12 18:47:59Z

0

I make a file with your text. and here's the code. you can repeat it for df_func. enjoy.

cols = ['x','y']
df = pd.read_csv('1.txt',sep='(',names=cols, header=None)
df.head()
df_class = df[df['x']=='class']
df_func = df[df['x']=='function']
df_class[['y', 'z']] =df_class['y'].str.split(',"', 1, expand=True)
df_class['z'] = df_class['z'].str[:-4]
df_class

answered Jan 12, 2021 at 18:47

r.burak

5445 silver badges10 bronze badges

Comments

frab · Accepted Answer · 2021-01-12 19:06:03Z

0

Read each line of the file, then, for each line:

# line contains the line of the file for this iteration
if line.startswith("class"):
    line.replace("class(","[").replace(").","]")
    line = eval(line)
    # class type stuff
elif line.startswith("function"):
    line.replace("function(","[").replace(").","]")
    line = eval(line)
    # function type stuff

The resulting line variable will be the list of the elements for the line. Then you can do whatever you need with it.

Example: first line = [[1,0,0,0],"Small-molecule metabolism "]

edited Jan 12, 2021 at 19:06

answered Jan 12, 2021 at 18:50

frab

1,1731 gold badge6 silver badges15 bronze badges

Collectives™ on Stack Overflow

Create two dataframes using Pandas from a text file Python

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related