3

I have a csv file like:

"B/G/213","B/C/208","WW_cis",,
"B/U/215","B/A/206","WW_cis",,
"B/C/214","B/G/207","WW_cis",,
"B/G/217","B/C/204","WW_cis",,
"B/A/216","B/U/205","WW_cis",,
"B/C/219","B/G/202","WW_cis",,
"B/U/218","B/A/203","WW_cis",,
"B/G/201","B/C/220","WW_cis",,
"B/A/203","B/U/218","WW_cis",,

and I want to read it into something like an array or dataframe, so that I would be able to compare elements from one column to selected elements from another columns. At first, I have read it straight into an array using numpy.genfromtxt, but I got stings like '"B/A/203"' with extra quotes " everywhere. I read somewhere, that pandas allows to strip strings of extra " so I tried:

class StructureReader(object):
    def __init__(self, filename):
        self.filename=filename
    def read(self):
        self.data=pd.read_csv(StringIO(str("RNA/"+self.filename)), header=None, sep = ",")
        self.data

but I get something like so:

<class 'pandas.core.frame.DataFrame'> 0 0 RNA/4v6p.csv

How can I get my CSV file into some kind of a data type that would allow me to search through columns and rows?

2
  • No such thing as a stupid question. couldn't resist, lol. Commented Mar 20, 2016 at 16:28
  • My comment now seems mean... the original question had that in the narrative. I meant it to encourage the OP's quest for knowledge. Commented Mar 20, 2016 at 16:50

3 Answers 3

3

Data Insert

You are putting the string of the filename into your DataFrame, i.e. RNA/4v6p.csv is your data in location row 0, col 0. You need to read in the file and store the data. This can be done by removing StringIO(str(...)) in your class

class StructureReader(object):
    def __init__(self, filename):
        self.filename = filename
    def read(self):
        self.data = pd.read_csv("RNA/"+self.filename), header=None, sep = ",")
        self.data

Code structure critique

I would also recommend removing the parent directory from being hardcoded by

  1. Always passing in a full file path

    class StructureReader(object):
        def __init__(self, filepath):
            self.filepath = filepath
        def read(self):
            self.data = pd.read_csv(self.filepath), header=None, sep = ",")
            self.data
    
  2. Making the directory an __init__() argument

    class StructureReader(object):
        def __init__(self, directory, filename):
            self.directory = directory
            self.filename = filename
        def read(self):
            self.data=pd.read_csv(self.directory+"/"+self.filename), header=None, sep = ",")
            # or import os and self.data=pd.read_csv(os.path.join(self.directory, self.filename)), header=None, sep = ",")
            self.data
    
  3. Making the directory a constant attribute

    class StructureReader(object):
        def __init__(self, filename):
            self.directory = "RNA"
            self.filename = filename
        def read(self):
            self.data = pd.read_csv(self.directory+"/"+self.filename), header=None, sep = ",")
            # or import os and self.data=pd.read_csv(os.path.join(self.directory, self.filename)), header=None, sep = ",")
            self.data
    

This has nothing to do with reading your data, just a best practice commentary on structuring you code (Just my $0.02).

Sign up to request clarification or add additional context in comments.

Comments

2

IIUC, you can just read it with:

df = pd.read_csv('yourfile.csv', header=None)

that for me returns:

         0        1       2   3   4
0  B/G/213  B/C/208  WW_cis NaN NaN
1  B/U/215  B/A/206  WW_cis NaN NaN
2  B/C/214  B/G/207  WW_cis NaN NaN
3  B/G/217  B/C/204  WW_cis NaN NaN
4  B/A/216  B/U/205  WW_cis NaN NaN
5  B/C/219  B/G/202  WW_cis NaN NaN
6  B/U/218  B/A/203  WW_cis NaN NaN
7  B/G/201  B/C/220  WW_cis NaN NaN
8  B/A/203  B/U/218  WW_cis NaN NaN

you can then select only the columns you want with:

df = df[[0,1,2]]

and operate as usual with dataframes.

Comments

1

I think you've mixed up StringIO with the file name. You either have your data as a string and then you use StringIO or you simply specify a file name (not using StringIO):

In [189]: data="""\
   .....: "B/G/213","B/C/208","WW_cis",,
   .....: "B/U/215","B/A/206","WW_cis",,
   .....: "B/C/214","B/G/207","WW_cis",,
   .....: "B/G/217","B/C/204","WW_cis",,
   .....: "B/A/216","B/U/205","WW_cis",,
   .....: "B/C/219","B/G/202","WW_cis",,
   .....: "B/U/218","B/A/203","WW_cis",,
   .....: "B/G/201","B/C/220","WW_cis",,
   .....: "B/A/203","B/U/218","WW_cis",,
   .....: """

In [190]:

In [190]: df = pd.read_csv(io.StringIO(data), sep=',', header=None, usecols=[0,1,2])

In [191]: df
Out[191]:
         0        1       2
0  B/G/213  B/C/208  WW_cis
1  B/U/215  B/A/206  WW_cis
2  B/C/214  B/G/207  WW_cis
3  B/G/217  B/C/204  WW_cis
4  B/A/216  B/U/205  WW_cis
5  B/C/219  B/G/202  WW_cis
6  B/U/218  B/A/203  WW_cis
7  B/G/201  B/C/220  WW_cis
8  B/A/203  B/U/218  WW_cis

PS you can decide what columns do you want to parse (to have in your data frame) - look at the usecols parameter

Or using file name

import os

df = pd.read_csv(os.path.join('RNA', self.filename), sep=',', header=None, usecols=[0,1,2])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.