Calling data from a csv file with its header Python

Question

I've been using the following code to call columns based on their headers.

def GetValuesFromColumn(title):

  values = []
  rownum = 0
  with open(file, 'r') as f:
    reader = csv.reader(f)
    for row in reader:
      if rownum == 0:
        index = row.index(title)
        rownum = 1
      else:
        values.append(row[index])

  return values

It's working fine. But I'm currently working on such files where there could be more than one row with same header and my script gives just the first column. Instead, I'd like to call the column by checking if it has a particular word. For instance, consider there are three columns with the name 'data'. The first data column has info about tissue, second about cell, third about organism like below

data,data,data
ab tissue, cell: b cells, organism: human
bc  gf tissue, cell: d cells, organism: human
bc  gf tissue, cell: e cells, organism: human

then I'd like to be able to call 'tissue' and get data from first data column in this format - ab,bc gf. How can I do that?

What is the rule that your code is supposed to use to know that the first data column has info about tissue? If the only answer is "It should read the data, understand them the way a human does, and guess", that's going to be pretty hard to code. — abarnert
– abarnert, Commented Nov 20, 2014 at 22:48
@abarnert All it needs to do is, if a column has data about tissue, i.e., has the word 'tissue', then extract it, and assign it to a variable 'tissue' or anything else. But the variable 'tissue' shouldn't have the actual word 'tissue', but have the tissue name (i.e., ab, bc etc in the above example) — abn
– abn, Commented Nov 20, 2014 at 22:50
What about reading the lines, getting the first column either as you did, or by using split(','). Then checking if the word 'tissue' is there in the string at index 0, and if it is, print the string after splitting it again, based on spaces. Say value = 'ab tissue' then you can get the name as value.split(' ')[0] — T90
– T90, Commented Nov 20, 2014 at 22:55
@T90 That works for this case, but if the name is two or more words long, it doesn't. Also, if it is mentioned the way the cell name is mentioned, it wouldn't work then also. — abn
– abn, Commented Nov 20, 2014 at 22:59
'if it is mentioned the way the cell name is mentioned, it wouldn't work then also', Sorry, I didn't get that. Aboutmore than one word, you could loop the words in the list formed by split() and print them if they are not the header. for w in value.split(' '): if 'tissue' not in w: print w — T90
– T90, Commented Nov 20, 2014 at 23:03

Stuart · Accepted Answer · 2014-11-21 00:00:56Z

It depends on exactly what set of possible ways of identifying the title/keyword you want to allow. But for example you could do the following to identify cases in the forms 'value keyword' and 'keyword: value' (regardless of the title at the top of the column).

def get_values_flexibly(file, keyword):
    values = []
    with open(file, 'r') as f:
        reader = csv.reader(f)
        for row in reader:
            for cell in row:
                if cell.endswith(' ' + keyword):
                    values.append(cell[:-len(keyword) - 1])
                elif cell.split(':')[0].strip() == keyword:
                    values.append(cell.split(':')[1].strip())
    return values

print get_values_flexibly(file, 'tissue')    # ['ab', 'bc  gf', 'bc  gf']
print get_values_flexibly(file, 'organism')  # ['human', 'human', 'human']

Alternatively, if you know a particular type of value will always be in the same column you can write a function that first checks the first row of the data for a matching header, then checks the second row for a matching keyword in the format 'value keyword' or 'keyword: value'

def get_values_flexibly(file, keyword):
    def process(func):
        return [func(cell)] + [func(row[index]) for row in reader]

    with open(file, 'r') as f:
        reader = csv.reader(f)
        first_row = reader.next()
        if keyword in first_row:
            return [row[first_row.index(keyword)] for row in reader]
        for index, cell in enumerate(reader.next()):
            if cell.endswith(' ' + keyword):
                return process(lambda cell: cell[:-len(keyword) - 1])
            elif cell.split(':')[0].strip() == keyword:
                return process(lambda cell: cell.split(':')[1].strip())

abarnert · Accepted Answer · 2014-11-20 23:59:32Z

What you've asked for is:

All it needs to do is, if a column has data about tissue, i.e., has the word 'tissue', then extract it, and assign it to a variable 'tissue' or anything else.

OK, let's forget the last part; you don't want to assign it to a variable whose variable name has anything to do with your data. You just want to append it to the values list that you return.

Anyway, this rule is pretty simple. It doesn't seem like a very good rule to me—it's going to give you 'ab ' with a trailing space for 'tissue', and it'll be even worse for 'cell', giving you ': d cells'. But it's the rule you've come up with, so let's implement it.

First, we need to detect that the caller is asking for a special "data" column. We'll know this is the case because title isn't in the header. If we see that, let's just punt on the rest of the normal logic, and call a different function for the special "data" column logic:

# ...
if rownum == 0:
    try:
        index = row.index(title)
    except ValueError:
        indices = [i for i, col in index if col == 'data']
        return GetValuesFromDataColumn(title, indices, reader)
    rownum = 1
# ...

Now, for each row, just go through all the data columns (which we have the indices of), check for the word, and, if found, extract it and stash the rest of the string.

The simplest way to do that "check for the word" is the str.find method. It'll return either -1, if it's not there, or the index of the start of the word, if it is.

To extract the word and stash the rest, we just slice the column before and after the word. So:

def GetValuesFromDataColumn(title, indices, reader):
    values = []
    for row in reader:
        for index in indices:
            pos = row[index].find(title)
            if pos != -1:
                value = row[index][:pos] + row[index][pos+len(title):]
                values.append(value)
                break
    return values

Benjamin James Drury · Accepted Answer · 2014-11-20 22:59:29Z

0

You could try just f.readline(), then using the split method on strings to return a list of the different sections? When you run out of lines, you could stop reading the file. So:

def GetValuesFromColumn(title):
    values = list()
    with (pen(file, 'r') as f:
        line = ' '
        while line != '':
            line = f.readline()
            values.append(line.split(','))
    return values

At this point if you just looked through your list of lists, you should be able to find your tissue data. However, it is possible that I have completely misunderstood your question, so forgive me if that is the case.

answered Nov 20, 2014 at 22:59

Benjamin James Drury

2,3661 gold badge18 silver badges27 bronze badges

Collectives™ on Stack Overflow

Calling data from a csv file with its header Python

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related