How to iterate csv with Python and get only certain columns?

Question

I have a csv -file structured like this:

Last Name;First Name;Start Date;End Date;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
Example;Eva;;;1.1.2021;15.6.2021;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
Here is some random information.;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-------;Header;------- ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
Index;Date;Time;Reading
0;10.4.2021;19:12:10;0,1;;;;;;
1;10.4.2021;19:07:14;;;;;;;;
2;10.4.2021;19:05:34;0,1;;;;;;
3;10.4.2021;19:05:32;0,1;;;;;;
4;10.4.2021;19:05:32;0,1;;;;;;
5;10.4.2021;19:05:31;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-------;Header;------- ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
Index;Date;Time;Reading
6;12.4.2021;19:12:10;0,1;;;;;;
7;12.4.2021;19:07:14;;;;;;;;
8;12.4.2021;19:05:34;0,1;;;;;;
9;12.4.2021;19:05:32;0,1;;;;;;
10;12.4.2021;19:05:32;0,1;;;;;;
11;12.4.2021;19:05:31;;;;;;;;

My goal is to get a clean dict out of the file with only the information I need. Let's say I want only name, dates and readings structured like this:

{
'last_name': 'Example', 
'first_name': 'Eva', 
'measurements': [{'date:': 'some_date', 'reading': 'some_reading'}]
}

How can I iterate through the columns and only get the ones I need? There are a lot of columns where the Reading field is nan. In my example this happens in the second and the last row, in both sections.

This is what I have tried so far to get the data I want from the csv:

df = (pd.read_csv(file,
                  sep='\s+',
                  skiprows=6,
                  index_col=0,
                  dtype='unicode'
                  )
       )
df = pd.DataFrame(df, columns=['Reading', 'Date', 'Time'])
print(df.keys())
print(len(df))
test_list = df.values.tolist()
print(test_list)
print(len(test_list))

This gives me result:

Index(['Reading', 'Date', 'Time'], dtype='object')
13520
[[nan, nan, nan], [nan, nan, nan], ...]
13520

So the list where I want to have the values is just a list of list of nan's.

Since you have several (?) CSV files appended into one, this is not trivial using pandas csv reading. I suggest you keep each CSV as a separate file. I.e. one for names, one for readings, etc. — Jonas Byström
– Jonas Byström, Commented Sep 1, 2021 at 8:35
That is one CSV file I get, I can't affect it's structure. It has multiple sections where the column names (Index;Date;Time;Reading) are the same and I would need to get them all from the file. — lr_optim
– lr_optim, Commented Sep 1, 2021 at 8:51
@JonasByström Sorry, there was a mistake in the indexes in my example. They don't start over in different sections. The error is now fixed in my original post. — lr_optim
– lr_optim, Commented Sep 1, 2021 at 9:07
Does this answer your question? How to read only certain rows and cells from csv with Python pandas? — mozway
– mozway, Commented Sep 1, 2021 at 9:13
@mozway Maybe it does but as a beginner with handling CSV -files I wasn't sure how to handle the data. Thank you for trying to explain me the solution, this time I understood it more clearly with the example by Serge Ballesta. — lr_optim
– lr_optim, Commented Sep 1, 2021 at 10:01

mozway · Accepted Answer · 2021-09-01 09:13:38Z

This really looks like your previous question, so just adapt my previous answer ;)

header

pd.read_csv('filename.csv', sep=';+', nrows=1).dropna(axis=1).loc[0].to_dict()

output:

{'Last Name': 'Example',
 'First Name': 'Eva',
 'Start Date': '1.1.2021',
 'End Date': '15.6.2021'}

data

df = (pd.read_csv('filename.csv',
                  sep=';',
                  skiprows=6,
                  index_col=0,
                  usecols=range(4),
                 )
        .drop(['Index', '-------', float('nan')], # get rid of extra headers
              errors='ignore')
     )

output:

            Date      Time Reading
Index                             
0      10.4.2021  19:12:10     0,1
1      10.4.2021  19:07:14     NaN
2      10.4.2021  19:05:34     0,1
3      10.4.2021  19:05:32     0,1
4      10.4.2021  19:05:32     0,1
5      10.4.2021  19:05:31     NaN
0      12.4.2021  19:12:10     0,1
1      12.4.2021  19:07:14     NaN
2      12.4.2021  19:05:34     0,1
3      12.4.2021  19:05:32     0,1
4      12.4.2021  19:05:32     0,1
5      12.4.2021  19:05:31     NaN

Serge Ballesta · Accepted Answer · 2021-09-01 09:33:06Z

This is not a correctly formatted single csv file, but a text file containing multiple csv fragments. It will be easyer to use the lower level csv module than the high level pandas one:

with open(file) as fd:
    rd = csv.reader(fd, delimiter=';')
    state = 0     # wait for the Last Name;First Name line
    for row in rd:
        if state == 0:
            if row[0].startswith('Last'):
                state = 1    # next line will contain last and first names
            continue
        if state == 1:
            # prepare the resul dictionary
            measurements = []
            d = {'last_name': row[0], 'first_name': row[1],
                 'measurements': measurements}
            state = 2    # wait for a header line
            continue
        if state == 2:
            if row[0] == 'Index':
                state = 3   # read data lines
            continue
        #  state is 3
        if row[0] == '':
            state = 2  // wait for next data bloc
        else:
            # process a data line
            measurements.append({'date': row[1], 'reading': row[3]})

This code only store the date and reading fields as text, but you will easily add more complex processing...

With your sample data it gives:

{'first_name': 'Eva',
 'last_name': 'Example',
 'measurements': [{'date': '10.4.2021', 'reading': '0,1'},
                  {'date': '10.4.2021', 'reading': ''},
                  {'date': '10.4.2021', 'reading': '0,1'},
                  {'date': '10.4.2021', 'reading': '0,1'},
                  {'date': '10.4.2021', 'reading': '0,1'},
                  {'date': '10.4.2021', 'reading': ''},
                  {'date': '12.4.2021', 'reading': '0,1'},
                  {'date': '12.4.2021', 'reading': ''},
                  {'date': '12.4.2021', 'reading': '0,1'},
                  {'date': '12.4.2021', 'reading': '0,1'},
                  {'date': '12.4.2021', 'reading': '0,1'},
                  {'date': '12.4.2021', 'reading': ''}]}

Collectives™ on Stack Overflow

How to iterate csv with Python and get only certain columns?

2 Answers 2

header

data

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

header

data

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related