1

I have a csv -file structured like this:

Last Name;First Name;Start Date;End Date;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
Example;Eva;;;1.1.2021;15.6.2021;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
Here is some random information.;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-------;Header;------- ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
Index;Date;Time;Reading
0;10.4.2021;19:12:10;0,1;;;;;;
1;10.4.2021;19:07:14;;;;;;;;
2;10.4.2021;19:05:34;0,1;;;;;;
3;10.4.2021;19:05:32;0,1;;;;;;
4;10.4.2021;19:05:32;0,1;;;;;;
5;10.4.2021;19:05:31;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-------;Header;------- ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
Index;Date;Time;Reading
6;12.4.2021;19:12:10;0,1;;;;;;
7;12.4.2021;19:07:14;;;;;;;;
8;12.4.2021;19:05:34;0,1;;;;;;
9;12.4.2021;19:05:32;0,1;;;;;;
10;12.4.2021;19:05:32;0,1;;;;;;
11;12.4.2021;19:05:31;;;;;;;;

My goal is to get a clean dict out of the file with only the information I need. Let's say I want only name, dates and readings structured like this:

{
'last_name': 'Example', 
'first_name': 'Eva', 
'measurements': [{'date:': 'some_date', 'reading': 'some_reading'}]
}

How can I iterate through the columns and only get the ones I need? There are a lot of columns where the Reading field is nan. In my example this happens in the second and the last row, in both sections.

This is what I have tried so far to get the data I want from the csv:

df = (pd.read_csv(file,
                  sep='\s+',
                  skiprows=6,
                  index_col=0,
                  dtype='unicode'
                  )
       )
df = pd.DataFrame(df, columns=['Reading', 'Date', 'Time'])
print(df.keys())
print(len(df))
test_list = df.values.tolist()
print(test_list)
print(len(test_list))

This gives me result:

Index(['Reading', 'Date', 'Time'], dtype='object')
13520
[[nan, nan, nan], [nan, nan, nan], ...]
13520

So the list where I want to have the values is just a list of list of nan's.

5
  • 1
    Since you have several (?) CSV files appended into one, this is not trivial using pandas csv reading. I suggest you keep each CSV as a separate file. I.e. one for names, one for readings, etc. Commented Sep 1, 2021 at 8:35
  • That is one CSV file I get, I can't affect it's structure. It has multiple sections where the column names (Index;Date;Time;Reading) are the same and I would need to get them all from the file. Commented Sep 1, 2021 at 8:51
  • @JonasByström Sorry, there was a mistake in the indexes in my example. They don't start over in different sections. The error is now fixed in my original post. Commented Sep 1, 2021 at 9:07
  • Does this answer your question? How to read only certain rows and cells from csv with Python pandas? Commented Sep 1, 2021 at 9:13
  • @mozway Maybe it does but as a beginner with handling CSV -files I wasn't sure how to handle the data. Thank you for trying to explain me the solution, this time I understood it more clearly with the example by Serge Ballesta. Commented Sep 1, 2021 at 10:01

2 Answers 2

2

This really looks like your previous question, so just adapt my previous answer ;)

header

pd.read_csv('filename.csv', sep=';+', nrows=1).dropna(axis=1).loc[0].to_dict()

output:

{'Last Name': 'Example',
 'First Name': 'Eva',
 'Start Date': '1.1.2021',
 'End Date': '15.6.2021'}

data

df = (pd.read_csv('filename.csv',
                  sep=';',
                  skiprows=6,
                  index_col=0,
                  usecols=range(4),
                 )
        .drop(['Index', '-------', float('nan')], # get rid of extra headers
              errors='ignore')
     )

output:

            Date      Time Reading
Index                             
0      10.4.2021  19:12:10     0,1
1      10.4.2021  19:07:14     NaN
2      10.4.2021  19:05:34     0,1
3      10.4.2021  19:05:32     0,1
4      10.4.2021  19:05:32     0,1
5      10.4.2021  19:05:31     NaN
0      12.4.2021  19:12:10     0,1
1      12.4.2021  19:07:14     NaN
2      12.4.2021  19:05:34     0,1
3      12.4.2021  19:05:32     0,1
4      12.4.2021  19:05:32     0,1
5      12.4.2021  19:05:31     NaN
Sign up to request clarification or add additional context in comments.

Comments

1

This is not a correctly formatted single csv file, but a text file containing multiple csv fragments. It will be easyer to use the lower level csv module than the high level pandas one:

with open(file) as fd:
    rd = csv.reader(fd, delimiter=';')
    state = 0     # wait for the Last Name;First Name line
    for row in rd:
        if state == 0:
            if row[0].startswith('Last'):
                state = 1    # next line will contain last and first names
            continue
        if state == 1:
            # prepare the resul dictionary
            measurements = []
            d = {'last_name': row[0], 'first_name': row[1],
                 'measurements': measurements}
            state = 2    # wait for a header line
            continue
        if state == 2:
            if row[0] == 'Index':
                state = 3   # read data lines
            continue
        #  state is 3
        if row[0] == '':
            state = 2  // wait for next data bloc
        else:
            # process a data line
            measurements.append({'date': row[1], 'reading': row[3]})

This code only store the date and reading fields as text, but you will easily add more complex processing...

With your sample data it gives:

{'first_name': 'Eva',
 'last_name': 'Example',
 'measurements': [{'date': '10.4.2021', 'reading': '0,1'},
                  {'date': '10.4.2021', 'reading': ''},
                  {'date': '10.4.2021', 'reading': '0,1'},
                  {'date': '10.4.2021', 'reading': '0,1'},
                  {'date': '10.4.2021', 'reading': '0,1'},
                  {'date': '10.4.2021', 'reading': ''},
                  {'date': '12.4.2021', 'reading': '0,1'},
                  {'date': '12.4.2021', 'reading': ''},
                  {'date': '12.4.2021', 'reading': '0,1'},
                  {'date': '12.4.2021', 'reading': '0,1'},
                  {'date': '12.4.2021', 'reading': '0,1'},
                  {'date': '12.4.2021', 'reading': ''}]}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.