3

I have been trying to use panda to parse date as string from my csv.

What I am trying to achieve here is to extract a row (from the csv) which contains two type of data, string and integer.

From here, I want to use docx-mailmerge to produce Microsoft Word reports.

For context, I was referring to the solutions here when figuring out: datetime dtypes in pandas read_csv

Thank you.

Here is error I am getting:

ValueError: invalid literal for int() with base 10: 'Num'

Here is my code:

import pandas as pd   

headers = ['AA', 'Num']
dtypes = {'AA': 'str', 'Num': 'int'}
pdate = ['AA']

df = pd.read_csv('test1.csv', 
header=None,
names = headers,
dtype = dtypes,
parse_dates = pdate, 
date_parser=None) 

top = df.head()
print(top)

Here is my csv file:

AA,Num
1/1/2020,5
2/1/2020,10
3/1/2020,15
4/1/2020,20
5/1/2020,25
6/1/2020,30
7/1/2020,35
8/1/2020,40
9/1/2020,45
10/1/2020,50
11/1/2020,55

2 Answers 2

1

I think problem is parameter header=None, it is used if no headers in csv. So remove it. Also parameter names = headers should be omit, if need parse all columns:

#here is set new columns names, so original headers are shift to first row of data
df = pd.read_csv('test1.csv',
header=None,
names = headers,
#dtype = dtypes,
#parse_dates = pdate, 
date_parser=None) 

top = df.head()
print(top)
         AA  Num
0        AA  Num
1  1/1/2020    5
2  2/1/2020   10
3  3/1/2020   15
4  4/1/2020   20

Correct parsing:

df = pd.read_csv('test1.csv',
#header=None,
#names = headers,
dtype = dtypes,
parse_dates = pdate, 
date_parser=None) 

top = df.head()
print(top)
          AA  Num
0 2020-01-01    5
1 2020-02-01   10
2 2020-03-01   15
3 2020-04-01   20
4 2020-05-01   25

If need filter only some columns instead names = headers use usecols=headers parameter.

Sign up to request clarification or add additional context in comments.

12 Comments

Thank you for this! & I dont think it is allowable to use 'use_cols'.
@SyahmiShaarani - Sorry, my typo. It is usecols and used if many columns and need parse only few.
Yes it is working! But now I am getting "2020-01-01 00:00:00" when I print the exact cell (row and column) but in the dataframe it shows only the date. How do I remove the timestamp?
@SyahmiShaarani - In pandas if no times are not displayed 00:00:00, you can check it print (df['AA'].tolist()). If want no times, convert column AA to dates like df['AA'] = df['AA'].dt.date
Another question, what if I am to use Excel directly? The prob is the date is not considered as 'str' then how do I go about my code? I am receiving this error " dateparse = lambda x: pd.datetime.strptime(x, '%d/%m/%Y') TypeError: strptime() argument 1 must be str, not datetime.datetime" when I used excel as the date column is not parsed by pandas.
|
0

Here is the working code:

headers = ['AA', 'Num']
dtypes = {'AA': 'str', 'Num': 'int'}
pdate = ['AA']

df = pd.read_csv('test1.csv', 
dtype = dtypes,
parse_dates = pdate, 
date_parser=None)

top = df.head()
print(top)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.