0

I work with pandas. I have the following data:

useradClick.head(n=5)
Out[291]: 
             timestamp  userId   adCategory  adCount
0  2016-05-26 15:13:22     611  electronics        1
1  2016-05-26 15:17:24    1874       movies        1
2  2016-05-26 15:22:52    2139    computers        1
3  2016-05-26 15:22:57     212      fashion        1
4  2016-05-26 15:22:58    1027     clothing        1

I want to convert 2016-05-26 15:13:22 to 2016-05-26 15. After I want to do a group by

I tried

useradClickv1 = useradClick.select(pd.to_datetime('timestamp',format='%d%m%Y'))

But I get the error

Traceback (most recent call last):

  File "<ipython-input-292-9d5a6a59d577>", line 1, in <module>
    useradClickv1 = useradClick.select(pd.to_datetime('timestamp',format='%d%m%Y'))

  File "/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/util/decorators.py", line 91, in wrapper
    return func(*args, **kwargs)

  File "/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/tseries/tools.py", line 287, in to_datetime
    unit=unit, infer_datetime_format=infer_datetime_format)

  File "/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/tseries/tools.py", line 416, in _to_datetime
    return _convert_listlike(np.array([arg]), box, format)[0]

  File "/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/tseries/tools.py", line 402, in _convert_listlike
    raise e

  File "/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/tseries/tools.py", line 365, in _convert_listlike
    arg, format, exact=exact, errors=errors)

  File "pandas/tslib.pyx", line 3183, in pandas.tslib.array_strptime (pandas/tslib.c:55388)

**ValueError: time data 'timestamp' does not match format '%d%m%Y' (match)**

How can I do this conversion using pandas?

EDITED 2016/07/07

I checked your answer and I get the error
adclicksDF = pd.read_csv('/home/cloudera/Eglence/ad-clicks.csv')

adclicksDF = adclicksDF.rename(columns=lambda x: x.strip())

adclicksDF['adCount'] = 1

useradClick = adclicksDF[['timestamp','userId','adCategory','adCount']]

seradClick.timestamp = pd.to_datetime(useradClick.timestamp)
Traceback (most recent call last):

  File "<ipython-input-31-ff9d4c4432ef>", line 1, in <module>
    seradClick.timestamp = pd.to_datetime(useradClick.timestamp)

NameError: name 'seradClick' is not defined


useradClick.timestamp = pd.to_datetime(useradClick.timestamp)
/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:2698: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value

EDITED

I work with anaconda pandas 0.18.0

import pandas as pd

from pyspark.mllib.clustering import KMeans, KMeansModel

from numpy import array

from pyspark import SparkConf, SparkContext

from pyspark.sql import SQLContext

import sys

conf = (SparkConf()
         .setMaster("local")
         .setAppName("My app")
         .set("spark.executor.memory", "1g"))

sc          = SparkContext(conf = conf)


sqlContext  = SQLContext(sc)

adclicksDF = pd.read_csv('/home/cloudera/Eglence/ad-clicks.csv')

adclicksDF = adclicksDF.rename(columns=lambda x: x.strip())

adclicksDF['adCount'] = 1 

useradClick = adclicksDF[['timestamp','userId','adCategory','adCount']]

useradClick.ix[:,'timestamp'] = p.to_datetime(useradClick.timestamp)
Traceback (most recent call last):

  File "<ipython-input-21-dcc10ed41daa>", line 1, in <module>
    useradClick.ix[:,'timestamp'] = p.to_datetime(useradClick.timestamp)

NameError: name 'p' is not defined


useradClick.ix[:,'timestamp'] = pd.to_datetime(useradClick.timestamp)
/home/cloudera/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:461: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
1
  • Firstly pd.to_datetime(useradClick['timestamp']) should just work also what is the purpose of changing the format to 2016-05-26 15? Commented Jul 7, 2016 at 17:43

2 Answers 2

1

UPDATE:

cols = ['timestamp','userId','adCategory']
adclicksDF = pd.read_csv('/home/cloudera/Eglence/ad-clicks.csv',
                         uscols=cols,
                         parse_dates=['timestamp'],
                         skipinitialspace=True).assign(adCount=1)
#adclicksDF['adCount'] = 1

Original answer:

If i guessed correctly you don't need to convert your datetime into string as you described.

If you want to group by hour:

if your timestamp is of object (string) dtype, you should convert it to datetime first:

df.loc[: , 'timestamp'] = pd.to_datetime(df['timestamp'])

In [15]: df
Out[15]:
            timestamp  userId   adCategory  adCount
0 2016-05-26 15:13:22     611  electronics        1
1 2016-05-26 15:17:24    1874       movies        1
2 2016-05-26 15:22:52    2139    computers        1
3 2016-05-26 15:22:57     212      fashion        1
4 2016-05-26 15:22:58    1027     clothing        1
5 2016-05-26 16:22:57     111      fashion        1
6 2016-05-26 16:22:58     222     clothing        1

In [16]: df.groupby(pd.Grouper(key='timestamp', freq='1H'))['adCount'].agg(['count','sum'])
Out[16]:
                     count  sum
timestamp
2016-05-26 15:00:00      5    5
2016-05-26 16:00:00      2    2
Sign up to request clarification or add additional context in comments.

12 Comments

I checked type fields and timestamp is <class 'str'> <class 'numpy.int64'> <class 'str'> <class 'numpy.int64'> Does groupby work with timestamp type str?
I checked your answer. You can see the error I get in my first question. I edited my first question
I understand the error But I don't know how to fix it. The error line is useradClick.timestamp = pd.to_datetime(useradClick.timestamp)
no, you've misspelled the DF name: seradClick instead of useradClick. BTW what is your pandas version?
you can try this df.ix[: , 'timestamp'] = pd.to_datetime(df.timestamp) in order to get rid of the warning. But pandas v. 0.18.1 should work properly also with the code from my answer - i've checked it
|
0

Pandas expects the format to be in '%d%m%Y' (daymonthyear) without spaces. Your format is 2016-05-26 00:00:00 '%y-%m-%d %h:%m:%s'. Try

useradClickv1 = useradClick.select(pd.to_datetime('timestamp',format='%y-%m-%d %h:%m:%s'))

2 Comments

I executed useradClickv1 = useradClick.select(pd.to_datetime('timestamp',format='%y-%m-%d')) But I get the error valueError: time data 'timestamp' does not match format '%y-%m-%d' (match)
Did not see the time was part of the time stamp. I edited my answer. Try that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.