1

I am importing data into a pandas dataframe from Google BigQuery and I'd like to sort the results by date. My code is as follows:

import sys, getopt
import pandas as pd
from datetime import datetime

# set your BigQuery service account private private key
pkey ='#REMOVED#'
destination_table = 'test.test_table_2'
project_id = '#REMOVED#'

# write your query
query = """
SELECT date, SUM(totals.visits) AS Visits
FROM `#REMOVED#.#REMOVED#.ga_sessions_20*`
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 3 day) and
DATE_sub(current_date(), interval 1 day)
GROUP BY Date
    """

data = pd.read_gbq(query, project_id, dialect='standard', private_key=pkey, parse_dates=True, index_col='date')
date = data.sort_index()

data.info()
data.describe()

print(data.head())

My output is shown below, as you can see dates are not sorted.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
date      3 non-null object
Visits    3 non-null int32
dtypes: int32(1), object(1)
memory usage: 116.0+ bytes
       date  Visits
0  20180312  207440
1  20180310  178155
2  20180311  207452

I have read several questions and so far tried the below, which resulted in no change to my output:

  • Removing index_col='date' and adding date = data.sort_values(by='date')
  • Setting the date column as the index, then sorting the index (shown above).
  • Setting headers (headers = ['Date', 'Visits']) and dypes (dtypes = [datetime, int]) to my read_gbq line (parse_dates=True, names=headers)

What am I missing?

0

3 Answers 3

2

I managed to solve this by transforming my date field into a datetime object, I assumed this would be done automatically by parse_date=True but it seems that will only parse a existing datetime object.

I added the following after my query to create a new datetime column from my date string, then I was able to use data.sort_index() and it worked as expected:

time_format = '%Y-%m-%d'
data = pd.read_gbq(query, project_id, dialect='standard', private_key=pkey)

data['n_date'] = pd.to_datetime(data['date'], format=time_format)  

data.index = data['n_date']

del data['date']
del data['n_date']

data.index.names = ['Date']

data = data.sort_index()
Sign up to request clarification or add additional context in comments.

Comments

1

As most of the work is done on the Google BigQuery side, I'd do sorting there as well:

query = """
SELECT date, SUM(totals.visits) AS Visits
FROM `#REMOVED#.#REMOVED#.ga_sessions_20*`
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 3 day) and
DATE_sub(current_date(), interval 1 day)
GROUP BY Date
ORDER BY Date
"""

2 Comments

Thanks @MaxU a super lightweight suggestion which did work as an excellent alternative.
@BenP, glad I could help :)
1

This should work:

data.sort_values('date', inplace=True)

1 Comment

This may not sort appropriately if your dates are like 01/01/2010, it will sort on the first value it sees. In this case, it will sort by the month (01) instead of the year... probably not the sorting this guy is after.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.