How to transfer SQL Group by results to a Pandas dataframe

Question

I have sql query that return a data set in following format

user_id, type_id, avg
1, 3, 2.5
1, 2, 3.0
1, 5, 4.6
1, 11, 3.4
2, 2, 4.5
2, 3, 3.0
2, 11, 3.1

data above comes from following query, and it get executed is a very large table.

select u.user_id, t.type_id, sum(u.preference)/count(u.preference) 
from user_preference u, item_type_pairs t 
where t.item_id = u.item_id group by u.user_id, t.type_id;

Query takes 10min and returns 2 plus million records. My end goal is to put this in data frame where rows are user_id and columns representing type_id and each cell representing the avg value for an item by type_id.

   type_id_1, type_id_2, type_id_3
u1|             3.0        2.5
u2|             4.5        3.0

What would be the best way to go about on this. I am also still figuring out? Should I be reading row by row and somehow populate the data frame?

Can you fill in the DataFrame completely? It's not clear (to me) exactly what you are looking for. — unutbu
– unutbu, Commented Oct 6, 2014 at 20:23
Is your problem that you can't read the sql output into python? Or that you don't know how to do the group by in pandas? Please edit your question to clarify. — Wilduck
– Wilduck, Commented Oct 6, 2014 at 20:39
@unutbu Sorry I made a mistake in original data frame representation. now its updated. Does it make more sense? — add-semi-colons
– add-semi-colons, Commented Oct 6, 2014 at 20:48
@Wilduck Assume that I have connected to the database and has this results set using mysql cursor. I want to populate the pandas dataframe. — add-semi-colons
– add-semi-colons, Commented Oct 6, 2014 at 21:14

Wilduck · Accepted Answer · 2014-10-06 21:27:39Z

1

I'm going to assume that you are able to create a MySql connection object, using something like:

import MySQLdb as mdb

con = mdb.connect('localhost', 'testuser', 'test623', 'testdb')

Then, getting your data into python is as simple as:

with con:
    cur = con.cursor(mdb.cursors.DictCursor)
    cur.execute(
        "select u.user_id, t.type_id, sum(u.preference)/count(u.preference)"
        "from user_preference u, item_type_pairs t"
        "where t.item_id = u.item_id group by u.user_id, t.type_id;"
    )
    rows = cur.fetchall()

At this point rows will look something like:

[{'user_id': 1, 'type_id': 2, 'avg': 2.5},
 {'user_id': 1, 'type_id': 2, 'avg': 3.0},
 ...]

From this step, creating a pandas dataframe from this data is extremely simple:

import pandas as pd
import numpy as np

my_df = pd.DataFrame(rows)

Then, you can use the pivot_table function to transform it into your desired output:

final_df = pd.pivot_table(
    df,
    index='user_id',
    columns='type_id',
    values='avg',
    agg_func=np.average
)

answered Oct 6, 2014 at 21:27

Wilduck

14.2k13 gold badges63 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

add-semi-colons Over a year ago

Is it ok to use fetchall with 2 million records?

Collectives™ on Stack Overflow

How to transfer SQL Group by results to a Pandas dataframe

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related