0

I have sql query that return a data set in following format

user_id, type_id, avg
1, 3, 2.5
1, 2, 3.0
1, 5, 4.6
1, 11, 3.4
2, 2, 4.5
2, 3, 3.0
2, 11, 3.1

data above comes from following query, and it get executed is a very large table.

select u.user_id, t.type_id, sum(u.preference)/count(u.preference) 
from user_preference u, item_type_pairs t 
where t.item_id = u.item_id group by u.user_id, t.type_id;

Query takes 10min and returns 2 plus million records. My end goal is to put this in data frame where rows are user_id and columns representing type_id and each cell representing the avg value for an item by type_id.

   type_id_1, type_id_2, type_id_3
u1|             3.0        2.5
u2|             4.5        3.0 

What would be the best way to go about on this. I am also still figuring out? Should I be reading row by row and somehow populate the data frame?

4
  • Can you fill in the DataFrame completely? It's not clear (to me) exactly what you are looking for. Commented Oct 6, 2014 at 20:23
  • Is your problem that you can't read the sql output into python? Or that you don't know how to do the group by in pandas? Please edit your question to clarify. Commented Oct 6, 2014 at 20:39
  • @unutbu Sorry I made a mistake in original data frame representation. now its updated. Does it make more sense? Commented Oct 6, 2014 at 20:48
  • @Wilduck Assume that I have connected to the database and has this results set using mysql cursor. I want to populate the pandas dataframe. Commented Oct 6, 2014 at 21:14

1 Answer 1

1

I'm going to assume that you are able to create a MySql connection object, using something like:

import MySQLdb as mdb

con = mdb.connect('localhost', 'testuser', 'test623', 'testdb')

Then, getting your data into python is as simple as:

with con:
    cur = con.cursor(mdb.cursors.DictCursor)
    cur.execute(
        "select u.user_id, t.type_id, sum(u.preference)/count(u.preference)"
        "from user_preference u, item_type_pairs t"
        "where t.item_id = u.item_id group by u.user_id, t.type_id;"
    )
    rows = cur.fetchall()

At this point rows will look something like:

[{'user_id': 1, 'type_id': 2, 'avg': 2.5},
 {'user_id': 1, 'type_id': 2, 'avg': 3.0},
 ...]

From this step, creating a pandas dataframe from this data is extremely simple:

import pandas as pd
import numpy as np

my_df = pd.DataFrame(rows)

Then, you can use the pivot_table function to transform it into your desired output:

final_df = pd.pivot_table(
    df,
    index='user_id',
    columns='type_id',
    values='avg',
    agg_func=np.average
)
Sign up to request clarification or add additional context in comments.

1 Comment

Is it ok to use fetchall with 2 million records?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.