0

I have these kind of data (sample) in MySQL:

Table1:

ID     ITEM    CNT
--------------------
0001    AAB     5
0001    BBA     3
0001    BBB     8
0001    AAC     10
0002    BBA     2
0002    BBC     7
0003    FFG     2
0003    JPO     4
0003    PUI     22
..........

And i would like to find a way to import these data in Python in the form of count vectors, e.g.:

0001 = [5,10,3,8,0,0,0,0]
0002 = [0,0,2,0,7,0,0,0]
0003 = [0,0,0,0,0,0,4,22]

where the elements represent the counts for all the items for each id in this form: [AAB,AAC,BBA,BBB,BBC,FFG,JPO,PUI]

So I would like to ask, what is the best and most efficient way of implementing this? Is it better to do it from python or mysql and how?

Thank you

2 Answers 2

1

It is usually more efficient -- when possible -- to manipulate the data in SQL rather than in Python.

With this setup:

import config
import MySQLdb
conn = MySQLdb.connect(
    host=config.HOST, user=config.USER,
    passwd=config.PASS, db='test')
cursor = conn.cursor()

sql = '''\
DROP TABLE IF EXISTS foo 
'''
cursor.execute(sql)

sql = '''\
CREATE TABLE foo (
    ID varchar(4),
    ITEM varchar(3),
    CNT int)
'''

cursor.execute(sql)

sql = '''\
INSERT INTO foo VALUES (%s,%s,%s)
'''

cursor.executemany(sql, [['0001', 'AAB', 5],
                         ['0001', 'BBA', 3],
                         ['0001', 'BBB', 8],
                         ['0002', 'BBA', 2]])

You could form the desired SQL with:

items = 'AAB AAC BBA BBB BBC FFG JPO PUI'.split()
fields = ', '.join('COALESCE({}.CNT, 0)'.format(item) for item in items)
joins = '\n'.join('''\
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = '{i}') as {i}
    ON T.ID = {i}.ID'''.format(i=item) for item in items)
sql = '''\
SELECT T.ID, {f}
FROM (SELECT DISTINCT ID from foo) as T
{j}
'''.format(f=fields, j=joins)

print(sql)

And use it like this:

result = dict()
cursor.execute(sql)
for row in cursor:
    result[row[0]] = row[1:]
print(result)    

The SQL query used is:

SELECT T.ID, COALESCE(AAB.CNT, 0), COALESCE(AAC.CNT, 0), COALESCE(BBA.CNT, 0), COALESCE(BBB.CNT, 0), COALESCE(BBC.CNT, 0), COALESCE(FFG.CNT, 0), COALESCE(JPO.CNT, 0), COALESCE(PUI.CNT, 0)
FROM (SELECT DISTINCT ID from foo) as T
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'AAB') as AAB
    ON T.ID = AAB.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'AAC') as AAC
    ON T.ID = AAC.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'BBA') as BBA
    ON T.ID = BBA.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'BBB') as BBB
    ON T.ID = BBB.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'BBC') as BBC
    ON T.ID = BBC.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'FFG') as FFG
    ON T.ID = FFG.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'JPO') as JPO
    ON T.ID = JPO.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'PUI') as PUI
    ON T.ID = PUI.ID

and the resultant dict looks like:

{'0001': (5L, 0L, 3L, 8L, 0L, 0L, 0L, 0L), '0002': (0L, 0L, 2L, 0L, 0L, 0L, 0L, 0L)}

I know you asked for

0001 = [5,10,3,8,0,0,0,0]
0002 = [0,0,2,0,7,0,0,0]
0003 = [0,0,0,0,0,0,4,22]

but there are at least two problems with this. First, 0001 is not a valid Python variable name. Variable names can not start with a digit. Second, you do not want dynamically defined variable names because it is hard to program with a bare variable name which is not known until run-time.

Instead, use the would-be variable names as keys in a dict, result. Then you can refer to the "variable" 0001 with result['0001'].

Sign up to request clarification or add additional context in comments.

5 Comments

thank you for your answer...I'm going to try this out, but is there another way instead of explicitly writing each item, because I have thousands of items and millions of rows!
As you read through my post again, you will see (in the middle) how I generate the SQL programmatically.
hi thanks..I have tried it but it gives me this error: _mysql_exceptions.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'CNT, 0), COALESCE(01.CNT, 0), COALESCE(0113.CNT, 0), COALESCE(011Z.CNT, 0), COAL' at line 1")
I have another error though: 'Too many tables; MySQL can only use 61 tables in a join'
Well, I can imagine a hackish workaround -- forming a series of SQL queries that use <= 61 joins, and then using Python to "join the joins" into a final result. But that certainly does not feel like a great solution. Perhaps it would be better to open a new question on how to deal with a query that seems to require more than 61 joins. I don't know the best answer to that.
0

You can do it on access via a crosstab query where the row header will be the id and the column header will be the item and cnt the value to be aggregated. You can then loop through each column of each row to get your vectors. See here for help on crosstab queries: http://allenbrowne.com/ser-67.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.