3

I have two tables: trainSearcStream and SearchInfo. My code is supposed to merge the two tables based on a common column named SearchID. the problem is that the code runs for sometime and then terminates without reporting any error in my eclipse console. the number or rows in each tables is: trainSearchStream|392,356,948 and SearchInfo|112,159,462. the size of my database (which include other tables) is 40GB. Update: When I ran code in terminal I got a message saying: 'killed: 9'

import pandas as pd
import sqlite3


# Create connection.
con = sqlite3.connect(path + 'database.sqlite')

#read into a data frame
df = pd.read_sql_query("SELECT * FROM trainSearchStream, SearchInfo WHERE  trainSearchStream.SearchID = SearchInfo.SearchID;", con)

#save to file
df.to_csv(path + 'traindata.csv',index=False,encoding='utf-8')
6
  • 1
    I suppose it terminates during the read_sql_query call and not the to_csv one? Do you know what the size of the tables would be that you want to query? And how much memory do you have? 40GB is not something that fits in typical laptops' memory. Commented Jul 7, 2015 at 11:39
  • BTW, if your only goal of above is to export the data to csv, there are probably better ways to do this than a roundtrip to python. See eg stackoverflow.com/questions/6076984/… Commented Jul 7, 2015 at 11:41
  • 1
    A few question : what is the size of the result of your query ? (you can execute it elsewhere than in python) Do you know if your python is 32 or 64 bits ? I guess the result of your query loads in RAM and you are unlikely to have 40GB of RAM. Maybe it's over 2GB and you are on a 32-bit version and it causes the problem. Commented Jul 7, 2015 at 11:41
  • 1
    I am not a python guy so I don't know exactly what your code does but I think pd.read_sql_query will load the result of your query into memory. In this case its not a good idea because of the size of the result. I would stream the query and than write the csv line by line. For small results that way would be slower. Commented Jul 7, 2015 at 11:48
  • @joris yes it terminates before to_csv. I have a laptop with 16Gb memory and 500Gb flash storage. I also have the tables in .tsv format. in that format they have size of 9.47Gb and 11.02 Gb. My goal is to join several tables according to a common column then export them to a cvs file. I tried pandas merge function. but my code would terminate. When I ran the pandas code in terminal I got a message saying: 'killed: 9' Commented Jul 7, 2015 at 11:52

1 Answer 1

1

As mentionned, the best way I think is to directly use sqlite and not python.

So the solution is, I guess, this:

sqlite> .mode csv
sqlite> .output test.csv
sqlite> SELECT * FROM trainSearchStream, SearchInfo WHERE  trainSearchStream.SearchID = SearchInfo.SearchID; /*your query here*/
sqlite> .output stdout

Sorry, that is not an answer but commenting with the code would have destroyed the indent and we are using python ...

Could you run this and give us the output please ? (what was printed)

import pandas as pd
import sqlite3


# Create connection.
con = sqlite3.connect(path + 'database.sqlite')

try:
    df = pd.read_sql_query("SELECT * FROM trainSearchStream, SearchInfo WHERE  trainSearchStream.SearchID = SearchInfo.SearchID;", con)
    print "read_sql_query went well"
except Exception, e:
    print "read_sql_query failed: "+ str(e))

try:
    df.to_csv(path + 'traindata.csv',index=False,encoding='utf-8')
    print "to_csv went well"
except Exception, e:
    print "to_csv failed: "+ str(e))
Sign up to request clarification or add additional context in comments.

1 Comment

the code terminates without leaving any message. @Pholochtairze

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.