0

I have a pandas dataframe object which I have preallocated with 400 000 entries. 2 columns a timestamp of type datetime.datetime and a float number. When I attempt to insert (overwrite) a row in the table it seems rather slow, depending on the size of the table I get something like 0.044seconds. I have created an index of integer and I am using this index to access the row. Here is how I am using it:

maxsize = 400000
data = pd.DataFrame({'ts' : date_list, 'val' : zeros}, index=range(maxsize))
# later on, the next statement is "slow"
data.iloc[0] = pd.Series({'ts' : datetime.datetime.now(), 'val': val})

As per me investigation, the last statement takes about 0.044seconds on my machine (i7-4650U). This seems quite slow. Is there something that I am doing fundamentally wrong ? Could I use something like a HDF Store to improve write speeds, but keep high read speeds also ?

Thanks.

2 Answers 2

1

I think your solution is more a process than a programming one. Why use Python as a data storage handler since you worry about performance? Essentially, Python would resemble a client that interacts with data pulled from an external source, namely a dedicated database like MySQL or SQLite (using ODBC/OLEDB).

So, why not structure your dataset (append rows, update records, select columns) using an indexed, relational, SQL-engine database beforehand, then import into Python data frames for analytical/graphical purposes? Examples include:

DATABASE CONNECTION

conn = sqlite3.connect("databasename.sqlite")
df = pd.read_sql("SELECT [field1], [field2] from datatable", conn)
df

APPEND ROWS

conn = sqlite3.connect('databasename.sqlite')
cur = conn.cursor()
sql =  "INSERT INTO datatable (field1, field2, field3) VALUES ('{0}','{1}','{2}');".format(items[0], items[1], items[2])

cur.execute(sql)
db.commit()

CSV EXPORT/IMPORT

conn = sqlite3.connect('databasename.sqlite')
cur = conn.cursor()
cur.execute("SELECT [field1], [field2] from datatable")

a = csv.writer(open('datafile.csv', 'w', newline=''))
for row in cur.fetchall() :
    a.writerow(row)

filepath = 'datafile.csv'  # OUTPUTTED PRIOR FROM DATABASE
tp = pd.io.parsers.read_csv(filepath, sep=',', iterator=True, chunksize=1000, encoding = "ISO-8859-1")
finaldf = pd.concat(list(tp), ignore_index=True)
Sign up to request clarification or add additional context in comments.

Comments

0

You are assigning a series that is object dtype, iow, its mixed. So when the element assignment happens, the datetime needs to get converted. All of this is cheap; what is expensive are that each column needs to be internally copied to guard against dtype changes. Their is a fair amount of validation in the assignments to handle lots of edge cases.

In [23]: data = pd.DataFrame({'ts' : pd.date_range('20130101',freq='s',periods=maxsize), 'val' : 0}, index=range(maxsize))

In [24]: s = Series({'ts' : datetime.datetime.now(), 'val' : 1 })

In [25]: %timeit data.iloc[-1] = s
100 loops, best of 3: 10.6 ms per loop

You can bypass a lot of this, but do an item-by-item assignment. This is quite fast, but you have to make sure that your dtypes are compatible.

In [26]: def f():
    data.iat[-1,0] = s['ts']
    data.iat[-1,1] = s['val']
   ....:     

In [27]: data.tail()              
Out[27]: 
                               ts  val
399995        2013-01-05 15:06:35    0
399996        2013-01-05 15:06:36    0
399997        2013-01-05 15:06:37    0
399998        2013-01-05 15:06:38    0
399999 2015-02-24 06:03:58.344166    1

In [28]: %timeit f()
10000 loops, best of 3: 35.2 us per loop

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.