Cassandra Batch Insert in Python

Question

I need to do a batch INSERT in Cassandra using Python. I am using the latest Datastax python driver.

The INSERTS are batches of columns that will be in the same row. I will have many rows to insert, but chunks of the data will be in the same row.

I can do individual INSERTS in a for loop as described in this post: Parameterized queries with the Python Cassandra Module I am using parametrized query, values as shown in that example.

This did not help: How to multi insert rows in cassandra

I am not clear how to assemble a parameterized INSERT:

BEGIN BATCH  
  INSERT(query values1)  
  INSERT(query values2)  
  ...  
APPLY BATCH;  
cursor.execute(batch_query)

Is this even possible? Will this speed up my INSERTS? I have to do millions. Even thousands take too long. I found some Java info: http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0

Why do you need batch for that? Just insert your data using prepared inserts and you'll be fine — Mikhail Stepura
– Mikhail Stepura, Commented Apr 7, 2014 at 23:59
Won't the insert be faster if the INSERTs are done in BATCH ??? There are chunks of INSERTs that are in the same row. Will BATCH make a difference in localized writes? Although those INSERTs that are in the same row are contiguously INSERTed. Versus if Cassandra jumps around when INSERTing. — user3480774
– user3480774, Commented Apr 8, 2014 at 4:39

m01 · Accepted Answer · 2019-04-17 15:10:16Z

7

Since version 2.0.0 of the driver, there is a BatchStatement construct. If using the ORM, you can also use the BatchQuery class.

cluster = Cluster([sever_ip])
session = cluster.connect(keyspace)
insert_user = session.prepare('INSERT INTO table_name (id,name) VALUES (?, ?)')
batch = BatchStatement(consistency_level=ConsistencyLevel.QUORUM)
for i,j in some_value:
    try:
      batch.add(insert_user,(i,j))
      logger.info('Data Inserted into the table')
    except Exception as e:
      logger.error('The cassandra error: {}'.format(e))
session.execute(batch)

edited Apr 17, 2019 at 15:10

m01

9,4756 gold badges35 silver badges60 bronze badges

answered Mar 15, 2018 at 5:33

Srivasan Sridharan

1461 silver badge5 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alex Popescu · Accepted Answer · 2014-04-08 20:13:43Z

4

Intro: Right now the DataStax Python driver doesn't support the CQL protocol in Cassandra 2.0 -- it's work in progress and betas will should up soon. At that point you'll be able to have a BATCH statement to which you can add bound prepared statements as needed.

Considering the above, the solution you could use is the one described in the post you've linked: prepare a statement that includes a BATCH with a series of INSERTs. The obvious downside of this solution is that you'd need to decide upfront how many inserts will be in your batch and also you'll have to split your input data accordingly.

Example code:

BATCH_SIZE = 10
INSERT_STMT = 'INSERT INTO T (id, fld1) VALUES (?, ?)';
BATCH_STMT = 'BEGIN BATCH'

for i in range(BATCH_SIZE):
  BATCH_STMT += INSERT_STMT

BATCH_STMT += 'APPLY BATCH;'
prep_batch = session.prepare(BATCH_STMT)

Then as you receive data you can iterate over it and for each BATCH_SIZE rows you bind those to the above prep_batch and execute it.

answered Apr 8, 2014 at 20:13

Alex Popescu

4,02221 silver badges20 bronze badges

3 Comments

user3480774 Over a year ago

Thanks. I suspected that from the datastax post. Your answer helps though.

m01 Over a year ago

I believe this answer is now out of date

Isaac I9 Over a year ago

docs.datastax.com/en/developer/python-driver/3.25/cqlengine/…

Collectives™ on Stack Overflow

Cassandra Batch Insert in Python

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related