9

I'm using the Python cassandra driver to connect and query our Cassandra cluster.

I want to manipulate my data via Pandas, there is an area in the documentation for the cassandra driver that mentions this exactly: https://datastax.github.io/python-driver/api/cassandra/protocol.html

NumpyProtocolHander: deserializes results directly into NumPy arrays. This facilitates efficient integration with analysis toolkits such as Pandas.

Following the above instructions and doing a SELECT query in Cassandra, I can see the output (via the type() function) as a:

<class 'cassandra.cluster.ResultSet'>

Iterating through the results, this what printing a row comes up as:

{u'reversals_rejected': array([0, 0]), u'revenue': array([ 0, 10]), u'reversals_revenue': array([0, 0]), u'rejected': array([3, 1]), u'impressions_positive': array([3, 3]), u'site_user_id': array([226226, 354608], dtype=int32), u'error': array([0, 0]), u'impressions_negative': array([0, 0]), u'accepted': array([0, 2])}

(I've limited the query results, I'm working with much larger amounts of data - hence wanting to use numpy and pandas).

My knowledge of Pandas is limited, I attempted to run very basic functionalities:

rslt = cassandraSession.execute("SELECT accepted FROM table")

test = rslt[["accepted"]].head(1)

This outputs the following error:

Traceback (most recent call last):
  File "/UserStats.py", line 27, in <module>
    test = rslt[["accepted"]].head(1)
  File "cassandra/cluster.py", line 3380, in cassandra.cluster.ResultSet.__getitem__ (cassandra/cluster.c:63998)
TypeError: list indices must be integers, not list

I understand the error, I just don't know how to "transition" from this supposed numpy array to being able to use Pandas.

1 Answer 1

7

The short answer is:

df = pd.DataFrame(rslt[0])
test = df.head(1)

rslt[0] gives you your data as a Python dict, that can be easily converted to a Pandas dataframe.

For a complete solution:

import pandas as pd
from cassandra.cluster import Cluster
from cassandra.protocol import NumpyProtocolHandler
from cassandra.query import tuple_factory

cluster = Cluster(
    contact_points=['your_ip'],
    )
session = cluster.connect('your_keyspace')
session.row_factory = tuple_factory
session.client_protocol_handler = NumpyProtocolHandler

prepared_stmt = session.prepare ( "SELECT * FROM ... WHERE ...;")
bound_stmt = prepared_stmt.bind([...])
rslt = session.execute(bound_stmt)
df = pd.DataFrame(rslt[0])

Note: The above solution will only get you part of the data if the query is large. So you should do:

df = pd.DataFrame()
for r in rslt:
    df = df.append(r)
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for your help! Specifying the index for the DataFrame was what I was missing in the end!
What goes in the .bind([...]) brackets?
@yeliabsalohcin It's the variables you want to send to the query. So if you had a query like this in your prepared statement: "SELECT * FROM users WHERE user_id=? AND date=?" they would replace the "?".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.