What is the proper and fastest way to read Cassandra data into pandas? Now I use the following code but it's very slow...
import pandas as pd
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.query import dict_factory
auth_provider = PlainTextAuthProvider(username=CASSANDRA_USER, password=CASSANDRA_PASS)
cluster = Cluster(contact_points=[CASSANDRA_HOST], port=CASSANDRA_PORT,
auth_provider=auth_provider)
session = cluster.connect(CASSANDRA_DB)
session.row_factory = dict_factory
sql_query = "SELECT * FROM {}.{};".format(CASSANDRA_DB, CASSANDRA_TABLE)
df = pd.DataFrame()
for row in session.execute(sql_query):
df = df.append(pd.DataFrame(row, index=[0]))
df = df.reset_index(drop=True).fillna(pd.np.nan)
Reading 1000 rows takes 1 minute, and I have a "bit more"... If I run the same query eg. in DBeaver, I get the whole results (~40k rows) within a minute.
Thank you!!!
session.execute(sql_query)is a list of dicts, I'd try justdf = pd.DataFrame(session.execute(sql_query))or runpd.DataFrameon some portion of this list. Appending rows to a data frame one by one is inefficient.session.execute(sql_query)is a special<cassandra.cluster.ResultSet at 0x1b4b61d0>iterable object. Its rows can be tuples, named_tuples or dictionaries.lst=[]; for row in session...: lst.append(row)if nothing else works. And then concatenate the results:df = pd.concat(lst). This way you could avoid costly 40k calls topd.DataFrame.append.