0

I am reading in a csv using pandas chunks functionality. It works, except for I am not able to retain headers. Is there a way/option to do this? here is sample code:

import pyspark
import pandas as pd
sc = pyspark.SparkContext(appName="myAppName")
spark_rdd = sc.emptyRDD()

# filename: csv file
chunks = pd.read_csv(filename, chunksize=10000)
for chunk in chunks:
    spark_rdd +=  sc.parallelize(chunk.values.tolist())

    #print(chunk.head())
    #print(spark_rdd.toDF().show())
    #break

spark_df = spark_rdd.toDF()
spark_df.show()
0

2 Answers 2

1

Try this :

import pyspark
import pandas as pd
sc = pyspark.SparkContext(appName="myAppName")
spark_rdd = sc.emptyRDD()

# Read ten rows to get column names
x = pd.read_csv(filename,nrows=10)
mycolumns = list(x)

# filename: csv file
chunks = pd.read_csv(filename, chunksize=10000)
for chunk in chunks:
    spark_rdd +=  sc.parallelize(chunk.values.tolist())

spark_df = spark_rdd.map(lambda x:tuple(x)).toDF(mycolumns)
spark_df.show()
Sign up to request clarification or add additional context in comments.

2 Comments

for reading headers, x = pd.read_csv(filename,nrows=1) should suffice?
I agree its arbitrary, won't matter practically if you take 1,5 or 10 rows as long as you take atleast one.
0

I ended up using databricks' spark-csv

sc = pyspark.SparkContext()
sql = pyspark.SQLContext(sc)

df = sql.read.load(filename, 
                 format='com.databricks.spark.csv', 
                 header='true', 
                 inferSchema='true')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.