'large' Pyspark dataframe write to parquet/convert to Pandas dataframe

Question

I'm trying to join lots of 'small csv'(1000+files, 6 millions rows every). i'm using Pyspark on a fat node (Memory: 128G, CPU: 24 cores). However, when i tried to write this dataframe out to parquet. 'stack over flow occurs'.

sc = SparkContext.getOrCreate(conf=conf)
sqlContext = SQLContext(sc)
bg_f = getfiles('./files')
SName = str(os.path.basename(bg_f[0]).split('.')[0])
schema = StructType([
    StructField('CataID', StringType(), True),
    StructField('Start_Block', IntegerType(), True),
    StructField('End_Block', IntegerType(), True),
    StructField(BName, IntegerType(), True)
])
temp = sqlContext.read.csv(bg_f[0], sep='\t', header=False, schema=schema)
for p in bg_f[1:]:
    SName = str(os.path.basename(p).split('.')[0])
    schema = StructType([
        StructField('CataID', StringType(), True),
        StructField('Start_Block', IntegerType(), True),
        StructField('End_Block', IntegerType(), True),
        StructField(BName, IntegerType(), True)
    ])
    cur = sqlContext.read.csv(p, sep='\t', header=False, schema=schema)
    temp = temp.join(cur,
                     on=['CataID', 'Start_Block', 'End_Block'],
                     how='outer')
temp = temp.drop('CataID', 'Start_Block', 'End_Block')

Jonathan Oheix · Accepted Answer · 2019-09-02 10:23:11Z

1

This happens because of your join instruction which duplicate rows and is memory consuming :

temp.join(cur,
          on=['CataID', 'Start_Block', 'End_Block'],
          how='outer')

If you only keep the column BName, why not select only this one after the read.csv ?

temp = sqlContext.read.csv(bg_f[0], sep='\t', header=False, schema=schema).select(BName)

You could then use :

temp = temp.union(cur)

instead of join, and drop duplicated rows at the end :

temp = temp.distinct()

answered Sep 2, 2019 at 10:23

Jonathan Oheix

1114 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

'large' Pyspark dataframe write to parquet/convert to Pandas dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related