How can I extract columns from a dataframe according to an array and if it does not find a column, that column should contain null values? - pyspark

Question

I have an array named "extractColumns" and I have a dataframe named "raw_data". I wanted to create a new dataframe according to the array and the dataframe. Even if, when doing the "select", it does not find a column in the dataframe, that column would have to come NULL.

How can I do this

In that case just create a new column version as below - from pyspark.sql.functions import * raw_data.select(['refsnp_id', 'chr_name', 'chrom_start', 'chrom_end']).withColumn("version", lit(None)) — Dipanjan Mallick
– Dipanjan Mallick, Commented Mar 17, 2022 at 14:57

Luiz Viola · Accepted Answer · 2022-03-17 14:52:56Z

raw_data  = spark.createDataFrame(
  [
('1',20),
('2',34),
('3',12)
  ], ['foo','bar'])



#columns I want to extract from raw_dataframe
extractColumns = ['refsnp_id', 'chr_name', 'chrom_start', 'chrom_end', 'version']


import pyspark.sql.functions as F

new_raw_data = raw_data

for col in extractColumns:
    if col not in raw_data.columns:
        new_raw_data = new_raw_data.withColumn(col, F.lit(None))\

        
        
new_raw_data.show()

+---+---+---------+--------+-----------+---------+-------+
|foo|bar|refsnp_id|chr_name|chrom_start|chrom_end|version|
+---+---+---------+--------+-----------+---------+-------+
|  1| 20|     null|    null|       null|     null|   null|
|  2| 34|     null|    null|       null|     null|   null|
|  3| 12|     null|    null|       null|     null|   null|
+---+---+---------+--------+-----------+---------+-------+

Collectives™ on Stack Overflow

How can I extract columns from a dataframe according to an array and if it does not find a column, that column should contain null values? - pyspark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related