0

I have an array named "extractColumns" and I have a dataframe named "raw_data". I wanted to create a new dataframe according to the array and the dataframe. Even if, when doing the "select", it does not find a column in the dataframe, that column would have to come NULL.

How can I do this

3
  • Please give some insights about input and output datasets Commented Mar 17, 2022 at 13:27
  • I'm still unclear about your use case here. Commented Mar 17, 2022 at 14:06
  • In that case just create a new column version as below - from pyspark.sql.functions import * raw_data.select(['refsnp_id', 'chr_name', 'chrom_start', 'chrom_end']).withColumn("version", lit(None)) Commented Mar 17, 2022 at 14:57

1 Answer 1

1
raw_data  = spark.createDataFrame(
  [
('1',20),
('2',34),
('3',12)
  ], ['foo','bar'])



#columns I want to extract from raw_dataframe
extractColumns = ['refsnp_id', 'chr_name', 'chrom_start', 'chrom_end', 'version']


import pyspark.sql.functions as F

new_raw_data = raw_data

for col in extractColumns:
    if col not in raw_data.columns:
        new_raw_data = new_raw_data.withColumn(col, F.lit(None))\

        
        
new_raw_data.show()
+---+---+---------+--------+-----------+---------+-------+
|foo|bar|refsnp_id|chr_name|chrom_start|chrom_end|version|
+---+---+---------+--------+-----------+---------+-------+
|  1| 20|     null|    null|       null|     null|   null|
|  2| 34|     null|    null|       null|     null|   null|
|  3| 12|     null|    null|       null|     null|   null|
+---+---+---------+--------+-----------+---------+-------+

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.