Adding missing columns to a dataframe pyspark

Question

When reading data from a text file using pyspark using following code,

spark = SparkSession.builder.master("local[*]").getOrCreate()
df = sqlContext.read.option("sep", "|").option("header", "false").csv('D:\\DATA-2021-12-03.txt')

My data text file looks like,

col1|cpl2|col3|col4
112 |4344|fn1 | home_a| extras| applied | <null>| <empty>

But the output I got was,

col1|cpl2|col3|col4
112 |4344|fn1 | home_a

Is there a way to add those missing columns for the dataframe?

Expecting,

col1|cpl2|col3|col4|col5|col6|col7|col8
112 |4344|fn1 | home_a| extras| applied | <null>| <empty>

I think that you have to either modify the CSV file header or insert it before you read the file using a csv reader. — k88
– k88, Commented Dec 24, 2021 at 18:25
if you don't care about the names of the existing columns,option("header", "true") and passing a custom schema should work, no? (there might some warnings about the header not matching the schema) — njzk2
– njzk2, Commented Dec 24, 2021 at 18:50
yeah. tried that. but it gives me an error. dataframe conversion is failing — codebot
– codebot, Commented Dec 24, 2021 at 18:51

Nithish · Accepted Answer · 2021-12-24 18:51:47Z

You can explicitly specify the schema, instead of infering it.


from pyspark.sql.types import StructType,StructField, StringType, IntegerType 
schema = StructType() \
      .add("col1",StringType(),True) \
      .add("col2",StringType(),True) \
      .add("col3",StringType(),True) \
      .add("col4",StringType(),True) \
      .add("col5",StringType(),True) \
      .add("col6",StringType(),True) \
      .add("col7",StringType(),True) \
      .add("col8",StringType(),True) 

df = spark.read.option("sep", "|").option("header", "true").schema(schema).csv('70475571_data.txt')

Output

+----+----+----+-------+-------+---------+-------+--------+
|col1|col2|col3|   col4|   col5|     col6|   col7|    col8|
+----+----+----+-------+-------+---------+-------+--------+
|112 |4344|fn1 | home_a| extras| applied | <null>| <empty>|
+----+----+----+-------+-------+---------+-------+--------+

Collectives™ on Stack Overflow

Adding missing columns to a dataframe pyspark

1 Answer 1

Output

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Output

Comments

Your Answer

Sign up or log in

Post as a guest

Related