I have 5 CSV files and the header is in only the first file. I want to read and create a dataframe using spark. My code below works, however, I lose 4 rows of data using this method because the header is set to true in the final read. If I set the header to false I get the 4 rows of data back but I also get the actual header from the first file as a row in my data .
Is there a more efficient way to do this so that the header doesn't show up as a row in my dataset?
header = spark.read \
.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("path/file-1")
schema = header.schema
df = spark.read \
.format("csv") \
.option("header", "true") \
.schema(schema) \
.load("path")