0

I am trying to ingest 2 csv files into a single spark dataframe. However, the schema of these 2 datasets is very different, and when I perform the below operation, I get back only the schema of the second csv, as if the first one doesn't exist. How can I solve this? My final goal is to count the total number of words.

paths = ["abfss://lmne.dfs.core.windows.net/csvs/MachineLearning_reddit.csv", "abfss://[email protected]/csvs/bbc_news.csv"]

df0_spark=spark.read.format("csv").option("header","false").load(paths)
df0_spark.write.mode("overwrite").saveAsTable("ML_reddit2")
df0_spark.show()

I tried to load both of the files into a single spark dataframe, but it only gives me back one of the tables.

1
  • how different are the schemas? Commented Oct 28, 2022 at 6:28

2 Answers 2

1

I have reproduced the above and got the below results.

For sample, I have two csv files in dbfs with different schemas. when I execute the above code, I got the same result.

enter image description here

To get the desired schema enable mergeSchemaand header while reading the files.

Code:

df0_spark=spark.read.format("csv").option("mergeSchema","true").option("header","true").load(paths)
df0_spark.show()

enter image description here

If you want to combine the two files without nulls, we should have a common identity column and we have to read the files individually and use inner join for that.

Sign up to request clarification or add additional context in comments.

Comments

0

The solution that has worked for me the best in such cases was to read all distinct files separately, and then union them after they have been put into DataFrames. So your code could look something like this:

paths = ["abfss://lmne.dfs.core.windows.net/csvs/MachineLearning_reddit.csv", "abfss://[email protected]/csvs/bbc_news.csv"]

# Load all distinct CSV files
df1 = spark.read.option("header", false).csv(paths[0])
df2 = spark.read.option("header", false).csv(paths[1])

# Union DataFrames
combined_df = df1.unionByName(df2, allowMissingColumns=True)

Note: if the names of columns differ between the files, then for all columns from first file that are not present in second one, you will have null values. If the schema should be matching, then you can always rename the columns, before the unionByName step.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.