2

I have requirement to read multiple csv files in one go. Now these csv files may have variable number of columns and in any order. We have requirement to read only specific columns from csv files . How do we do that ? I have tried defining custom schema but then the I get different data in columns.

For ex :

CSV file

ID, Name , Address How do I select only Id and address column. Since if I say select (Id, Address) then it gives me ID and Name data in Address column. I want to select only ID and Address column according to header names while reading.

Thanks, Naveed

1 Answer 1

3

You can iterate over the files and create a final dataframe like:

files = ['path/to/file1.csv', 'path/to/file2.csv', 'path/to/file3.csv', 'path/to/file4.csv']

#define the output dataframe's schema column name and type should be correct
schema = t.StructType([
    t.StructField("a", t.StringType(), True), StructField("c", t.StringType(), True)
])

output_df = spark.createDataFrame([],schema)



for i,file in enumerate(data):
    df = spark.read.csv(file, header=True)
    output_df = output_df.union(df.select('a','c'))

output_df.show()

output_df will contain your desired output.

Sign up to request clarification or add additional context in comments.

5 Comments

We are parsing the file as it is with header=True while reading CSV...try and then share result
Nice solution..!
Is there any way that we can read the desired number of columns from csv itself rather than iterating over the files. Since we have around 10000 files, iterating over them would lead to very bad performance
@ShubhamJain, how do I get list of all files in datalake directory in list format as you mentioned.
The performance is very bad while reading data from 10000 files iteratively. We need to find way to only read the selected columns from csv

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.