1

I am using following code to read multiple csv files and and converting them to pandas df then concat it as a single pandas df. Finally converting again into spark DataFrame. I want to skip conversion to pandas df part and simply want to have spark DataFrame.

File Paths

 abfss://xxxxxx/abc/year=2021/month=1/dayofmonth=1/hour=1/*.csv
 abfss://xxxxxx/abc/year=2021/month=1/dayofmonth=1/hour=2/*.csv
......

Code

list = []


for month in range(1,3,1):
  for day in range(1,31,1):
    for hour in range(0,24,1):
      file_location = "//xxxxxx/abc/year=2021/month="+str(month)+"/dayofmonth="+str(day)+"/hour="+str(hour)+"/*.csv"    
     
        try : 
          spark_df = spark.read.format("csv").option("header", "true").load(file_location)
          pandas_df = spark_df.toPandas()
          list.append(pandas_df)
    
    
        except AnalysisException as e:
          print(e)

final_pandas_df = pd.concat(list)
df = spark.createDataFrame(final_pandas_df)

1 Answer 1

3

You can load all the files and apply a filter on the partitioning columns:

df = spark.read.format("csv").option("header", "true").load("abfss://xxxxxx/abc/").filter(
    'year = 2021 and month between 1 and 2 and day between 1 and 30 and hour between 0 and 23'
)

Sign up to request clarification or add additional context in comments.

6 Comments

The command just keep running , even I tried with year = 2021 and month = 1 and day = 1 and hour between 0 and 10
It might take some time if you have a lot of files or partitions. You could try adding year=2021 to the loading path... it might help reduce the number of partitions that needs to be scanned.
sorry, that was a typo... you're right, it should be dayofmonth.
just a small query, if I want to fetch data from Nov 2020 - Feb 2021 does month between 11 and 2 will work ?
no, you need to specify two conditions for 11/2020-12/2020 and 1/2021-2/2021 separately
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.