Read in Files and split them into two dataframes (Pyspark, spark-dataframe)

Question

I am quite new to Pyspark (and Spark) and have a concrete task to solve that is currently beyond my knowledge :).

I have a bunch of files of the following structure:

'File_A.dtx':

## Animal
# Header Start
Name, Type, isMammal
# Body Start
Hasi, Rabbit, yes
Birdi, Bird, no
Cathi, Cat, yes
## House
# Header Start
Street, Number
# Body Start
Main Street, 32
Buchengasse, 11

'File_B.dtx':

## Animal
# Header Start
Name, Type, isMammal
# Body Start
Diddi, Dog, yes
Eli, Elephant, yes
## House
# Header Start
Street, Number
# Body Start
Strauchweg, 13
Igelallee, 22

My anticipated result are two dataframes as follows:

Animals:

| Filename   | Name    | Type     | isMammal    | 
| ---------- | ------- | -------- | ----------- | 
| File_A.dtx | Hasi    | Rabbit   | yes         | 
| File_A.dtx | Birdi   | Bird     | no          | 
| File_A.dtx | Cathi   | Cat      | yes         | 
| File_B.dtx | Diddi   | Dog      | yes         | 
| File_B.dtx | Eli     | Elephant | yes         |

House:

| Filename   | Street       | Number   | 
| ---------- | ------------ | -------- | 
| File_A.dtx | Main Street  | 32       | 
| File_A.dtx | Buchengasse  | 11       | 
| File_B.dtx | Strauchweg   | 13       | 
| File_B.dtx | Igelallee    | 22       |

The solution should be able to work in parallel. It can work per file since each file is small (around 3 MB) but I have a lot of them.

Thanks so much for hints.

What I currently have is just:

from  pyspark.sql.functions import input_file_name
df1 = spark.read.text(filelist).withColumn("Filename", input_file_name())

Now my main problem is, how do I split the dataframe according to the rows ## Animal and ## House and aggregate it again to a dataframe to fullfil my task?

Finally, my question is complete ! Sorry for the multiple edits! Thanks for help! — jan_nessner
– jan_nessner, Commented Jan 11, 2022 at 12:52

Nithish · Accepted Answer · 2022-01-11 13:54:27Z

1

Assuming you know the format of the before hand and no two dataframes will have the same number of columns. Then you can do the following:

Remove comments (lines start with #) from the dataset
Remove header rows from the dataset
Remove empty lines
Split the lines using ,
Create animals_df as subset of rows from df in step 4 wherein the size of array from splitting is equal to 3 and extract the array values as columns
Create house_df as subset of rows from df in step 4 wherein the size of array from splitting is equal to 2 and extract the array values as columns

from  pyspark.sql.functions import element_at, input_file_name, length, col as c, split, size

filelist = ["File_A.dtx", "File_B.dtx"]

df1 = spark.read.text(filelist).withColumn("Filename", input_file_name())

# STEP 1
comment_removed = df1.filter(~(c("value").startswith("#")))

# STEP 2
header_removed = comment_removed.filter(~(c("value").isin("Name, Type, isMammal", "Street, Number")))

# STEP 3
remove_empty_lines = header_removed.filter(length("value") > 0)

# STEP 4
processed_df = remove_empty_lines.withColumn("value", split("value", ",")).withColumn("Filename", element_at(split("Filename", "/"), -1)).cache()

# STEP 5
animals_df = processed_df.filter(size("value") == 3).selectExpr("Filename", "value[0] as Name", "value[1] as Type", "value[2] as isMammal")
animals_df.show()

"""
+----------+-----+---------+--------+
|  Filename| Name|     Type|isMammal|
+----------+-----+---------+--------+
|File_A.dtx| Hasi|   Rabbit|     yes|
|File_A.dtx|Birdi|     Bird|      no|
|File_A.dtx|Cathi|      Cat|     yes|
|File_B.dtx|Diddi|      Dog|     yes|
|File_B.dtx|  Eli| Elephant|     yes|
+----------+-----+---------+--------+
"""

# STEP 6
house_df = processed_df.filter(size("value") == 2).selectExpr("Filename", "value[0] as Street", "cast(value[1] as int) as Number")
house_df.show()
"""
+----------+-----------+------+
|  Filename|     Street|Number|
+----------+-----------+------+
|File_A.dtx|Main Street|    32|
|File_A.dtx|Buchengasse|    11|
|File_B.dtx| Strauchweg|    13|
|File_B.dtx|  Igelallee|    22|
+----------+-----------+------+
"""

answered Jan 11, 2022 at 13:54

Nithish

3,2472 gold badges11 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jan_nessner Over a year ago

Thanks heaps... this is an excellent and nice introduction into the topic! Exactly as I needed it. For me, having worked with 'normal' dataframes only, it is harder than expected to get used to the new syntax and new handling!

jan_nessner Over a year ago

You have now used the nice trick that the number of columns were different for the two types (animals and house)... However, would it be also possible to use the ## Animal and ## House tags directly?

Nithish Over a year ago

If the files have line numbers or if you read the files as an RDD and add an index then, ## Animal and ## House can be used to identify range of rows containing the rows.

Collectives™ on Stack Overflow

Read in Files and split them into two dataframes (Pyspark, spark-dataframe)

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related