I am quite new to Pyspark (and Spark) and have a concrete task to solve that is currently beyond my knowledge :).
I have a bunch of files of the following structure:
'File_A.dtx':
## Animal
# Header Start
Name, Type, isMammal
# Body Start
Hasi, Rabbit, yes
Birdi, Bird, no
Cathi, Cat, yes
## House
# Header Start
Street, Number
# Body Start
Main Street, 32
Buchengasse, 11
'File_B.dtx':
## Animal
# Header Start
Name, Type, isMammal
# Body Start
Diddi, Dog, yes
Eli, Elephant, yes
## House
# Header Start
Street, Number
# Body Start
Strauchweg, 13
Igelallee, 22
My anticipated result are two dataframes as follows:
Animals:
| Filename | Name | Type | isMammal |
| ---------- | ------- | -------- | ----------- |
| File_A.dtx | Hasi | Rabbit | yes |
| File_A.dtx | Birdi | Bird | no |
| File_A.dtx | Cathi | Cat | yes |
| File_B.dtx | Diddi | Dog | yes |
| File_B.dtx | Eli | Elephant | yes |
House:
| Filename | Street | Number |
| ---------- | ------------ | -------- |
| File_A.dtx | Main Street | 32 |
| File_A.dtx | Buchengasse | 11 |
| File_B.dtx | Strauchweg | 13 |
| File_B.dtx | Igelallee | 22 |
The solution should be able to work in parallel. It can work per file since each file is small (around 3 MB) but I have a lot of them.
Thanks so much for hints.
What I currently have is just:
from pyspark.sql.functions import input_file_name
df1 = spark.read.text(filelist).withColumn("Filename", input_file_name())
Now my main problem is, how do I split the dataframe according to the rows ## Animal and ## House and aggregate it again to a dataframe to fullfil my task?