I have a text file that reads like this:-
This recipe can be made either with a stand mixer, or by hand with a bowl, a
wooden spoon, and strong arms. If you use salted butter, please omit the
added salt in this recipe.
Yum
Ingredients
1 1/4 cups all-purpose flour (160 g)
1/4 teaspoon salt
1/2 teaspoon baking powder
1/2 cup unsalted butter (1 stick, or 8 Tbsp, or 112g) at room temperature
1/2 cup white sugar (90 g)
1/2 cup dark brown sugar, packed (85 g)
1 large egg
1 teaspoon vanilla extract
1/2 teaspoon instant coffee granules or instant espresso powder
1/2 cup chopped macadamia nuts (3 1/2 ounces, or 100 g)
1/2 cup white chocolate chips
Method
1 Preheat the oven to 350°F (175°C). Vigorously whisk together the flour,
and baking powder in a bowl and set aside.
I want to extract the data between words Ingredients and Method.
I have written a regex (?s)(?<=\bIngredients\b).*?(?=\bMethod\b)
to extract the data and it's working fine.
But when I try to that using spark-shell like following, it doesn't give me
anything.
val b = sc.textFile("/home/akshat/file.txt")
val regex = "(?s)(?<=\bIngredients\b).*?(?=\bMethod\b)".r
regex.findAllIn(b).foreach(println)
Please tell me where I am going wrong and what steps should I take to
correct this?
Thanks in advance!
bis an RDD,regex.findAllInapplies to String. So, you would need to apply the regex to the Strings of the RDD collection. Probably with something likemaporflatmap