1

I have a text file that reads like this:-

This recipe can be made either with a stand mixer, or by hand with a bowl, a
wooden spoon, and strong arms. If you use salted butter, please omit the
added salt in this recipe.
Yum
Ingredients
1 1/4 cups all-purpose flour (160 g)
1/4 teaspoon salt
1/2 teaspoon baking powder
1/2 cup unsalted butter (1 stick, or 8 Tbsp, or 112g) at room temperature
1/2 cup white sugar (90 g)
1/2 cup dark brown sugar, packed (85 g)
1 large egg
1 teaspoon vanilla extract
1/2 teaspoon instant coffee granules or instant espresso powder
1/2 cup chopped macadamia nuts (3 1/2 ounces, or 100 g)
1/2 cup white chocolate chips
Method
1 Preheat the oven to 350°F (175°C). Vigorously whisk together the flour,
and baking powder in a bowl and set aside.

I want to extract the data between words Ingredients and Method.
I have written a regex (?s)(?<=\bIngredients\b).*?(?=\bMethod\b)
to extract the data and it's working fine.
But when I try to that using spark-shell like following, it doesn't give me
anything.

val b = sc.textFile("/home/akshat/file.txt")
val regex = "(?s)(?<=\bIngredients\b).*?(?=\bMethod\b)".r
regex.findAllIn(b).foreach(println)

Please tell me where I am going wrong and what steps should I take to
correct this?
Thanks in advance!

1
  • b is an RDD, regex.findAllIn applies to String. So, you would need to apply the regex to the Strings of the RDD collection. Probably with something like map or flatmap Commented Jun 3, 2015 at 7:46

1 Answer 1

1

what you need to do is

  1. Read the file using WholeTextFiles (so it does not break lines and you read entire data together)
  2. Write a function which takes a string and outputs a string using that regex so, it may look like (in python)

Blockquote

def getWhatIneed(s):
    output = <my regexp>
    return output

b = sc.WholeTextFiles(...)
c = b.map(getWhatIneed)

Now, c is also a RDD. You need to collect it before you print it. Output of collect is a normal array/list

print c.collect()
Sign up to request clarification or add additional context in comments.

2 Comments

Problem is that output in function defined should return a string as told by you but regexp is of type scala.util.matching.regex so it gives a type mismatch error. In this case what should be my approach?
thats a scala regex question. I am no expert in that, by any rate. By looking up scala api docs, it seems you need to pass your string to the regexp instance and then extract the output using methods like findAllMatchIn.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.