0

I have a huge dataframe where a column "categories" has various attributes of a business i.e. whether it's a Restaurant, Laundry Service, Disco Theque etc. What I need is to be able to .filter the dataframe such that each row which contains Restaurant may be seen. The problem here is that "categories" is an array of strings where a cell may be like: "Restaurants, Food, Nightlife". Any ideas? (Scala [2.10.6] Spark [2.0.1] Hadoop [2.7.2])

I have tried SQL style queries like:

val countResult = sqlContext.sql(
   "SELECT business.neighborhood, business.state, business.stars, business.categories 
    FROM business where business.categories == Restaurants group by business.state"
).collect() 
display(countResult) 

and

dfBusiness.filter($"categories" == "Restaurants").show()

and

dfBusiness.filter($"categories" == ["Restaurants"]).show() 

I think I might need to iterate over each cell but I have no idea on how to do that.

Any ideas?

1
  • 1
    Looks like a good start. I think you should add that to your question and the version of Spark you are using. It is always good to be as complete and specific as possible. Welcome! Commented Apr 10, 2017 at 1:51

1 Answer 1

1

The functions library can be very helpful for processing columns in a DataFrame. In this case, array_contains should provide what you need:

dfBusiness.filter(array_contains($"categories", "Restaurants"))

This filters out any rows that don't have a "Restaurants" element in the array in the categories column.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.