I have a huge dataframe where a column "categories" has various attributes of a business i.e. whether it's a Restaurant, Laundry Service, Disco Theque etc. What I need is to be able to .filter the dataframe such that each row which contains Restaurant may be seen. The problem here is that "categories" is an array of strings where a cell may be like: "Restaurants, Food, Nightlife". Any ideas? (Scala [2.10.6] Spark [2.0.1] Hadoop [2.7.2])
I have tried SQL style queries like:
val countResult = sqlContext.sql(
"SELECT business.neighborhood, business.state, business.stars, business.categories
FROM business where business.categories == Restaurants group by business.state"
).collect()
display(countResult)
and
dfBusiness.filter($"categories" == "Restaurants").show()
and
dfBusiness.filter($"categories" == ["Restaurants"]).show()
I think I might need to iterate over each cell but I have no idea on how to do that.
Any ideas?