Using map() and filter() in Spark instead of spark.sql

Question

I have two datasets that I want to INNER JOIN to give me a whole new table with the desired data. I used SQL and manage to get it. But now I want to try it with map() and filter(), is it possible?

This is my code using the SPARK SQL:

    import org.apache.spark.SparkContext
    import org.apache.spark.SparkConf
    import org.apache.spark.sql.SparkSession

    object hello {
      def main(args: Array[String]): Unit = {

        val conf = new SparkConf()
          .setMaster("local")
          .setAppName("quest9")

        val sc = new SparkContext(conf)
        val spark = SparkSession.builder().appName("quest9").master("local").getOrCreate()

        val zip_codes = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/zip.csv")
        val census = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/census.csv")

        census.createOrReplaceTempView("census")
        zip_codes.createOrReplaceTempView("zip")

        //val query = spark.sql("SELECT * FROM census")

        val query = spark.sql("SELECT DISTINCT census.Total_Males AS male, census.Total_Females AS female FROM census INNER JOIN zip ON census.Zip_Code=zip.Zip_Code WHERE zip.City = 'Inglewood' AND zip.County = 'Los Angeles'")

        query.show()

        query.write.parquet("/home/hdfs/Documents/population/census/IDE/census.parquet")

        sc.stop()
      }
    }

It seems more sensible to use dataframe.join() than map or filter. Why would you not use it ? See stackoverflow.com/questions/40343625/… or jaceklaskowski.gitbooks.io/mastering-spark-sql/… or stackoverflow.com/questions/36800174/… — GPI
– GPI, Commented Jan 3, 2020 at 14:43
@GPI Because I was instructed to do it this way, for some reason — João Bernardo
– João Bernardo, Commented Jan 3, 2020 at 14:50

GPI · Accepted Answer · 2020-01-03 16:02:34Z

1

The only sensible way, in general to do this would be to use the join() method of `Dataset̀. I would urge you to question the need to use only map/filter to do this, as this is not intuitive, and will probably confuse any experienced spark developer (or simply put, make him roll his eyes). It may also lead to scalability issues should the dataset grow.

That said, in your use case, it is pretty simple to avoid using join. Another possibility would be to issue two separate jobs to spark :

fetch the zip code(s) that interests you
filter on the census data on that (those) zip code(s)

Step 1 collect the zip codes of interest (not sure of the exact syntax as I do not have a spark shell at hand, but it should be trivial to find the right one).

var codes: Seq[String] = zip_codes
             // filter on the city
             .filter(row => row.getAs[String]("City").equals("Inglewood"))
             // filter on the county
             .filter(row => row.getAs[String]("County").equals("Los Angeles"))
             // map to zip code as a String
             .map(row => row.getAs[String]("Zip_Code"))
             .as[String]
             // Collect on the driver side
             .collect()

Then again, writing it this way instead of using select/where is pretty strange to anyone being used to spark.

Yet, the reason this will work is because we can be sure that zip codes matching a given town and county will be really small. So it is safe to perform driver side collcetion of the result.

Now on to step 2 :

census.filter(row => codes.contains(row.getAs[String]("Zip_Code")))
      .map( /* whatever to get your data out */ )

edited Jan 3, 2020 at 16:02

answered Jan 3, 2020 at 15:55

GPI

9,3782 gold badges34 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

João Bernardo Over a year ago

Thank you for the help, since I'm in "formation" I bet its just to increase my knowledge. Anyway, you helped me a lot.

João Bernardo Over a year ago

just didnt quite get the last .map() in step two. I did a .collect() instead and it gets what I want plus the columns I dont want

Raphael Roth · Accepted Answer · 2020-01-03 15:56:31Z

0

What you need is a join, your query roughly translates to :

census.as("census")
  .join(
    broadcast(zip_codes
        .where($"City"==="Inglewood")
        .where($"County"==="Los Angeles")
      .as("zip"))
    ,Seq("Zip_Code"),
    "inner" // "leftsemi" would also be sufficient
  )
  .select(
    $"census.Total_Males".as("male"),
    $"census.Total_Females".as("female")
  ).distinct()

answered Jan 3, 2020 at 15:56

Raphael Roth

27.3k19 gold badges98 silver badges152 bronze badges

1 Comment

João Bernardo Over a year ago

Appreciate the help !

Collectives™ on Stack Overflow

Using map() and filter() in Spark instead of spark.sql

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related