Split field and create multi rows from one row Spark-Scala

Question

i am new and i need help with this issue.

I have a csv file like this:

ANI,2974483123 29744423747 293744450542,Twitter,@ani

I need split the second column "2974483123 29744423747 293744450542" and create 3 rows like this:

ANI,2974483123,Twitter,@ani

ANI,29744423747,Twitter,@ani

ANI,293744450542,Twitter,@ani

Can someone help me? please!

R. Gabriel · Accepted Answer · 2016-03-02 15:34:16Z

9

flatMap is what you're looking for:

val input: RDD[String] = sc.parallelize(Seq("ANI,2974483123 29744423747 293744450542,Twitter,@ani"))
val csv: RDD[Array[String]] = input.map(_.split(','))

val result = csv.flatMap { case Array(s1, s2, s3, s4) => s2.split(" ").map(part => (s1, part, s3, s4)) }

edited Mar 2, 2016 at 15:34

R. Gabriel

751 silver badge5 bronze badges

answered Mar 1, 2016 at 21:18

Tzach Zohar

37.9k3 gold badges83 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

BdEngineer Over a year ago

focusing only on three countries: US,CA,MX .....original record: ["MotelID", "BidDate", "HU", "UK", "NL", "US", "MX", "AU", "CA", "CN", "KR","BE", "I","JP", "IN", "HN", "GY", "DE"] [0000002,11-05-08-2016,0.92,1.68,0.81,0.68,1.59,,1.63,1.77,2.06,0.66,1.53,,0.32,0.88,0.83,1.01] keep only the three important ones 0000002,11-05-08-2016,1.59,,1.77 transpose the record and include the related Losa in a separate column 0000002,11-05-08-2016,US,1.59 0000002,11-05-08-2016,MX, 0000002,11-05-08-2016,CA,1.77 ....how to get above result ?

BushMinusZero · Accepted Answer · 2017-10-24 22:46:57Z

Here is a slightly different solution that takes advantage of the built in SQL UDFs available to Spark. Ideally these should be used instead of custom functions to take advantage of performance improvements provided by the query optimizer (https://blog.cloudera.com/blog/2017/02/working-with-udfs-in-apache-spark/).

import org.apache.spark.sql.functions.{split, explode}

val filename = "/path/to/file.csv"
val columns = Seq("col1","col2","col3","col4")

val df = spark.read.csv(filename).toDF(columns: _*)

// import "split" instead of writing your own split UDF
df.withColumn("col2", split($"col2", " ")).
  // import "explode instead of map then flatMap
  select($"col1", explode($"col2"), $"col3", $"col4").take(10)

Alberto Bonsanto · Accepted Answer · 2016-03-01 21:23:49Z

0

Pretty similar to Tzach's answer, but in python2 and being careful about multi-space separators.

import re

rdd = sc.textFile("datasets/test.csv").map(lambda x: x.split(","))

print(rdd.take(1))
print(rdd.map(lambda (a, b, c, d): [(a, number, c, d) for number in re.split(" +", b)])
         .flatMap(lambda x: x)
         .take(10))

#[[u'ANI', u'2974481249 2974444747 2974440542', u'Twitter', u'maximotussie']]
#[(u'ANI', u'2974481249', u'Twitter', u'maximotussie'), 
# (u'ANI', u'2974444747', u'Twitter', u'maximotussie'), 
# (u'ANI', u'2974440542', u'Twitter', u'maximotussie')]

answered Mar 1, 2016 at 21:23

Alberto Bonsanto

18.1k10 gold badges67 silver badges93 bronze badges

Collectives™ on Stack Overflow

Split field and create multi rows from one row Spark-Scala

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related