2

i am new and i need help with this issue.

I have a csv file like this:

ANI,2974483123 29744423747 293744450542,Twitter,@ani

I need split the second column "2974483123 29744423747 293744450542" and create 3 rows like this:

ANI,2974483123,Twitter,@ani

ANI,29744423747,Twitter,@ani

ANI,293744450542,Twitter,@ani

Can someone help me? please!

3 Answers 3

9

flatMap is what you're looking for:

val input: RDD[String] = sc.parallelize(Seq("ANI,2974483123 29744423747 293744450542,Twitter,@ani"))
val csv: RDD[Array[String]] = input.map(_.split(','))

val result = csv.flatMap { case Array(s1, s2, s3, s4) => s2.split(" ").map(part => (s1, part, s3, s4)) }
Sign up to request clarification or add additional context in comments.

1 Comment

focusing only on three countries: US,CA,MX .....original record: ["MotelID", "BidDate", "HU", "UK", "NL", "US", "MX", "AU", "CA", "CN", "KR","BE", "I","JP", "IN", "HN", "GY", "DE"] [0000002,11-05-08-2016,0.92,1.68,0.81,0.68,1.59,,1.63,1.77,2.06,0.66,1.53,,0.32,0.88,0.83,1.01] keep only the three important ones 0000002,11-05-08-2016,1.59,,1.77 transpose the record and include the related Losa in a separate column 0000002,11-05-08-2016,US,1.59 0000002,11-05-08-2016,MX, 0000002,11-05-08-2016,CA,1.77 ....how to get above result ?
2

Here is a slightly different solution that takes advantage of the built in SQL UDFs available to Spark. Ideally these should be used instead of custom functions to take advantage of performance improvements provided by the query optimizer (https://blog.cloudera.com/blog/2017/02/working-with-udfs-in-apache-spark/).

import org.apache.spark.sql.functions.{split, explode}

val filename = "/path/to/file.csv"
val columns = Seq("col1","col2","col3","col4")

val df = spark.read.csv(filename).toDF(columns: _*)

// import "split" instead of writing your own split UDF
df.withColumn("col2", split($"col2", " ")).
  // import "explode instead of map then flatMap
  select($"col1", explode($"col2"), $"col3", $"col4").take(10)

Comments

0

Pretty similar to Tzach's answer, but in python2 and being careful about multi-space separators.

import re

rdd = sc.textFile("datasets/test.csv").map(lambda x: x.split(","))

print(rdd.take(1))
print(rdd.map(lambda (a, b, c, d): [(a, number, c, d) for number in re.split(" +", b)])
         .flatMap(lambda x: x)
         .take(10))

#[[u'ANI', u'2974481249 2974444747 2974440542', u'Twitter', u'maximotussie']]
#[(u'ANI', u'2974481249', u'Twitter', u'maximotussie'), 
# (u'ANI', u'2974444747', u'Twitter', u'maximotussie'), 
# (u'ANI', u'2974440542', u'Twitter', u'maximotussie')]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.