9

I have Spark Data Frame 1 of several columns: (user_uuid, url, date_visit)

I want to transform this DF1 to Data Frame 2 with form : (user_uuid, domain, date_visit)

What I wanted to use is regular expression to detect domain and apply it to DF1 val regexpr = """(?i)^((https?):\/\/)?((www|www1)\.)?([\w-\.]+)""".r

Could you please help me composing code to transform Data Frames in Scala? I am completely new to Spark and Scala and syntax is hard. Thanks!

4
  • 2
    " I am completely new to Spark and Scala " This is very much a "give me the code" question currently. What have you tried? How are you planning to learn Spark/Scala? SO is best used by trying something yourself and asking specific questions when you are stuck. As you should know, as you've been here for 4 years! Commented Aug 20, 2015 at 15:21
  • 1
    @Paul, this is really an atomic operation, but it unable to find it over the internet. When you ask for, say, regex, you don't present the ugly tries, right? I evaluated the code I've got and saw it makes no sense to have it here - it's defining a DF solely, how could it have helped? Yes, I am here for 4 years already, with plenty of other experience, lecturing me may feel pleasant, but in this question I don't find it appropriate. Commented Aug 20, 2015 at 16:57
  • 1
    I would be more convinced with evidence you'd looked through the scaladoc for DataFrame. There aren't really many operations there, and even fewer that return a DataFrame. If your question had been around "how to I use select for this", it would have been clear you had put some work in, not just thrown this over the wall to SO. I'm not (intentionally) lecturing, but I'm trying to do my tiny bit to encourage people to do the work necessary to avoid wasting the time of the man y others who might read a question.Got two comment upvotes too, so looks like I'm not completely isolated on this Commented Aug 20, 2015 at 17:03
  • 1
    @TheArchetypalPaul Over time, I see answer for this question helped at least 9 people who upvoted it. It proves it does not matter who I should "convince" about my clumsy tries before asking, as question is helpful for community anyway. Commented Aug 16, 2017 at 6:58

1 Answer 1

19

Spark >= 1.5:

You can use regexp_extract function:

import org.apache.spark.sql.functions.regexp_extract

val patter: String = ??? 
val groupIdx: Int = ???

df.withColumn("domain", regexp_extract(url, pattern, groupIdx))

Spark < 1.5.0

Define an UDF

val pattern: scala.util.matching.Regex = ???

def getFirst(pattern: scala.util.matching.Regex) = udf(
  (url: String) => pattern.findFirstIn(url) match { 
    case Some(domain) => domain
    case None => "unknown"
  }
)

Use defined UDF:

df.select(
  $"user_uuid",
  getFirst(pattern)($"url").alias("domain"),
  $"date_visit"
)

or register temp table:

df.registerTempTable("df")

sqlContext.sql(s"""
  SELECT user_uuid, regexp_extract(url, '$pattern', $group_idx) AS domain, date_visit FROM df""")

Replace pattern with a valid Java regexp and group_id with an index of the group.

Sign up to request clarification or add additional context in comments.

1 Comment

it worked, just don't forget to add before ' before regex and ' after.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.