0

I have a textfile that is delimited with ~, I need to perform some parsing before converting to a dataframe. The code reads in a textfile as RDD[String] does some parsing. Then, it converts to RDD[Row]. Then with the schema creates a dataframe.

So below is the following code I have. It works but the problem is the actual schema is 400 fields long. I was wondering if there is a simpler way than typing attributes(1), attributes(2), attributes(3)... so on.

I am currently on Spark 1.6. CDH 5.2.2

Example Input:

20161481132310 ~     ~"This" is a comma 10

Current Code:

val schema_1 = StructType(Array(
StructField("EXAMPLE_1", StringType, true),
StructField("EXAMPLE_2", StringType, true),
StructField("EXAMPLE_3", StringType, true)))

val rdd = sc.textFile("example.txt")
val rdd_truncate =  rdd.map(_.split("~").map(_.trim).mkString("~"))
val row_final = rdd_truncate
  .map(_.split("~"))
  .map(attributes => Row(attributes(0),
    attributes(1),
    attributes(2)))

val df = sqlContext.createDataFrame(row_final, schema_1)

Based on suggestion I modified for following. It works except on quotes. The "This" in the input will fail. Any suggestions?

val df = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("delimiter","~")
      .schema(schema)
      .load("example.txt")
val df_final = df.select(df.columns.map(c =>trim(col(c)).alias(c)): _*)

1 Answer 1

3

Just use standard CSV reader:

spark.read.schema(schema).option("delimiter", "~").csv("example.txt")

If you want to trim fields just use select:

import org.apache.spark.sql.functions.{col, trim}

df.select(df.columns.map(c => trim(col(c)).alias(c)): _*)

If you use Spark 1.x you can use spark-csv:

sqlContext.read
  .format("com.databricks.spark.csv")
  .schema(schema)
  .option("delimiter", "~")
  .load("example.txt")

If this is for some reason not sufficient you can use Row.fromSeq:

Row.fromSeq(line.split("~").take(3))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.