Reading TSV into Spark Dataframe with Scala API

Question

I have been trying to get the databricks library for reading CSVs to work. I am trying to read a TSV created by hive into a spark data frame using the scala api.

Here is an example that you can run in the spark shell (I made the sample data public so it can work for you)

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};

val sqlContext = new SQLContext(sc)
val segments = sqlContext.read.format("com.databricks.spark.csv").load("s3n://michaeldiscenza/data/test_segments")

The documentation says you can specify the delimiter but I am unclear about how to specify that option.

Michael Discenza · Accepted Answer · 2015-11-24 15:48:38Z

40

All of the option parameters are passed in the option() function as below:

val segments = sqlContext.read.format("com.databricks.spark.csv")
    .option("delimiter", "\t")
    .load("s3n://michaeldiscenza/data/test_segments")

answered Nov 24, 2015 at 15:48

Michael Discenza

3,3507 gold badges33 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Steffen Schmitz Over a year ago

For the native DataFrameReader with SparkSession the option is called "sep": spark.read.option("sep", "\t").csv("PATH")

Amir Over a year ago

i get a long error "Traceback (most recent call last): File "/tmp/zeppelin_pyspark-1508289913406712111.py", line 367, in <module> Exception: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-1508289913406712111.py", line 360, in <module>.... File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o929.load.

Chadwick Robbert Over a year ago

@Michael Discenza I think the answer needs to be updated for the latest version of Spark or the question should include the Spark version.

Shaido · Accepted Answer · 2018-10-25 09:02:34Z

37

With Spark 2.0+ use the built-in CSV connector to avoid third party dependancy and better performance:

val spark = SparkSession.builder.getOrCreate()
val segments = spark.read.option("sep", "\t").csv("/path/to/file")

edited Oct 25, 2018 at 9:02

answered Dec 14, 2017 at 2:41

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

1 Comment

Davos Over a year ago

This is the correct answer for newer Spark. I'd hardly call databricks a 3rd party though given how much they contribute to the Spark open source and that com.databricks.spark.csv is basically what became that built in csv connector, but fair point generally.

Buvaneswari Viswanathan · Accepted Answer · 2021-02-09 07:48:24Z

0

You May also try to inferSchema and check for schema.

val df = spark.read.format("csv")
      .option("inferSchema", "true")
      .option("sep","\t")
      .option("header", "true")
      .load(tmp_loc)

   df.printSchema()

answered Feb 9, 2021 at 7:48

Buvaneswari Viswanathan

315 bronze badges

Collectives™ on Stack Overflow

Reading TSV into Spark Dataframe with Scala API

3 Answers 3

3 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related