32

I have been trying to get the databricks library for reading CSVs to work. I am trying to read a TSV created by hive into a spark data frame using the scala api.

Here is an example that you can run in the spark shell (I made the sample data public so it can work for you)

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};

val sqlContext = new SQLContext(sc)
val segments = sqlContext.read.format("com.databricks.spark.csv").load("s3n://michaeldiscenza/data/test_segments")

The documentation says you can specify the delimiter but I am unclear about how to specify that option.

3 Answers 3

40

All of the option parameters are passed in the option() function as below:

val segments = sqlContext.read.format("com.databricks.spark.csv")
    .option("delimiter", "\t")
    .load("s3n://michaeldiscenza/data/test_segments")
Sign up to request clarification or add additional context in comments.

3 Comments

For the native DataFrameReader with SparkSession the option is called "sep": spark.read.option("sep", "\t").csv("PATH")
i get a long error "Traceback (most recent call last): File "/tmp/zeppelin_pyspark-1508289913406712111.py", line 367, in <module> Exception: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-1508289913406712111.py", line 360, in <module>.... File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o929.load.
@Michael Discenza I think the answer needs to be updated for the latest version of Spark or the question should include the Spark version.
37

With Spark 2.0+ use the built-in CSV connector to avoid third party dependancy and better performance:

val spark = SparkSession.builder.getOrCreate()
val segments = spark.read.option("sep", "\t").csv("/path/to/file")

1 Comment

This is the correct answer for newer Spark. I'd hardly call databricks a 3rd party though given how much they contribute to the Spark open source and that com.databricks.spark.csv is basically what became that built in csv connector, but fair point generally.
0

You May also try to inferSchema and check for schema.

val df = spark.read.format("csv")
      .option("inferSchema", "true")
      .option("sep","\t")
      .option("header", "true")
      .load(tmp_loc)

   df.printSchema()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.