1

This is a common error in Spark SQL, I tried all other answers but no differences! I want to read the following small CSV file from HDFS (or even local filesystem).

+----+-----------+----------+-------------------+-----------+------+------+------+------+------+
|  id| first_name| last_name|                ssn|      test1| test2| test3| test4| final| grade|
+----+-----------+----------+-------------------+-----------+------+------+------+------+------+
| 4.0|      Dandy|       Jim|        087-75-4321|       47.0|   1.0|  23.0|  36.0|  45.0|    C+|
|13.0|   Elephant|       Ima|        456-71-9012|       45.0|   1.0|  78.0|  88.0|  77.0|    B-|
|14.0|   Franklin|     Benny|        234-56-2890|       50.0|   1.0|  90.0|  80.0|  90.0|    B-|
|15.0|     George|       Boy|        345-67-3901|       40.0|   1.0|  11.0|  -1.0|   4.0|     B|
|16.0|  Heffalump|    Harvey|        632-79-9439|       30.0|   1.0|  20.0|  30.0|  40.0|     C|
+----+-----------+----------+-------------------+-----------+------+------+------+------+------+

Here is the code:

List<String> cols = new ArrayList<>();
        Collections.addAll(cols, "id, first_name".replaceAll("\\s+", "").split(","));
Dataset<Row> temp = spark.read()
                .format("org.apache.spark.csv")
                .option("header", true)
                .option("inferSchema", true)
                .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
                .csv(path)
                .selectExpr(JavaConverters.asScalaIteratorConverter(cols.iterator()).asScala().toSeq());

but it errors:

Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) 'first_name missing from  last_name#14,      test1#16,id#12, test4#19, ssn#15, test3#18, grade#21, test2#17, final#20, first_name#13 in operator 'Project [id#12, 'first_name];;
'Project [id#12, 'first_name]
+- Relation[id#12, first_name#13, last_name#14, ssn#15,      test1#16, test2#17, test3#18, test4#19, final#20, grade#21] csv

In some cases it works without error:

  1. If I didn't select anything, it will successfully get all data.
  2. If I just select column "id".

Even I tried using views and SQL method:

df.createOrReplaceTempView("csvFile");
spark.sql(SELECT id, first_name FROM csvFile).show()

but I got the same errro!

I did the same with the same data that I read from the database, and it was without any error.

I use Spark 2.2.1.

2
  • Please include a sample of the file. Commented Jun 11, 2018 at 13:30
  • The table at the first part of the question is the whole data on csv file. Commented Jun 11, 2018 at 15:05

2 Answers 2

3

No need to convert String[] --> List<String> --> Seq<String>.

Simply pass array in selectExpr method, because selectExpr support varargs datatype.

Dataset<Row> temp = spark.read()
                .format("org.apache.spark.csv")
                .option("header", true)
                .option("inferSchema", true)
                .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
                .csv(path)
                .selectExpr("id, first_name".replaceAll("\\s+", "").split(","));
Sign up to request clarification or add additional context in comments.

1 Comment

Good point to use String[], but I still got the error Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'first_name' given input columns: [ final, test1, first_name, test3, id, grade, test2, ssn, test4, last_name]; line 1 pos 0; 'Project [id#12, 'first_name] +- Relation[id#12, first_name#13, last_name#14, ssn#15, test1#16, test2#17, test3#18, test4#19, final#20, grade#21] csv
2

It was because of the incorrect structure of the CSV file! I removed the white spaces from the CSV file and now it works!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.