Spark SQL error AnalysisException: cannot resolve column_name

Question

This is a common error in Spark SQL, I tried all other answers but no differences! I want to read the following small CSV file from HDFS (or even local filesystem).

+----+-----------+----------+-------------------+-----------+------+------+------+------+------+
|  id| first_name| last_name|                ssn|      test1| test2| test3| test4| final| grade|
+----+-----------+----------+-------------------+-----------+------+------+------+------+------+
| 4.0|      Dandy|       Jim|        087-75-4321|       47.0|   1.0|  23.0|  36.0|  45.0|    C+|
|13.0|   Elephant|       Ima|        456-71-9012|       45.0|   1.0|  78.0|  88.0|  77.0|    B-|
|14.0|   Franklin|     Benny|        234-56-2890|       50.0|   1.0|  90.0|  80.0|  90.0|    B-|
|15.0|     George|       Boy|        345-67-3901|       40.0|   1.0|  11.0|  -1.0|   4.0|     B|
|16.0|  Heffalump|    Harvey|        632-79-9439|       30.0|   1.0|  20.0|  30.0|  40.0|     C|
+----+-----------+----------+-------------------+-----------+------+------+------+------+------+

Here is the code:

List<String> cols = new ArrayList<>();
        Collections.addAll(cols, "id, first_name".replaceAll("\\s+", "").split(","));
Dataset<Row> temp = spark.read()
                .format("org.apache.spark.csv")
                .option("header", true)
                .option("inferSchema", true)
                .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
                .csv(path)
                .selectExpr(JavaConverters.asScalaIteratorConverter(cols.iterator()).asScala().toSeq());

but it errors:

Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) 'first_name missing from  last_name#14,      test1#16,id#12, test4#19, ssn#15, test3#18, grade#21, test2#17, final#20, first_name#13 in operator 'Project [id#12, 'first_name];;
'Project [id#12, 'first_name]
+- Relation[id#12, first_name#13, last_name#14, ssn#15,      test1#16, test2#17, test3#18, test4#19, final#20, grade#21] csv

In some cases it works without error:

If I didn't select anything, it will successfully get all data.
If I just select column "id".

Even I tried using views and SQL method:

df.createOrReplaceTempView("csvFile");
spark.sql(SELECT id, first_name FROM csvFile).show()

but I got the same errro!

I did the same with the same data that I read from the database, and it was without any error.

I use Spark 2.2.1.

The table at the first part of the question is the whole data on csv file. — Soheil Pourbafrani
– Soheil Pourbafrani, Commented Jun 11, 2018 at 15:05

Kaushal · Accepted Answer · 2018-06-12 19:24:24Z

3

No need to convert String[] --> List<String> --> Seq<String>.

Simply pass array in selectExpr method, because selectExpr support varargs datatype.

Dataset<Row> temp = spark.read()
                .format("org.apache.spark.csv")
                .option("header", true)
                .option("inferSchema", true)
                .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
                .csv(path)
                .selectExpr("id, first_name".replaceAll("\\s+", "").split(","));

answered Jun 12, 2018 at 19:24

Kaushal

3,3773 gold badges32 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Soheil Pourbafrani Over a year ago

Good point to use String[], but I still got the error Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'first_name

' given input columns: [ final,       test1,  first_name,  test3, id,  grade,  test2,  ssn,  test4,  last_name]; line 1 pos 0; 'Project [id#12, 'first_name] +- Relation[id#12, first_name#13, last_name#14, ssn#15,      test1#16, test2#17, test3#18, test4#19, final#20, grade#21] csv

Soheil Pourbafrani · Accepted Answer · 2018-06-20 12:35:25Z

2

It was because of the incorrect structure of the CSV file! I removed the white spaces from the CSV file and now it works!

answered Jun 20, 2018 at 12:35

Soheil Pourbafrani

3,4473 gold badges38 silver badges76 bronze badges

Collectives™ on Stack Overflow

Spark SQL error AnalysisException: cannot resolve column_name

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related