This is a common error in Spark SQL, I tried all other answers but no differences! I want to read the following small CSV file from HDFS (or even local filesystem).
+----+-----------+----------+-------------------+-----------+------+------+------+------+------+
| id| first_name| last_name| ssn| test1| test2| test3| test4| final| grade|
+----+-----------+----------+-------------------+-----------+------+------+------+------+------+
| 4.0| Dandy| Jim| 087-75-4321| 47.0| 1.0| 23.0| 36.0| 45.0| C+|
|13.0| Elephant| Ima| 456-71-9012| 45.0| 1.0| 78.0| 88.0| 77.0| B-|
|14.0| Franklin| Benny| 234-56-2890| 50.0| 1.0| 90.0| 80.0| 90.0| B-|
|15.0| George| Boy| 345-67-3901| 40.0| 1.0| 11.0| -1.0| 4.0| B|
|16.0| Heffalump| Harvey| 632-79-9439| 30.0| 1.0| 20.0| 30.0| 40.0| C|
+----+-----------+----------+-------------------+-----------+------+------+------+------+------+
Here is the code:
List<String> cols = new ArrayList<>();
Collections.addAll(cols, "id, first_name".replaceAll("\\s+", "").split(","));
Dataset<Row> temp = spark.read()
.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.csv(path)
.selectExpr(JavaConverters.asScalaIteratorConverter(cols.iterator()).asScala().toSeq());
but it errors:
Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) 'first_name missing from last_name#14, test1#16,id#12, test4#19, ssn#15, test3#18, grade#21, test2#17, final#20, first_name#13 in operator 'Project [id#12, 'first_name];;
'Project [id#12, 'first_name]
+- Relation[id#12, first_name#13, last_name#14, ssn#15, test1#16, test2#17, test3#18, test4#19, final#20, grade#21] csv
In some cases it works without error:
- If I didn't select anything, it will successfully get all data.
- If I just select column "id".
Even I tried using views and SQL method:
df.createOrReplaceTempView("csvFile");
spark.sql(SELECT id, first_name FROM csvFile).show()
but I got the same errro!
I did the same with the same data that I read from the database, and it was without any error.
I use Spark 2.2.1.