After reading csv file in Dataset, want to remove spaces from String type data using Java API.
Apache Spark 2.0.0
Dataset<Row> dataset = sparkSession.read().format("csv").option("header", "true").load("/pathToCsv/data.csv");
Dataset<String> dataset2 = dataset.map(new MapFunction<Row,String>() {
@Override
public String call(Row value) throws Exception {
return value.getString(0).replace(" ", "");
// But this will remove space from only first column
}
}, Encoders.STRING());
By using MapFunction, not able to remove spaces from all columns.
But in Scala, by using following way in spark-shell able to perform desired operation.
val ds = spark.read.format("csv").option("header", "true").load("/pathToCsv/data.csv")
val opds = ds.select(ds.columns.map(c => regexp_replace(col(c), " ", "").alias(c)): _*)
Dataset opds have data without spaces. Want to achieve same in Java. But in Java API columns method returns String[] and not able to perform functional programming on Dataset.
Input Data
+----------------+----------+-----+---+---+
| x| y| z| a| b|
+----------------+----------+-----+---+---+
| Hello World|John Smith|There| 1|2.3|
|Welcome to world| Bob Alice|Where| 5|3.6|
+----------------+----------+-----+---+---+
Expected Output Data
+--------------+---------+-----+---+---+
| x| y| z| a| b|
+--------------+---------+-----+---+---+
| HelloWorld|JohnSmith|There| 1|2.3|
|Welcometoworld| BobAlice|Where| 5|3.6|
+--------------+---------+-----+---+---+