2

After reading csv file in Dataset, want to remove spaces from String type data using Java API.

Apache Spark 2.0.0

Dataset<Row> dataset = sparkSession.read().format("csv").option("header", "true").load("/pathToCsv/data.csv");
Dataset<String> dataset2 = dataset.map(new MapFunction<Row,String>() {

    @Override
    public String call(Row value) throws Exception {

        return value.getString(0).replace(" ", ""); 
        // But this will remove space from only first column
    }
}, Encoders.STRING());

By using MapFunction, not able to remove spaces from all columns.

But in Scala, by using following way in spark-shell able to perform desired operation.

val ds = spark.read.format("csv").option("header", "true").load("/pathToCsv/data.csv")
val opds = ds.select(ds.columns.map(c => regexp_replace(col(c), " ", "").alias(c)): _*)

Dataset opds have data without spaces. Want to achieve same in Java. But in Java API columns method returns String[] and not able to perform functional programming on Dataset.

Input Data

+----------------+----------+-----+---+---+
|               x|         y|    z|  a|  b|
+----------------+----------+-----+---+---+
|     Hello World|John Smith|There|  1|2.3|
|Welcome to world| Bob Alice|Where|  5|3.6|
+----------------+----------+-----+---+---+

Expected Output Data

+--------------+---------+-----+---+---+
|             x|        y|    z|  a|  b|
+--------------+---------+-----+---+---+
|    HelloWorld|JohnSmith|There|  1|2.3|
|Welcometoworld| BobAlice|Where|  5|3.6|
+--------------+---------+-----+---+---+
4
  • At which position you want to remove space, post a sample string and output you are expecting. You can use trim() function to remove leading and trailing white space. Commented Aug 4, 2016 at 12:41
  • @Ravikumar Want to remove space between the strings. Commented Aug 4, 2016 at 13:01
  • You can use regex for removing spaces between Strings, Just post a sample string and what output string you are expecting after removing spaces. Commented Aug 4, 2016 at 13:15
  • @Ravikumar check edited question Commented Aug 4, 2016 at 13:25

2 Answers 2

3

Try:

for (String col: dataset.columns) {
  dataset = dataset.withColumn(col, regexp_replace(dataset.col(col), " ", ""));
}
Sign up to request clarification or add additional context in comments.

Comments

0

You can try following regex to remove white spaces between strings.

value.getString(0).replaceAll("\\s+", "");

About \s+ : match any white space character between one and unlimited times, as many times as possible. Instead of replace use replaceAll function.

More about replace and replaceAll functions Difference between String replace() and replaceAll()

3 Comments

this will remove space from first column only.
Print out value.getString(0) and post it so that to see whether string is multiline.
@mastersheel007 try this value.getString(0).replaceAll("(?is)\\s+", "");

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.