How to read a text file as one string into Spark DataFrame with Java

Question

I want to create a DF of text files where each row represents a whole txt file in a column named text.

I've tried the following but I got a DF where the text is separated by lines.

Dataset<Row> df = spark.read()
            .textFile("resources/textfile.txt")
            .toDF("text");

Instead of DF of 1 row in the case of 1 file, I've got a DF of 70 rows for this file.

werner · Accepted Answer · 2021-08-04 19:27:25Z

1

You can collect the dataframe into an array and then join the array to a single string:

import static org.apache.spark.sql.functions.*;

df.agg(collect_list("text").alias("text"))
    .withColumn("text", concat_ws(" ", col("text")))
    .show();

answered Aug 4, 2021 at 19:27

werner

15k6 gold badges36 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to read a text file as one string into Spark DataFrame with Java

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related