0

I'm trying to convert a string[][] into a Dataset<Row> column consisting of string[]. I have gone through the docs and available examples online but could not find something similar to this. I don't know if its possible or not as I'm a complete beginner in spark.

Sample input:
String[][] test = {{"test1"}, {"test2", "test3"}, {"test4", "test5"}};
Sample output:

Dataset<Row> test_df
test_df.show()
+-------------+
|          foo|
+-------------+
|      [test1]|
|[test2,test3]|
|[test4,test5]|
+-------------+

I'm probably defining the structType wrong for string[][], I've tried different ways too. Here's what I'm trying to do:


    String[][] test = {{"test1"}, {"test2", "test3"}, {"test4", "test5"}};
    
    List<String[]> test1 = Arrays.asList(test);
    
    StructType structType = DataTypes.createStructType(
        DataTypes.createStructField(
                   "foo", 
                   DataTypes.createArrayType(DataTypes.StringType), 
                   true));
    
    Dataset<Row> t = spark.createDataFrame(test1, structType);
    t.show();
0

1 Answer 1

1

The problem with your code is that you are trying to use a method (spark.createDataFrame(List<Row>, StructType)) which takes a list of Row objects. But you use it with a list of arrays.

There are several ways to overcome it:

  • Create a Row from each of the arrays, and then apply the method you have been using.
  • Create a dataset of string arrays using a bean encoder and then convert it to a dataset of Row using a row encoder.
  • Create the dataframe using a Java Bean.

I think the last method is the easiest, so here is how you do it. You have to define a small Java bean whose only instance variable is a String array.

public static class ArrayWrapper {
    private String[] foo;

    public ArrayWrapper(String[] foo) {
        this.foo = foo;
    }

    public String[] getFoo() {
        return foo;
    }

    public void setFoo(String[] foo) {
        this.foo = foo;
    }
}

Make sure the Java Bean has a constructor that accepts a String array.

Then, to create the dataframe, you first create a list of ArrayWrapper (your Java Bean) from the array of arrays, and then make a dataframe using the createDataFrame(List<?>,Class<?>) method.

String[][] test = {{"test1"}, {"test2", "test3"}, {"test4", "test5"}};
List<ArrayWrapper> list = Arrays.stream(test).map(ArrayWrapper::new).collect(Collectors.toList());
Dataset<Row> testDF = spark.createDataFrame(list,ArrayWrapper.class);
testDF.show();

The name of the column is determined by the name of the instance variable in the Java Bean.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.