0

I'm trying to use Spark (Java API) to take an in-memory Map (that potentially contains other nested Maps as its values) and convert it into a dataframe. I think I need something along these lines:

Map myMap = getSomehow();
RDD myRDD = sparkContext.makeRDD(myMap); // ???
DataFrame df = sparkContext.read(myRDD); // ???

But I'm having a tough time seeing the forest through the trees here...any ideas? Again this might be a Map<String,String> or a Map<String,Map>, where there could be several nested layers of maps-inside-of-maps-inside-of-maps, etc.

1 Answer 1

1

So I tried something, not sure if this is the most efficient option to do it, but I do not see any other right now.

    SparkConf sf = new SparkConf().setAppName("name").setMaster("local[*]");
    JavaSparkContext sc = new JavaSparkContext(sf);
    SQLContext sqlCon = new SQLContext(sc);

    Map map = new HashMap<String, Map<String, String>>();
    map.put("test1", putMap);

    HashMap putMap = new HashMap<String, String>();
    putMap.put("1", "test");


    List<Tuple2<String, HashMap>> list = new ArrayList<Tuple2<String, HashMap>>();

    Set<String> allKeys = map.keySet();
    for (String key : allKeys) {
        list.add(new Tuple2<String, HashMap>(key, (HashMap) map.get(key)));
    };

    JavaRDD<Tuple2<String, HashMap>> rdd = sc.parallelize(list);

    System.out.println(rdd.first());

    List<StructField> fields = new ArrayList<>();
    StructField field1 = DataTypes.createStructField("String", DataTypes.StringType, true);
    StructField field2 = DataTypes.createStructField("Map",
            DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType), true);

    fields.add(field1);
    fields.add(field2);

    StructType struct = DataTypes.createStructType(fields);

    JavaRDD<Row> rowRDD = rdd.map(new Function<Tuple2<String, HashMap>, Row>() {

        @Override
        public Row call(Tuple2<String, HashMap> arg0) throws Exception {
            return RowFactory.create(arg0._1, arg0._2);
        }

    });

    DataFrame df = sqlCon.createDataFrame(rowRDD, struct);

    df.show();

In this scenario I assumed that the Map in the Dataframe is of Type (String, String). Hope this helps!

Edit: Obviously you can delete all the prints. I did this for visualization purposes!

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.