Creating a Pyspark data frame with variable schema

Question

I want to create a pyspark dataframe in which there is a column with variable schema. So my data frame can look something like this:

| Id | Variable_Column                  |
|----|----------------------------------|
| 1  | [{"col1":"val1"}]                |
| 2  | [{"col1":"val2", "col2":"val3"}] |

So to achieve this. I started out like this:

schema = StructType([StructField("Id", IntegerType(), True),\
                      StructField("Variable_Column", ArrayType(StructType()), True)\
                               ])
valdict = dict()
valdict["col1"] = "val1"
values = [(1, [valdict])]
df = spark.createDataFrame(values, schema)
display(df)

| Id | Variable_Column |
|----|-----------------|
| 1  | [{}]            |

Doing it this way I'm creating a empty array. Also this does not feel right, I want the type of internal columns also to be preserved. Please suggest what is the right way to insert data. For my variable column, I'm using "ArrayType(StructType())", is that the right column type to use?

Could you provide a mode detailed example, and tell us exactly what you expect of that variable column? My main question is: what's variable? The length? The types? The structure? — Oli
– Oli, Commented Sep 2, 2020 at 8:35
Hi. In my use case, it could it be any of the above. For example, the 1st row can have two integer type key value pairs. Second row can have 2 string type 2 integer type etc. Is something like this even possible in pyspark data frame? If not what is the right way to deal with problem? — Kas1
– Kas1, Commented Sep 2, 2020 at 16:07
That's not possible in standard spark. Columns have a DataType and all the values in that column must have this type. Variable length is achievable with arrays or maps. But that's all you can do as far as I know. There are workarounds, but nothing in plain pyspark. Let me try to provide a solution. — Oli
– Oli, Commented Sep 3, 2020 at 7:58

Oli · Accepted Answer · 2020-09-03 15:12:36Z

SOLUTION 1

If you simply want to create a column with a variable number of values, you can use ArrayType of StructType. In your case, you defined an empty StructType, hence the result you get.

You can define a dataframe like this:

df1 = spark.createDataFrame([ (1, [('name1', 'val1'), ('name2', 'val2')]),
                              (2, [('name3', 'val3')])],
           ['Id', 'Variable_Column'])
df1.show(truncate=False)

which corresponds to the example you provide:

+---+----------------------------+
|Id |Variable_Column             |
+---+----------------------------+
|1  |[[name1,val1], [name2,val2]]|
|2  |[[name3,val3]]              |
+---+----------------------------+

Note that you don't need to explicitly define the schema in that case but if you want to, it would look like this (you can call df1.schema to print it by the way):

schema = StructType([
             StructField('Id',LongType()),
             StructField('Variable_Column',ArrayType(StructType([
                   StructField('name',StringType()),
                   StructField('value',StringType())
             ])))
         ])

SOLUTION 2

Very similarly, you could use the MapType type like this:

df2 = spark.createDataFrame([ (1, dict([('name1', 'val1'), ('name2', 'val2')])), 
                              (2, dict([('name3', 'val3')]) )
              ], ['Id', 'Variable_Column'])
df2.show(truncate=False)

+---+---------------------------------+
|Id |Variable_Column                  |
+---+---------------------------------+
|1  |Map(name2 -> val2, name1 -> val1)|
|2  |Map(name3 -> val3)               |
+---+---------------------------------+

SOLUTION 3

In a comment, you say that you would also want variable types. That's not possible with dataframes. If that's truly what you want, you may not be using the right tool. But if it is just a corner case, you could keep record of the type of the data in a string like this:

df3 = spark.createDataFrame([ (1, [('name1', 'val1', 'string'),
                                   ('name2', '0.6', 'double')]),
                              (2, [('name3', '3', 'integer')])],
           ['Id', 'Variable_Column'])
df3.show(truncate=False)

+---+-----------------------------------------+
|Id |Variable_Column                          |
+---+-----------------------------------------+
|1  |[[name1,val1,string], [name2,0.6,double]]|
|2  |[[name3,3,integer]]                      |
+---+-----------------------------------------+

Thanks for solution, that helps. I took the solution 3 idea, made few changes and stored the schema in separate schema file. Although I wish data frame had that capability by default.

pardeep garg · Accepted Answer · 2020-09-02 22:42:48Z

1

You can define schema as below:

schema = StructType([StructField("Id", IntegerType(), True),\
                      StructField("Variable_Column", ArrayType(MapType(StringType(),StringType())), True)\
                                ])

This will give output like below :

df.show()
+---+--------------------+
| Id|     Variable_Column|
+---+--------------------+
|  1|[[col2 -> val3, c...|
+---+--------------------+

answered Sep 2, 2020 at 22:42

pardeep garg

2191 silver badge11 bronze badges

Collectives™ on Stack Overflow

Creating a Pyspark data frame with variable schema

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related