2

I want to create a pyspark dataframe in which there is a column with variable schema. So my data frame can look something like this:

| Id | Variable_Column                  |
|----|----------------------------------|
| 1  | [{"col1":"val1"}]                |
| 2  | [{"col1":"val2", "col2":"val3"}] |

So to achieve this. I started out like this:

schema = StructType([StructField("Id", IntegerType(), True),\
                      StructField("Variable_Column", ArrayType(StructType()), True)\
                               ])
valdict = dict()
valdict["col1"] = "val1"
values = [(1, [valdict])]
df = spark.createDataFrame(values, schema)
display(df)

| Id | Variable_Column |
|----|-----------------|
| 1  | [{}]            |

Doing it this way I'm creating a empty array. Also this does not feel right, I want the type of internal columns also to be preserved. Please suggest what is the right way to insert data. For my variable column, I'm using "ArrayType(StructType())", is that the right column type to use?

3
  • Could you provide a mode detailed example, and tell us exactly what you expect of that variable column? My main question is: what's variable? The length? The types? The structure? Commented Sep 2, 2020 at 8:35
  • Hi. In my use case, it could it be any of the above. For example, the 1st row can have two integer type key value pairs. Second row can have 2 string type 2 integer type etc. Is something like this even possible in pyspark data frame? If not what is the right way to deal with problem? Commented Sep 2, 2020 at 16:07
  • That's not possible in standard spark. Columns have a DataType and all the values in that column must have this type. Variable length is achievable with arrays or maps. But that's all you can do as far as I know. There are workarounds, but nothing in plain pyspark. Let me try to provide a solution. Commented Sep 3, 2020 at 7:58

2 Answers 2

1

SOLUTION 1

If you simply want to create a column with a variable number of values, you can use ArrayType of StructType. In your case, you defined an empty StructType, hence the result you get.

You can define a dataframe like this:

df1 = spark.createDataFrame([ (1, [('name1', 'val1'), ('name2', 'val2')]),
                              (2, [('name3', 'val3')])],
           ['Id', 'Variable_Column'])
df1.show(truncate=False)

which corresponds to the example you provide:

+---+----------------------------+
|Id |Variable_Column             |
+---+----------------------------+
|1  |[[name1,val1], [name2,val2]]|
|2  |[[name3,val3]]              |
+---+----------------------------+

Note that you don't need to explicitly define the schema in that case but if you want to, it would look like this (you can call df1.schema to print it by the way):

schema = StructType([
             StructField('Id',LongType()),
             StructField('Variable_Column',ArrayType(StructType([
                   StructField('name',StringType()),
                   StructField('value',StringType())
             ])))
         ])

SOLUTION 2

Very similarly, you could use the MapType type like this:

df2 = spark.createDataFrame([ (1, dict([('name1', 'val1'), ('name2', 'val2')])), 
                              (2, dict([('name3', 'val3')]) )
              ], ['Id', 'Variable_Column'])
df2.show(truncate=False)
+---+---------------------------------+
|Id |Variable_Column                  |
+---+---------------------------------+
|1  |Map(name2 -> val2, name1 -> val1)|
|2  |Map(name3 -> val3)               |
+---+---------------------------------+

SOLUTION 3

In a comment, you say that you would also want variable types. That's not possible with dataframes. If that's truly what you want, you may not be using the right tool. But if it is just a corner case, you could keep record of the type of the data in a string like this:

df3 = spark.createDataFrame([ (1, [('name1', 'val1', 'string'),
                                   ('name2', '0.6', 'double')]),
                              (2, [('name3', '3', 'integer')])],
           ['Id', 'Variable_Column'])
df3.show(truncate=False)
+---+-----------------------------------------+
|Id |Variable_Column                          |
+---+-----------------------------------------+
|1  |[[name1,val1,string], [name2,0.6,double]]|
|2  |[[name3,3,integer]]                      |
+---+-----------------------------------------+
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for solution, that helps. I took the solution 3 idea, made few changes and stored the schema in separate schema file. Although I wish data frame had that capability by default.
1

You can define schema as below:

schema = StructType([StructField("Id", IntegerType(), True),\
                      StructField("Variable_Column", ArrayType(MapType(StringType(),StringType())), True)\
                                ])

This will give output like below :

df.show()
+---+--------------------+
| Id|     Variable_Column|
+---+--------------------+
|  1|[[col2 -> val3, c...|
+---+--------------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.