I have a dataframe called 'df' structured as follows
| ID | name | lv1 | lv2 |
|---|---|---|---|
| abb | name1 | 40.34 | 21.56 |
| bab | name2 | 21.30 | 67.45 |
| bba | name3 | 32.45 | 45.44 |
In Pandas, I can use the following code to create a new column that contains a list of the lv1 and lv2 values
cols = ['lv1', 'lv2']
df['new_col'] = df[cols].values.tolist()
Due to memory issues because of the size of the data, I am now using Databricks instead (which I have never used before) and need to replicate the above. I've created a Spark dataframe successfully by mounting the location of my data and then loading
file_location = 'dbfs:/mnt/<mountname>/filename.csv'
file_type = "csv"
infer_schema = "false"
first_row_is_header = "true"
delimiter = ","
df = spark.read.format(file_type)
.option("inferSchema", infer_schema)
.option("header", first_row_is_header)
.option("sep", delimiter)
.load(file_location)
display(df)
This loads the data, however, I'm stuck on how to complete the necessary next step. I've found a function called struct in the Spark, however, I can't seem to find the corresponding function in PySpark. Any suggestions?