Transform aws glue get-tables output from json to PySpark Dataframe

Question

I'm trying to transform the json output of aws glue get-tables command into a PySpark dataframe.

After reading the json output with this command:

   df = spark.read.option("inferSchema", "true") \
   .option("multiline", "true") \
   .json("tmp/my_json.json")

I get the following from printSchema:

root
 |-- TableList: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- CatalogId: string (nullable = true)
 |    |    |-- CreateTime: string (nullable = true)
 |    |    |-- CreatedBy: string (nullable = true)
 |    |    |-- DatabaseName: string (nullable = true)
 |    |    |-- IsRegisteredWithLakeFormation: boolean (nullable = true)
 |    |    |-- LastAccessTime: string (nullable = true)
 |    |    |-- Name: string (nullable = true)
 |    |    |-- Owner: string (nullable = true)
 |    |    |-- Parameters: struct (nullable = true)
 |    |    |    |-- CrawlerSchemaDeserializerVersion: string (nullable = true)
 |    |    |    |-- CrawlerSchemaSerializerVersion: string (nullable = true)
 |    |    |    |-- UPDATED_BY_CRAWLER: string (nullable = true)
 |    |    |    |-- averageRecordSize: string (nullable = true)
 |    |    |    |-- classification: string (nullable = true)
 |    |    |    |-- compressionType: string (nullable = true)
 |    |    |    |-- objectCount: string (nullable = true)
 |    |    |    |-- recordCount: string (nullable = true)
 |    |    |    |-- sizeKey: string (nullable = true)
 |    |    |    |-- spark.sql.create.version: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.numPartCols: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.numParts: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.part.0: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.part.1: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.partCol.0: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.partCol.1: string (nullable = true)
 |    |    |    |-- typeOfData: string (nullable = true)
 |    |    |-- PartitionKeys: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- Name: string (nullable = true)
 |    |    |    |    |-- Type: string (nullable = true)
 |    |    |-- Retention: long (nullable = true)
 |    |    |-- StorageDescriptor: struct (nullable = true)
 |    |    |    |-- BucketColumns: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- Columns: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- Name: string (nullable = true)
 |    |    |    |    |    |-- Type: string (nullable = true)
 |    |    |    |-- Compressed: boolean (nullable = true)
 |    |    |    |-- InputFormat: string (nullable = true)
 |    |    |    |-- Location: string (nullable = true)
 |    |    |    |-- NumberOfBuckets: long (nullable = true)
 |    |    |    |-- OutputFormat: string (nullable = true)
 |    |    |    |-- Parameters: struct (nullable = true)
 |    |    |    |    |-- CrawlerSchemaDeserializerVersion: string (nullable = true)
 |    |    |    |    |-- CrawlerSchemaSerializerVersion: string (nullable = true)
 |    |    |    |    |-- UPDATED_BY_CRAWLER: string (nullable = true)
 |    |    |    |    |-- averageRecordSize: string (nullable = true)
 |    |    |    |    |-- classification: string (nullable = true)
 |    |    |    |    |-- compressionType: string (nullable = true)
 |    |    |    |    |-- objectCount: string (nullable = true)
 |    |    |    |    |-- recordCount: string (nullable = true)
 |    |    |    |    |-- sizeKey: string (nullable = true)
 |    |    |    |    |-- spark.sql.create.version: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.numPartCols: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.numParts: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.part.0: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.part.1: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.partCol.0: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.partCol.1: string (nullable = true)
 |    |    |    |    |-- typeOfData: string (nullable = true)
 |    |    |    |-- SerdeInfo: struct (nullable = true)
 |    |    |    |    |-- Parameters: struct (nullable = true)
 |    |    |    |    |    |-- serialization.format: string (nullable = true)
 |    |    |    |    |-- SerializationLibrary: string (nullable = true)
 |    |    |    |-- SortColumns: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- StoredAsSubDirectories: boolean (nullable = true)
 |    |    |-- TableType: string (nullable = true)
 |    |    |-- UpdateTime: string (nullable = true)

But just one column with the whole json is created in df:

+--------------------+
|           TableList|
+--------------------+
|[[903342277921, 2...|
+--------------------+

Is there a way to programmatically (and dynamically) create the dataframe in the same way that is referenced in printSchema?

Thanks in advance!

Robert Kossendey · Accepted Answer · 2021-11-05 15:52:25Z

1

You can use the explode() function to turn the elements of an array to separate rows:

df = df.select('*',explode(df['TableList']).select('col.*')

edited Nov 5, 2021 at 15:52

answered Nov 5, 2021 at 8:40

Robert Kossendey

7,1062 gold badges19 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Lucas Mignone Over a year ago

Thanks for your answer! This give me another column named "col" and the same number of rows that tables in the list, which is ok, but what I need or what I'm looking for is to have the elements printed in printSchema() as columns without having to declare explicitly the schema

Robert Kossendey Over a year ago

I've updated my solution, forgot to include the last step :)

Lucas Mignone Over a year ago

Many thanks, it worked! I've upvoted your answer.

Collectives™ on Stack Overflow

Transform aws glue get-tables output from json to PySpark Dataframe

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related