0

I'm trying to transform the json output of aws glue get-tables command into a PySpark dataframe.

After reading the json output with this command:

   df = spark.read.option("inferSchema", "true") \
   .option("multiline", "true") \
   .json("tmp/my_json.json")

I get the following from printSchema:

root
 |-- TableList: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- CatalogId: string (nullable = true)
 |    |    |-- CreateTime: string (nullable = true)
 |    |    |-- CreatedBy: string (nullable = true)
 |    |    |-- DatabaseName: string (nullable = true)
 |    |    |-- IsRegisteredWithLakeFormation: boolean (nullable = true)
 |    |    |-- LastAccessTime: string (nullable = true)
 |    |    |-- Name: string (nullable = true)
 |    |    |-- Owner: string (nullable = true)
 |    |    |-- Parameters: struct (nullable = true)
 |    |    |    |-- CrawlerSchemaDeserializerVersion: string (nullable = true)
 |    |    |    |-- CrawlerSchemaSerializerVersion: string (nullable = true)
 |    |    |    |-- UPDATED_BY_CRAWLER: string (nullable = true)
 |    |    |    |-- averageRecordSize: string (nullable = true)
 |    |    |    |-- classification: string (nullable = true)
 |    |    |    |-- compressionType: string (nullable = true)
 |    |    |    |-- objectCount: string (nullable = true)
 |    |    |    |-- recordCount: string (nullable = true)
 |    |    |    |-- sizeKey: string (nullable = true)
 |    |    |    |-- spark.sql.create.version: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.numPartCols: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.numParts: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.part.0: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.part.1: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.partCol.0: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.partCol.1: string (nullable = true)
 |    |    |    |-- typeOfData: string (nullable = true)
 |    |    |-- PartitionKeys: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- Name: string (nullable = true)
 |    |    |    |    |-- Type: string (nullable = true)
 |    |    |-- Retention: long (nullable = true)
 |    |    |-- StorageDescriptor: struct (nullable = true)
 |    |    |    |-- BucketColumns: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- Columns: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- Name: string (nullable = true)
 |    |    |    |    |    |-- Type: string (nullable = true)
 |    |    |    |-- Compressed: boolean (nullable = true)
 |    |    |    |-- InputFormat: string (nullable = true)
 |    |    |    |-- Location: string (nullable = true)
 |    |    |    |-- NumberOfBuckets: long (nullable = true)
 |    |    |    |-- OutputFormat: string (nullable = true)
 |    |    |    |-- Parameters: struct (nullable = true)
 |    |    |    |    |-- CrawlerSchemaDeserializerVersion: string (nullable = true)
 |    |    |    |    |-- CrawlerSchemaSerializerVersion: string (nullable = true)
 |    |    |    |    |-- UPDATED_BY_CRAWLER: string (nullable = true)
 |    |    |    |    |-- averageRecordSize: string (nullable = true)
 |    |    |    |    |-- classification: string (nullable = true)
 |    |    |    |    |-- compressionType: string (nullable = true)
 |    |    |    |    |-- objectCount: string (nullable = true)
 |    |    |    |    |-- recordCount: string (nullable = true)
 |    |    |    |    |-- sizeKey: string (nullable = true)
 |    |    |    |    |-- spark.sql.create.version: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.numPartCols: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.numParts: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.part.0: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.part.1: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.partCol.0: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.partCol.1: string (nullable = true)
 |    |    |    |    |-- typeOfData: string (nullable = true)
 |    |    |    |-- SerdeInfo: struct (nullable = true)
 |    |    |    |    |-- Parameters: struct (nullable = true)
 |    |    |    |    |    |-- serialization.format: string (nullable = true)
 |    |    |    |    |-- SerializationLibrary: string (nullable = true)
 |    |    |    |-- SortColumns: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- StoredAsSubDirectories: boolean (nullable = true)
 |    |    |-- TableType: string (nullable = true)
 |    |    |-- UpdateTime: string (nullable = true)

But just one column with the whole json is created in df:

+--------------------+
|           TableList|
+--------------------+
|[[903342277921, 2...|
+--------------------+

Is there a way to programmatically (and dynamically) create the dataframe in the same way that is referenced in printSchema?

Thanks in advance!

1 Answer 1

1

You can use the explode() function to turn the elements of an array to separate rows:

df = df.select('*',explode(df['TableList']).select('col.*')
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for your answer! This give me another column named "col" and the same number of rows that tables in the list, which is ok, but what I need or what I'm looking for is to have the elements printed in printSchema() as columns without having to declare explicitly the schema
I've updated my solution, forgot to include the last step :)
Many thanks, it worked! I've upvoted your answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.