I have downloaded the corpus of articles Aminar DBLP Version 11. The corpus is a huge text file (12GB) which each line is a self-contained JSON string:
'{"id": "100001334", "title": "Ontologies in HYDRA - Middleware for Ambient Intelligent Devices.", "authors": [{"name": "Peter Kostelnik", "id": "2702511795"}, {"name": "Martin Sarnovsky", "id": "2041014688"}, {"name": "Jan Hreno", "id": "2398560122"}], "venue": {"raw": "AMIF"}, "year": 2009, "n_citation": 2, "page_start": "43", "page_end": "46", "doc_type": "", "publisher": "", "volume": "", "issue": "", "fos": [{"name": "Lernaean Hydra", "w": 0.4178039}, {"name": "Database", "w": 0.4269269}, {"name": "World Wide Web", "w": 0.415332377}, {"name": "Ontology (information science)", "w": 0.459045082}, {"name": "Computer science", "w": 0.399807781}, {"name": "Middleware", "w": 0.5905041}, {"name": "Ambient intelligence", "w": 0.5440575}]}'
All JSON strings are new line separated.
When I open the file using PySpark, it returns a dataframe with one column containing JSON strings:
df = spark.read.text(path_to_data)
df.show()
+--------------------+
| value|
+--------------------+
|{"id": "100001334...|
|{"id": "100001888...|
|{"id": "100002270...|
|{"id": "100004108...|
|{"id": "10000571"...|
|{"id": "100007563...|
|{"id": "100008278...|
|{"id": "100008490...|
I need to access JSON fields to build my deep learning model.
My first attempt was trying to open the file using JSON method as mentioned in this question:
df = spark.read.option("wholeFile", True).option("mode", "PERMISSIVE").json(path_to_data)
But all the proposed solutions took ages to run (more than 3h) with no results to show.
My second attempt was trying to parse a JSON object from JSON string to get a dataframe with columns as follows:
df = spark.read.text(path_to_data)
schema = StructType([StructField("id", StringType()), StructField("title", StringType()), StructField("authors", ArrayType(MapType(StringType(), StringType()))), StructField("venue", MapType(StringType(), StringType()), True), StructField("year", IntegerType(), True), StructField("keywords", ArrayType(StringType()), True), StructField("references", ArrayType(StringType()), True), StructField("n_citation", IntegerType(), True), StructField("page_start", StringType(), True), StructField("page_end", StringType(), True), StructField("doc_type", StringType(), True), StructField("lang", StringType(), True), StructField("publisher", StringType(), True), StructField("volume", StringType(), True), StructField("issue", StringType(), True), StructField("issn", StringType(), True), StructField("isbn", StringType(), True), StructField("doi", StringType(), True), StructField("pdf", StringType(), True), StructField("url", ArrayType(StringType()), True),
StructField("abstract", StringType(), True), StructField("indexed_abstract", StringType(), True)])
datajson = df.withColumn("jsonData", from_json(col("value"),schema)).select("jsonData.*")
But it returned the exception "cannot resolve column due to data type mismatch PySpark", even though the data types of each field in the schema are true (based on the official website of corpus here)
My third attempt was trying to parse the JSON string to Map data type:
casted = df.withColumn("value", from_json(df.value, MapType(StringType(),StringType())))
It gave me the following result:
root
|-- value: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+--------------------+
| value|
+--------------------+
|{id -> 100001334,...|
|{id -> 1000018889...|
|{id -> 1000022707...|
|{id -> 100004108,...|
|{id -> 10000571, ...|
|{id -> 100007563,...|
|{id -> 100008278,...|
Now, each row is a valid JSON object which can be accessed as follows:
row = casted.first()
row.value['id']
row.value['title']
row.value['authors']
Now, my question is how to convert this dataframe of one column named 'value' to a dataframe with the columns mentioned above (id, title, authors, etc) based on JSON objects?