Please note that the following queries take a while to complete on this huge dataset, however they should complete sucessfully after some hours.
We start the arangoimp to import our base dataset:
arangoimp --create-collection true --collection RawSubReddits --type jsonl ./RC_2017-01
We use arangosh to create the collections where our final data is going to live in:
db._create("authors")
db._createEdgeCollection("authorsToSubreddits")
We fill the authors collection by simply ignoring any subsequently occuring duplicate authors;
We will calculate the _key of the author by using the MD5 function,
so it obeys the restrictions for allowed chars in _key, and we can know it later on by calling MD5() again on the author field:
db._query(`
FOR item IN RawSubReddits
INSERT {
_key: MD5(item.author),
author: item.author
} INTO authors
OPTIONS { ignoreErrors: true }`);
After the we have filled the second vertex collection (we will keep the imported collection as the first vertex collection) we have to calculate the edges.
Since each author can have created several subreds, its most probably going to be several edges originating from each author. As previously mentioned,
we can use the MD5()-function again to reference the author previously created:
db._query(`
FOR onesubred IN RawSubReddits
INSERT {
_from: CONCAT('authors/', MD5(onesubred.author)),
_to: CONCAT('RawSubReddits/', onesubred._key)
} INTO authorsToSubreddits")
After the edge collection is filled (which may again take a while - we're talking about 40 million edges herer, right? - we create the graph description:
db._graphs.save({
"_key": "reddits",
"orphanCollections" : [ ],
"edgeDefinitions" : [
{
"collection": "authorsToSubreddits",
"from": ["authors"],
"to": ["RawSubReddits"]
}
]
})
We now can use the UI to browse the graphs, or use AQL queries to browse the graph. Lets pick the sort of random first author from that list:
db._query(`for author IN authors LIMIT 1 RETURN author`).toArray()
[
{
"_key" : "1cec812d4e44b95e5a11f3cbb15f7980",
"_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980",
"_rev" : "_W_Eu-----_",
"author" : "punchyourbuns"
}
]
We identified an author, and now run a graph query for him:
db._query(`FOR vertex, edge, path IN 0..1
OUTBOUND 'authors/1cec812d4e44b95e5a11f3cbb15f7980'
GRAPH 'reddits'
RETURN path`).toArray()
One of the resulting paths looks like that:
{
"edges" : [
{
"_key" : "128327199",
"_id" : "authorsToSubreddits/128327199",
"_from" : "authors/1cec812d4e44b95e5a11f3cbb15f7980",
"_to" : "RawSubReddits/38026350",
"_rev" : "_W_LOxgm--F"
}
],
"vertices" : [
{
"_key" : "1cec812d4e44b95e5a11f3cbb15f7980",
"_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980",
"_rev" : "_W_HAL-y--_",
"author" : "punchyourbuns"
},
{
"_key" : "38026350",
"_id" : "RawSubReddits/38026350",
"_rev" : "_W-JS0na--b",
"distinguished" : null,
"created_utc" : 1484537478,
"id" : "dchfe6e",
"edited" : false,
"parent_id" : "t1_dch51v3",
"body" : "I don't understand tension at all."
"Mine is set to auto."
"I'll replace the needle and rethread. Thanks!",
"stickied" : false,
"gilded" : 0,
"subreddit" : "sewing",
"author" : "punchyourbuns",
"score" : 3,
"link_id" : "t3_5o66d0",
"author_flair_text" : null,
"author_flair_css_class" : null,
"controversiality" : 0,
"retrieved_on" : 1486085797,
"subreddit_id" : "t5_2sczp"
}
]
}