0

Background

I have a rocksdb collection that contains three fields: _id, author, subreddit.

Problem

I would like to create a Arango graph that creates a graph connecting these two existing columns. But the examples and the drivers seem to only accept collections as its edge definitions.

Issue

The ArangoDb documentation is lacking information on how I can create a graph using edges and nodes pulled from the same collection.

EDIT:

Solution

This was fixed with a code change at this Arangodb issues ticket.

4 Answers 4

3

Here's one way to do it using jq, a JSON-oriented command-line tool.

First, an outline of the steps:

1) Use arangoexport to export your author/subredit collection to a file, say, exported.json;

2) Run the jq script, nodes_and_edges.jq, shown below;

3) Use arangoimp to import the JSON produced in (2) into ArangoDB.

There are several ways the graph can be stored in ArangoDB, so ultimately you might wish to tweak nodes_and_edges.jq accordingly (e.g. to generate the nodes first, and then the edges).

INDEX

If your jq does not have INDEX defined, then use this:

def INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row|idx_expr|
      if type != "string" then tojson
      else .
      end] |= $row);
def INDEX(idx_expr): INDEX(.[]; idx_expr);

nodes_and_edges.jq

# This module is for generating JSON suitable for importing into ArangoDB.

### Generic Functions

# nodes/2
# $name must be the name of the ArangoDB collection of nodes corresponding to $key.
# The scheme for generating key names can be altered by changing the first
# argument of assign_keys, e.g. to "" if no prefix is wanted.
def nodes($key; $name):
  map( {($key): .[$key]} ) | assign_keys($name[0:1] + "_"; 1);

def assign_keys(prefix; start):
  . as $in
  | reduce range(0;length) as $i ([];
    . + [$in[$i] + {"_key": "\(prefix)\(start+$i)"}]);

# nodes_and_edges facilitates the normalization of an implicit graph
# in an ArangoDB "document" collection of objects having $from and $to keys.
# The input should be an array of JSON objects, as produced 
# by arangoexport for a single collection.
# If $nodesq is truthy, then the JSON for both the nodes and edges is emitted,
# otherwise only the JSON for the edges is emitted.
# 
# The first four arguments should be strings.
# 
# $from and $to should be the key names in . to be used for the from-to edges;
# $name1 and $name2 should be the names of the corresponding collections of nodes.
def nodes_and_edges($from; $to; $name1; $name2; $nodesq ):
  def dict($s): INDEX(.[$s]) | map_values(._key);
  def objects: to_entries[] | {($from): .key, "_key": .value};
  (nodes($from; $name1) | dict($from)) as $fdict
  | (nodes($to; $name2) | dict($to)  ) as $tdict
  | (if $nodesq then $fdict, $tdict | objects
     else empty end),
    (.[] | {_from: "\($name1)/\($fdict[.[$from]])",
            _to:   "\($name2)/\($tdict[.[$to]])"} )  ;


### Problem-Specific Functions

# If you wish to generate the collections separately,
# then these will come in handy:
def authors: nodes("author"; "authors");
def subredits: nodes("subredit"; "subredits");

def nodes_and_edges:
  nodes_and_edges("author"; "subredit"; "authors"; "subredits"; true);

nodes_and_edges

Invocation

jq -cf extract_nodes_edges.jq exported.json

This invocation will produce a set of JSONL (JSON-Lines) for "authors", one for "subredits" and an edge collection.

Example

exported.json
[
  {"_id":"test/115159","_key":"115159","_rev":"_V8JSdTS---","author": "A", "subredit": "S1"},
  {"_id":"test/145120","_key":"145120","_rev":"_V8ONdZa---","author": "B", "subredit": "S2"},
  {"_id":"test/114474","_key":"114474","_rev":"_V8JZJJS---","author": "C", "subredit": "S3"}
]

Output

{"author":"A","_key":"name_1"}
{"author":"B","_key":"name_2"}
{"author":"C","_key":"name_3"}
{"subredit":"S1","_key":"sid_1"}
{"subredit":"S2","_key":"sid_2"}
{"subredit":"S3","_key":"sid_3"}
{"_from":"authors/name_1","_to":"subredits/sid_1"}
{"_from":"authors/name_2","_to":"subredits/sid_2"}
{"_from":"authors/name_3","_to":"subredits/sid_3"}
Sign up to request clarification or add additional context in comments.

Comments

2

Please note that the following queries take a while to complete on this huge dataset, however they should complete sucessfully after some hours.

We start the arangoimp to import our base dataset:

arangoimp --create-collection true  --collection RawSubReddits --type jsonl ./RC_2017-01 

We use arangosh to create the collections where our final data is going to live in:

db._create("authors")
db._createEdgeCollection("authorsToSubreddits")

We fill the authors collection by simply ignoring any subsequently occuring duplicate authors; We will calculate the _key of the author by using the MD5 function, so it obeys the restrictions for allowed chars in _key, and we can know it later on by calling MD5() again on the author field:

db._query(`
  FOR item IN RawSubReddits
    INSERT {
      _key: MD5(item.author),
      author: item.author
      } INTO authors
        OPTIONS { ignoreErrors: true }`);

After the we have filled the second vertex collection (we will keep the imported collection as the first vertex collection) we have to calculate the edges. Since each author can have created several subreds, its most probably going to be several edges originating from each author. As previously mentioned, we can use the MD5()-function again to reference the author previously created:

 db._query(`
   FOR onesubred IN RawSubReddits
     INSERT {
       _from: CONCAT('authors/', MD5(onesubred.author)),
       _to: CONCAT('RawSubReddits/', onesubred._key)
     } INTO  authorsToSubreddits")

After the edge collection is filled (which may again take a while - we're talking about 40 million edges herer, right? - we create the graph description:

db._graphs.save({
  "_key": "reddits",
  "orphanCollections" : [ ],
  "edgeDefinitions" : [ 
    {
      "collection": "authorsToSubreddits",
      "from": ["authors"],
      "to": ["RawSubReddits"]
    }
  ]
})

We now can use the UI to browse the graphs, or use AQL queries to browse the graph. Lets pick the sort of random first author from that list:

db._query(`for author IN authors LIMIT 1 RETURN author`).toArray()
[ 
  { 
    "_key" : "1cec812d4e44b95e5a11f3cbb15f7980", 
    "_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
    "_rev" : "_W_Eu-----_", 
    "author" : "punchyourbuns" 
  } 
]

We identified an author, and now run a graph query for him:

db._query(`FOR vertex, edge, path IN 0..1
   OUTBOUND 'authors/1cec812d4e44b95e5a11f3cbb15f7980'
   GRAPH 'reddits'
   RETURN path`).toArray()

One of the resulting paths looks like that:

{ 
  "edges" : [ 
    { 
      "_key" : "128327199", 
      "_id" : "authorsToSubreddits/128327199", 
      "_from" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
      "_to" : "RawSubReddits/38026350", 
      "_rev" : "_W_LOxgm--F" 
    } 
  ], 
  "vertices" : [ 
    { 
      "_key" : "1cec812d4e44b95e5a11f3cbb15f7980", 
      "_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
      "_rev" : "_W_HAL-y--_", 
      "author" : "punchyourbuns" 
    }, 
    { 
      "_key" : "38026350", 
      "_id" : "RawSubReddits/38026350", 
      "_rev" : "_W-JS0na--b", 
      "distinguished" : null, 
      "created_utc" : 1484537478, 
      "id" : "dchfe6e", 
      "edited" : false, 
      "parent_id" : "t1_dch51v3", 
      "body" : "I don't understand tension at all."
         "Mine is set to auto."
         "I'll replace the needle and rethread. Thanks!", 
      "stickied" : false, 
      "gilded" : 0, 
      "subreddit" : "sewing", 
      "author" : "punchyourbuns", 
      "score" : 3, 
      "link_id" : "t3_5o66d0", 
      "author_flair_text" : null, 
      "author_flair_css_class" : null, 
      "controversiality" : 0, 
      "retrieved_on" : 1486085797, 
      "subreddit_id" : "t5_2sczp" 
    } 
  ] 
}

Comments

2

For a graph you need an edge collection for the edges and vertex collections for the nodes. You can't create a graph using only one collection.

Maybe this topic in the documentations is helpful for you.

4 Comments

Ok, are you saying that I can't create a graph using two distinct parts of a denormalized collection?
@GabrielFair, graphs in Arango, at their simplest, are like join tables. Vertex is one table. Edges are another table. You can join through the table back into the Vertex. You can use the Graph nature with a collection that's like a relational table of Employee with a column of Manager that is in the Employee table. You can store that kind of document, but the Graph features won't use it.
Thank you for explaining that. The documentation did mention this but it wasn't this clear. My confusion was from a false idea that I could pick out collection attributes to create a graph without isolating those variables first in their own collection.
For graphs in ArangoDB it has to be at least one document / vertex collection and one edge collection. There are other options for certain data models and use cases, granted that you don't need graph traversal etc., like a simple form of join: docs.arangodb.com/devel/AQL/Tutorial/Join.html or more complex joins: arangodb.com/why-arangodb/sql-aql-comparison (scroll down to JOINS headline). In your case, you probably wanna go for a graph, because that allows to easily retrieve all follow-up posts on a given initial post (graph traversal with variable depth).
1

Here's an AQL solution, which however presupposes that all the referenced collections already exist, and that UPSERT is not necessary.

FOR v IN testcollection
  LET a = v.author
  LET s = v.subredit
  FILTER a
  FILTER s
  LET fid = (INSERT {author: a}   INTO authors RETURN NEW._id)[0]
  LET tid = (INSERT {subredit: s} INTO subredits RETURN NEW._id)[0]
  INSERT {_from: fid, _to: tid} INTO author_of
  RETURN [fid, tid]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.