I'm trying to generate an network graph that visualises data lineage (cluster graph such as this). Please keep in mind that I'm very new to the NetworkX library and that the code below might be far from optimal.
My data consist of 2 Pandas dataframes:
df_objs: this df contains a UUID and name of the different items (these will become the nodesdf_calls: this df contains a calling and called UUID (these UUIDs are references to the UUIDs of the items indf_objs).
Here's what I do to initialise the directed graph and create the nodes:
import networkx as nx
objs = df_objs.set_index('uuid').to_dict(orient='index')
g = nx.DiGraph()
for obj_id, obj_attrs in objs.items():
g.add_node(obj_id, attr_dict=obj_attrs)
And to generate the edges:
g.add_edges_from(df_calls.drop_duplicates().to_dict(orient='split')['data'])
Next, I want to know the lineage of a single item using their UUID:
g_tree = nx.DiGraph(nx.bfs_edges(g, 'f6e214b1bba34a01bd0c18f232d6aee2', reverse=True))
So far so good. The last step is to generate the JSON graph so that I can feed the resulting JSON file to D3.js in order to perform the visualisation:
# Create the JSON data structure
from networkx.readwrite import json_graph
data = json_graph.tree_data(g_tree, root='f6e214b1bba34a01bd0c18f232d6aee2')
# Write the tree to a JSON file
import json
with open('./tree.json', 'w') as f:
json.dump(data, f)
All of the above works, however, instead of the node names, I'm left with the UUID in the JSON data, due to the node attributes being dropped in the call to nx.bfs_edges().
Example:
Not a problem (at least that's what I thought); I'll just update the nodes in the g_tree with the attributes from g.
obj_names = nx.get_node_attributes(g, 'name')
for obj_id, obj_name in obj_names.items():
try:
g_tree[obj_id]['name'] = obj_name
except Exception:
pass
Note: I can't use set_node_attributes() as g contains more nodes than g_tree, which causes a KeyError.
If I then try to generate the JSON data again:
data = json_graph.tree_data(g_tree, root='f6e214b1bba34a01bd0c18f232d6aee2')
it will throw the error:
TypeError: G is not a tree.
This is due to number of nodes != number of edges + 1.
Before setting the attributes, the number of nodes was 81 and the number of edges 80. After setting the attributes, the number of edges increased to 120 (number of nodes remained the same).
OK, as for my questions:
- Am I taking the long way around and is there a much shorter/better/faster way to generate the same result?
- What is causing the number of edges to increase when I'm only setting the attributes for nodes?
- Is there a way to retain the node attributes when trying to generate the tree?
