5

I'm looking for a way to get the coordinates of a cluster point in the dendrogram plot based on its ClusterNode return by to_tree.

Using scipy to build a dendogram from data such as:

X = data
Y = pdist(X)
Z = linkage(Y)
dend = dendrogram(Z)
rootnode, nodesList = to_tree(Z, rd=True)

What I would like to do is build a function get_coords(somClusterNode) that would return the tuple (x, y) specifying the position of the node in the plot.

Thanks to this answer, I managed to figure out how to get the position from the dendrogram return values, such as:

i, d = list(zip(dend['icoord'], dend['dcoord']))[-1]
x = 0.5 * sum(i[1:3])
y = d[1]
plt.plot(x, y, 'ro')

But I can figure out a relation between the nodesList ordering and the icoord/dcoord ordering in order to map one to the other.

Do you have any idea where I could look for ?

Thanks for your help !

6
  • What version of scipy are you using? When I try to run your code I get an error: ValueError: Valid methods when the raw observations are omitted are 'single', 'complete', 'weighted', and 'average'. Are you sure that line 3 should not be Z = linkage(X, method="ward")? Commented Apr 20, 2017 at 10:01
  • I use SciPy v.0.19.0 with python v.3.5.2 Commented Apr 20, 2017 at 10:23
  • It seems that both are compatible : "The input y may be either a 1d compressed distance matrix or a 2d array of observation vectors." in docs.scipy.org/doc/scipy/reference/generated/… Commented Apr 20, 2017 at 10:27
  • Yeah, but if you specify method="ward", my version will only work when providing the original observations. Upgrading my scipy installation to see if the issue persists... Commented Apr 20, 2017 at 10:32
  • ok. Then any method will do fine. I don't think my question depend on the linkage method I've chosen. I edited the post accordingly. Commented Apr 20, 2017 at 10:34

2 Answers 2

3

Each dendrogram maps to only one tree of ClusterNodes, but any tree of ClusterNodes could map to an infinite number of dendrograms. Hence your mapping from node ID to (x,y) positions should probably just be another field in your dendrogram data structure instead of being a function of a ClusterNode. Instead of defining a function get_coords, I hence appends a dictionary to dend that maps node IDs to (x,y) coordinates. You can access the positions with

x,y = dend['node_id_to_coord'][node_id] # node_id is an integer as returned by ClusterNode.id

Code:

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram, to_tree
from scipy.spatial.distance import pdist

# create some random data
X = np.random.rand(10, 3)

# get dendrogram
Z = linkage(pdist(X), method="ward")
dend = dendrogram(Z)

# ----------------------------------------
# get leave coordinates, which are at y == 0

def flatten(l):
    return [item for sublist in l for item in sublist]
X = flatten(dend['icoord'])
Y = flatten(dend['dcoord'])
leave_coords = [(x,y) for x,y in zip(X,Y) if y==0]

# in the dendogram data structure,
# leave ids are listed in ascending order according to their x-coordinate
order = np.argsort([x for x,y in leave_coords])
id_to_coord = dict(zip(dend['leaves'], [leave_coords[idx] for idx in order])) # <- main data structure

# ----------------------------------------
# get coordinates of other nodes

# this should work but doesn't:

# # traverse tree from leaves upwards and populate mapping ID -> (x,y);
# # use linkage matrix to traverse the tree optimally
# # (each row in the linkage matrix corresponds to a row in dend['icoord'] and dend['dcoord'])
# root_node, node_list = to_tree(Z, rd=True)
# for ii, (X, Y) in enumerate(zip(dend['icoord'], dend['dcoord'])):
#     x = (X[1] + X[2]) / 2
#     y = Y[1] # or Y[2]
#     node_id = ii + len(dend['leaves'])
#     id_to_coord[node_id] = (x, y)

# so we need to do it the hard way:

# map endpoint of each link to coordinates of parent node
children_to_parent_coords = dict()
for i, d in zip(dend['icoord'], dend['dcoord']):
    x = (i[1] + i[2]) / 2
    y = d[1] # or d[2]
    parent_coord = (x, y)
    left_coord = (i[0], d[0])
    right_coord = (i[-1], d[-1])
    children_to_parent_coords[(left_coord, right_coord)] = parent_coord

# traverse tree from leaves upwards and populate mapping ID -> (x,y)
root_node, node_list = to_tree(Z, rd=True)
ids_left = range(len(dend['leaves']), len(node_list))

while len(ids_left) > 0:

    for ii, node_id in enumerate(ids_left):
        node = node_list[node_id]
        if (node.left.id in id_to_coord) and (node.right.id in id_to_coord):
            left_coord = id_to_coord[node.left.id]
            right_coord = id_to_coord[node.right.id]
            id_to_coord[node_id] = children_to_parent_coords[(left_coord, right_coord)]

    ids_left = [node_id for node_id in range(len(node_list)) if not node_id in id_to_coord]

# plot result on top of dendrogram
ax = plt.gca()
for node_id, (x, y) in id_to_coord.iteritems():
    if not node_list[node_id].is_leaf():
        ax.plot(x, y, 'ro')
        ax.annotate(str(node_id), (x, y), xytext=(0, -8),
                    textcoords='offset points',
                    va='top', ha='center')

dend['node_id_to_coord'] = id_to_coord

enter image description here

Sign up to request clarification or add additional context in comments.

7 Comments

Thank you so much ! It was days I was trying to solve this ! Ty, ty, ty !
I tried to edit to change .iteritems() to .items() to make it compatible with python3 however, stack doesn't allow edit inferior to 6 characters.
My pleasure. I will make the edit in a second. Also, I just figured out that I could use the linkage matrix to traverse the tree more effectively. Edit incoming.
Ok, this version is much simpler now. I somehow hadn't realised that each link in the linkage matrix corresponds to the dendrogram['icoord'] and dendrogram['dcoords'].
Yeah, you are right, the node IDs sometimes end up in all the wrong places. I will roll back, too.
|
1

There is another way to do it:

The ids of the dendogram appear to be generated by a reversed right to left traversal of the tree. This allows us to construct a translation from node.id to the index of icoord and dcoord as follows:

def rl_traversal(node):
    # skipping leaves
    if not node.is_leaf():
        yield node.id
        yield from rl_traversal(node.right)
        yield from rl_traversal(node.left)

id_map = dict(zip( rl_traversal(root), reversed(range(root.get_count()-1))) ))
# id_map[node_id] = dendogram_id

The coordinates of a node can then be obtained via dendo['icoord'][id_map[node_id]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.