Sorting term IDs

Working with HPO typically includes working with items (e.g. patients) annotated with HPO terms. However, the annotations are rarely sorted in any meaningful order which can obscure the interpretation. HPO toolkit provides logic for sorting HPO terms such that the similar terms are located closer than the rest.

Let’s illustrate this on example. Suppose having a subject annotated with the following terms:

>>> import hpotk
>>> subject = (
...   'HP:0001744',  # Splenomegaly
...   'HP:0020221',  # Clonic seizure
...   'HP:0001238',  # Slender finger
...   'HP:0011153',  # Focal motor seizure
...   'HP:0002240'   # Hepatomegaly
... )
>>> term_ids = tuple(hpotk.TermId.from_curie(curie) for curie in subject)

The order of HPO annotations does not reflect that Splenomegaly is more “similar” to Hepatomegaly than to Clonic seizure. The implementations of hpotk.util.sort.TermIdSorting endeavor to improve on this.

The sorting logic is handled by hpotk.util.sort.TermIdSorting implementations. The algorithm takes a sequence of term IDs or hpotk.model.Identified entities, such as hpotk.model.Term, and returns indices for sorting the input sequence - the same what numpy.argsort() does.

Hierarchical sorting

hpotk.util.sort.HierarchicalEdgeTermIdSorting sorts the term IDs using a combination of hierarchical clustering and graph edge distance. The algorithm iteratively chooses the most similar term ID pairs and places them into adjacent locations.

We’ll use a toy HPO with several terms to present the functionality:

>>> import os
>>> fpath_hpo = os.path.join('docs', 'data', 'hp.toy.json')
>>> hpo = hpotk.load_minimal_ontology(fpath_hpo)

>>> from hpotk.util.sort import HierarchicalEdgeTermIdSorting
>>> sorting = HierarchicalEdgeTermIdSorting(hpo)

We can obtain the indices that will sort the HPO terms and prepare a tuple with sorted terms:

>>> indices = sorting.argsort(term_ids)
>>> ordered = tuple(term_ids[idx] for idx in indices)

Now let’s look at the order. Originally, the HPO terms were ordered as follows:

'HP:0001744'   # Splenomegaly
'HP:0020221'   # Clonic seizure
'HP:0001238'   # Slender finger
'HP:0011153'   # Focal motor seizure
'HP:0002240'   # Hepatomegaly

After the sorting, we get this order:

>>> for term_id in ordered:
...   print(hpo.get_term(term_id).name)
Focal motor seizure
Clonic seizure
Hepatomegaly
Splenomegaly
Slender finger

which is much better, right?