upgrade_prefix_map

upgrade_prefix_map(prefix_map: Mapping[str, str]) → List[Record][source]

Convert a (potentially problematic) prefix map (i.e., not bijective) into a list of records.

A prefix map is bijective if it has no duplicate CURIE prefixes (i.e., keys in a dictionary) and no duplicate URI prefixes (i.e., values in a dictionary). Because of the way that dictionaries work in Python, we are always guaranteed that there are no duplicate keys.

However, it is both possible and frequent to have duplicate values. This happens because many semantic spaces have multiple synonymous CURIE prefixes. For example, the OBO in OWL vocabulary has two common, interchangable prefixes: oio and oboInOwl (and the case variant oboinowl). Therefore, a prefix map might contain the following parts that make it non-bijective:

{
  "oio": "http://www.geneontology.org/formats/oboInOwl#",
  "oboInOwl": "http://www.geneontology.org/formats/oboInOwl#"
}

This is bad because this prefix map can’t be used to determinstically compress a URI. For example, should http://www.geneontology.org/formats/oboInOwl#hasDbXref be compressed to oio:hasDbXref or oboInOwl:hasDbXref? Neither is necessarily incorrect, but the issue here is that there is not an explicit choice by the data modeler, meaning that data compressed into CURIEs with this non-bijective map might not be readily integrable with other datasets.

The best solution to this situation is not more code, but rather for the data modeler to address the issue upstream in the following steps:

Choose the which of prefix synonyms is going to be the primary prefix. If you’re not sure, the Bioregistry is a comprehensive registry of prefixes and their syonyms applicable in the semantic web and the natural sciences. It gives a good suggestion of what the best prefix is. In the OBO in OWL case, it suggests oboInOwl.
Update all related data artifacts to only use that preferred prefix
Either 1) remove the other synonyms (in this example, oio) from the prefix map or 2) transition to using Extended Prefix Maps, a more modern data structure for supporting URI and CURIE interconversion.

The first part of step 3 in this solution highlights one of the key shortcomings of prefix maps themselves - they can’t keep track of synonyms, which are often useful in data integration, especially when a single prefix map is defined on the level of a project or community. The extended prefix map is a simple data structure proposed to address this.

This function is for people who are not in the position to make the sustainable fix, and want to automate the assignment of which is the preferred prefix. It uses a deterministic algorithm to choose from two or more CURIE prefixes that have the same URI prefix and generate an extended prefix map in which they have bene collapsed into a single record. More specitically, the algorithm is based on a case-sensitive lexical sort of the prefixes. The first in the sort order becomes the primary prefix and the others become synonyms in the resulting record.

Parameters:: prefix_map – A mapping whose keys represent CURIE prefixes and values represent URI prefixes
Returns:: A list of curies.Record objects that together constitute an extended prefix map

>>> from curies import Converter, upgrade_prefix_map
>>> pm = {"a": "https://example.com/a/", "b": "https://example.com/a/"}
>>> records = upgrade_prefix_map(pm)
>>> converter = Converter(records)
>>> converter.expand("a:1")
'https://example.com/a/1'
>>> converter.expand("b:1")
'https://example.com/a/1'
>>> converter.compress("https://example.com/a/1")
'a:1'

Note

Thanks to Joe Flack for proposing this algorithm in this discussion.