URI Prefix Discovery
Discovery new entries for a Converter.
The curies.discover()
functionality is intended to be used in a “data science” workflow. Its goal is to
enable a data scientist to semi-interactively explore data (e.g., coming from an ontology, SSSOM, RDF)
that doesn’t come with a complete (extended) prefix map and identify common URI prefixes.
It returns the discovered URI prefixes in a curies.Converter
object with “dummy” CURIE prefixes.
This makes it possible to convert the URIs appearing in the data into CURIEs and therefore enables
their usage in places where CURIEs are expected.
However, it’s suggested that after discovering URI prefixes, the data scientist more carefully constructs a meaningful prefix map based on the discovered one. This might include some or all of the following steps:
Replace dummy CURIE prefixes with meaningful ones
Remove spurious URI prefixes that appear but do not represent a semantic space. This happens often due to using
_
as a delimiter or having a frequency cutoff of zero (see the parameters for this function).Consider chaining a comprehensive extended prefix map such as the Bioregistry (from
curies.get_bioregistry_converter()
) with onto the converter passed to this function so pre-existing URI prefixes are not re-discovered.
Finally, you should save the prefix map that you create in a persistent place (i.e., inside a JSON file) such that it can be reused.
Algorithm
The curies.discover()
function implements the following algorithm that does the following for each URI:
For each delimiter (in the priority order they are given) check if the delimiter is present.
If it’s present, split the URI into two parts based on rightmost appearance of the delimiter.
If the right part after splitting is all alphanumeric characters, save the URI prefix (with delimiter attached)
If a delimiter is successfully used to identify a URI prefix, don’t check any of the following delimiters
After identifying putative URI prefixes, the second part of the algorithm does the following:
If a cutoff was provided, remove all putative URI prefixes for which there were fewer examples than the cutoff
Sort the URI prefixes lexicographically (i.e., with
sorted()
)Assign a dummy CURIE prefix to each URI prefix, counting upwards from 1
Construct a converter from this prefix map and return it
Discovering URI Prefixes from an Ontology
A common place where discovering URI prefixes is important is when working with new ontologies. In the following example, we look at the Academic Event Ontology (AEON). This is an ontology developed under OBO Foundry principles describing academic events. Accordingly, it includes many URI references to terms in OBO Foundry ontologies.
In this tutorial, we use curies.discover()
(and then curies.discover_from_rdf()
as a nice convenience
function) to load the ontology in the RDF/XML format and discover putative URI prefixes.
import curies
from curies.discovery import get_uris_from_rdf
ONTOLOGY_URL = "https://raw.githubusercontent.com/tibonto/aeon/main/aeon.owl"
uris = get_uris_from_rdf(ONTOLOGY_URL, format="xml")
discovered_converter = curies.discover(uris)
# note, these two steps can be combine with curies.discover_from_rdf,
# and we'll do that in the following examples
We discovered the fifty URI prefixes in the following table. Many of them appear to be OBO Foundry URI prefixes or semantic web prefixes, so in the next step, we’ll use prior knowledge to reduce the false discovery rate.
curie_prefix |
uri_prefix |
---|---|
ns1 |
|
ns2 |
|
ns3 |
|
ns4 |
|
ns5 |
|
ns6 |
|
ns7 |
|
ns8 |
|
ns9 |
|
ns10 |
|
ns11 |
|
ns12 |
|
ns13 |
|
ns14 |
|
ns15 |
|
ns16 |
|
ns17 |
|
ns18 |
|
ns19 |
|
ns20 |
|
ns21 |
|
ns22 |
|
ns23 |
|
ns24 |
|
ns25 |
|
ns26 |
|
ns27 |
|
ns28 |
|
ns29 |
|
ns30 |
|
ns31 |
|
ns32 |
|
ns33 |
|
ns34 |
|
ns35 |
|
ns36 |
|
ns37 |
|
ns38 |
|
ns39 |
|
ns40 |
|
ns41 |
|
ns42 |
|
ns43 |
|
ns44 |
|
ns45 |
|
ns46 |
|
ns47 |
|
ns48 |
|
ns49 |
|
ns50 |
|
In the following block, we chain together (extended) prefix maps from the OBO Foundry as well as
a “semantic web” prefix map to try and reduce the number of false positives by passing them
through the converter
keyword argument.
import curies
ONTOLOGY_URL = "https://raw.githubusercontent.com/tibonto/aeon/main/aeon.owl"
SEMWEB_URL = "https://raw.githubusercontent.com/biopragmatics/bioregistry/main/exports/contexts/semweb.context.jsonld"
base_converter = curies.chain([
curies.load_jsonld_context(SEMWEB_URL),
curies.get_obo_converter(),
])
discovered_converter = curies.discover_from_rdf(
ONTOLOGY_URL, format="xml", converter=base_converter
)
We reduced the number of putative URI prefixes in half in the following table. However, we can still identify some putative URI prefixes that likely would have appeared in a more comprehensive (extended) prefix map such as the Bioregistry such as:
https://ror.org/
for the Research Organization Registry (ROR)https://w3id.org/seo#
for the Scientific Event Ontology (SEO)http://usefulinc.com/ns/doap#
for the Description of a Project (DOAP) vocabulary
Despite this, we’re on our way! It’s also obvious that several of the remaining putative URI prefixes come from
non-standard usage of the OBO PURL system (e.g., http://purl.obolibrary.org/obo/valid_for_go_annotation_
)
and some are proper false positives due to using _
as a delimiter
(e.g., https://www.confident-conference.org/index.php/Event:VIVO_2021_talk2_
).
curie_prefix |
uri_prefix |
---|---|
ns1 |
|
ns2 |
|
ns3 |
|
ns4 |
|
ns5 |
|
ns6 |
|
ns7 |
|
ns8 |
|
ns9 |
|
ns10 |
|
ns11 |
|
ns12 |
|
ns13 |
|
ns14 |
|
ns15 |
|
ns16 |
|
ns17 |
|
ns18 |
|
ns19 |
|
ns20 |
|
ns21 |
|
ns22 |
|
ns23 |
|
ns24 |
|
ns25 |
|
As a final step in our iterative journey of URI prefix discovery, we’re going to use a cutoff for a minimum of two appearances of a URI prefix to reduce the most spurious false positives.
import curies
ONTOLOGY_URL = "https://raw.githubusercontent.com/tibonto/aeon/main/aeon.owl"
SEMWEB_URL = "https://raw.githubusercontent.com/biopragmatics/bioregistry/main/exports/contexts/semweb.context.jsonld"
base_converter = curies.chain([
curies.load_jsonld_context(SEMWEB_URL),
curies.get_obo_converter(),
])
discovered_converter = curies.discover_from_rdf(
ONTOLOGY_URL, format="xml", converter=base_converter, cutoff=2
)
We have reduced the list to a manageable set of 9 putative URI prefixes in the following table.
curie_prefix |
uri_prefix |
---|---|
ns1 |
|
ns2 |
|
ns3 |
|
ns4 |
|
ns5 |
|
ns6 |
|
ns7 |
|
ns8 |
|
ns9 |
|
Here are the calls to be made:
ns1
represents the AEON vocabulary itself and should be given theaeon
prefix.ns2
andns3`
are all false positivesns6
,ns7
, andns8
are a tricky case - they have a meaningful overlap that can’t be easily automatically detected (yet). In this case, it makes the most sense to add the shortest one manually to the base converter with some unique name (don’t usens6
as it will cause conflicts later), like in:base_converter = curies.chain([ curies.load_jsonld_context(SEMWEB_URL), curies.get_obo_converter(), curies.load_prefix_map({"confident_event_vivo_2021": "https://www.confident-conference.org/index.php/Event:VIVO_2021_"}), ])
In reality, these are all part of the ConfIDent Event vocabulary, which has the URI prefix
https://www.confident-conference.org/index.php/Event:
.ns4
represents the Conference Ontology and should be given theconference
prefix.ns5
represents the Scientific Event Ontology (SEO) and should be given theseo
prefix.ns9
represents the Semantic Web Rule Language, though using URNs is an interesting choice in serialization.
After we’ve made these calls, it’s a good idea to write an (extended) prefix map. In this case, since we aren’t working with CURIE prefix synonyms nor URI prefix synonyms, it’s okay to write a simple prefix map or a JSON-LD context without losing information.
Note
Postscript: throughout this guide, we used the following Python code to create the RST tables:
def print_converter(converter) -> None:
from tabulate import tabulate
rows = sorted(
[
(record.prefix, f"``{record.uri_prefix}``")
for record in discovered_converter.records
],
key=lambda t: int(t[0].removeprefix("ns")),
)
print(tabulate(rows, headers=["curie_prefix", "uri_prefix"], tablefmt="rst"))
Just Make It Work, or, A Guide to Being a Questionable Semantic Citizen
The goal of the curies
package is to provide the tools towards making semantically well-defined data,
which has a meaningful (extended) prefix map associated with it. Maybe you’re in an organization that doesn’t really
care about the utility of nice prefix maps, and just wants to get the job done where you need to turn URIs into _some_
CURIEs.
Here’s a recipe for doing this, based on the last example with AEON:
import curies
ONTOLOGY_URL = "https://raw.githubusercontent.com/tibonto/aeon/main/aeon.owl"
# Use the Bioregistry as a base prefix since it's the most comprehensive one
base_converter = curies.get_bioregistry_converter()
# Only discover what the Bioregistry doesn't already have
discovered_converter = curies.discover_from_rdf(
ONTOLOGY_URL, format="xml", converter=base_converter
)
# Chain together the base converter with the discoveries
augmented_converter = curies.chain([base_converter, discovered_converter])
With the augmented converter, you can now convert all URIs in the ontology into CURIEs. They will have a smattering of unintelligible prefixes with no meaning, but at least the job is done!