URI Prefix Discovery

Discovery new entries for a Converter.

The curies.discover() functionality is intended to be used in a “data science” workflow. Its goal is to enable a data scientist to semi-interactively explore data (e.g., coming from an ontology, SSSOM, RDF) that doesn’t come with a complete (extended) prefix map and identify common URI prefixes.

It returns the discovered URI prefixes in a curies.Converter object with “dummy” CURIE prefixes. This makes it possible to convert the URIs appearing in the data into CURIEs and therefore enables their usage in places where CURIEs are expected.

However, it’s suggested that after discovering URI prefixes, the data scientist more carefully constructs a meaningful prefix map based on the discovered one. This might include some or all of the following steps:

  1. Replace dummy CURIE prefixes with meaningful ones

  2. Remove spurious URI prefixes that appear but do not represent a semantic space. This happens often due to using _ as a delimiter or having a frequency cutoff of zero (see the parameters for this function).

  3. Consider chaining a comprehensive extended prefix map such as the Bioregistry (from curies.get_bioregistry_converter()) with onto the converter passed to this function so pre-existing URI prefixes are not re-discovered.

Finally, you should save the prefix map that you create in a persistent place (i.e., inside a JSON file) such that it can be reused.

Algorithm

The curies.discover() function implements the following algorithm that does the following for each URI:

  1. For each delimiter (in the priority order they are given) check if the delimiter is present.

  2. If it’s present, split the URI into two parts based on rightmost appearance of the delimiter.

  3. If the right part after splitting is all alphanumeric characters, save the URI prefix (with delimiter attached)

  4. If a delimiter is successfully used to identify a URI prefix, don’t check any of the following delimiters

After identifying putative URI prefixes, the second part of the algorithm does the following:

  1. If a cutoff was provided, remove all putative URI prefixes for which there were fewer examples than the cutoff

  2. Sort the URI prefixes lexicographically (i.e., with sorted())

  3. Assign a dummy CURIE prefix to each URI prefix, counting upwards from 1

  4. Construct a converter from this prefix map and return it

Discovering URI Prefixes from an Ontology

A common place where discovering URI prefixes is important is when working with new ontologies. In the following example, we look at the Academic Event Ontology (AEON). This is an ontology developed under OBO Foundry principles describing academic events. Accordingly, it includes many URI references to terms in OBO Foundry ontologies.

In this tutorial, we use curies.discover() (and then curies.discover_from_rdf() as a nice convenience function) to load the ontology in the RDF/XML format and discover putative URI prefixes.

import curies
from curies.discovery import get_uris_from_rdf

ONTOLOGY_URL = "https://raw.githubusercontent.com/tibonto/aeon/main/aeon.owl"

uris = get_uris_from_rdf(ONTOLOGY_URL, format="xml")
discovered_converter = curies.discover(uris)
# note, these two steps can be combine with curies.discover_from_rdf,
# and we'll do that in the following examples

We discovered the fifty URI prefixes in the following table. Many of them appear to be OBO Foundry URI prefixes or semantic web prefixes, so in the next step, we’ll use prior knowledge to reduce the false discovery rate.

curie_prefix

uri_prefix

ns1

http://ontologydesignpatterns.org/wiki/Community:Parts_and_

ns2

http://protege.stanford.edu/plugins/owl/protege#

ns3

http://purl.obolibrary.org/obo/AEON_

ns4

http://purl.obolibrary.org/obo/APOLLO_SV_

ns5

http://purl.obolibrary.org/obo/BFO_

ns6

http://purl.obolibrary.org/obo/CRO_

ns7

http://purl.obolibrary.org/obo/ENVO_

ns8

http://purl.obolibrary.org/obo/IAO_

ns9

http://purl.obolibrary.org/obo/ICO_

ns10

http://purl.obolibrary.org/obo/NCBITaxon_

ns11

http://purl.obolibrary.org/obo/OBIB_

ns12

http://purl.obolibrary.org/obo/OBI_

ns13

http://purl.obolibrary.org/obo/OMO_

ns14

http://purl.obolibrary.org/obo/OOSTT_

ns15

http://purl.obolibrary.org/obo/RO_

ns16

http://purl.obolibrary.org/obo/TXPO_

ns17

http://purl.obolibrary.org/obo/bfo/axiom/

ns18

http://purl.obolibrary.org/obo/valid_for_

ns19

http://purl.obolibrary.org/obo/valid_for_go_

ns20

http://purl.obolibrary.org/obo/valid_for_go_annotation_

ns21

http://purl.obolibrary.org/obo/wikiCFP_

ns22

http://purl.org/dc/elements/1.1/

ns23

http://purl.org/dc/terms/

ns24

http://usefulinc.com/ns/doap#

ns25

http://wiki.geneontology.org/index.php/Involved_

ns26

http://www.geneontology.org/formats/oboInOwl#

ns27

http://www.geneontology.org/formats/oboInOwl#created_

ns28

http://www.w3.org/1999/02/22-rdf-syntax-ns#

ns29

http://www.w3.org/2000/01/rdf-schema#

ns30

http://www.w3.org/2001/XMLSchema#

ns31

http://www.w3.org/2002/07/owl#

ns32

http://www.w3.org/2003/11/swrl#

ns33

http://www.w3.org/2004/02/skos/core#

ns34

http://www.w3.org/ns/prov#

ns35

http://xmlns.com/foaf/0.1/

ns36

https://en.wikipedia.org/wiki/Allen%27s_interval_

ns37

https://groups.google.com/d/msg/bfo-owl-devel/s9Uug5QmAws/ZDRnpiIi_

ns38

https://ror.org/

ns39

https://w3id.org/scholarlydata/ontology/conference-ontology.owl#

ns40

https://w3id.org/seo#

ns41

https://www.confident-conference.org/index.php/Academic_Field:Information_

ns42

https://www.confident-conference.org/index.php/Event:VIVO_

ns43

https://www.confident-conference.org/index.php/Event:VIVO_2021_

ns44

https://www.confident-conference.org/index.php/Event:VIVO_2021_orga_

ns45

https://www.confident-conference.org/index.php/Event:VIVO_2021_talk1_

ns46

https://www.confident-conference.org/index.php/Event:VIVO_2021_talk2_

ns47

https://www.confident-conference.org/index.php/Event_Series:VIVO_

ns48

https://www.wikidata.org/wiki/

ns49

https://www.wikidata.org/wiki/Wikidata:Property_proposal/colocated_

ns50

urn:swrl#

In the following block, we chain together (extended) prefix maps from the OBO Foundry as well as a “semantic web” prefix map to try and reduce the number of false positives by passing them through the converter keyword argument.

import curies

ONTOLOGY_URL = "https://raw.githubusercontent.com/tibonto/aeon/main/aeon.owl"
SEMWEB_URL = "https://raw.githubusercontent.com/biopragmatics/bioregistry/main/exports/contexts/semweb.context.jsonld"

base_converter = curies.chain([
    curies.load_jsonld_context(SEMWEB_URL),
    curies.get_obo_converter(),
])

discovered_converter = curies.discover_from_rdf(
    ONTOLOGY_URL, format="xml", converter=base_converter
)

We reduced the number of putative URI prefixes in half in the following table. However, we can still identify some putative URI prefixes that likely would have appeared in a more comprehensive (extended) prefix map such as the Bioregistry such as:

Despite this, we’re on our way! It’s also obvious that several of the remaining putative URI prefixes come from non-standard usage of the OBO PURL system (e.g., http://purl.obolibrary.org/obo/valid_for_go_annotation_) and some are proper false positives due to using _ as a delimiter (e.g., https://www.confident-conference.org/index.php/Event:VIVO_2021_talk2_).

curie_prefix

uri_prefix

ns1

http://ontologydesignpatterns.org/wiki/Community:Parts_and_

ns2

http://protege.stanford.edu/plugins/owl/protege#

ns3

http://purl.obolibrary.org/obo/AEON_

ns4

http://purl.obolibrary.org/obo/bfo/axiom/

ns5

http://purl.obolibrary.org/obo/valid_for_

ns6

http://purl.obolibrary.org/obo/valid_for_go_

ns7

http://purl.obolibrary.org/obo/valid_for_go_annotation_

ns8

http://purl.obolibrary.org/obo/wikiCFP_

ns9

http://usefulinc.com/ns/doap#

ns10

http://wiki.geneontology.org/index.php/Involved_

ns11

https://en.wikipedia.org/wiki/Allen%27s_interval_

ns12

https://groups.google.com/d/msg/bfo-owl-devel/s9Uug5QmAws/ZDRnpiIi_

ns13

https://ror.org/

ns14

https://w3id.org/scholarlydata/ontology/conference-ontology.owl#

ns15

https://w3id.org/seo#

ns16

https://www.confident-conference.org/index.php/Academic_Field:Information_

ns17

https://www.confident-conference.org/index.php/Event:VIVO_

ns18

https://www.confident-conference.org/index.php/Event:VIVO_2021_

ns19

https://www.confident-conference.org/index.php/Event:VIVO_2021_orga_

ns20

https://www.confident-conference.org/index.php/Event:VIVO_2021_talk1_

ns21

https://www.confident-conference.org/index.php/Event:VIVO_2021_talk2_

ns22

https://www.confident-conference.org/index.php/Event_Series:VIVO_

ns23

https://www.wikidata.org/wiki/

ns24

https://www.wikidata.org/wiki/Wikidata:Property_proposal/colocated_

ns25

urn:swrl#

As a final step in our iterative journey of URI prefix discovery, we’re going to use a cutoff for a minimum of two appearances of a URI prefix to reduce the most spurious false positives.

import curies

ONTOLOGY_URL = "https://raw.githubusercontent.com/tibonto/aeon/main/aeon.owl"
SEMWEB_URL = "https://raw.githubusercontent.com/biopragmatics/bioregistry/main/exports/contexts/semweb.context.jsonld"

base_converter = curies.chain([
    curies.load_jsonld_context(SEMWEB_URL),
    curies.get_obo_converter(),
])

discovered_converter = curies.discover_from_rdf(
    ONTOLOGY_URL, format="xml", converter=base_converter, cutoff=2
)

We have reduced the list to a manageable set of 9 putative URI prefixes in the following table.

curie_prefix

uri_prefix

ns1

http://purl.obolibrary.org/obo/AEON_

ns2

http://purl.obolibrary.org/obo/bfo/axiom/

ns3

http://purl.obolibrary.org/obo/valid_for_go_

ns4

https://w3id.org/scholarlydata/ontology/conference-ontology.owl#

ns5

https://w3id.org/seo#

ns6

https://www.confident-conference.org/index.php/Event:VIVO_2021_

ns7

https://www.confident-conference.org/index.php/Event:VIVO_2021_talk1_

ns8

https://www.confident-conference.org/index.php/Event:VIVO_2021_talk2_

ns9

urn:swrl#

Here are the calls to be made:

  • ns1 represents the AEON vocabulary itself and should be given the aeon prefix.

  • ns2 and ns3` are all false positives

  • ns6, ns7, and ns8 are a tricky case - they have a meaningful overlap that can’t be easily automatically detected (yet). In this case, it makes the most sense to add the shortest one manually to the base converter with some unique name (don’t use ns6 as it will cause conflicts later), like in:

    base_converter = curies.chain([
        curies.load_jsonld_context(SEMWEB_URL),
        curies.get_obo_converter(),
        curies.load_prefix_map({"confident_event_vivo_2021": "https://www.confident-conference.org/index.php/Event:VIVO_2021_"}),
    ])
    

    In reality, these are all part of the ConfIDent Event vocabulary, which has the URI prefix https://www.confident-conference.org/index.php/Event:.

  • ns4 represents the Conference Ontology and should be given the conference prefix.

  • ns5 represents the Scientific Event Ontology (SEO) and should be given the seo prefix.

  • ns9 represents the Semantic Web Rule Language, though using URNs is an interesting choice in serialization.

After we’ve made these calls, it’s a good idea to write an (extended) prefix map. In this case, since we aren’t working with CURIE prefix synonyms nor URI prefix synonyms, it’s okay to write a simple prefix map or a JSON-LD context without losing information.

Note

Postscript: throughout this guide, we used the following Python code to create the RST tables:

def print_converter(converter) -> None:
    from tabulate import tabulate
    rows = sorted(
        [
            (record.prefix, f"``{record.uri_prefix}``")
            for record in discovered_converter.records
        ],
        key=lambda t: int(t[0].removeprefix("ns")),
    )
    print(tabulate(rows, headers=["curie_prefix", "uri_prefix"], tablefmt="rst"))

Just Make It Work, or, A Guide to Being a Questionable Semantic Citizen

The goal of the curies package is to provide the tools towards making semantically well-defined data, which has a meaningful (extended) prefix map associated with it. Maybe you’re in an organization that doesn’t really care about the utility of nice prefix maps, and just wants to get the job done where you need to turn URIs into _some_ CURIEs.

Here’s a recipe for doing this, based on the last example with AEON:

import curies

ONTOLOGY_URL = "https://raw.githubusercontent.com/tibonto/aeon/main/aeon.owl"

# Use the Bioregistry as a base prefix since it's the most comprehensive one
base_converter = curies.get_bioregistry_converter()

# Only discover what the Bioregistry doesn't already have
discovered_converter = curies.discover_from_rdf(
    ONTOLOGY_URL, format="xml", converter=base_converter
)

# Chain together the base converter with the discoveries
augmented_converter = curies.chain([base_converter, discovered_converter])

With the augmented converter, you can now convert all URIs in the ontology into CURIEs. They will have a smattering of unintelligible prefixes with no meaning, but at least the job is done!