discover

discover(uris: Iterable[str], *, delimiters: Sequence[str] | None = None, cutoff: int | None = None, metaprefix: str = 'ns', converter: Converter | None = None) → Converter[source]

Discover new URI prefixes and construct a converter with a unique dummy CURIE prefix for each.

Parameters:

uris – An iterable of URIs to search through. Will be taken as a set and each unique entry is only considered once.
delimiters –
The character(s) that delimit a URI prefix from a local unique identifier. If none given, defaults to using /, #, and _. For example:
- / is the delimiter in https://www.ncbi.nlm.nih.gov/pubmed/37929212, which separates the URI prefix https://www.ncbi.nlm.nih.gov/pubmed/ from the local unique identifier 37929212 for the article “New insights into osmobiosis and chemobiosis in tardigrades” in PubMed.
- # is the delimiter in http://www.w3.org/2000/01/rdf-schema#label, which separates the URI prefix http://www.w3.org/2000/01/rdf-schema# from the local unique identifier label for the term “label” in the RDF Schema. The # typically is used in a URL to denote a fragment and commonly appears in small semantic web vocabularies that are shown as a single HTML page.
- _ is the delimiter in http://purl.obolibrary.org/obo/GO_0032571, which separates the URI prefix http://purl.obolibrary.org/obo/GO_ from the local unique identifier 0032571 for the term “response to vitamin K” in the Gene Ontology
Note

The delimiter is itself a part of the URI prefix
cutoff –
If given, will require more than cutoff unique local unique identifiers associated with a given URI prefix to keep it.

Defaults to zero, which increases recall (i.e., likelihood of getting all possible URI prefixes) but decreases precision (i.e., more of the results might be false positives / spurious). If you get a lot of false positives, try increasing first to 1, 2, then maybe higher.
metaprefix – The beginning part of each dummy prefix, followed by a number. The default value is ns, so dummy prefixes are named ns1, ns2, and so on.
converter –
If a pre-existing converter is passed, then URIs that can be parsed using the pre-existing converter are not considered during discovery.

For example, if you’re an OBO person working with URIs coming from an OBO ontology, it makes sense to pass the converter from curies.get_obo_converter() to reduce false positive discoveries. More generally, a comprehensive converter like the Bioregistry (from curies.get_bioregistry_converter()) can massively reduce false positive discoveries and ultimately reduce burden on the data scientist using this function when needing to understand the results and carefully curate a prefix map based on the discoveries.

Returns:

A converter with dummy prefixes

>>> import curies
>>> uris = [f"http://ran.dom/{i:03}" for i in range(30)]
>>> discovered_converter = curies.discover(uris)
>>> discovered_converter.records
[Record(prefix="ns1", uri_prefix="http://ran.dom/", prefix_synonyms=[], uri_prefix_synonyms=[], pattern=None)]
>>> discovered_converter.compress("http://ran.dom/002")
'ns1:002'