discover
- discover(uris: Iterable[str], *, delimiters: Sequence[str] | None = None, cutoff: int | None = None, metaprefix: str = 'ns', converter: Converter | None = None) Converter [source]
Discover new URI prefixes and construct a converter with a unique dummy CURIE prefix for each.
- Parameters:
uris – An iterable of URIs to search through. Will be taken as a set and each unique entry is only considered once.
delimiters –
The character(s) that delimit a URI prefix from a local unique identifier. If none given, defaults to using
/
,#
, and_
. For example:/
is the delimiter inhttps://www.ncbi.nlm.nih.gov/pubmed/37929212
, which separates the URI prefixhttps://www.ncbi.nlm.nih.gov/pubmed/
from the local unique identifier 37929212 for the article “New insights into osmobiosis and chemobiosis in tardigrades” in PubMed.#
is the delimiter inhttp://www.w3.org/2000/01/rdf-schema#label
, which separates the URI prefixhttp://www.w3.org/2000/01/rdf-schema#
from the local unique identifier label for the term “label” in the RDF Schema. The#
typically is used in a URL to denote a fragment and commonly appears in small semantic web vocabularies that are shown as a single HTML page._
is the delimiter inhttp://purl.obolibrary.org/obo/GO_0032571
, which separates the URI prefixhttp://purl.obolibrary.org/obo/GO_
from the local unique identifier 0032571 for the term “response to vitamin K” in the Gene Ontology
Note
The delimiter is itself a part of the URI prefix
cutoff –
If given, will require more than
cutoff
unique local unique identifiers associated with a given URI prefix to keep it.Defaults to zero, which increases recall (i.e., likelihood of getting all possible URI prefixes) but decreases precision (i.e., more of the results might be false positives / spurious). If you get a lot of false positives, try increasing first to 1, 2, then maybe higher.
metaprefix – The beginning part of each dummy prefix, followed by a number. The default value is
ns
, so dummy prefixes are namedns1
,ns2
, and so on.converter –
If a pre-existing converter is passed, then URIs that can be parsed using the pre-existing converter are not considered during discovery.
For example, if you’re an OBO person working with URIs coming from an OBO ontology, it makes sense to pass the converter from
curies.get_obo_converter()
to reduce false positive discoveries. More generally, a comprehensive converter like the Bioregistry (fromcuries.get_bioregistry_converter()
) can massively reduce false positive discoveries and ultimately reduce burden on the data scientist using this function when needing to understand the results and carefully curate a prefix map based on the discoveries.
- Returns:
A converter with dummy prefixes
>>> import curies # Generate some example URIs >>> uris = [f"http://ran.dom/{i:03}" for i in range(30)] >>> discovered_converter = curies.discover(uris) >>> discovered_converter.records [Record(prefix="ns1", uri_prefix="http://ran.dom/")] # Now, you can compress the URIs to dummy CURIEs >>> discovered_converter.compress("http://ran.dom/002") 'ns1:002'