Converter with Preprocessing

When simple expansion and contraction aren’t enough, and you want to inject global or context-specific rewrite rules, you can wrap a curies.Converter and preprocessing rules encoded in an instance of curies.PreprocessingRules inside a curies.PreprocessingConverter.

Rewrites

For example, you always want to fix legacy references to the OBO_REL namespace:

import curies
from curies import PreprocessingRules, PreprocessingConverter, PreprocessingRewrites

rules = PreprocessingRules(
    rewrites=PreprocessingRewrites(
        full={"OBO_REL:is_a": "rdfs:subClassOf"},
    ),
)

converter = curies.get_obo_converter()
converter = PreprocessingConverter.from_converter(converter, rules=rules)

parsed = converter.parse_curie("OBO_REL:is_a")
assert parsed == ReferenceTuple("rdfs", "subClassOf")

Similarly, there may be a whole class of references that need to be fixed based on their prefix, such as the APOLLO:SV_ references that are mangled by the OWLAPI due to the OBO Foundry’s PURL rules

import curies
from curies import PreprocessingRules, PreprocessingConverter, PreprocessingRewrites

rules = PreprocessingRules(
    rewrites=PreprocessingRewrites(
        prefix={"APOLLO:SV_": "APOLLO_SV:"},
    )
)

converter = curies.get_obo_converter()
converter = PreprocessingConverter.from_converter(converter, rules=rules)

parsed = converter.parse_curie("APOLLO:SV_1234567")
assert parsed == ReferenceTuple("APOLLO_SV", "1234567")

The CURIE and URI rewrites are unified. Therefore, you can also use a URI as a rewrite, such as handling Creative Commons license URLs, which unfortunately aren’t themselves part of a semantic space for licenses. Luckily, SPDX is, and we can remap to that.

import curies
from curies import PreprocessingRules, PreprocessingConverter, PreprocessingRewrites

rules = PreprocessingRules(
    rewrites=PreprocessingRewrites(
        full={
            "http://creativecommons.org/licenses/by/3.0/": "spdx:CC-BY-3.0",
        },
    )
)

converter = curies.get_obo_converter()
converter.add_prefix("spdx", "https://spdx.org/licenses/")
converter = PreprocessingConverter.from_converter(converter, rules=rules)

parsed = converter.parse_uri("http://creativecommons.org/licenses/by/3.0/")
assert parsed == ReferenceTuple("spdx", "CC-BY-3.0")

Some rewrite rules only apply to a specific resource, because of its own quirks in curation or encoding. For example, CHMO encodes OrangeBook entries with orange as a prefix, which is not typically specific enough to warrant curating orange as a prefix, e.g., in the Bioregistry

import curies
from curies import PreprocessingRules, PreprocessingConverter, PreprocessingRewrites

rules = PreprocessingRules(
    rewrites=PreprocessingRewrites(
        resource_prefix={
            "CHMO": {"orange:": "orangebook:"},
        },
    ),
)

converter = curies.get_obo_converter()
converter.add_prefix("orangebook", "https://bioregistry.io/orangebook:")
converter = PreprocessingConverter.from_converter(converter, rules=rules)

parsed = converter.parse_curie("orange:10.2.1.1.3")
assert parsed == ReferenceTuple("orangebook", "10.2.1.1.3")

Similarly, this can be used to inject knowledge about resources that improperly import EDAM sub-trees such as MCRO, which uses format as a prefix where it means edam.format

Blocks

Some references are _never_ informative, and can be configured to be thrown away, such as Bgee:curators, BioGRID:curators, GROUP:OBI, and similar group curation flags.

import curies
from curies import PreprocessingRules, PreprocessingConverter, PreprocessingBlocklists

rules = PreprocessingRules(
    blocklists=PreprocessingBlocklists(
        full=["Bgee:curators", "BioGRID:curators", "GROUP:OBI"],
    ),
)

converter = curies.get_obo_converter()
converter = PreprocessingConverter.from_converter(converter, rules=rules)

# raises a BlocklistError
converter.parse_curie("GROUP:OBI")

Blocklists cause throwing an exception that can be handled by downstream code, such as returning a None. This is done because in some places, it’s nice to have the distinction between None being returned by parsing failing, versus actively being blocked. This can be toggled with the block_action argument.

Suffix Post-processing

URIs from some prefixes that are known to sometimes have unneeded suffixes can be configured to be stripped in the following way

import curies
from curies import PreprocessingRules, PreprocessingConverter, PostprocessingRewrites

converter = Converter.from_prefixes(
    {"emedicine": "http://emedicine.medscape.com/article/"}
)
rules = PreprocessingRules(
    postprocessing=PostprocessingRewrites(
        suffix={
            "emedicine": "-overview",
        },
    ),
)

converter = PreprocessingConverter.from_converter(converter, rules=rules)

parsed = converter.parse_curie("http://emedicine.medscape.com/article/276512-overview")
assert parsed == ReferenceTuple("emedicine", "276512")