Enrichment

Mixins for enrichment analysis using STRING/UniProt.

Provides methods for STRING-based functional and protein–protein interaction (PPI) enrichment.

This mixin includes utilities for:

Running functional enrichment on differentially expressed or user-supplied gene lists.
Performing STRING PPI enrichment to identify interaction networks.
Generating STRING network visualization links and embedded SVGs.
Listing and accessing enrichment results stored in .stats.

Methods:

Name	Description
`enrichment_functional`	Runs STRING functional enrichment on DE results or a custom gene list.
`enrichment_ppi`	Runs STRING PPI enrichment on a user-supplied gene or accession list.
`list_enrichments`	Lists available enrichment results and DE comparisons.
`plot_enrichment_svg`	Displays a STRING enrichment SVG inline or saves it to file.
`get_string_mappings`	Maps UniProt accessions to STRING IDs using the STRING API.
`resolve_to_accessions`	Resolves gene names or mixed inputs to accessions using internal mappings.
`get_string_network_link`	Generates a direct STRING network URL for visualization.

Source code in src/scpviz/pAnnData/enrichment.py

class EnrichmentMixin:
    """
    Provides methods for STRING-based functional and protein–protein interaction (PPI) enrichment.

    This mixin includes utilities for:

    - Running functional enrichment on differentially expressed or user-supplied gene lists.
    - Performing STRING PPI enrichment to identify interaction networks.
    - Generating STRING network visualization links and embedded SVGs.
    - Listing and accessing enrichment results stored in `.stats`.

    Functions:
        enrichment_functional: Runs STRING functional enrichment on DE results or a custom gene list.
        enrichment_ppi: Runs STRING PPI enrichment on a user-supplied gene or accession list.
        list_enrichments: Lists available enrichment results and DE comparisons.
        plot_enrichment_svg: Displays a STRING enrichment SVG inline or saves it to file.
        get_string_mappings: Maps UniProt accessions to STRING IDs using the STRING API.
        resolve_to_accessions: Resolves gene names or mixed inputs to accessions using internal mappings.
        get_string_network_link: Generates a direct STRING network URL for visualization.
    """

    def get_string_mappings(self, identifiers, overwrite=False, cache_col="STRING", batch_size=100, debug=False):
        """
        Resolve STRING IDs for UniProt accessions with a 2-step strategy:
        1) Use UniProt stream (fields: xref_string) to fill cache quickly.
        2) For any still-missing rows, query STRING get_string_ids, batched by organism_id.

        This method retrieves corresponding STRING identifiers for a list of UniProt accessions
        and stores the result in `self.prot.var["STRING_id"]` for downstream use.

        Args:
            identifiers (list of str): List of UniProt accession IDs to map.
            batch_size (int): Number of accessions to include in each API query (default is 300).
            debug (bool): If True, prints progress and response info.

        Returns:
            pd.DataFrame: Mapping table with columns: `input_identifier`, `string_identifier`, and `ncbi_taxon_id`.

        Note:
            This is a helper method used primarily by `enrichment_functional()` and `enrichment_ppi()`.
        """

        identifiers = [str(x).strip() for x in identifiers if x is not None and str(x).strip()]
        if debug:
            print(f"{format_log_prefix('info')} Resolving STRING IDs for {len(identifiers)} identifiers...")

        prot_var = self.prot.var
        if cache_col not in prot_var.columns:
            prot_var[cache_col] = pd.NA
        if "ncbi_taxon_id" not in prot_var.columns:
            prot_var["ncbi_taxon_id"] = pd.NA

        # Use cached STRING IDs if available
        valid_ids = [i for i in identifiers if i in prot_var.index]
        existing = prot_var.loc[valid_ids, cache_col]
        found_ids = {i: sid for i, sid in existing.items() if pd.notna(sid) and str(sid).strip()}
        missing = [i for i in identifiers if i not in found_ids]

        if overwrite:
            print(f"{format_log_prefix('info_only',2)} Overwriting cached STRING IDs.")
            missing = valid_ids
            found_ids = {}

        print(f"{format_log_prefix('info_only',2)} Found {len(found_ids)} cached STRING IDs. {len(missing)} need lookup.")
        print(missing) if debug else None

        # 1. UniProt stream (fast)         # Use UniProt xref_string field to fill cache quickly
        # 2. STRING API for still-missing ones

        if missing:
            map_df = utils.get_string_mappings(
                missing,
                use_uniprot=True,
                use_string=True,
                caller_identity="scpviz",
                batch_size=batch_size,
                debug=debug,
            )
        else:
            map_df = pd.DataFrame(columns=["input_identifier", "string_identifier", "ncbi_taxon_id"])

        # Combine all new mappings
        if not map_df.empty:
            updated = 0
            updated_tax = 0

            for _, row in map_df.iterrows():
                acc = row["input_identifier"]
                sid = row["string_identifier"]
                tax = row.get("ncbi_taxon_id", pd.NA)

                if acc is None or acc not in prot_var.index:
                    continue

                if pd.notna(sid) and str(sid).strip():
                    self.prot.var.at[acc, cache_col] = sid
                    found_ids[acc] = sid
                    updated += 1
                else:
                    print(f"[DEBUG] Skipping unknown accession '{acc}'") if debug else None

                tax = utils.scalarize_taxon(tax)
                if tax is not pd.NA and pd.notna(tax):
                    prot_var.at[acc, "ncbi_taxon_id"] = str(tax)
                    updated_tax += 1

            print(f"{format_log_prefix('info_only',3)} Cached {updated} STRING ID mappings.")
            if debug:
                print(f"{format_log_prefix('info_only',3)} Cached {updated_tax} ncbi_taxon_id values.")

        elif missing:
            print(f"{format_log_prefix('warn_only',3)} No STRING mappings returned from STRING API.")


        # ------------------------------------
        # Build and MERGE UniProt results into out_df
        # ------------------------------------
        out_df = pd.DataFrame({"input_identifier": identifiers})

        # Prefer found_ids (fresh + cached), else fall back to cache for known indices
        out_df["string_identifier"] = out_df["input_identifier"].map(
            lambda acc: found_ids.get(
                acc,
                prot_var.at[acc, cache_col] if acc in prot_var.index else pd.NA,
            )
        )

        out_df["ncbi_taxon_id"] = out_df["input_identifier"].map(
            lambda acc: prot_var.at[acc, "ncbi_taxon_id"] if acc in prot_var.index else pd.NA
        )
        out_df["ncbi_taxon_id"] = out_df["ncbi_taxon_id"].apply(utils.scalarize_taxon)

        return out_df


    def resolve_to_accessions(self, mixed_list):
        """
        Convert gene names or accessions into standardized UniProt accession IDs.

        This method resolves input items using the internal gene-to-accession map,
        ensuring all returned entries are accessions present in the `.prot` object.

        Args:
            mixed_list (list of str): A list containing gene names and/or UniProt accessions.

        Returns:
            list of str: List of resolved UniProt accession IDs.

        Note:
            This function is similar to `utils.resolve_accessions()` but operates in the context 
            of the current `pAnnData` object and its internal gene mappings.

        Todo:
            Add example comparing results from `resolve_to_accessions()` and `utils.resolve_accessions()`.
        """
        gene_to_acc, _ = self.get_gene_maps(on='protein') 
        accs = []
        unresolved_accs = []
        for item in mixed_list:
            if item in self.prot.var.index:
                accs.append(item)  # already an accession
            elif item in gene_to_acc:
                accs.append(gene_to_acc[item])
            else:
                unresolved_accs.append(item)
                # print(f"{format_log_prefix('warn_only',2)} Could not resolve '{item}' to an accession — skipping.")
        return accs, unresolved_accs

    def enrichment_functional(
        self,
        genes=None,
        from_de=True,
        top_n=150,
        score_col="significance_score",
        gene_col="Genes",
        de_key="de_results",
        store_key=None,
        species=None,
        background=None,
        debug=False,
        **kwargs
    ):
        """
        Run functional enrichment analysis using STRING on a gene list.

        This method performs ranked or unranked enrichment analysis using STRING's API.
        It supports both differential expression-based analysis (up- and down-regulated genes)
        and custom gene lists provided by the user. Enrichment results are stored in
        `.stats["functional"]` for later access and plotting.

        Args:
            genes (list of str, optional): List of gene symbols to analyze. Ignored if `from_de=True`.
            from_de (bool): If True (default), selects genes from stored differential expression results.
            top_n (int): Number of top-ranked genes to use when `from_de=True` (default is 150).
            score_col (str): Column name in the DE table to rank genes by (default is `"significance_score"`).
            gene_col (str): Column name in `.prot.var` or DE results that contains gene names.
            de_key (str): Key to retrieve stored DE results from `.stats["de_results"]`.
            store_key (str, optional): Custom key to store enrichment results. Ignored when `from_de=True`.
            species (str, optional): Organism name or NCBI taxonomy ID. If None, inferred from STRING response.
            background (str or list of str, optional): Background gene list to use for enrichment.

                - If `"all_quantified"`, uses non-significant proteins from DE or all other quantified proteins.
                - If a list, must contain valid gene names or accessions.
            debug (bool): If True, prints API request info and diagnostic messages.
            **kwargs: Additional keyword arguments passed to the STRING enrichment API.

        Returns:
            dict or pd.DataFrame:

                - If `from_de=True`, returns a dictionary of enrichment DataFrames for "up" and "down" gene sets.
                - If `genes` is provided, returns a single enrichment DataFrame.

        Example:
            Run differential expression, then perform STRING enrichment on top-ranked genes:
                ```python
                case1 = {'cellline': 'AS', 'treatment': 'sc'} # legacy style: class_type = ["group", "condition"]
                case2 = {'cellline': 'BE', 'treatment': 'sc'} # legacy style: values = [["GroupA", "Treatment1"], ["GroupA", "Control"]]
                pdata_nb.de(values = case_values) # or legacy style: pdata.de(classes=class_type, values=values)
                pdata.list_enrichments()  # list available DE result keys
                pdata.enrichment_functional(from_de=True, de_key="GroupA_Treatment1 vs GroupA_Control")
                ```

            Perform enrichment on a custom list of genes:
                ```python
                genelist = ["P55072", "NPLOC4", "UFD1", "STX5A", "NSFL1C", "UBXN2A",
                            "UBXN4", "UBE4B", "YOD1", "WASHC5", "PLAA", "UBXN10"]
                pdata.enrichment_functional(genes=genelist, from_de=False)
                ```

        Note:
            Internally uses `resolve_to_accessions()` and `get_string_mappings()`, and stores results 
            in `.stats["functional"]`. Results can be accessed or visualized via `plot_enrichment_svg()`
            or by visiting the linked STRING URLs.
        """
        def query_functional_enrichment(query_ids, species_id, background_ids=None, debug=False):
            print(f"{format_log_prefix('info_only',2)} Running enrichment on {len(query_ids)} STRING IDs (species {species_id})...") if debug else None
            url = "https://string-db.org/api/json/enrichment"
            payload = {
                "identifiers": "%0d".join(query_ids),
                "species": species_id,
                "caller_identity": "scpviz"
            }
            if background_ids is not None:
                print(f"{format_log_prefix('info_only')} Using background of {len(background_ids)} STRING IDs.")
                payload["background_string_identifiers"] = "%0d".join(background_ids)

            print(payload) if debug else None
            response = requests.post(url, data=payload)
            response.raise_for_status()
            return pd.DataFrame(response.json())

        # Ensure string metadata section exists
        if "functional" not in self.stats:
            self.stats["functional"] = {}

        if genes is None and from_de:
            resolved_key = _resolve_de_key(self.stats, de_key)
            de_df = self.stats[resolved_key]
            sig_df = de_df[de_df["significance"] != "not significant"].copy()
            print(f"{format_log_prefix('user')} Running STRING enrichment [DE-based: {resolved_key}]")

            up_genes = sig_df[sig_df[score_col] > 0][gene_col].dropna().head(top_n).tolist()
            down_genes = sig_df[sig_df[score_col] < 0][gene_col].dropna().head(top_n).tolist()

            up_accs, up_unresolved = self.resolve_to_accessions(up_genes)
            down_accs, down_unresolved = self.resolve_to_accessions(down_genes)

            background_accs = None
            background_string_ids = None
            if background == "all_quantified":
                print(f"{format_log_prefix('warn')} Mapping background proteins may take a long time due to batching.")
                background_accs = de_df[de_df["significance"] == "not significant"].index.tolist()

            if background_accs:
                bg_map = self.get_string_mappings(background_accs,debug=debug)
                bg_map = bg_map[bg_map["string_identifier"].notna()]
                background_string_ids = bg_map["string_identifier"].tolist()

            if store_key is not None:
                print(f"{format_log_prefix('warn')} Ignoring `store_key` for DE-based enrichment. Using auto-generated pretty keys.")

            results = {}
            for label, accs in zip(["up", "down"], [up_accs, down_accs]):
                print(f"\n🔹 {label.capitalize()}-regulated proteins")
                t0 = time.time()

                if not accs:
                    print(f"{format_log_prefix('warn')} No {label}-regulated proteins to analyze.")
                    continue

                mapping_df = self.get_string_mappings(accs, debug=debug)
                mapping_df = mapping_df[mapping_df["string_identifier"].notna()]
                if mapping_df.empty:
                    print(f"{format_log_prefix('warn')} No valid STRING mappings found for {label}-regulated proteins.")
                    continue

                string_ids = mapping_df["string_identifier"].tolist()
                inferred_species = mapping_df["ncbi_taxon_id"].mode().iloc[0]
                if species is not None:
                    # check if user species is same as inferred
                    if inferred_species != species:
                        print(f"{format_log_prefix('warn',2)} Inferred species ({inferred_species}) does not match user-specified ({species}). Using user-specified species.")
                    species_id = species
                else:
                    species_id = inferred_species

                print(f"   🔸 Proteins: {len(accs)} → STRING IDs: {len(string_ids)}")
                print(f"   🔸 Species: {species_id} | Background: {'None' if background_string_ids is None else 'custom'}")
                if label == "up":
                    if up_unresolved:
                        print(f"{format_log_prefix('warn',2)} Some accessions unresolved for {label}-regulated proteins: {', '.join(up_unresolved)}")
                else:
                    if down_unresolved:
                        print(f"{format_log_prefix('warn',2)} Some accessions unresolved for {label}-regulated proteins: {', '.join(down_unresolved)}")

                enrichment_df = query_functional_enrichment(string_ids, species_id, background_string_ids, debug=debug)
                enrich_key = f"{resolved_key}_{label}"
                pretty_base = _pretty_vs_key(resolved_key)
                pretty_key = f"{pretty_base}_{label}"
                string_url = self.get_string_network_link(string_ids=string_ids, species=species_id)

                self.stats["functional"][pretty_key] = {
                    "string_ids": string_ids,
                    "background_string_ids": background_string_ids,
                    "species": species_id,
                    "input_key": resolved_key if from_de else None,
                    "string_url": string_url,
                    "result": enrichment_df
                }

                print(f"{format_log_prefix('result')} Enrichment complete ({time.time() - t0:.2f}s)")
                print(f"   • Access result: pdata.stats['functional'][\"{pretty_key}\"][\"result\"]")
                print(f"   • Plot command : pdata.plot_enrichment_svg(\"{pretty_base}\", direction=\"{label}\")")
                print(f"   • View online  : {string_url}\n")

                results[label] = enrichment_df

        elif genes is not None:
            t0 = time.time()
            print(f"{format_log_prefix('user')} Running STRING enrichment [user-supplied]")

            if store_key is None:
                prefix = "UserSearch"
                existing = self.stats["functional"].keys() if "functional" in self.stats else []
                existing_ids = [k for k in existing if k.startswith(prefix)]
                next_id = len(existing_ids) + 1
                store_key = f"{prefix}{next_id}"

            input_accs, unresolved_accs = self.resolve_to_accessions(genes)
            mapping_df = self.get_string_mappings(input_accs, debug=debug)
            mapping_df = mapping_df[mapping_df["string_identifier"].notna()]
            if mapping_df.empty:
                raise ValueError("No valid STRING mappings found for the provided identifiers.")

            string_ids = mapping_df["string_identifier"].tolist()
            inferred_species = mapping_df["ncbi_taxon_id"].mode().iloc[0]
            if species is not None:
                # check if user species is same as inferred
                if inferred_species != species:
                    print(f"{format_log_prefix('warn',2)} Inferred species ({inferred_species}) does not match user-specified ({species}). Using user-specified species.")
                species_id = species
            else:
                species_id = inferred_species

            background_string_ids = None
            if background == "all_quantified":
                print(f"{format_log_prefix('warn')} Mapping background proteins may take a long time due to batching.")
                all_accs = list(self.prot.var_names)
                background_accs = list(set(all_accs) - set(input_accs))
                bg_map = self.get_string_mappings(background_accs, debug=debug)
                bg_map = bg_map[bg_map["string_identifier"].notna()]
                background_string_ids = bg_map["string_identifier"].tolist()

            print(f"   🔸 Input genes: {len(genes)} → Resolved STRING IDs: {len(string_ids)}")
            print(f"   🔸 Species: {species_id} | Background: {'None' if background_string_ids is None else 'custom'}")
            if unresolved_accs:
                print(f"{format_log_prefix('warn',2)} Some accessions unresolved: {', '.join(unresolved_accs)}")

            enrichment_df = query_functional_enrichment(string_ids, species_id, background_string_ids, debug=debug)
            string_url = self.get_string_network_link(string_ids=string_ids, species=species_id)

            self.stats["functional"][store_key] = {
                "string_ids": string_ids,
                "background_string_ids": background_string_ids,
                "species": species_id,
                "input_key": None,
                "string_url": string_url,
                "result": enrichment_df
            }

            print(f"{format_log_prefix('result')} Enrichment complete ({time.time() - t0:.2f}s)")
            print(f"   • Access result: pdata.stats['functional'][\"{store_key}\"][\"result\"]")
            print(f"   • Plot command : pdata.plot_enrichment_svg(\"{store_key}\")")
            print(f"   • View online  : {string_url}\n")

            return enrichment_df

        else:
            raise ValueError("Must provide 'genes' or set from_de=True to use DE results.") 

    def enrichment_ppi(self, genes, species=None, store_key=None, debug=False):
        """
        Run STRING PPI (protein–protein interaction) enrichment on a user-supplied gene or accession list.

        This method maps the input gene names or UniProt accessions to STRING IDs, infers the species 
        if not provided, and submits the list to STRING's PPI enrichment endpoint. Results are stored 
        in `.stats["ppi"]` for later retrieval or visualization.

        Args:
            genes (list of str): A list of gene names or UniProt accessions to analyze.
            species (int or str, optional): NCBI taxonomy ID (e.g., 9606 for human). If None, inferred from STRING mappings.
            store_key (str, optional): Key to store the enrichment result under `.stats["ppi"]`.
                If None, a unique key is auto-generated.

        Returns:
            pd.DataFrame: DataFrame of STRING PPI enrichment results.

        Example:
            Run differential expression, then perform STRING PPI enrichment on significant genes:
                ```python
                class_type = ["group", "condition"]
                values = [["GroupA", "Treatment1"], ["GroupA", "Control"]]

                pdata.de(classes=class_type, values=values)
                pdata.list_enrichments()
                sig_genes = pdata.stats["de_results"]["GroupA_Treatment1 vs GroupA_Control"]
                sig_genes = sig_genes[sig_genes["significance"] != "not significant"]["Genes"].dropna().tolist()

                pdata.enrichment_ppi(genes=sig_genes)
                ```
        """
        def query_ppi_enrichment(string_ids, species):
            # print(f"[INFO] Running PPI enrichment for {len(string_ids)} STRING IDs (species {species})...")
            url = "https://string-db.org/api/json/ppi_enrichment"
            payload = {
                "identifiers": "%0d".join(string_ids),
                "species": species,
                "caller_identity": "scpviz"
            }

            response = requests.post(url, data=payload)
            response.raise_for_status()

            result = response.json()
            print("[DEBUG] PPI enrichment result:", result)
            return result[0] if isinstance(result, list) else result

        print(f"{format_log_prefix('user')} Running STRING PPI enrichment")
        t0 = time.time()
        input_accs, unresolved_accs = self.resolve_to_accessions(genes)
        mapping_df = self.get_string_mappings(input_accs, debug=debug)
        mapping_df = mapping_df[mapping_df["string_identifier"].notna()]
        if mapping_df.empty:
            raise ValueError("No valid STRING mappings found for the provided genes/accessions.")

        string_ids = mapping_df["string_identifier"].tolist()
        inferred_species = mapping_df["ncbi_taxon_id"].mode().iloc[0]
        species_id = species if species is not None else inferred_species

        print(f"   🔸 Input genes: {len(genes)} → Resolved STRING IDs: {len(mapping_df)}")
        print(f"   🔸 Species: {species_id}")
        if unresolved_accs:
            print(f"{format_log_prefix('warn', 2)} Some accessions unresolved: {', '.join(unresolved_accs)}")

        result = query_ppi_enrichment(string_ids, species_id)

        # Store results
        if "ppi" not in self.stats:
            self.stats["ppi"] = {}

        if store_key is None:
            base = "UserPPI"
            counter = 1
            while f"{base}{counter}" in self.stats["ppi"]:
                counter += 1
            store_key = f"{base}{counter}"

        self.stats["ppi"][store_key] = {
            "result": result,
            "string_ids": string_ids,
            "species": species_id
        }

        print(f"{format_log_prefix('result')} PPI enrichment complete ({time.time() - t0:.2f}s)")
        print(f"   • STRING IDs   : {len(string_ids)}")
        print(f"   • Edges found  : {result['number_of_edges']} vs {result['expected_number_of_edges']} expected")
        print(f"   • p-value      : {result['p_value']:.2e}")
        print(f"   • Access result: pdata.stats['ppi']['{store_key}']['result']\n")

        return result

    def list_enrichments(self):
        """
        List available STRING enrichment results and unprocessed DE contrasts.

        This method prints available functional and PPI enrichment entries stored in
        `.stats["functional"]` and `.stats["ppi"]`, as well as DE comparisons in 
        `.stats["de_results"]` that have not yet been analyzed.

        Returns:
            None

        Example:
            List enrichment results stored after running functional or PPI enrichment:
                ```python
                pdata.list_enrichments()
                ```
        """

        functional = self.stats.get("functional", {})
        ppi_keys = self.stats.get("ppi", {}).keys()
        de_keys = {k for k in self.stats if "vs" in k and not k.endswith(("_up", "_down"))}

        # Collect enriched DE keys based on input_key metadata
        enriched_de = set()
        enriched_results = []

        for k, meta in functional.items():
            input_key = meta.get("input_key", None)
            is_de = "vs" in k

            if input_key and input_key in de_keys:
                base = input_key
                suffix = k.rsplit("_", 1)[-1]
                pretty = f"{_pretty_vs_key(base)}_{suffix}"
                enriched_de.add(base)
                enriched_results.append((pretty, k, "DE-based"))
            else:
                enriched_results.append((k, k, "User"))

        de_unenriched = sorted(_pretty_vs_key(k) for k in (de_keys - enriched_de))

        print(f"{format_log_prefix('user')} Listing STRING enrichment status\n")

        print(f"{format_log_prefix('info_only',2)} Available DE comparisons (not yet enriched):")
        if de_unenriched:
            for pk in de_unenriched:
                print(f"        - {pk}")
        else:
            print("  (none)\n")

        print("\n  🔹 To run enrichment:")
        print("      pdata.enrichment_functional(from_de=True, de_key=\"...\")")

        print(f"\n{format_log_prefix('result_only')} Completed STRING enrichment results:")
        if not enriched_results:
            print("    (none)")
        for pretty, raw_key, kind in enriched_results:
            if kind == "DE-based":
                base, suffix = pretty.rsplit("_", 1)
                print(f"  - {pretty} ({kind})")
                print(f"    • Table: pdata.stats['functional'][\"{raw_key}\"]['result']")
                print(f"    • Plot : pdata.plot_enrichment_svg(\"{base}\", direction=\"{suffix}\")")
                url = self.stats["functional"].get(raw_key, {}).get("string_url")
                if url:
                    print(f"    • Link  : {url}")
            else:
                print(f"  - {pretty} ({kind})")
                print(f"    • Table: pdata.stats['functional'][\"{raw_key}\"]['result']")
                print(f"    • Plot : pdata.plot_enrichment_svg(\"{pretty}\")")
                url = self.stats["functional"].get(raw_key, {}).get("string_url")
                if url:
                    print(f"    • Link  : {url}")

        if ppi_keys:
            print(f"\n{format_log_prefix('result_only')} Completed STRING enrichment results:")
            for key in sorted(ppi_keys):
                print(f"  - {key} (User)")
                print(f"    • Table: pdata.stats['ppi']['{key}']['result']")
        else:
            print(f"\n{format_log_prefix('result_only')} Completed STRING PPI results:")
            print("    (none)")

    def plot_enrichment_svg(self, key, direction=None, category=None, save_as=None):
        """
        Display STRING enrichment SVG inline in a Jupyter notebook.

        This method fetches and renders a STRING-generated SVG for a previously completed
        functional enrichment result. Optionally, the SVG can also be saved to disk.

        Args:
            key (str): Enrichment result key from `.stats["functional"]`. For DE-based comparisons, this 
                includes both contrast and direction (e.g., `"GroupA_Treatment1_vs_Control_up"`).
            direction (str, optional): Direction of DE result, either `"up"` or `"down"`. Use `None` for 
                user-defined gene lists.
            category (str, optional): STRING enrichment category to filter by (e.g., `"Process"`, `"KEGG"`). See the table in the below note for options.
            save_as (str, optional): If provided, saves the retrieved SVG to the given file path.

        Returns:
            None

        Example:
            Display a STRING enrichment network for a user-supplied gene list:
                ```python
                pdata.plot_enrichment_svg("UserSearch1")
                ```

        !!! note "Supported STRING Enrichment Categories"
            The following category IDs are supported for functional enrichment.  
            More details are available on the [STRING API documentation site](https://string-db.org/cgi/help?subpage=api).

            | Category ID          | Description                                      |
            |----------------------|--------------------------------------------------|
            | **Process**          | Biological Process (Gene Ontology)               |
            | **Function**         | Molecular Function (Gene Ontology)               |
            | **Component**        | Cellular Component (Gene Ontology)               |
            | **Keyword**          | Annotated Keywords (UniProt)                     |
            | **KEGG**             | KEGG Pathways                                    |
            | **RCTM**             | Reactome Pathways                                |
            | **HPO**              | Human Phenotype (Monarch)                        |
            | **MPO**              | Mammalian Phenotype Ontology (Monarch)           |
            | **DPO**              | Drosophila Phenotype (Monarch)                   |
            | **WPO**              | *C. elegans* Phenotype Ontology (Monarch)        |
            | **ZPO**              | Zebrafish Phenotype Ontology (Monarch)           |
            | **FYPO**             | Fission Yeast Phenotype Ontology (Monarch)       |
            | **Pfam**             | Protein Domains (Pfam)                           |
            | **SMART**            | Protein Domains (SMART)                          |
            | **InterPro**         | Protein Domains and Features (InterPro)          |
            | **PMID**             | Reference Publications (PubMed)                  |
            | **NetworkNeighborAL**| Local Network Cluster (STRING)                   |
            | **COMPARTMENTS**     | Subcellular Localization (COMPARTMENTS)          |
            | **TISSUES**          | Tissue Expression (TISSUES)                      |
            | **DISEASES**         | Disease–gene Associations (DISEASES)             |
            | **WikiPathways**     | WikiPathways                                     |

        Note:
            The `key` must correspond to an existing entry in `.stats["functional"]`, created via 
            `enrichment_functional()`.
        """
        from xml.parsers.expat import ExpatError

        if "functional" not in self.stats:
            raise ValueError("No STRING enrichment results found in .stats['functional'].")

        all_keys = list(self.stats["functional"].keys())

        # Handle DE-type key
        if "vs" in key:
            if direction not in {"up", "down"}:
                raise ValueError("You must specify direction='up' or 'down' for DE-based enrichment keys.")
            lookup_key = _resolve_de_key(self.stats["functional"], f"{key}_{direction}")
        else:
            # Handle user-supplied key (e.g. "userSearch1")
            if direction is not None:
                print(f"[WARNING] Ignoring direction='{direction}' for user-supplied key: '{key}'")
            lookup_key = key

        if lookup_key not in self.stats["functional"]:
            available = "\n".join(f"  - {k}" for k in self.stats["functional"].keys())
            raise ValueError(f"Could not find enrichment results for '{lookup_key}'. Available keys:\n{available}")

        meta = self.stats["functional"][lookup_key]
        string_ids = meta["string_ids"]
        species_id = meta["species"]

        url = "https://string-db.org/api/svg/enrichmentfigure"
        params = {
            "identifiers": "%0d".join(string_ids),
            "species": species_id
        }
        if category:
            params["category"] = category

        print(f"{format_log_prefix('user')} Fetching STRING SVG for key '{lookup_key}' (n={len(string_ids)})...")
        response = requests.get(url, params=params)
        response.raise_for_status()

        if save_as:
            with open(save_as, "wb") as f:
                f.write(response.content)
            print(f"{format_log_prefix('info_only')} Saved SVG to: {save_as}")

        with tempfile.NamedTemporaryFile("wb", suffix=".svg", delete=False) as tmp:
            tmp.write(response.content)
            tmp_path = tmp.name

        try:
            try:
                display(SVG(filename=tmp_path))
            except ExpatError:
                print(
                    f"{format_log_prefix('info_only')} No enrichment figure available "
                    f"for key '{lookup_key}'"
                    + (f" (category='{category}')." if category else ".")
                )
        finally:
            os.remove(tmp_path)

    def get_string_network_link(self, key=None, string_ids=None, species=None, show_labels=True):
        """
        Generate a direct STRING network URL to visualize protein interactions online.

        This method constructs a STRING website link to view a network of proteins,
        using either a list of STRING IDs or a key from previously stored enrichment results.

        Args:
            key (str, optional): Key from `.stats["functional"]` to extract STRING IDs and species info.
            string_ids (list of str, optional): List of STRING identifiers to include in the network.
            species (int or str, optional): NCBI taxonomy ID (e.g., 9606 for human). Required if not using a stored key.
            show_labels (bool): If True (default), node labels will be shown in the network view.

        Returns:
            str: URL to open the network in the STRING web interface.

        Example:
            Get a STRING network link for a stored enrichment result:
                ```python
                url = pdata.get_string_network_link(key="UserSearch1")
                print(url)
                ```
        """
        if string_ids is None:
            if key is None:
                raise ValueError("Must provide either a list of STRING IDs or a key.")
            metadata = self.stats.get("functional", {}).get(key)
            if metadata is None:
                raise ValueError(f"Key '{key}' not found in self.stats['functional'].")
            string_ids = metadata.get("string_ids")
            species = species or metadata.get("species")

        if not string_ids:
            raise ValueError("No STRING IDs found or provided.")

        base_url = "https://string-db.org/cgi/network"
        params = [
            f"identifiers={'%0d'.join(string_ids)}",
            f"caller_identity=scpviz"
        ]
        if species:
            params.append(f"species={species}")
        if show_labels:
            params.append("show_query_node_labels=1")

        return f"{base_url}?{'&'.join(params)}"

enrichment_functional

enrichment_functional(genes=None, from_de=True, top_n=150, score_col='significance_score', gene_col='Genes', de_key='de_results', store_key=None, species=None, background=None, debug=False, **kwargs)

Run functional enrichment analysis using STRING on a gene list.

This method performs ranked or unranked enrichment analysis using STRING's API. It supports both differential expression-based analysis (up- and down-regulated genes) and custom gene lists provided by the user. Enrichment results are stored in .stats["functional"] for later access and plotting.

Parameters:

Name	Type	Description	Default
`genes`	`list of str`	List of gene symbols to analyze. Ignored if `from_de=True`.	`None`
`from_de`	`bool`	If True (default), selects genes from stored differential expression results.	`True`
`top_n`	`int`	Number of top-ranked genes to use when `from_de=True` (default is 150).	`150`
`score_col`	`str`	Column name in the DE table to rank genes by (default is `"significance_score"`).	`'significance_score'`
`gene_col`	`str`	Column name in `.prot.var` or DE results that contains gene names.	`'Genes'`
`de_key`	`str`	Key to retrieve stored DE results from `.stats["de_results"]`.	`'de_results'`
`store_key`	`str`	Custom key to store enrichment results. Ignored when `from_de=True`.	`None`
`species`	`str`	Organism name or NCBI taxonomy ID. If None, inferred from STRING response.	`None`
`background`	`str or list of str`	Background gene list to use for enrichment. If `"all_quantified"`, uses non-significant proteins from DE or all other quantified proteins. If a list, must contain valid gene names or accessions.	`None`
`debug`	`bool`	If True, prints API request info and diagnostic messages.	`False`
`**kwargs`		Additional keyword arguments passed to the STRING enrichment API.	`{}`

Returns:

Type	Description
	dict or pd.DataFrame: If `from_de=True`, returns a dictionary of enrichment DataFrames for "up" and "down" gene sets. If `genes` is provided, returns a single enrichment DataFrame.

Example

Run differential expression, then perform STRING enrichment on top-ranked genes:

case1 = {'cellline': 'AS', 'treatment': 'sc'} # legacy style: class_type = ["group", "condition"]
case2 = {'cellline': 'BE', 'treatment': 'sc'} # legacy style: values = [["GroupA", "Treatment1"], ["GroupA", "Control"]]
pdata_nb.de(values = case_values) # or legacy style: pdata.de(classes=class_type, values=values)
pdata.list_enrichments()  # list available DE result keys
pdata.enrichment_functional(from_de=True, de_key="GroupA_Treatment1 vs GroupA_Control")

Perform enrichment on a custom list of genes:

genelist = ["P55072", "NPLOC4", "UFD1", "STX5A", "NSFL1C", "UBXN2A",
            "UBXN4", "UBE4B", "YOD1", "WASHC5", "PLAA", "UBXN10"]
pdata.enrichment_functional(genes=genelist, from_de=False)

Note

Internally uses resolve_to_accessions() and get_string_mappings(), and stores results in .stats["functional"]. Results can be accessed or visualized via plot_enrichment_svg() or by visiting the linked STRING URLs.

Source code in src/scpviz/pAnnData/enrichment.py

def enrichment_functional(
    self,
    genes=None,
    from_de=True,
    top_n=150,
    score_col="significance_score",
    gene_col="Genes",
    de_key="de_results",
    store_key=None,
    species=None,
    background=None,
    debug=False,
    **kwargs
):
    """
    Run functional enrichment analysis using STRING on a gene list.

    This method performs ranked or unranked enrichment analysis using STRING's API.
    It supports both differential expression-based analysis (up- and down-regulated genes)
    and custom gene lists provided by the user. Enrichment results are stored in
    `.stats["functional"]` for later access and plotting.

    Args:
        genes (list of str, optional): List of gene symbols to analyze. Ignored if `from_de=True`.
        from_de (bool): If True (default), selects genes from stored differential expression results.
        top_n (int): Number of top-ranked genes to use when `from_de=True` (default is 150).
        score_col (str): Column name in the DE table to rank genes by (default is `"significance_score"`).
        gene_col (str): Column name in `.prot.var` or DE results that contains gene names.
        de_key (str): Key to retrieve stored DE results from `.stats["de_results"]`.
        store_key (str, optional): Custom key to store enrichment results. Ignored when `from_de=True`.
        species (str, optional): Organism name or NCBI taxonomy ID. If None, inferred from STRING response.
        background (str or list of str, optional): Background gene list to use for enrichment.

            - If `"all_quantified"`, uses non-significant proteins from DE or all other quantified proteins.
            - If a list, must contain valid gene names or accessions.
        debug (bool): If True, prints API request info and diagnostic messages.
        **kwargs: Additional keyword arguments passed to the STRING enrichment API.

    Returns:
        dict or pd.DataFrame:

            - If `from_de=True`, returns a dictionary of enrichment DataFrames for "up" and "down" gene sets.
            - If `genes` is provided, returns a single enrichment DataFrame.

    Example:
        Run differential expression, then perform STRING enrichment on top-ranked genes:
            ```python
            case1 = {'cellline': 'AS', 'treatment': 'sc'} # legacy style: class_type = ["group", "condition"]
            case2 = {'cellline': 'BE', 'treatment': 'sc'} # legacy style: values = [["GroupA", "Treatment1"], ["GroupA", "Control"]]
            pdata_nb.de(values = case_values) # or legacy style: pdata.de(classes=class_type, values=values)
            pdata.list_enrichments()  # list available DE result keys
            pdata.enrichment_functional(from_de=True, de_key="GroupA_Treatment1 vs GroupA_Control")
            ```

        Perform enrichment on a custom list of genes:
            ```python
            genelist = ["P55072", "NPLOC4", "UFD1", "STX5A", "NSFL1C", "UBXN2A",
                        "UBXN4", "UBE4B", "YOD1", "WASHC5", "PLAA", "UBXN10"]
            pdata.enrichment_functional(genes=genelist, from_de=False)
            ```

    Note:
        Internally uses `resolve_to_accessions()` and `get_string_mappings()`, and stores results 
        in `.stats["functional"]`. Results can be accessed or visualized via `plot_enrichment_svg()`
        or by visiting the linked STRING URLs.
    """
    def query_functional_enrichment(query_ids, species_id, background_ids=None, debug=False):
        print(f"{format_log_prefix('info_only',2)} Running enrichment on {len(query_ids)} STRING IDs (species {species_id})...") if debug else None
        url = "https://string-db.org/api/json/enrichment"
        payload = {
            "identifiers": "%0d".join(query_ids),
            "species": species_id,
            "caller_identity": "scpviz"
        }
        if background_ids is not None:
            print(f"{format_log_prefix('info_only')} Using background of {len(background_ids)} STRING IDs.")
            payload["background_string_identifiers"] = "%0d".join(background_ids)

        print(payload) if debug else None
        response = requests.post(url, data=payload)
        response.raise_for_status()
        return pd.DataFrame(response.json())

    # Ensure string metadata section exists
    if "functional" not in self.stats:
        self.stats["functional"] = {}

    if genes is None and from_de:
        resolved_key = _resolve_de_key(self.stats, de_key)
        de_df = self.stats[resolved_key]
        sig_df = de_df[de_df["significance"] != "not significant"].copy()
        print(f"{format_log_prefix('user')} Running STRING enrichment [DE-based: {resolved_key}]")

        up_genes = sig_df[sig_df[score_col] > 0][gene_col].dropna().head(top_n).tolist()
        down_genes = sig_df[sig_df[score_col] < 0][gene_col].dropna().head(top_n).tolist()

        up_accs, up_unresolved = self.resolve_to_accessions(up_genes)
        down_accs, down_unresolved = self.resolve_to_accessions(down_genes)

        background_accs = None
        background_string_ids = None
        if background == "all_quantified":
            print(f"{format_log_prefix('warn')} Mapping background proteins may take a long time due to batching.")
            background_accs = de_df[de_df["significance"] == "not significant"].index.tolist()

        if background_accs:
            bg_map = self.get_string_mappings(background_accs,debug=debug)
            bg_map = bg_map[bg_map["string_identifier"].notna()]
            background_string_ids = bg_map["string_identifier"].tolist()

        if store_key is not None:
            print(f"{format_log_prefix('warn')} Ignoring `store_key` for DE-based enrichment. Using auto-generated pretty keys.")

        results = {}
        for label, accs in zip(["up", "down"], [up_accs, down_accs]):
            print(f"\n🔹 {label.capitalize()}-regulated proteins")
            t0 = time.time()

            if not accs:
                print(f"{format_log_prefix('warn')} No {label}-regulated proteins to analyze.")
                continue

            mapping_df = self.get_string_mappings(accs, debug=debug)
            mapping_df = mapping_df[mapping_df["string_identifier"].notna()]
            if mapping_df.empty:
                print(f"{format_log_prefix('warn')} No valid STRING mappings found for {label}-regulated proteins.")
                continue

            string_ids = mapping_df["string_identifier"].tolist()
            inferred_species = mapping_df["ncbi_taxon_id"].mode().iloc[0]
            if species is not None:
                # check if user species is same as inferred
                if inferred_species != species:
                    print(f"{format_log_prefix('warn',2)} Inferred species ({inferred_species}) does not match user-specified ({species}). Using user-specified species.")
                species_id = species
            else:
                species_id = inferred_species

            print(f"   🔸 Proteins: {len(accs)} → STRING IDs: {len(string_ids)}")
            print(f"   🔸 Species: {species_id} | Background: {'None' if background_string_ids is None else 'custom'}")
            if label == "up":
                if up_unresolved:
                    print(f"{format_log_prefix('warn',2)} Some accessions unresolved for {label}-regulated proteins: {', '.join(up_unresolved)}")
            else:
                if down_unresolved:
                    print(f"{format_log_prefix('warn',2)} Some accessions unresolved for {label}-regulated proteins: {', '.join(down_unresolved)}")

            enrichment_df = query_functional_enrichment(string_ids, species_id, background_string_ids, debug=debug)
            enrich_key = f"{resolved_key}_{label}"
            pretty_base = _pretty_vs_key(resolved_key)
            pretty_key = f"{pretty_base}_{label}"
            string_url = self.get_string_network_link(string_ids=string_ids, species=species_id)

            self.stats["functional"][pretty_key] = {
                "string_ids": string_ids,
                "background_string_ids": background_string_ids,
                "species": species_id,
                "input_key": resolved_key if from_de else None,
                "string_url": string_url,
                "result": enrichment_df
            }

            print(f"{format_log_prefix('result')} Enrichment complete ({time.time() - t0:.2f}s)")
            print(f"   • Access result: pdata.stats['functional'][\"{pretty_key}\"][\"result\"]")
            print(f"   • Plot command : pdata.plot_enrichment_svg(\"{pretty_base}\", direction=\"{label}\")")
            print(f"   • View online  : {string_url}\n")

            results[label] = enrichment_df

    elif genes is not None:
        t0 = time.time()
        print(f"{format_log_prefix('user')} Running STRING enrichment [user-supplied]")

        if store_key is None:
            prefix = "UserSearch"
            existing = self.stats["functional"].keys() if "functional" in self.stats else []
            existing_ids = [k for k in existing if k.startswith(prefix)]
            next_id = len(existing_ids) + 1
            store_key = f"{prefix}{next_id}"

        input_accs, unresolved_accs = self.resolve_to_accessions(genes)
        mapping_df = self.get_string_mappings(input_accs, debug=debug)
        mapping_df = mapping_df[mapping_df["string_identifier"].notna()]
        if mapping_df.empty:
            raise ValueError("No valid STRING mappings found for the provided identifiers.")

        string_ids = mapping_df["string_identifier"].tolist()
        inferred_species = mapping_df["ncbi_taxon_id"].mode().iloc[0]
        if species is not None:
            # check if user species is same as inferred
            if inferred_species != species:
                print(f"{format_log_prefix('warn',2)} Inferred species ({inferred_species}) does not match user-specified ({species}). Using user-specified species.")
            species_id = species
        else:
            species_id = inferred_species

        background_string_ids = None
        if background == "all_quantified":
            print(f"{format_log_prefix('warn')} Mapping background proteins may take a long time due to batching.")
            all_accs = list(self.prot.var_names)
            background_accs = list(set(all_accs) - set(input_accs))
            bg_map = self.get_string_mappings(background_accs, debug=debug)
            bg_map = bg_map[bg_map["string_identifier"].notna()]
            background_string_ids = bg_map["string_identifier"].tolist()

        print(f"   🔸 Input genes: {len(genes)} → Resolved STRING IDs: {len(string_ids)}")
        print(f"   🔸 Species: {species_id} | Background: {'None' if background_string_ids is None else 'custom'}")
        if unresolved_accs:
            print(f"{format_log_prefix('warn',2)} Some accessions unresolved: {', '.join(unresolved_accs)}")

        enrichment_df = query_functional_enrichment(string_ids, species_id, background_string_ids, debug=debug)
        string_url = self.get_string_network_link(string_ids=string_ids, species=species_id)

        self.stats["functional"][store_key] = {
            "string_ids": string_ids,
            "background_string_ids": background_string_ids,
            "species": species_id,
            "input_key": None,
            "string_url": string_url,
            "result": enrichment_df
        }

        print(f"{format_log_prefix('result')} Enrichment complete ({time.time() - t0:.2f}s)")
        print(f"   • Access result: pdata.stats['functional'][\"{store_key}\"][\"result\"]")
        print(f"   • Plot command : pdata.plot_enrichment_svg(\"{store_key}\")")
        print(f"   • View online  : {string_url}\n")

        return enrichment_df

    else:
        raise ValueError("Must provide 'genes' or set from_de=True to use DE results.") 

enrichment_ppi

enrichment_ppi(genes, species=None, store_key=None, debug=False)

Run STRING PPI (protein–protein interaction) enrichment on a user-supplied gene or accession list.

This method maps the input gene names or UniProt accessions to STRING IDs, infers the species if not provided, and submits the list to STRING's PPI enrichment endpoint. Results are stored in .stats["ppi"] for later retrieval or visualization.

Parameters:

Name	Type	Description	Default
`genes`	`list of str`	A list of gene names or UniProt accessions to analyze.	required
`species`	`int or str`	NCBI taxonomy ID (e.g., 9606 for human). If None, inferred from STRING mappings.	`None`
`store_key`	`str`	Key to store the enrichment result under `.stats["ppi"]`. If None, a unique key is auto-generated.	`None`

Returns:

Type	Description
	pd.DataFrame: DataFrame of STRING PPI enrichment results.

Example

Run differential expression, then perform STRING PPI enrichment on significant genes:

class_type = ["group", "condition"]
values = [["GroupA", "Treatment1"], ["GroupA", "Control"]]

pdata.de(classes=class_type, values=values)
pdata.list_enrichments()
sig_genes = pdata.stats["de_results"]["GroupA_Treatment1 vs GroupA_Control"]
sig_genes = sig_genes[sig_genes["significance"] != "not significant"]["Genes"].dropna().tolist()

pdata.enrichment_ppi(genes=sig_genes)

Source code in src/scpviz/pAnnData/enrichment.py

def enrichment_ppi(self, genes, species=None, store_key=None, debug=False):
    """
    Run STRING PPI (protein–protein interaction) enrichment on a user-supplied gene or accession list.

    This method maps the input gene names or UniProt accessions to STRING IDs, infers the species 
    if not provided, and submits the list to STRING's PPI enrichment endpoint. Results are stored 
    in `.stats["ppi"]` for later retrieval or visualization.

    Args:
        genes (list of str): A list of gene names or UniProt accessions to analyze.
        species (int or str, optional): NCBI taxonomy ID (e.g., 9606 for human). If None, inferred from STRING mappings.
        store_key (str, optional): Key to store the enrichment result under `.stats["ppi"]`.
            If None, a unique key is auto-generated.

    Returns:
        pd.DataFrame: DataFrame of STRING PPI enrichment results.

    Example:
        Run differential expression, then perform STRING PPI enrichment on significant genes:
            ```python
            class_type = ["group", "condition"]
            values = [["GroupA", "Treatment1"], ["GroupA", "Control"]]

            pdata.de(classes=class_type, values=values)
            pdata.list_enrichments()
            sig_genes = pdata.stats["de_results"]["GroupA_Treatment1 vs GroupA_Control"]
            sig_genes = sig_genes[sig_genes["significance"] != "not significant"]["Genes"].dropna().tolist()

            pdata.enrichment_ppi(genes=sig_genes)
            ```
    """
    def query_ppi_enrichment(string_ids, species):
        # print(f"[INFO] Running PPI enrichment for {len(string_ids)} STRING IDs (species {species})...")
        url = "https://string-db.org/api/json/ppi_enrichment"
        payload = {
            "identifiers": "%0d".join(string_ids),
            "species": species,
            "caller_identity": "scpviz"
        }

        response = requests.post(url, data=payload)
        response.raise_for_status()

        result = response.json()
        print("[DEBUG] PPI enrichment result:", result)
        return result[0] if isinstance(result, list) else result

    print(f"{format_log_prefix('user')} Running STRING PPI enrichment")
    t0 = time.time()
    input_accs, unresolved_accs = self.resolve_to_accessions(genes)
    mapping_df = self.get_string_mappings(input_accs, debug=debug)
    mapping_df = mapping_df[mapping_df["string_identifier"].notna()]
    if mapping_df.empty:
        raise ValueError("No valid STRING mappings found for the provided genes/accessions.")

    string_ids = mapping_df["string_identifier"].tolist()
    inferred_species = mapping_df["ncbi_taxon_id"].mode().iloc[0]
    species_id = species if species is not None else inferred_species

    print(f"   🔸 Input genes: {len(genes)} → Resolved STRING IDs: {len(mapping_df)}")
    print(f"   🔸 Species: {species_id}")
    if unresolved_accs:
        print(f"{format_log_prefix('warn', 2)} Some accessions unresolved: {', '.join(unresolved_accs)}")

    result = query_ppi_enrichment(string_ids, species_id)

    # Store results
    if "ppi" not in self.stats:
        self.stats["ppi"] = {}

    if store_key is None:
        base = "UserPPI"
        counter = 1
        while f"{base}{counter}" in self.stats["ppi"]:
            counter += 1
        store_key = f"{base}{counter}"

    self.stats["ppi"][store_key] = {
        "result": result,
        "string_ids": string_ids,
        "species": species_id
    }

    print(f"{format_log_prefix('result')} PPI enrichment complete ({time.time() - t0:.2f}s)")
    print(f"   • STRING IDs   : {len(string_ids)}")
    print(f"   • Edges found  : {result['number_of_edges']} vs {result['expected_number_of_edges']} expected")
    print(f"   • p-value      : {result['p_value']:.2e}")
    print(f"   • Access result: pdata.stats['ppi']['{store_key}']['result']\n")

    return result

get_string_mappings

get_string_mappings(identifiers, overwrite=False, cache_col='STRING', batch_size=100, debug=False)

Resolve STRING IDs for UniProt accessions with a 2-step strategy: 1) Use UniProt stream (fields: xref_string) to fill cache quickly. 2) For any still-missing rows, query STRING get_string_ids, batched by organism_id.

This method retrieves corresponding STRING identifiers for a list of UniProt accessions and stores the result in self.prot.var["STRING_id"] for downstream use.

Parameters:

Name	Type	Description	Default
`identifiers`	`list of str`	List of UniProt accession IDs to map.	required
`batch_size`	`int`	Number of accessions to include in each API query (default is 300).	`100`
`debug`	`bool`	If True, prints progress and response info.	`False`

Returns:

Type	Description
	pd.DataFrame: Mapping table with columns: `input_identifier`, `string_identifier`, and `ncbi_taxon_id`.

Note

This is a helper method used primarily by enrichment_functional() and enrichment_ppi().

Source code in src/scpviz/pAnnData/enrichment.py

def get_string_mappings(self, identifiers, overwrite=False, cache_col="STRING", batch_size=100, debug=False):
    """
    Resolve STRING IDs for UniProt accessions with a 2-step strategy:
    1) Use UniProt stream (fields: xref_string) to fill cache quickly.
    2) For any still-missing rows, query STRING get_string_ids, batched by organism_id.

    This method retrieves corresponding STRING identifiers for a list of UniProt accessions
    and stores the result in `self.prot.var["STRING_id"]` for downstream use.

    Args:
        identifiers (list of str): List of UniProt accession IDs to map.
        batch_size (int): Number of accessions to include in each API query (default is 300).
        debug (bool): If True, prints progress and response info.

    Returns:
        pd.DataFrame: Mapping table with columns: `input_identifier`, `string_identifier`, and `ncbi_taxon_id`.

    Note:
        This is a helper method used primarily by `enrichment_functional()` and `enrichment_ppi()`.
    """

    identifiers = [str(x).strip() for x in identifiers if x is not None and str(x).strip()]
    if debug:
        print(f"{format_log_prefix('info')} Resolving STRING IDs for {len(identifiers)} identifiers...")

    prot_var = self.prot.var
    if cache_col not in prot_var.columns:
        prot_var[cache_col] = pd.NA
    if "ncbi_taxon_id" not in prot_var.columns:
        prot_var["ncbi_taxon_id"] = pd.NA

    # Use cached STRING IDs if available
    valid_ids = [i for i in identifiers if i in prot_var.index]
    existing = prot_var.loc[valid_ids, cache_col]
    found_ids = {i: sid for i, sid in existing.items() if pd.notna(sid) and str(sid).strip()}
    missing = [i for i in identifiers if i not in found_ids]

    if overwrite:
        print(f"{format_log_prefix('info_only',2)} Overwriting cached STRING IDs.")
        missing = valid_ids
        found_ids = {}

    print(f"{format_log_prefix('info_only',2)} Found {len(found_ids)} cached STRING IDs. {len(missing)} need lookup.")
    print(missing) if debug else None

    # 1. UniProt stream (fast)         # Use UniProt xref_string field to fill cache quickly
    # 2. STRING API for still-missing ones

    if missing:
        map_df = utils.get_string_mappings(
            missing,
            use_uniprot=True,
            use_string=True,
            caller_identity="scpviz",
            batch_size=batch_size,
            debug=debug,
        )
    else:
        map_df = pd.DataFrame(columns=["input_identifier", "string_identifier", "ncbi_taxon_id"])

    # Combine all new mappings
    if not map_df.empty:
        updated = 0
        updated_tax = 0

        for _, row in map_df.iterrows():
            acc = row["input_identifier"]
            sid = row["string_identifier"]
            tax = row.get("ncbi_taxon_id", pd.NA)

            if acc is None or acc not in prot_var.index:
                continue

            if pd.notna(sid) and str(sid).strip():
                self.prot.var.at[acc, cache_col] = sid
                found_ids[acc] = sid
                updated += 1
            else:
                print(f"[DEBUG] Skipping unknown accession '{acc}'") if debug else None

            tax = utils.scalarize_taxon(tax)
            if tax is not pd.NA and pd.notna(tax):
                prot_var.at[acc, "ncbi_taxon_id"] = str(tax)
                updated_tax += 1

        print(f"{format_log_prefix('info_only',3)} Cached {updated} STRING ID mappings.")
        if debug:
            print(f"{format_log_prefix('info_only',3)} Cached {updated_tax} ncbi_taxon_id values.")

    elif missing:
        print(f"{format_log_prefix('warn_only',3)} No STRING mappings returned from STRING API.")


    # ------------------------------------
    # Build and MERGE UniProt results into out_df
    # ------------------------------------
    out_df = pd.DataFrame({"input_identifier": identifiers})

    # Prefer found_ids (fresh + cached), else fall back to cache for known indices
    out_df["string_identifier"] = out_df["input_identifier"].map(
        lambda acc: found_ids.get(
            acc,
            prot_var.at[acc, cache_col] if acc in prot_var.index else pd.NA,
        )
    )

    out_df["ncbi_taxon_id"] = out_df["input_identifier"].map(
        lambda acc: prot_var.at[acc, "ncbi_taxon_id"] if acc in prot_var.index else pd.NA
    )
    out_df["ncbi_taxon_id"] = out_df["ncbi_taxon_id"].apply(utils.scalarize_taxon)

    return out_df

get_string_network_link

get_string_network_link(key=None, string_ids=None, species=None, show_labels=True)

Generate a direct STRING network URL to visualize protein interactions online.

This method constructs a STRING website link to view a network of proteins, using either a list of STRING IDs or a key from previously stored enrichment results.

Parameters:

Name	Type	Description	Default
`key`	`str`	Key from `.stats["functional"]` to extract STRING IDs and species info.	`None`
`string_ids`	`list of str`	List of STRING identifiers to include in the network.	`None`
`species`	`int or str`	NCBI taxonomy ID (e.g., 9606 for human). Required if not using a stored key.	`None`
`show_labels`	`bool`	If True (default), node labels will be shown in the network view.	`True`

Returns:

Name	Type	Description
`str`		URL to open the network in the STRING web interface.

Example

Get a STRING network link for a stored enrichment result:

url = pdata.get_string_network_link(key="UserSearch1")
print(url)

Source code in src/scpviz/pAnnData/enrichment.py

def get_string_network_link(self, key=None, string_ids=None, species=None, show_labels=True):
    """
    Generate a direct STRING network URL to visualize protein interactions online.

    This method constructs a STRING website link to view a network of proteins,
    using either a list of STRING IDs or a key from previously stored enrichment results.

    Args:
        key (str, optional): Key from `.stats["functional"]` to extract STRING IDs and species info.
        string_ids (list of str, optional): List of STRING identifiers to include in the network.
        species (int or str, optional): NCBI taxonomy ID (e.g., 9606 for human). Required if not using a stored key.
        show_labels (bool): If True (default), node labels will be shown in the network view.

    Returns:
        str: URL to open the network in the STRING web interface.

    Example:
        Get a STRING network link for a stored enrichment result:
            ```python
            url = pdata.get_string_network_link(key="UserSearch1")
            print(url)
            ```
    """
    if string_ids is None:
        if key is None:
            raise ValueError("Must provide either a list of STRING IDs or a key.")
        metadata = self.stats.get("functional", {}).get(key)
        if metadata is None:
            raise ValueError(f"Key '{key}' not found in self.stats['functional'].")
        string_ids = metadata.get("string_ids")
        species = species or metadata.get("species")

    if not string_ids:
        raise ValueError("No STRING IDs found or provided.")

    base_url = "https://string-db.org/cgi/network"
    params = [
        f"identifiers={'%0d'.join(string_ids)}",
        f"caller_identity=scpviz"
    ]
    if species:
        params.append(f"species={species}")
    if show_labels:
        params.append("show_query_node_labels=1")

    return f"{base_url}?{'&'.join(params)}"

list_enrichments

list_enrichments()

List available STRING enrichment results and unprocessed DE contrasts.

This method prints available functional and PPI enrichment entries stored in .stats["functional"] and .stats["ppi"], as well as DE comparisons in .stats["de_results"] that have not yet been analyzed.

Returns:

Type	Description
	None

Example

List enrichment results stored after running functional or PPI enrichment:

pdata.list_enrichments()

Source code in src/scpviz/pAnnData/enrichment.py

def list_enrichments(self):
    """
    List available STRING enrichment results and unprocessed DE contrasts.

    This method prints available functional and PPI enrichment entries stored in
    `.stats["functional"]` and `.stats["ppi"]`, as well as DE comparisons in 
    `.stats["de_results"]` that have not yet been analyzed.

    Returns:
        None

    Example:
        List enrichment results stored after running functional or PPI enrichment:
            ```python
            pdata.list_enrichments()
            ```
    """

    functional = self.stats.get("functional", {})
    ppi_keys = self.stats.get("ppi", {}).keys()
    de_keys = {k for k in self.stats if "vs" in k and not k.endswith(("_up", "_down"))}

    # Collect enriched DE keys based on input_key metadata
    enriched_de = set()
    enriched_results = []

    for k, meta in functional.items():
        input_key = meta.get("input_key", None)
        is_de = "vs" in k

        if input_key and input_key in de_keys:
            base = input_key
            suffix = k.rsplit("_", 1)[-1]
            pretty = f"{_pretty_vs_key(base)}_{suffix}"
            enriched_de.add(base)
            enriched_results.append((pretty, k, "DE-based"))
        else:
            enriched_results.append((k, k, "User"))

    de_unenriched = sorted(_pretty_vs_key(k) for k in (de_keys - enriched_de))

    print(f"{format_log_prefix('user')} Listing STRING enrichment status\n")

    print(f"{format_log_prefix('info_only',2)} Available DE comparisons (not yet enriched):")
    if de_unenriched:
        for pk in de_unenriched:
            print(f"        - {pk}")
    else:
        print("  (none)\n")

    print("\n  🔹 To run enrichment:")
    print("      pdata.enrichment_functional(from_de=True, de_key=\"...\")")

    print(f"\n{format_log_prefix('result_only')} Completed STRING enrichment results:")
    if not enriched_results:
        print("    (none)")
    for pretty, raw_key, kind in enriched_results:
        if kind == "DE-based":
            base, suffix = pretty.rsplit("_", 1)
            print(f"  - {pretty} ({kind})")
            print(f"    • Table: pdata.stats['functional'][\"{raw_key}\"]['result']")
            print(f"    • Plot : pdata.plot_enrichment_svg(\"{base}\", direction=\"{suffix}\")")
            url = self.stats["functional"].get(raw_key, {}).get("string_url")
            if url:
                print(f"    • Link  : {url}")
        else:
            print(f"  - {pretty} ({kind})")
            print(f"    • Table: pdata.stats['functional'][\"{raw_key}\"]['result']")
            print(f"    • Plot : pdata.plot_enrichment_svg(\"{pretty}\")")
            url = self.stats["functional"].get(raw_key, {}).get("string_url")
            if url:
                print(f"    • Link  : {url}")

    if ppi_keys:
        print(f"\n{format_log_prefix('result_only')} Completed STRING enrichment results:")
        for key in sorted(ppi_keys):
            print(f"  - {key} (User)")
            print(f"    • Table: pdata.stats['ppi']['{key}']['result']")
    else:
        print(f"\n{format_log_prefix('result_only')} Completed STRING PPI results:")
        print("    (none)")

plot_enrichment_svg

plot_enrichment_svg(key, direction=None, category=None, save_as=None)

Display STRING enrichment SVG inline in a Jupyter notebook.

This method fetches and renders a STRING-generated SVG for a previously completed functional enrichment result. Optionally, the SVG can also be saved to disk.

Parameters:

Name	Type	Description	Default
`key`	`str`	Enrichment result key from `.stats["functional"]`. For DE-based comparisons, this includes both contrast and direction (e.g., `"GroupA_Treatment1_vs_Control_up"`).	required
`direction`	`str`	Direction of DE result, either `"up"` or `"down"`. Use `None` for user-defined gene lists.	`None`
`category`	`str`	STRING enrichment category to filter by (e.g., `"Process"`, `"KEGG"`). See the table in the below note for options.	`None`
`save_as`	`str`	If provided, saves the retrieved SVG to the given file path.	`None`

Returns:

Type	Description
	None

Example

Display a STRING enrichment network for a user-supplied gene list:

pdata.plot_enrichment_svg("UserSearch1")

Supported STRING Enrichment Categories

The following category IDs are supported for functional enrichment.
More details are available on the STRING API documentation site.

Category ID	Description
Process	Biological Process (Gene Ontology)
Function	Molecular Function (Gene Ontology)
Component	Cellular Component (Gene Ontology)
Keyword	Annotated Keywords (UniProt)
KEGG	KEGG Pathways
RCTM	Reactome Pathways
HPO	Human Phenotype (Monarch)
MPO	Mammalian Phenotype Ontology (Monarch)
DPO	Drosophila Phenotype (Monarch)
WPO	C. elegans Phenotype Ontology (Monarch)
ZPO	Zebrafish Phenotype Ontology (Monarch)
FYPO	Fission Yeast Phenotype Ontology (Monarch)
Pfam	Protein Domains (Pfam)
SMART	Protein Domains (SMART)
InterPro	Protein Domains and Features (InterPro)
PMID	Reference Publications (PubMed)
NetworkNeighborAL	Local Network Cluster (STRING)
COMPARTMENTS	Subcellular Localization (COMPARTMENTS)
TISSUES	Tissue Expression (TISSUES)
DISEASES	Disease–gene Associations (DISEASES)
WikiPathways	WikiPathways

Note

The key must correspond to an existing entry in .stats["functional"], created via enrichment_functional().

Source code in src/scpviz/pAnnData/enrichment.py

def plot_enrichment_svg(self, key, direction=None, category=None, save_as=None):
    """
    Display STRING enrichment SVG inline in a Jupyter notebook.

    This method fetches and renders a STRING-generated SVG for a previously completed
    functional enrichment result. Optionally, the SVG can also be saved to disk.

    Args:
        key (str): Enrichment result key from `.stats["functional"]`. For DE-based comparisons, this 
            includes both contrast and direction (e.g., `"GroupA_Treatment1_vs_Control_up"`).
        direction (str, optional): Direction of DE result, either `"up"` or `"down"`. Use `None` for 
            user-defined gene lists.
        category (str, optional): STRING enrichment category to filter by (e.g., `"Process"`, `"KEGG"`). See the table in the below note for options.
        save_as (str, optional): If provided, saves the retrieved SVG to the given file path.

    Returns:
        None

    Example:
        Display a STRING enrichment network for a user-supplied gene list:
            ```python
            pdata.plot_enrichment_svg("UserSearch1")
            ```

    !!! note "Supported STRING Enrichment Categories"
        The following category IDs are supported for functional enrichment.  
        More details are available on the [STRING API documentation site](https://string-db.org/cgi/help?subpage=api).

        | Category ID          | Description                                      |
        |----------------------|--------------------------------------------------|
        | **Process**          | Biological Process (Gene Ontology)               |
        | **Function**         | Molecular Function (Gene Ontology)               |
        | **Component**        | Cellular Component (Gene Ontology)               |
        | **Keyword**          | Annotated Keywords (UniProt)                     |
        | **KEGG**             | KEGG Pathways                                    |
        | **RCTM**             | Reactome Pathways                                |
        | **HPO**              | Human Phenotype (Monarch)                        |
        | **MPO**              | Mammalian Phenotype Ontology (Monarch)           |
        | **DPO**              | Drosophila Phenotype (Monarch)                   |
        | **WPO**              | *C. elegans* Phenotype Ontology (Monarch)        |
        | **ZPO**              | Zebrafish Phenotype Ontology (Monarch)           |
        | **FYPO**             | Fission Yeast Phenotype Ontology (Monarch)       |
        | **Pfam**             | Protein Domains (Pfam)                           |
        | **SMART**            | Protein Domains (SMART)                          |
        | **InterPro**         | Protein Domains and Features (InterPro)          |
        | **PMID**             | Reference Publications (PubMed)                  |
        | **NetworkNeighborAL**| Local Network Cluster (STRING)                   |
        | **COMPARTMENTS**     | Subcellular Localization (COMPARTMENTS)          |
        | **TISSUES**          | Tissue Expression (TISSUES)                      |
        | **DISEASES**         | Disease–gene Associations (DISEASES)             |
        | **WikiPathways**     | WikiPathways                                     |

    Note:
        The `key` must correspond to an existing entry in `.stats["functional"]`, created via 
        `enrichment_functional()`.
    """
    from xml.parsers.expat import ExpatError

    if "functional" not in self.stats:
        raise ValueError("No STRING enrichment results found in .stats['functional'].")

    all_keys = list(self.stats["functional"].keys())

    # Handle DE-type key
    if "vs" in key:
        if direction not in {"up", "down"}:
            raise ValueError("You must specify direction='up' or 'down' for DE-based enrichment keys.")
        lookup_key = _resolve_de_key(self.stats["functional"], f"{key}_{direction}")
    else:
        # Handle user-supplied key (e.g. "userSearch1")
        if direction is not None:
            print(f"[WARNING] Ignoring direction='{direction}' for user-supplied key: '{key}'")
        lookup_key = key

    if lookup_key not in self.stats["functional"]:
        available = "\n".join(f"  - {k}" for k in self.stats["functional"].keys())
        raise ValueError(f"Could not find enrichment results for '{lookup_key}'. Available keys:\n{available}")

    meta = self.stats["functional"][lookup_key]
    string_ids = meta["string_ids"]
    species_id = meta["species"]

    url = "https://string-db.org/api/svg/enrichmentfigure"
    params = {
        "identifiers": "%0d".join(string_ids),
        "species": species_id
    }
    if category:
        params["category"] = category

    print(f"{format_log_prefix('user')} Fetching STRING SVG for key '{lookup_key}' (n={len(string_ids)})...")
    response = requests.get(url, params=params)
    response.raise_for_status()

    if save_as:
        with open(save_as, "wb") as f:
            f.write(response.content)
        print(f"{format_log_prefix('info_only')} Saved SVG to: {save_as}")

    with tempfile.NamedTemporaryFile("wb", suffix=".svg", delete=False) as tmp:
        tmp.write(response.content)
        tmp_path = tmp.name

    try:
        try:
            display(SVG(filename=tmp_path))
        except ExpatError:
            print(
                f"{format_log_prefix('info_only')} No enrichment figure available "
                f"for key '{lookup_key}'"
                + (f" (category='{category}')." if category else ".")
            )
    finally:
        os.remove(tmp_path)

resolve_to_accessions

resolve_to_accessions(mixed_list)

Convert gene names or accessions into standardized UniProt accession IDs.

This method resolves input items using the internal gene-to-accession map, ensuring all returned entries are accessions present in the .prot object.

Parameters:

Name	Type	Description	Default
`mixed_list`	`list of str`	A list containing gene names and/or UniProt accessions.	required

Returns:

Type	Description
	list of str: List of resolved UniProt accession IDs.

Note

This function is similar to utils.resolve_accessions() but operates in the context of the current pAnnData object and its internal gene mappings.

Todo

Add example comparing results from resolve_to_accessions() and utils.resolve_accessions().

Source code in src/scpviz/pAnnData/enrichment.py

def resolve_to_accessions(self, mixed_list):
    """
    Convert gene names or accessions into standardized UniProt accession IDs.

    This method resolves input items using the internal gene-to-accession map,
    ensuring all returned entries are accessions present in the `.prot` object.

    Args:
        mixed_list (list of str): A list containing gene names and/or UniProt accessions.

    Returns:
        list of str: List of resolved UniProt accession IDs.

    Note:
        This function is similar to `utils.resolve_accessions()` but operates in the context 
        of the current `pAnnData` object and its internal gene mappings.

    Todo:
        Add example comparing results from `resolve_to_accessions()` and `utils.resolve_accessions()`.
    """
    gene_to_acc, _ = self.get_gene_maps(on='protein') 
    accs = []
    unresolved_accs = []
    for item in mixed_list:
        if item in self.prot.var.index:
            accs.append(item)  # already an accession
        elif item in gene_to_acc:
            accs.append(gene_to_acc[item])
        else:
            unresolved_accs.append(item)
            # print(f"{format_log_prefix('warn_only',2)} Could not resolve '{item}' to an accession — skipping.")
    return accs, unresolved_accs