Skip to content

Enrichment

Mixins for enrichment analysis using STRING/UniProt.


Provides methods for STRING-based functional and protein–protein interaction (PPI) enrichment.

This mixin includes utilities for:

  • Running functional enrichment on differentially expressed or user-supplied gene lists.
  • Performing STRING PPI enrichment to identify interaction networks.
  • Generating STRING network visualization links and embedded SVGs.
  • Listing and accessing enrichment results stored in .stats.

Methods:

Name Description
enrichment_functional

Runs STRING functional enrichment on DE results or a custom gene list.

enrichment_ppi

Runs STRING PPI enrichment on a user-supplied gene or accession list.

list_enrichments

Lists available enrichment results and DE comparisons.

plot_enrichment_svg

Displays a STRING enrichment SVG inline or saves it to file.

get_string_mappings

Maps UniProt accessions to STRING IDs using the STRING API.

resolve_to_accessions

Resolves gene names or mixed inputs to accessions using internal mappings.

get_string_network_link

Generates a direct STRING network URL for visualization.

Source code in src/scpviz/pAnnData/enrichment.py
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
class EnrichmentMixin:
    """
    Provides methods for STRING-based functional and protein–protein interaction (PPI) enrichment.

    This mixin includes utilities for:

    - Running functional enrichment on differentially expressed or user-supplied gene lists.
    - Performing STRING PPI enrichment to identify interaction networks.
    - Generating STRING network visualization links and embedded SVGs.
    - Listing and accessing enrichment results stored in `.stats`.

    Functions:
        enrichment_functional: Runs STRING functional enrichment on DE results or a custom gene list.
        enrichment_ppi: Runs STRING PPI enrichment on a user-supplied gene or accession list.
        list_enrichments: Lists available enrichment results and DE comparisons.
        plot_enrichment_svg: Displays a STRING enrichment SVG inline or saves it to file.
        get_string_mappings: Maps UniProt accessions to STRING IDs using the STRING API.
        resolve_to_accessions: Resolves gene names or mixed inputs to accessions using internal mappings.
        get_string_network_link: Generates a direct STRING network URL for visualization.
    """

    def get_string_mappings(self, identifiers, overwrite=False, cache_col="STRING", batch_size=100, debug=False):
        """
        Resolve STRING IDs for UniProt accessions with a 2-step strategy:
        1) Use UniProt stream (fields: xref_string) to fill cache quickly.
        2) For any still-missing rows, query STRING get_string_ids, batched by organism_id.

        This method retrieves corresponding STRING identifiers for a list of UniProt accessions
        and stores the result in `self.prot.var["STRING_id"]` for downstream use.

        Args:
            identifiers (list of str): List of UniProt accession IDs to map.
            batch_size (int): Number of accessions to include in each API query (default is 300).
            debug (bool): If True, prints progress and response info.

        Returns:
            pd.DataFrame: Mapping table with columns: `input_identifier`, `string_identifier`, and `ncbi_taxon_id`.

        Note:
            This is a helper method used primarily by `enrichment_functional()` and `enrichment_ppi()`.
        """

        identifiers = [str(x).strip() for x in identifiers if x is not None and str(x).strip()]
        if debug:
            print(f"{format_log_prefix('info')} Resolving STRING IDs for {len(identifiers)} identifiers...")

        prot_var = self.prot.var
        if cache_col not in prot_var.columns:
            prot_var[cache_col] = pd.NA
        if "ncbi_taxon_id" not in prot_var.columns:
            prot_var["ncbi_taxon_id"] = pd.NA

        # Use cached STRING IDs if available
        valid_ids = [i for i in identifiers if i in prot_var.index]
        existing = prot_var.loc[valid_ids, cache_col]
        found_ids = {i: sid for i, sid in existing.items() if pd.notna(sid) and str(sid).strip()}
        missing = [i for i in identifiers if i not in found_ids]

        if overwrite:
            print(f"{format_log_prefix('info_only',2)} Overwriting cached STRING IDs.")
            missing = valid_ids
            found_ids = {}

        print(f"{format_log_prefix('info_only',2)} Found {len(found_ids)} cached STRING IDs. {len(missing)} need lookup.")
        print(missing) if debug else None

        # 1. UniProt stream (fast)         # Use UniProt xref_string field to fill cache quickly
        # 2. STRING API for still-missing ones

        if missing:
            map_df = utils.get_string_mappings(
                missing,
                use_uniprot=True,
                use_string=True,
                caller_identity="scpviz",
                batch_size=batch_size,
                debug=debug,
            )
        else:
            map_df = pd.DataFrame(columns=["input_identifier", "string_identifier", "ncbi_taxon_id"])

        # Combine all new mappings
        if not map_df.empty:
            updated = 0
            updated_tax = 0

            for _, row in map_df.iterrows():
                acc = row["input_identifier"]
                sid = row["string_identifier"]
                tax = row.get("ncbi_taxon_id", pd.NA)

                if acc is None or acc not in prot_var.index:
                    continue

                if pd.notna(sid) and str(sid).strip():
                    self.prot.var.at[acc, cache_col] = sid
                    found_ids[acc] = sid
                    updated += 1
                else:
                    print(f"[DEBUG] Skipping unknown accession '{acc}'") if debug else None

                tax = utils.scalarize_taxon(tax)
                if tax is not pd.NA and pd.notna(tax):
                    prot_var.at[acc, "ncbi_taxon_id"] = str(tax)
                    updated_tax += 1

            print(f"{format_log_prefix('info_only',3)} Cached {updated} STRING ID mappings.")
            if debug:
                print(f"{format_log_prefix('info_only',3)} Cached {updated_tax} ncbi_taxon_id values.")

        elif missing:
            print(f"{format_log_prefix('warn_only',3)} No STRING mappings returned from STRING API.")


        # ------------------------------------
        # Build and MERGE UniProt results into out_df
        # ------------------------------------
        out_df = pd.DataFrame({"input_identifier": identifiers})

        # Prefer found_ids (fresh + cached), else fall back to cache for known indices
        out_df["string_identifier"] = out_df["input_identifier"].map(
            lambda acc: found_ids.get(
                acc,
                prot_var.at[acc, cache_col] if acc in prot_var.index else pd.NA,
            )
        )

        out_df["ncbi_taxon_id"] = out_df["input_identifier"].map(
            lambda acc: prot_var.at[acc, "ncbi_taxon_id"] if acc in prot_var.index else pd.NA
        )
        out_df["ncbi_taxon_id"] = out_df["ncbi_taxon_id"].apply(utils.scalarize_taxon)

        return out_df


    def resolve_to_accessions(self, mixed_list):
        """
        Convert gene names or accessions into standardized UniProt accession IDs.

        This method resolves input items using the internal gene-to-accession map,
        ensuring all returned entries are accessions present in the `.prot` object.

        Args:
            mixed_list (list of str): A list containing gene names and/or UniProt accessions.

        Returns:
            list of str: List of resolved UniProt accession IDs.

        Note:
            This function is similar to `utils.resolve_accessions()` but operates in the context 
            of the current `pAnnData` object and its internal gene mappings.

        Todo:
            Add example comparing results from `resolve_to_accessions()` and `utils.resolve_accessions()`.
        """
        gene_to_acc, _ = self.get_gene_maps(on='protein') 
        accs = []
        unresolved_accs = []
        for item in mixed_list:
            if item in self.prot.var.index:
                accs.append(item)  # already an accession
            elif item in gene_to_acc:
                accs.append(gene_to_acc[item])
            else:
                unresolved_accs.append(item)
                # print(f"{format_log_prefix('warn_only',2)} Could not resolve '{item}' to an accession — skipping.")
        return accs, unresolved_accs

    def enrichment_functional(
        self,
        genes=None,
        from_de=True,
        top_n=150,
        score_col="significance_score",
        gene_col="Genes",
        de_key="de_results",
        store_key=None,
        species=None,
        background=None,
        debug=False,
        **kwargs
    ):
        """
        Run functional enrichment analysis using STRING on a gene list.

        This method performs ranked or unranked enrichment analysis using STRING's API.
        It supports both differential expression-based analysis (up- and down-regulated genes)
        and custom gene lists provided by the user. Enrichment results are stored in
        `.stats["functional"]` for later access and plotting.

        Args:
            genes (list of str, optional): List of gene symbols to analyze. Ignored if `from_de=True`.
            from_de (bool): If True (default), selects genes from stored differential expression results.
            top_n (int): Number of top-ranked genes to use when `from_de=True` (default is 150).
            score_col (str): Column name in the DE table to rank genes by (default is `"significance_score"`).
            gene_col (str): Column name in `.prot.var` or DE results that contains gene names.
            de_key (str): Key to retrieve stored DE results from `.stats["de_results"]`.
            store_key (str, optional): Custom key to store enrichment results. Ignored when `from_de=True`.
            species (str, optional): Organism name or NCBI taxonomy ID. If None, inferred from STRING response.
            background (str or list of str, optional): Background gene list to use for enrichment.

                - If `"all_quantified"`, uses non-significant proteins from DE or all other quantified proteins.
                - If a list, must contain valid gene names or accessions.
            debug (bool): If True, prints API request info and diagnostic messages.
            **kwargs: Additional keyword arguments passed to the STRING enrichment API.

        Returns:
            dict or pd.DataFrame:

                - If `from_de=True`, returns a dictionary of enrichment DataFrames for "up" and "down" gene sets.
                - If `genes` is provided, returns a single enrichment DataFrame.

        Example:
            Run differential expression, then perform STRING enrichment on top-ranked genes:
                ```python
                case1 = {'cellline': 'AS', 'treatment': 'sc'} # legacy style: class_type = ["group", "condition"]
                case2 = {'cellline': 'BE', 'treatment': 'sc'} # legacy style: values = [["GroupA", "Treatment1"], ["GroupA", "Control"]]
                pdata_nb.de(values = case_values) # or legacy style: pdata.de(classes=class_type, values=values)
                pdata.list_enrichments()  # list available DE result keys
                pdata.enrichment_functional(from_de=True, de_key="GroupA_Treatment1 vs GroupA_Control")
                ```

            Perform enrichment on a custom list of genes:
                ```python
                genelist = ["P55072", "NPLOC4", "UFD1", "STX5A", "NSFL1C", "UBXN2A",
                            "UBXN4", "UBE4B", "YOD1", "WASHC5", "PLAA", "UBXN10"]
                pdata.enrichment_functional(genes=genelist, from_de=False)
                ```

        Note:
            Internally uses `resolve_to_accessions()` and `get_string_mappings()`, and stores results 
            in `.stats["functional"]`. Results can be accessed or visualized via `plot_enrichment_svg()`
            or by visiting the linked STRING URLs.
        """
        def query_functional_enrichment(query_ids, species_id, background_ids=None, debug=False):
            print(f"{format_log_prefix('info_only',2)} Running enrichment on {len(query_ids)} STRING IDs (species {species_id})...") if debug else None
            url = "https://string-db.org/api/json/enrichment"
            payload = {
                "identifiers": "%0d".join(query_ids),
                "species": species_id,
                "caller_identity": "scpviz"
            }
            if background_ids is not None:
                print(f"{format_log_prefix('info_only')} Using background of {len(background_ids)} STRING IDs.")
                payload["background_string_identifiers"] = "%0d".join(background_ids)

            print(payload) if debug else None
            response = requests.post(url, data=payload)
            response.raise_for_status()
            return pd.DataFrame(response.json())

        # Ensure string metadata section exists
        if "functional" not in self.stats:
            self.stats["functional"] = {}

        if genes is None and from_de:
            resolved_key = _resolve_de_key(self.stats, de_key)
            de_df = self.stats[resolved_key]
            sig_df = de_df[de_df["significance"] != "not significant"].copy()
            print(f"{format_log_prefix('user')} Running STRING enrichment [DE-based: {resolved_key}]")

            up_genes = sig_df[sig_df[score_col] > 0][gene_col].dropna().head(top_n).tolist()
            down_genes = sig_df[sig_df[score_col] < 0][gene_col].dropna().head(top_n).tolist()

            up_accs, up_unresolved = self.resolve_to_accessions(up_genes)
            down_accs, down_unresolved = self.resolve_to_accessions(down_genes)

            background_accs = None
            background_string_ids = None
            if background == "all_quantified":
                print(f"{format_log_prefix('warn')} Mapping background proteins may take a long time due to batching.")
                background_accs = de_df[de_df["significance"] == "not significant"].index.tolist()

            if background_accs:
                bg_map = self.get_string_mappings(background_accs,debug=debug)
                bg_map = bg_map[bg_map["string_identifier"].notna()]
                background_string_ids = bg_map["string_identifier"].tolist()

            if store_key is not None:
                print(f"{format_log_prefix('warn')} Ignoring `store_key` for DE-based enrichment. Using auto-generated pretty keys.")

            results = {}
            for label, accs in zip(["up", "down"], [up_accs, down_accs]):
                print(f"\n🔹 {label.capitalize()}-regulated proteins")
                t0 = time.time()

                if not accs:
                    print(f"{format_log_prefix('warn')} No {label}-regulated proteins to analyze.")
                    continue

                mapping_df = self.get_string_mappings(accs, debug=debug)
                mapping_df = mapping_df[mapping_df["string_identifier"].notna()]
                if mapping_df.empty:
                    print(f"{format_log_prefix('warn')} No valid STRING mappings found for {label}-regulated proteins.")
                    continue

                string_ids = mapping_df["string_identifier"].tolist()
                inferred_species = mapping_df["ncbi_taxon_id"].mode().iloc[0]
                if species is not None:
                    # check if user species is same as inferred
                    if inferred_species != species:
                        print(f"{format_log_prefix('warn',2)} Inferred species ({inferred_species}) does not match user-specified ({species}). Using user-specified species.")
                    species_id = species
                else:
                    species_id = inferred_species

                print(f"   🔸 Proteins: {len(accs)} → STRING IDs: {len(string_ids)}")
                print(f"   🔸 Species: {species_id} | Background: {'None' if background_string_ids is None else 'custom'}")
                if label == "up":
                    if up_unresolved:
                        print(f"{format_log_prefix('warn',2)} Some accessions unresolved for {label}-regulated proteins: {', '.join(up_unresolved)}")
                else:
                    if down_unresolved:
                        print(f"{format_log_prefix('warn',2)} Some accessions unresolved for {label}-regulated proteins: {', '.join(down_unresolved)}")

                enrichment_df = query_functional_enrichment(string_ids, species_id, background_string_ids, debug=debug)
                enrich_key = f"{resolved_key}_{label}"
                pretty_base = _pretty_vs_key(resolved_key)
                pretty_key = f"{pretty_base}_{label}"
                string_url = self.get_string_network_link(string_ids=string_ids, species=species_id)

                self.stats["functional"][pretty_key] = {
                    "string_ids": string_ids,
                    "background_string_ids": background_string_ids,
                    "species": species_id,
                    "input_key": resolved_key if from_de else None,
                    "string_url": string_url,
                    "result": enrichment_df
                }

                print(f"{format_log_prefix('result')} Enrichment complete ({time.time() - t0:.2f}s)")
                print(f"   • Access result: pdata.stats['functional'][\"{pretty_key}\"][\"result\"]")
                print(f"   • Plot command : pdata.plot_enrichment_svg(\"{pretty_base}\", direction=\"{label}\")")
                print(f"   • View online  : {string_url}\n")

                results[label] = enrichment_df

        elif genes is not None:
            t0 = time.time()
            print(f"{format_log_prefix('user')} Running STRING enrichment [user-supplied]")

            if store_key is None:
                prefix = "UserSearch"
                existing = self.stats["functional"].keys() if "functional" in self.stats else []
                existing_ids = [k for k in existing if k.startswith(prefix)]
                next_id = len(existing_ids) + 1
                store_key = f"{prefix}{next_id}"

            input_accs, unresolved_accs = self.resolve_to_accessions(genes)
            mapping_df = self.get_string_mappings(input_accs, debug=debug)
            mapping_df = mapping_df[mapping_df["string_identifier"].notna()]
            if mapping_df.empty:
                raise ValueError("No valid STRING mappings found for the provided identifiers.")

            string_ids = mapping_df["string_identifier"].tolist()
            inferred_species = mapping_df["ncbi_taxon_id"].mode().iloc[0]
            if species is not None:
                # check if user species is same as inferred
                if inferred_species != species:
                    print(f"{format_log_prefix('warn',2)} Inferred species ({inferred_species}) does not match user-specified ({species}). Using user-specified species.")
                species_id = species
            else:
                species_id = inferred_species

            background_string_ids = None
            if background == "all_quantified":
                print(f"{format_log_prefix('warn')} Mapping background proteins may take a long time due to batching.")
                all_accs = list(self.prot.var_names)
                background_accs = list(set(all_accs) - set(input_accs))
                bg_map = self.get_string_mappings(background_accs, debug=debug)
                bg_map = bg_map[bg_map["string_identifier"].notna()]
                background_string_ids = bg_map["string_identifier"].tolist()

            print(f"   🔸 Input genes: {len(genes)} → Resolved STRING IDs: {len(string_ids)}")
            print(f"   🔸 Species: {species_id} | Background: {'None' if background_string_ids is None else 'custom'}")
            if unresolved_accs:
                print(f"{format_log_prefix('warn',2)} Some accessions unresolved: {', '.join(unresolved_accs)}")

            enrichment_df = query_functional_enrichment(string_ids, species_id, background_string_ids, debug=debug)
            string_url = self.get_string_network_link(string_ids=string_ids, species=species_id)

            self.stats["functional"][store_key] = {
                "string_ids": string_ids,
                "background_string_ids": background_string_ids,
                "species": species_id,
                "input_key": None,
                "string_url": string_url,
                "result": enrichment_df
            }

            print(f"{format_log_prefix('result')} Enrichment complete ({time.time() - t0:.2f}s)")
            print(f"   • Access result: pdata.stats['functional'][\"{store_key}\"][\"result\"]")
            print(f"   • Plot command : pdata.plot_enrichment_svg(\"{store_key}\")")
            print(f"   • View online  : {string_url}\n")

            return enrichment_df

        else:
            raise ValueError("Must provide 'genes' or set from_de=True to use DE results.") 

    def enrichment_ppi(self, genes, species=None, store_key=None, debug=False):
        """
        Run STRING PPI (protein–protein interaction) enrichment on a user-supplied gene or accession list.

        This method maps the input gene names or UniProt accessions to STRING IDs, infers the species 
        if not provided, and submits the list to STRING's PPI enrichment endpoint. Results are stored 
        in `.stats["ppi"]` for later retrieval or visualization.

        Args:
            genes (list of str): A list of gene names or UniProt accessions to analyze.
            species (int or str, optional): NCBI taxonomy ID (e.g., 9606 for human). If None, inferred from STRING mappings.
            store_key (str, optional): Key to store the enrichment result under `.stats["ppi"]`.
                If None, a unique key is auto-generated.

        Returns:
            pd.DataFrame: DataFrame of STRING PPI enrichment results.

        Example:
            Run differential expression, then perform STRING PPI enrichment on significant genes:
                ```python
                class_type = ["group", "condition"]
                values = [["GroupA", "Treatment1"], ["GroupA", "Control"]]

                pdata.de(classes=class_type, values=values)
                pdata.list_enrichments()
                sig_genes = pdata.stats["de_results"]["GroupA_Treatment1 vs GroupA_Control"]
                sig_genes = sig_genes[sig_genes["significance"] != "not significant"]["Genes"].dropna().tolist()

                pdata.enrichment_ppi(genes=sig_genes)
                ```
        """
        def query_ppi_enrichment(string_ids, species):
            # print(f"[INFO] Running PPI enrichment for {len(string_ids)} STRING IDs (species {species})...")
            url = "https://string-db.org/api/json/ppi_enrichment"
            payload = {
                "identifiers": "%0d".join(string_ids),
                "species": species,
                "caller_identity": "scpviz"
            }

            response = requests.post(url, data=payload)
            response.raise_for_status()

            result = response.json()
            print("[DEBUG] PPI enrichment result:", result)
            return result[0] if isinstance(result, list) else result

        print(f"{format_log_prefix('user')} Running STRING PPI enrichment")
        t0 = time.time()
        input_accs, unresolved_accs = self.resolve_to_accessions(genes)
        mapping_df = self.get_string_mappings(input_accs, debug=debug)
        mapping_df = mapping_df[mapping_df["string_identifier"].notna()]
        if mapping_df.empty:
            raise ValueError("No valid STRING mappings found for the provided genes/accessions.")

        string_ids = mapping_df["string_identifier"].tolist()
        inferred_species = mapping_df["ncbi_taxon_id"].mode().iloc[0]
        species_id = species if species is not None else inferred_species

        print(f"   🔸 Input genes: {len(genes)} → Resolved STRING IDs: {len(mapping_df)}")
        print(f"   🔸 Species: {species_id}")
        if unresolved_accs:
            print(f"{format_log_prefix('warn', 2)} Some accessions unresolved: {', '.join(unresolved_accs)}")

        result = query_ppi_enrichment(string_ids, species_id)

        # Store results
        if "ppi" not in self.stats:
            self.stats["ppi"] = {}

        if store_key is None:
            base = "UserPPI"
            counter = 1
            while f"{base}{counter}" in self.stats["ppi"]:
                counter += 1
            store_key = f"{base}{counter}"

        self.stats["ppi"][store_key] = {
            "result": result,
            "string_ids": string_ids,
            "species": species_id
        }

        print(f"{format_log_prefix('result')} PPI enrichment complete ({time.time() - t0:.2f}s)")
        print(f"   • STRING IDs   : {len(string_ids)}")
        print(f"   • Edges found  : {result['number_of_edges']} vs {result['expected_number_of_edges']} expected")
        print(f"   • p-value      : {result['p_value']:.2e}")
        print(f"   • Access result: pdata.stats['ppi']['{store_key}']['result']\n")

        return result

    def list_enrichments(self):
        """
        List available STRING enrichment results and unprocessed DE contrasts.

        This method prints available functional and PPI enrichment entries stored in
        `.stats["functional"]` and `.stats["ppi"]`, as well as DE comparisons in 
        `.stats["de_results"]` that have not yet been analyzed.

        Returns:
            None

        Example:
            List enrichment results stored after running functional or PPI enrichment:
                ```python
                pdata.list_enrichments()
                ```
        """

        functional = self.stats.get("functional", {})
        ppi_keys = self.stats.get("ppi", {}).keys()
        de_keys = {k for k in self.stats if "vs" in k and not k.endswith(("_up", "_down"))}

        # Collect enriched DE keys based on input_key metadata
        enriched_de = set()
        enriched_results = []

        for k, meta in functional.items():
            input_key = meta.get("input_key", None)
            is_de = "vs" in k

            if input_key and input_key in de_keys:
                base = input_key
                suffix = k.rsplit("_", 1)[-1]
                pretty = f"{_pretty_vs_key(base)}_{suffix}"
                enriched_de.add(base)
                enriched_results.append((pretty, k, "DE-based"))
            else:
                enriched_results.append((k, k, "User"))

        de_unenriched = sorted(_pretty_vs_key(k) for k in (de_keys - enriched_de))

        print(f"{format_log_prefix('user')} Listing STRING enrichment status\n")

        print(f"{format_log_prefix('info_only',2)} Available DE comparisons (not yet enriched):")
        if de_unenriched:
            for pk in de_unenriched:
                print(f"        - {pk}")
        else:
            print("  (none)\n")

        print("\n  🔹 To run enrichment:")
        print("      pdata.enrichment_functional(from_de=True, de_key=\"...\")")

        print(f"\n{format_log_prefix('result_only')} Completed STRING enrichment results:")
        if not enriched_results:
            print("    (none)")
        for pretty, raw_key, kind in enriched_results:
            if kind == "DE-based":
                base, suffix = pretty.rsplit("_", 1)
                print(f"  - {pretty} ({kind})")
                print(f"    • Table: pdata.stats['functional'][\"{raw_key}\"]['result']")
                print(f"    • Plot : pdata.plot_enrichment_svg(\"{base}\", direction=\"{suffix}\")")
                url = self.stats["functional"].get(raw_key, {}).get("string_url")
                if url:
                    print(f"    • Link  : {url}")
            else:
                print(f"  - {pretty} ({kind})")
                print(f"    • Table: pdata.stats['functional'][\"{raw_key}\"]['result']")
                print(f"    • Plot : pdata.plot_enrichment_svg(\"{pretty}\")")
                url = self.stats["functional"].get(raw_key, {}).get("string_url")
                if url:
                    print(f"    • Link  : {url}")

        if ppi_keys:
            print(f"\n{format_log_prefix('result_only')} Completed STRING enrichment results:")
            for key in sorted(ppi_keys):
                print(f"  - {key} (User)")
                print(f"    • Table: pdata.stats['ppi']['{key}']['result']")
        else:
            print(f"\n{format_log_prefix('result_only')} Completed STRING PPI results:")
            print("    (none)")

    def plot_enrichment_svg(self, key, direction=None, category=None, save_as=None):
        """
        Display STRING enrichment SVG inline in a Jupyter notebook.

        This method fetches and renders a STRING-generated SVG for a previously completed
        functional enrichment result. Optionally, the SVG can also be saved to disk.

        Args:
            key (str): Enrichment result key from `.stats["functional"]`. For DE-based comparisons, this 
                includes both contrast and direction (e.g., `"GroupA_Treatment1_vs_Control_up"`).
            direction (str, optional): Direction of DE result, either `"up"` or `"down"`. Use `None` for 
                user-defined gene lists.
            category (str, optional): STRING enrichment category to filter by (e.g., `"Process"`, `"KEGG"`). See the table in the below note for options.
            save_as (str, optional): If provided, saves the retrieved SVG to the given file path.

        Returns:
            None

        Example:
            Display a STRING enrichment network for a user-supplied gene list:
                ```python
                pdata.plot_enrichment_svg("UserSearch1")
                ```

        !!! note "Supported STRING Enrichment Categories"
            The following category IDs are supported for functional enrichment.  
            More details are available on the [STRING API documentation site](https://string-db.org/cgi/help?subpage=api).

            | Category ID          | Description                                      |
            |----------------------|--------------------------------------------------|
            | **Process**          | Biological Process (Gene Ontology)               |
            | **Function**         | Molecular Function (Gene Ontology)               |
            | **Component**        | Cellular Component (Gene Ontology)               |
            | **Keyword**          | Annotated Keywords (UniProt)                     |
            | **KEGG**             | KEGG Pathways                                    |
            | **RCTM**             | Reactome Pathways                                |
            | **HPO**              | Human Phenotype (Monarch)                        |
            | **MPO**              | Mammalian Phenotype Ontology (Monarch)           |
            | **DPO**              | Drosophila Phenotype (Monarch)                   |
            | **WPO**              | *C. elegans* Phenotype Ontology (Monarch)        |
            | **ZPO**              | Zebrafish Phenotype Ontology (Monarch)           |
            | **FYPO**             | Fission Yeast Phenotype Ontology (Monarch)       |
            | **Pfam**             | Protein Domains (Pfam)                           |
            | **SMART**            | Protein Domains (SMART)                          |
            | **InterPro**         | Protein Domains and Features (InterPro)          |
            | **PMID**             | Reference Publications (PubMed)                  |
            | **NetworkNeighborAL**| Local Network Cluster (STRING)                   |
            | **COMPARTMENTS**     | Subcellular Localization (COMPARTMENTS)          |
            | **TISSUES**          | Tissue Expression (TISSUES)                      |
            | **DISEASES**         | Disease–gene Associations (DISEASES)             |
            | **WikiPathways**     | WikiPathways                                     |

        Note:
            The `key` must correspond to an existing entry in `.stats["functional"]`, created via 
            `enrichment_functional()`.
        """
        from xml.parsers.expat import ExpatError

        if "functional" not in self.stats:
            raise ValueError("No STRING enrichment results found in .stats['functional'].")

        all_keys = list(self.stats["functional"].keys())

        # Handle DE-type key
        if "vs" in key:
            if direction not in {"up", "down"}:
                raise ValueError("You must specify direction='up' or 'down' for DE-based enrichment keys.")
            lookup_key = _resolve_de_key(self.stats["functional"], f"{key}_{direction}")
        else:
            # Handle user-supplied key (e.g. "userSearch1")
            if direction is not None:
                print(f"[WARNING] Ignoring direction='{direction}' for user-supplied key: '{key}'")
            lookup_key = key

        if lookup_key not in self.stats["functional"]:
            available = "\n".join(f"  - {k}" for k in self.stats["functional"].keys())
            raise ValueError(f"Could not find enrichment results for '{lookup_key}'. Available keys:\n{available}")

        meta = self.stats["functional"][lookup_key]
        string_ids = meta["string_ids"]
        species_id = meta["species"]

        url = "https://string-db.org/api/svg/enrichmentfigure"
        params = {
            "identifiers": "%0d".join(string_ids),
            "species": species_id
        }
        if category:
            params["category"] = category

        print(f"{format_log_prefix('user')} Fetching STRING SVG for key '{lookup_key}' (n={len(string_ids)})...")
        response = requests.get(url, params=params)
        response.raise_for_status()

        if save_as:
            with open(save_as, "wb") as f:
                f.write(response.content)
            print(f"{format_log_prefix('info_only')} Saved SVG to: {save_as}")

        with tempfile.NamedTemporaryFile("wb", suffix=".svg", delete=False) as tmp:
            tmp.write(response.content)
            tmp_path = tmp.name

        try:
            try:
                display(SVG(filename=tmp_path))
            except ExpatError:
                print(
                    f"{format_log_prefix('info_only')} No enrichment figure available "
                    f"for key '{lookup_key}'"
                    + (f" (category='{category}')." if category else ".")
                )
        finally:
            os.remove(tmp_path)

    def get_string_network_link(self, key=None, string_ids=None, species=None, show_labels=True):
        """
        Generate a direct STRING network URL to visualize protein interactions online.

        This method constructs a STRING website link to view a network of proteins,
        using either a list of STRING IDs or a key from previously stored enrichment results.

        Args:
            key (str, optional): Key from `.stats["functional"]` to extract STRING IDs and species info.
            string_ids (list of str, optional): List of STRING identifiers to include in the network.
            species (int or str, optional): NCBI taxonomy ID (e.g., 9606 for human). Required if not using a stored key.
            show_labels (bool): If True (default), node labels will be shown in the network view.

        Returns:
            str: URL to open the network in the STRING web interface.

        Example:
            Get a STRING network link for a stored enrichment result:
                ```python
                url = pdata.get_string_network_link(key="UserSearch1")
                print(url)
                ```
        """
        if string_ids is None:
            if key is None:
                raise ValueError("Must provide either a list of STRING IDs or a key.")
            metadata = self.stats.get("functional", {}).get(key)
            if metadata is None:
                raise ValueError(f"Key '{key}' not found in self.stats['functional'].")
            string_ids = metadata.get("string_ids")
            species = species or metadata.get("species")

        if not string_ids:
            raise ValueError("No STRING IDs found or provided.")

        base_url = "https://string-db.org/cgi/network"
        params = [
            f"identifiers={'%0d'.join(string_ids)}",
            f"caller_identity=scpviz"
        ]
        if species:
            params.append(f"species={species}")
        if show_labels:
            params.append("show_query_node_labels=1")

        return f"{base_url}?{'&'.join(params)}"

enrichment_functional

enrichment_functional(genes=None, from_de=True, top_n=150, score_col='significance_score', gene_col='Genes', de_key='de_results', store_key=None, species=None, background=None, debug=False, **kwargs)

Run functional enrichment analysis using STRING on a gene list.

This method performs ranked or unranked enrichment analysis using STRING's API. It supports both differential expression-based analysis (up- and down-regulated genes) and custom gene lists provided by the user. Enrichment results are stored in .stats["functional"] for later access and plotting.

Parameters:

Name Type Description Default
genes list of str

List of gene symbols to analyze. Ignored if from_de=True.

None
from_de bool

If True (default), selects genes from stored differential expression results.

True
top_n int

Number of top-ranked genes to use when from_de=True (default is 150).

150
score_col str

Column name in the DE table to rank genes by (default is "significance_score").

'significance_score'
gene_col str

Column name in .prot.var or DE results that contains gene names.

'Genes'
de_key str

Key to retrieve stored DE results from .stats["de_results"].

'de_results'
store_key str

Custom key to store enrichment results. Ignored when from_de=True.

None
species str

Organism name or NCBI taxonomy ID. If None, inferred from STRING response.

None
background str or list of str

Background gene list to use for enrichment.

  • If "all_quantified", uses non-significant proteins from DE or all other quantified proteins.
  • If a list, must contain valid gene names or accessions.
None
debug bool

If True, prints API request info and diagnostic messages.

False
**kwargs

Additional keyword arguments passed to the STRING enrichment API.

{}

Returns:

Type Description

dict or pd.DataFrame:

  • If from_de=True, returns a dictionary of enrichment DataFrames for "up" and "down" gene sets.
  • If genes is provided, returns a single enrichment DataFrame.
Example

Run differential expression, then perform STRING enrichment on top-ranked genes:

case1 = {'cellline': 'AS', 'treatment': 'sc'} # legacy style: class_type = ["group", "condition"]
case2 = {'cellline': 'BE', 'treatment': 'sc'} # legacy style: values = [["GroupA", "Treatment1"], ["GroupA", "Control"]]
pdata_nb.de(values = case_values) # or legacy style: pdata.de(classes=class_type, values=values)
pdata.list_enrichments()  # list available DE result keys
pdata.enrichment_functional(from_de=True, de_key="GroupA_Treatment1 vs GroupA_Control")

Perform enrichment on a custom list of genes:

genelist = ["P55072", "NPLOC4", "UFD1", "STX5A", "NSFL1C", "UBXN2A",
            "UBXN4", "UBE4B", "YOD1", "WASHC5", "PLAA", "UBXN10"]
pdata.enrichment_functional(genes=genelist, from_de=False)

Note

Internally uses resolve_to_accessions() and get_string_mappings(), and stores results in .stats["functional"]. Results can be accessed or visualized via plot_enrichment_svg() or by visiting the linked STRING URLs.

Source code in src/scpviz/pAnnData/enrichment.py
def enrichment_functional(
    self,
    genes=None,
    from_de=True,
    top_n=150,
    score_col="significance_score",
    gene_col="Genes",
    de_key="de_results",
    store_key=None,
    species=None,
    background=None,
    debug=False,
    **kwargs
):
    """
    Run functional enrichment analysis using STRING on a gene list.

    This method performs ranked or unranked enrichment analysis using STRING's API.
    It supports both differential expression-based analysis (up- and down-regulated genes)
    and custom gene lists provided by the user. Enrichment results are stored in
    `.stats["functional"]` for later access and plotting.

    Args:
        genes (list of str, optional): List of gene symbols to analyze. Ignored if `from_de=True`.
        from_de (bool): If True (default), selects genes from stored differential expression results.
        top_n (int): Number of top-ranked genes to use when `from_de=True` (default is 150).
        score_col (str): Column name in the DE table to rank genes by (default is `"significance_score"`).
        gene_col (str): Column name in `.prot.var` or DE results that contains gene names.
        de_key (str): Key to retrieve stored DE results from `.stats["de_results"]`.
        store_key (str, optional): Custom key to store enrichment results. Ignored when `from_de=True`.
        species (str, optional): Organism name or NCBI taxonomy ID. If None, inferred from STRING response.
        background (str or list of str, optional): Background gene list to use for enrichment.

            - If `"all_quantified"`, uses non-significant proteins from DE or all other quantified proteins.
            - If a list, must contain valid gene names or accessions.
        debug (bool): If True, prints API request info and diagnostic messages.
        **kwargs: Additional keyword arguments passed to the STRING enrichment API.

    Returns:
        dict or pd.DataFrame:

            - If `from_de=True`, returns a dictionary of enrichment DataFrames for "up" and "down" gene sets.
            - If `genes` is provided, returns a single enrichment DataFrame.

    Example:
        Run differential expression, then perform STRING enrichment on top-ranked genes:
            ```python
            case1 = {'cellline': 'AS', 'treatment': 'sc'} # legacy style: class_type = ["group", "condition"]
            case2 = {'cellline': 'BE', 'treatment': 'sc'} # legacy style: values = [["GroupA", "Treatment1"], ["GroupA", "Control"]]
            pdata_nb.de(values = case_values) # or legacy style: pdata.de(classes=class_type, values=values)
            pdata.list_enrichments()  # list available DE result keys
            pdata.enrichment_functional(from_de=True, de_key="GroupA_Treatment1 vs GroupA_Control")
            ```

        Perform enrichment on a custom list of genes:
            ```python
            genelist = ["P55072", "NPLOC4", "UFD1", "STX5A", "NSFL1C", "UBXN2A",
                        "UBXN4", "UBE4B", "YOD1", "WASHC5", "PLAA", "UBXN10"]
            pdata.enrichment_functional(genes=genelist, from_de=False)
            ```

    Note:
        Internally uses `resolve_to_accessions()` and `get_string_mappings()`, and stores results 
        in `.stats["functional"]`. Results can be accessed or visualized via `plot_enrichment_svg()`
        or by visiting the linked STRING URLs.
    """
    def query_functional_enrichment(query_ids, species_id, background_ids=None, debug=False):
        print(f"{format_log_prefix('info_only',2)} Running enrichment on {len(query_ids)} STRING IDs (species {species_id})...") if debug else None
        url = "https://string-db.org/api/json/enrichment"
        payload = {
            "identifiers": "%0d".join(query_ids),
            "species": species_id,
            "caller_identity": "scpviz"
        }
        if background_ids is not None:
            print(f"{format_log_prefix('info_only')} Using background of {len(background_ids)} STRING IDs.")
            payload["background_string_identifiers"] = "%0d".join(background_ids)

        print(payload) if debug else None
        response = requests.post(url, data=payload)
        response.raise_for_status()
        return pd.DataFrame(response.json())

    # Ensure string metadata section exists
    if "functional" not in self.stats:
        self.stats["functional"] = {}

    if genes is None and from_de:
        resolved_key = _resolve_de_key(self.stats, de_key)
        de_df = self.stats[resolved_key]
        sig_df = de_df[de_df["significance"] != "not significant"].copy()
        print(f"{format_log_prefix('user')} Running STRING enrichment [DE-based: {resolved_key}]")

        up_genes = sig_df[sig_df[score_col] > 0][gene_col].dropna().head(top_n).tolist()
        down_genes = sig_df[sig_df[score_col] < 0][gene_col].dropna().head(top_n).tolist()

        up_accs, up_unresolved = self.resolve_to_accessions(up_genes)
        down_accs, down_unresolved = self.resolve_to_accessions(down_genes)

        background_accs = None
        background_string_ids = None
        if background == "all_quantified":
            print(f"{format_log_prefix('warn')} Mapping background proteins may take a long time due to batching.")
            background_accs = de_df[de_df["significance"] == "not significant"].index.tolist()

        if background_accs:
            bg_map = self.get_string_mappings(background_accs,debug=debug)
            bg_map = bg_map[bg_map["string_identifier"].notna()]
            background_string_ids = bg_map["string_identifier"].tolist()

        if store_key is not None:
            print(f"{format_log_prefix('warn')} Ignoring `store_key` for DE-based enrichment. Using auto-generated pretty keys.")

        results = {}
        for label, accs in zip(["up", "down"], [up_accs, down_accs]):
            print(f"\n🔹 {label.capitalize()}-regulated proteins")
            t0 = time.time()

            if not accs:
                print(f"{format_log_prefix('warn')} No {label}-regulated proteins to analyze.")
                continue

            mapping_df = self.get_string_mappings(accs, debug=debug)
            mapping_df = mapping_df[mapping_df["string_identifier"].notna()]
            if mapping_df.empty:
                print(f"{format_log_prefix('warn')} No valid STRING mappings found for {label}-regulated proteins.")
                continue

            string_ids = mapping_df["string_identifier"].tolist()
            inferred_species = mapping_df["ncbi_taxon_id"].mode().iloc[0]
            if species is not None:
                # check if user species is same as inferred
                if inferred_species != species:
                    print(f"{format_log_prefix('warn',2)} Inferred species ({inferred_species}) does not match user-specified ({species}). Using user-specified species.")
                species_id = species
            else:
                species_id = inferred_species

            print(f"   🔸 Proteins: {len(accs)} → STRING IDs: {len(string_ids)}")
            print(f"   🔸 Species: {species_id} | Background: {'None' if background_string_ids is None else 'custom'}")
            if label == "up":
                if up_unresolved:
                    print(f"{format_log_prefix('warn',2)} Some accessions unresolved for {label}-regulated proteins: {', '.join(up_unresolved)}")
            else:
                if down_unresolved:
                    print(f"{format_log_prefix('warn',2)} Some accessions unresolved for {label}-regulated proteins: {', '.join(down_unresolved)}")

            enrichment_df = query_functional_enrichment(string_ids, species_id, background_string_ids, debug=debug)
            enrich_key = f"{resolved_key}_{label}"
            pretty_base = _pretty_vs_key(resolved_key)
            pretty_key = f"{pretty_base}_{label}"
            string_url = self.get_string_network_link(string_ids=string_ids, species=species_id)

            self.stats["functional"][pretty_key] = {
                "string_ids": string_ids,
                "background_string_ids": background_string_ids,
                "species": species_id,
                "input_key": resolved_key if from_de else None,
                "string_url": string_url,
                "result": enrichment_df
            }

            print(f"{format_log_prefix('result')} Enrichment complete ({time.time() - t0:.2f}s)")
            print(f"   • Access result: pdata.stats['functional'][\"{pretty_key}\"][\"result\"]")
            print(f"   • Plot command : pdata.plot_enrichment_svg(\"{pretty_base}\", direction=\"{label}\")")
            print(f"   • View online  : {string_url}\n")

            results[label] = enrichment_df

    elif genes is not None:
        t0 = time.time()
        print(f"{format_log_prefix('user')} Running STRING enrichment [user-supplied]")

        if store_key is None:
            prefix = "UserSearch"
            existing = self.stats["functional"].keys() if "functional" in self.stats else []
            existing_ids = [k for k in existing if k.startswith(prefix)]
            next_id = len(existing_ids) + 1
            store_key = f"{prefix}{next_id}"

        input_accs, unresolved_accs = self.resolve_to_accessions(genes)
        mapping_df = self.get_string_mappings(input_accs, debug=debug)
        mapping_df = mapping_df[mapping_df["string_identifier"].notna()]
        if mapping_df.empty:
            raise ValueError("No valid STRING mappings found for the provided identifiers.")

        string_ids = mapping_df["string_identifier"].tolist()
        inferred_species = mapping_df["ncbi_taxon_id"].mode().iloc[0]
        if species is not None:
            # check if user species is same as inferred
            if inferred_species != species:
                print(f"{format_log_prefix('warn',2)} Inferred species ({inferred_species}) does not match user-specified ({species}). Using user-specified species.")
            species_id = species
        else:
            species_id = inferred_species

        background_string_ids = None
        if background == "all_quantified":
            print(f"{format_log_prefix('warn')} Mapping background proteins may take a long time due to batching.")
            all_accs = list(self.prot.var_names)
            background_accs = list(set(all_accs) - set(input_accs))
            bg_map = self.get_string_mappings(background_accs, debug=debug)
            bg_map = bg_map[bg_map["string_identifier"].notna()]
            background_string_ids = bg_map["string_identifier"].tolist()

        print(f"   🔸 Input genes: {len(genes)} → Resolved STRING IDs: {len(string_ids)}")
        print(f"   🔸 Species: {species_id} | Background: {'None' if background_string_ids is None else 'custom'}")
        if unresolved_accs:
            print(f"{format_log_prefix('warn',2)} Some accessions unresolved: {', '.join(unresolved_accs)}")

        enrichment_df = query_functional_enrichment(string_ids, species_id, background_string_ids, debug=debug)
        string_url = self.get_string_network_link(string_ids=string_ids, species=species_id)

        self.stats["functional"][store_key] = {
            "string_ids": string_ids,
            "background_string_ids": background_string_ids,
            "species": species_id,
            "input_key": None,
            "string_url": string_url,
            "result": enrichment_df
        }

        print(f"{format_log_prefix('result')} Enrichment complete ({time.time() - t0:.2f}s)")
        print(f"   • Access result: pdata.stats['functional'][\"{store_key}\"][\"result\"]")
        print(f"   • Plot command : pdata.plot_enrichment_svg(\"{store_key}\")")
        print(f"   • View online  : {string_url}\n")

        return enrichment_df

    else:
        raise ValueError("Must provide 'genes' or set from_de=True to use DE results.") 

enrichment_ppi

enrichment_ppi(genes, species=None, store_key=None, debug=False)

Run STRING PPI (protein–protein interaction) enrichment on a user-supplied gene or accession list.

This method maps the input gene names or UniProt accessions to STRING IDs, infers the species if not provided, and submits the list to STRING's PPI enrichment endpoint. Results are stored in .stats["ppi"] for later retrieval or visualization.

Parameters:

Name Type Description Default
genes list of str

A list of gene names or UniProt accessions to analyze.

required
species int or str

NCBI taxonomy ID (e.g., 9606 for human). If None, inferred from STRING mappings.

None
store_key str

Key to store the enrichment result under .stats["ppi"]. If None, a unique key is auto-generated.

None

Returns:

Type Description

pd.DataFrame: DataFrame of STRING PPI enrichment results.

Example

Run differential expression, then perform STRING PPI enrichment on significant genes:

class_type = ["group", "condition"]
values = [["GroupA", "Treatment1"], ["GroupA", "Control"]]

pdata.de(classes=class_type, values=values)
pdata.list_enrichments()
sig_genes = pdata.stats["de_results"]["GroupA_Treatment1 vs GroupA_Control"]
sig_genes = sig_genes[sig_genes["significance"] != "not significant"]["Genes"].dropna().tolist()

pdata.enrichment_ppi(genes=sig_genes)

Source code in src/scpviz/pAnnData/enrichment.py
def enrichment_ppi(self, genes, species=None, store_key=None, debug=False):
    """
    Run STRING PPI (protein–protein interaction) enrichment on a user-supplied gene or accession list.

    This method maps the input gene names or UniProt accessions to STRING IDs, infers the species 
    if not provided, and submits the list to STRING's PPI enrichment endpoint. Results are stored 
    in `.stats["ppi"]` for later retrieval or visualization.

    Args:
        genes (list of str): A list of gene names or UniProt accessions to analyze.
        species (int or str, optional): NCBI taxonomy ID (e.g., 9606 for human). If None, inferred from STRING mappings.
        store_key (str, optional): Key to store the enrichment result under `.stats["ppi"]`.
            If None, a unique key is auto-generated.

    Returns:
        pd.DataFrame: DataFrame of STRING PPI enrichment results.

    Example:
        Run differential expression, then perform STRING PPI enrichment on significant genes:
            ```python
            class_type = ["group", "condition"]
            values = [["GroupA", "Treatment1"], ["GroupA", "Control"]]

            pdata.de(classes=class_type, values=values)
            pdata.list_enrichments()
            sig_genes = pdata.stats["de_results"]["GroupA_Treatment1 vs GroupA_Control"]
            sig_genes = sig_genes[sig_genes["significance"] != "not significant"]["Genes"].dropna().tolist()

            pdata.enrichment_ppi(genes=sig_genes)
            ```
    """
    def query_ppi_enrichment(string_ids, species):
        # print(f"[INFO] Running PPI enrichment for {len(string_ids)} STRING IDs (species {species})...")
        url = "https://string-db.org/api/json/ppi_enrichment"
        payload = {
            "identifiers": "%0d".join(string_ids),
            "species": species,
            "caller_identity": "scpviz"
        }

        response = requests.post(url, data=payload)
        response.raise_for_status()

        result = response.json()
        print("[DEBUG] PPI enrichment result:", result)
        return result[0] if isinstance(result, list) else result

    print(f"{format_log_prefix('user')} Running STRING PPI enrichment")
    t0 = time.time()
    input_accs, unresolved_accs = self.resolve_to_accessions(genes)
    mapping_df = self.get_string_mappings(input_accs, debug=debug)
    mapping_df = mapping_df[mapping_df["string_identifier"].notna()]
    if mapping_df.empty:
        raise ValueError("No valid STRING mappings found for the provided genes/accessions.")

    string_ids = mapping_df["string_identifier"].tolist()
    inferred_species = mapping_df["ncbi_taxon_id"].mode().iloc[0]
    species_id = species if species is not None else inferred_species

    print(f"   🔸 Input genes: {len(genes)} → Resolved STRING IDs: {len(mapping_df)}")
    print(f"   🔸 Species: {species_id}")
    if unresolved_accs:
        print(f"{format_log_prefix('warn', 2)} Some accessions unresolved: {', '.join(unresolved_accs)}")

    result = query_ppi_enrichment(string_ids, species_id)

    # Store results
    if "ppi" not in self.stats:
        self.stats["ppi"] = {}

    if store_key is None:
        base = "UserPPI"
        counter = 1
        while f"{base}{counter}" in self.stats["ppi"]:
            counter += 1
        store_key = f"{base}{counter}"

    self.stats["ppi"][store_key] = {
        "result": result,
        "string_ids": string_ids,
        "species": species_id
    }

    print(f"{format_log_prefix('result')} PPI enrichment complete ({time.time() - t0:.2f}s)")
    print(f"   • STRING IDs   : {len(string_ids)}")
    print(f"   • Edges found  : {result['number_of_edges']} vs {result['expected_number_of_edges']} expected")
    print(f"   • p-value      : {result['p_value']:.2e}")
    print(f"   • Access result: pdata.stats['ppi']['{store_key}']['result']\n")

    return result

get_string_mappings

get_string_mappings(identifiers, overwrite=False, cache_col='STRING', batch_size=100, debug=False)

Resolve STRING IDs for UniProt accessions with a 2-step strategy: 1) Use UniProt stream (fields: xref_string) to fill cache quickly. 2) For any still-missing rows, query STRING get_string_ids, batched by organism_id.

This method retrieves corresponding STRING identifiers for a list of UniProt accessions and stores the result in self.prot.var["STRING_id"] for downstream use.

Parameters:

Name Type Description Default
identifiers list of str

List of UniProt accession IDs to map.

required
batch_size int

Number of accessions to include in each API query (default is 300).

100
debug bool

If True, prints progress and response info.

False

Returns:

Type Description

pd.DataFrame: Mapping table with columns: input_identifier, string_identifier, and ncbi_taxon_id.

Note

This is a helper method used primarily by enrichment_functional() and enrichment_ppi().

Source code in src/scpviz/pAnnData/enrichment.py
def get_string_mappings(self, identifiers, overwrite=False, cache_col="STRING", batch_size=100, debug=False):
    """
    Resolve STRING IDs for UniProt accessions with a 2-step strategy:
    1) Use UniProt stream (fields: xref_string) to fill cache quickly.
    2) For any still-missing rows, query STRING get_string_ids, batched by organism_id.

    This method retrieves corresponding STRING identifiers for a list of UniProt accessions
    and stores the result in `self.prot.var["STRING_id"]` for downstream use.

    Args:
        identifiers (list of str): List of UniProt accession IDs to map.
        batch_size (int): Number of accessions to include in each API query (default is 300).
        debug (bool): If True, prints progress and response info.

    Returns:
        pd.DataFrame: Mapping table with columns: `input_identifier`, `string_identifier`, and `ncbi_taxon_id`.

    Note:
        This is a helper method used primarily by `enrichment_functional()` and `enrichment_ppi()`.
    """

    identifiers = [str(x).strip() for x in identifiers if x is not None and str(x).strip()]
    if debug:
        print(f"{format_log_prefix('info')} Resolving STRING IDs for {len(identifiers)} identifiers...")

    prot_var = self.prot.var
    if cache_col not in prot_var.columns:
        prot_var[cache_col] = pd.NA
    if "ncbi_taxon_id" not in prot_var.columns:
        prot_var["ncbi_taxon_id"] = pd.NA

    # Use cached STRING IDs if available
    valid_ids = [i for i in identifiers if i in prot_var.index]
    existing = prot_var.loc[valid_ids, cache_col]
    found_ids = {i: sid for i, sid in existing.items() if pd.notna(sid) and str(sid).strip()}
    missing = [i for i in identifiers if i not in found_ids]

    if overwrite:
        print(f"{format_log_prefix('info_only',2)} Overwriting cached STRING IDs.")
        missing = valid_ids
        found_ids = {}

    print(f"{format_log_prefix('info_only',2)} Found {len(found_ids)} cached STRING IDs. {len(missing)} need lookup.")
    print(missing) if debug else None

    # 1. UniProt stream (fast)         # Use UniProt xref_string field to fill cache quickly
    # 2. STRING API for still-missing ones

    if missing:
        map_df = utils.get_string_mappings(
            missing,
            use_uniprot=True,
            use_string=True,
            caller_identity="scpviz",
            batch_size=batch_size,
            debug=debug,
        )
    else:
        map_df = pd.DataFrame(columns=["input_identifier", "string_identifier", "ncbi_taxon_id"])

    # Combine all new mappings
    if not map_df.empty:
        updated = 0
        updated_tax = 0

        for _, row in map_df.iterrows():
            acc = row["input_identifier"]
            sid = row["string_identifier"]
            tax = row.get("ncbi_taxon_id", pd.NA)

            if acc is None or acc not in prot_var.index:
                continue

            if pd.notna(sid) and str(sid).strip():
                self.prot.var.at[acc, cache_col] = sid
                found_ids[acc] = sid
                updated += 1
            else:
                print(f"[DEBUG] Skipping unknown accession '{acc}'") if debug else None

            tax = utils.scalarize_taxon(tax)
            if tax is not pd.NA and pd.notna(tax):
                prot_var.at[acc, "ncbi_taxon_id"] = str(tax)
                updated_tax += 1

        print(f"{format_log_prefix('info_only',3)} Cached {updated} STRING ID mappings.")
        if debug:
            print(f"{format_log_prefix('info_only',3)} Cached {updated_tax} ncbi_taxon_id values.")

    elif missing:
        print(f"{format_log_prefix('warn_only',3)} No STRING mappings returned from STRING API.")


    # ------------------------------------
    # Build and MERGE UniProt results into out_df
    # ------------------------------------
    out_df = pd.DataFrame({"input_identifier": identifiers})

    # Prefer found_ids (fresh + cached), else fall back to cache for known indices
    out_df["string_identifier"] = out_df["input_identifier"].map(
        lambda acc: found_ids.get(
            acc,
            prot_var.at[acc, cache_col] if acc in prot_var.index else pd.NA,
        )
    )

    out_df["ncbi_taxon_id"] = out_df["input_identifier"].map(
        lambda acc: prot_var.at[acc, "ncbi_taxon_id"] if acc in prot_var.index else pd.NA
    )
    out_df["ncbi_taxon_id"] = out_df["ncbi_taxon_id"].apply(utils.scalarize_taxon)

    return out_df
get_string_network_link(key=None, string_ids=None, species=None, show_labels=True)

Generate a direct STRING network URL to visualize protein interactions online.

This method constructs a STRING website link to view a network of proteins, using either a list of STRING IDs or a key from previously stored enrichment results.

Parameters:

Name Type Description Default
key str

Key from .stats["functional"] to extract STRING IDs and species info.

None
string_ids list of str

List of STRING identifiers to include in the network.

None
species int or str

NCBI taxonomy ID (e.g., 9606 for human). Required if not using a stored key.

None
show_labels bool

If True (default), node labels will be shown in the network view.

True

Returns:

Name Type Description
str

URL to open the network in the STRING web interface.

Example

Get a STRING network link for a stored enrichment result:

url = pdata.get_string_network_link(key="UserSearch1")
print(url)

Source code in src/scpviz/pAnnData/enrichment.py
def get_string_network_link(self, key=None, string_ids=None, species=None, show_labels=True):
    """
    Generate a direct STRING network URL to visualize protein interactions online.

    This method constructs a STRING website link to view a network of proteins,
    using either a list of STRING IDs or a key from previously stored enrichment results.

    Args:
        key (str, optional): Key from `.stats["functional"]` to extract STRING IDs and species info.
        string_ids (list of str, optional): List of STRING identifiers to include in the network.
        species (int or str, optional): NCBI taxonomy ID (e.g., 9606 for human). Required if not using a stored key.
        show_labels (bool): If True (default), node labels will be shown in the network view.

    Returns:
        str: URL to open the network in the STRING web interface.

    Example:
        Get a STRING network link for a stored enrichment result:
            ```python
            url = pdata.get_string_network_link(key="UserSearch1")
            print(url)
            ```
    """
    if string_ids is None:
        if key is None:
            raise ValueError("Must provide either a list of STRING IDs or a key.")
        metadata = self.stats.get("functional", {}).get(key)
        if metadata is None:
            raise ValueError(f"Key '{key}' not found in self.stats['functional'].")
        string_ids = metadata.get("string_ids")
        species = species or metadata.get("species")

    if not string_ids:
        raise ValueError("No STRING IDs found or provided.")

    base_url = "https://string-db.org/cgi/network"
    params = [
        f"identifiers={'%0d'.join(string_ids)}",
        f"caller_identity=scpviz"
    ]
    if species:
        params.append(f"species={species}")
    if show_labels:
        params.append("show_query_node_labels=1")

    return f"{base_url}?{'&'.join(params)}"

list_enrichments

list_enrichments()

List available STRING enrichment results and unprocessed DE contrasts.

This method prints available functional and PPI enrichment entries stored in .stats["functional"] and .stats["ppi"], as well as DE comparisons in .stats["de_results"] that have not yet been analyzed.

Returns:

Type Description

None

Example

List enrichment results stored after running functional or PPI enrichment:

pdata.list_enrichments()

Source code in src/scpviz/pAnnData/enrichment.py
def list_enrichments(self):
    """
    List available STRING enrichment results and unprocessed DE contrasts.

    This method prints available functional and PPI enrichment entries stored in
    `.stats["functional"]` and `.stats["ppi"]`, as well as DE comparisons in 
    `.stats["de_results"]` that have not yet been analyzed.

    Returns:
        None

    Example:
        List enrichment results stored after running functional or PPI enrichment:
            ```python
            pdata.list_enrichments()
            ```
    """

    functional = self.stats.get("functional", {})
    ppi_keys = self.stats.get("ppi", {}).keys()
    de_keys = {k for k in self.stats if "vs" in k and not k.endswith(("_up", "_down"))}

    # Collect enriched DE keys based on input_key metadata
    enriched_de = set()
    enriched_results = []

    for k, meta in functional.items():
        input_key = meta.get("input_key", None)
        is_de = "vs" in k

        if input_key and input_key in de_keys:
            base = input_key
            suffix = k.rsplit("_", 1)[-1]
            pretty = f"{_pretty_vs_key(base)}_{suffix}"
            enriched_de.add(base)
            enriched_results.append((pretty, k, "DE-based"))
        else:
            enriched_results.append((k, k, "User"))

    de_unenriched = sorted(_pretty_vs_key(k) for k in (de_keys - enriched_de))

    print(f"{format_log_prefix('user')} Listing STRING enrichment status\n")

    print(f"{format_log_prefix('info_only',2)} Available DE comparisons (not yet enriched):")
    if de_unenriched:
        for pk in de_unenriched:
            print(f"        - {pk}")
    else:
        print("  (none)\n")

    print("\n  🔹 To run enrichment:")
    print("      pdata.enrichment_functional(from_de=True, de_key=\"...\")")

    print(f"\n{format_log_prefix('result_only')} Completed STRING enrichment results:")
    if not enriched_results:
        print("    (none)")
    for pretty, raw_key, kind in enriched_results:
        if kind == "DE-based":
            base, suffix = pretty.rsplit("_", 1)
            print(f"  - {pretty} ({kind})")
            print(f"    • Table: pdata.stats['functional'][\"{raw_key}\"]['result']")
            print(f"    • Plot : pdata.plot_enrichment_svg(\"{base}\", direction=\"{suffix}\")")
            url = self.stats["functional"].get(raw_key, {}).get("string_url")
            if url:
                print(f"    • Link  : {url}")
        else:
            print(f"  - {pretty} ({kind})")
            print(f"    • Table: pdata.stats['functional'][\"{raw_key}\"]['result']")
            print(f"    • Plot : pdata.plot_enrichment_svg(\"{pretty}\")")
            url = self.stats["functional"].get(raw_key, {}).get("string_url")
            if url:
                print(f"    • Link  : {url}")

    if ppi_keys:
        print(f"\n{format_log_prefix('result_only')} Completed STRING enrichment results:")
        for key in sorted(ppi_keys):
            print(f"  - {key} (User)")
            print(f"    • Table: pdata.stats['ppi']['{key}']['result']")
    else:
        print(f"\n{format_log_prefix('result_only')} Completed STRING PPI results:")
        print("    (none)")

plot_enrichment_svg

plot_enrichment_svg(key, direction=None, category=None, save_as=None)

Display STRING enrichment SVG inline in a Jupyter notebook.

This method fetches and renders a STRING-generated SVG for a previously completed functional enrichment result. Optionally, the SVG can also be saved to disk.

Parameters:

Name Type Description Default
key str

Enrichment result key from .stats["functional"]. For DE-based comparisons, this includes both contrast and direction (e.g., "GroupA_Treatment1_vs_Control_up").

required
direction str

Direction of DE result, either "up" or "down". Use None for user-defined gene lists.

None
category str

STRING enrichment category to filter by (e.g., "Process", "KEGG"). See the table in the below note for options.

None
save_as str

If provided, saves the retrieved SVG to the given file path.

None

Returns:

Type Description

None

Example

Display a STRING enrichment network for a user-supplied gene list:

pdata.plot_enrichment_svg("UserSearch1")

Supported STRING Enrichment Categories

The following category IDs are supported for functional enrichment.
More details are available on the STRING API documentation site.

Category ID Description
Process Biological Process (Gene Ontology)
Function Molecular Function (Gene Ontology)
Component Cellular Component (Gene Ontology)
Keyword Annotated Keywords (UniProt)
KEGG KEGG Pathways
RCTM Reactome Pathways
HPO Human Phenotype (Monarch)
MPO Mammalian Phenotype Ontology (Monarch)
DPO Drosophila Phenotype (Monarch)
WPO C. elegans Phenotype Ontology (Monarch)
ZPO Zebrafish Phenotype Ontology (Monarch)
FYPO Fission Yeast Phenotype Ontology (Monarch)
Pfam Protein Domains (Pfam)
SMART Protein Domains (SMART)
InterPro Protein Domains and Features (InterPro)
PMID Reference Publications (PubMed)
NetworkNeighborAL Local Network Cluster (STRING)
COMPARTMENTS Subcellular Localization (COMPARTMENTS)
TISSUES Tissue Expression (TISSUES)
DISEASES Disease–gene Associations (DISEASES)
WikiPathways WikiPathways
Note

The key must correspond to an existing entry in .stats["functional"], created via enrichment_functional().

Source code in src/scpviz/pAnnData/enrichment.py
def plot_enrichment_svg(self, key, direction=None, category=None, save_as=None):
    """
    Display STRING enrichment SVG inline in a Jupyter notebook.

    This method fetches and renders a STRING-generated SVG for a previously completed
    functional enrichment result. Optionally, the SVG can also be saved to disk.

    Args:
        key (str): Enrichment result key from `.stats["functional"]`. For DE-based comparisons, this 
            includes both contrast and direction (e.g., `"GroupA_Treatment1_vs_Control_up"`).
        direction (str, optional): Direction of DE result, either `"up"` or `"down"`. Use `None` for 
            user-defined gene lists.
        category (str, optional): STRING enrichment category to filter by (e.g., `"Process"`, `"KEGG"`). See the table in the below note for options.
        save_as (str, optional): If provided, saves the retrieved SVG to the given file path.

    Returns:
        None

    Example:
        Display a STRING enrichment network for a user-supplied gene list:
            ```python
            pdata.plot_enrichment_svg("UserSearch1")
            ```

    !!! note "Supported STRING Enrichment Categories"
        The following category IDs are supported for functional enrichment.  
        More details are available on the [STRING API documentation site](https://string-db.org/cgi/help?subpage=api).

        | Category ID          | Description                                      |
        |----------------------|--------------------------------------------------|
        | **Process**          | Biological Process (Gene Ontology)               |
        | **Function**         | Molecular Function (Gene Ontology)               |
        | **Component**        | Cellular Component (Gene Ontology)               |
        | **Keyword**          | Annotated Keywords (UniProt)                     |
        | **KEGG**             | KEGG Pathways                                    |
        | **RCTM**             | Reactome Pathways                                |
        | **HPO**              | Human Phenotype (Monarch)                        |
        | **MPO**              | Mammalian Phenotype Ontology (Monarch)           |
        | **DPO**              | Drosophila Phenotype (Monarch)                   |
        | **WPO**              | *C. elegans* Phenotype Ontology (Monarch)        |
        | **ZPO**              | Zebrafish Phenotype Ontology (Monarch)           |
        | **FYPO**             | Fission Yeast Phenotype Ontology (Monarch)       |
        | **Pfam**             | Protein Domains (Pfam)                           |
        | **SMART**            | Protein Domains (SMART)                          |
        | **InterPro**         | Protein Domains and Features (InterPro)          |
        | **PMID**             | Reference Publications (PubMed)                  |
        | **NetworkNeighborAL**| Local Network Cluster (STRING)                   |
        | **COMPARTMENTS**     | Subcellular Localization (COMPARTMENTS)          |
        | **TISSUES**          | Tissue Expression (TISSUES)                      |
        | **DISEASES**         | Disease–gene Associations (DISEASES)             |
        | **WikiPathways**     | WikiPathways                                     |

    Note:
        The `key` must correspond to an existing entry in `.stats["functional"]`, created via 
        `enrichment_functional()`.
    """
    from xml.parsers.expat import ExpatError

    if "functional" not in self.stats:
        raise ValueError("No STRING enrichment results found in .stats['functional'].")

    all_keys = list(self.stats["functional"].keys())

    # Handle DE-type key
    if "vs" in key:
        if direction not in {"up", "down"}:
            raise ValueError("You must specify direction='up' or 'down' for DE-based enrichment keys.")
        lookup_key = _resolve_de_key(self.stats["functional"], f"{key}_{direction}")
    else:
        # Handle user-supplied key (e.g. "userSearch1")
        if direction is not None:
            print(f"[WARNING] Ignoring direction='{direction}' for user-supplied key: '{key}'")
        lookup_key = key

    if lookup_key not in self.stats["functional"]:
        available = "\n".join(f"  - {k}" for k in self.stats["functional"].keys())
        raise ValueError(f"Could not find enrichment results for '{lookup_key}'. Available keys:\n{available}")

    meta = self.stats["functional"][lookup_key]
    string_ids = meta["string_ids"]
    species_id = meta["species"]

    url = "https://string-db.org/api/svg/enrichmentfigure"
    params = {
        "identifiers": "%0d".join(string_ids),
        "species": species_id
    }
    if category:
        params["category"] = category

    print(f"{format_log_prefix('user')} Fetching STRING SVG for key '{lookup_key}' (n={len(string_ids)})...")
    response = requests.get(url, params=params)
    response.raise_for_status()

    if save_as:
        with open(save_as, "wb") as f:
            f.write(response.content)
        print(f"{format_log_prefix('info_only')} Saved SVG to: {save_as}")

    with tempfile.NamedTemporaryFile("wb", suffix=".svg", delete=False) as tmp:
        tmp.write(response.content)
        tmp_path = tmp.name

    try:
        try:
            display(SVG(filename=tmp_path))
        except ExpatError:
            print(
                f"{format_log_prefix('info_only')} No enrichment figure available "
                f"for key '{lookup_key}'"
                + (f" (category='{category}')." if category else ".")
            )
    finally:
        os.remove(tmp_path)

resolve_to_accessions

resolve_to_accessions(mixed_list)

Convert gene names or accessions into standardized UniProt accession IDs.

This method resolves input items using the internal gene-to-accession map, ensuring all returned entries are accessions present in the .prot object.

Parameters:

Name Type Description Default
mixed_list list of str

A list containing gene names and/or UniProt accessions.

required

Returns:

Type Description

list of str: List of resolved UniProt accession IDs.

Note

This function is similar to utils.resolve_accessions() but operates in the context of the current pAnnData object and its internal gene mappings.

Todo

Add example comparing results from resolve_to_accessions() and utils.resolve_accessions().

Source code in src/scpviz/pAnnData/enrichment.py
def resolve_to_accessions(self, mixed_list):
    """
    Convert gene names or accessions into standardized UniProt accession IDs.

    This method resolves input items using the internal gene-to-accession map,
    ensuring all returned entries are accessions present in the `.prot` object.

    Args:
        mixed_list (list of str): A list containing gene names and/or UniProt accessions.

    Returns:
        list of str: List of resolved UniProt accession IDs.

    Note:
        This function is similar to `utils.resolve_accessions()` but operates in the context 
        of the current `pAnnData` object and its internal gene mappings.

    Todo:
        Add example comparing results from `resolve_to_accessions()` and `utils.resolve_accessions()`.
    """
    gene_to_acc, _ = self.get_gene_maps(on='protein') 
    accs = []
    unresolved_accs = []
    for item in mixed_list:
        if item in self.prot.var.index:
            accs.append(item)  # already an accession
        elif item in gene_to_acc:
            accs.append(gene_to_acc[item])
        else:
            unresolved_accs.append(item)
            # print(f"{format_log_prefix('warn_only',2)} Could not resolve '{item}' to an accession — skipping.")
    return accs, unresolved_accs