Utilities
Utility functions for scpviz.
This package provides helper and processing functions used throughout scpviz. Import as:
from scpviz import utils as scutils
Submodules (for maintainers): formatting, data, class_filter, id_maps, stats.
Text / formatting
Functions:
| Name | Description |
|---|---|
format_log_prefix |
Return standardized log prefixes for messages. |
Data access + transformation
Functions:
| Name | Description |
|---|---|
parse_filename_index |
Parse sample metadata from filename columns. |
get_samplenames |
Resolve sample names for given classes from |
get_classlist |
Return unique class values for specified |
get_adata_layer |
Safely extract a matrix from |
get_adata |
Retrieve the |
get_abundance |
Extract abundance data from pAnnData or AnnData. |
resolve_accessions |
Map gene names or accessions to |
get_pep_prot_mapping |
Determine peptide-to-protein mapping column. |
update_layer_provenance |
Register a matrix layer in |
resolve_input_layer |
Map |
infer_layer_is_log |
Infer log-transformed layers via provenance or name heuristic. |
Sample selection / set logic
Functions:
| Name | Description |
|---|---|
format_class_filter |
Standardize class/value inputs for filtering. |
filter |
Legacy sample filtering (prefer |
resolve_class_filter |
Resolve class/value pairs and apply filtering. |
get_upset_contents |
Build contents for UpSet plots from pAnnData. |
get_upset_query |
Query features present/absent in UpSet contents. |
Identifier mappings (UniProt / STRING)
Functions:
| Name | Description |
|---|---|
get_uniprot_fields_worker |
Low-level UniProt REST API query function (batch up to 1024). |
get_uniprot_fields |
High-level UniProt API wrapper with batching. |
standardize_uniprot_columns |
Normalize UniProt column names for stable downstream use. |
get_string_mappings |
Map UniProt accessions to STRING IDs (UniProt + STRING fallback). |
convert_identifiers |
Convert between accession / gene / STRING / organism_id. |
Statistics
Functions:
| Name | Description |
|---|---|
pairwise_log2fc |
Compute pairwise median log2 fold change between groups. |
de_adata |
Differential expression helper over AnnData matrices. |
get_pca_importance |
Identify most important features for PCA components. |
get_protein_clusters |
Retrieve hierarchical clusters from stored linkage. |
Warning
Many functions here are internal helpers. For common workflows (filtering, plotting, enrichment),
prefer the corresponding pAnnData methods when available.
convert_identifiers
convert_identifiers(
ids: list[str],
from_type: str,
to_type: str | list[str],
pdata: pAnnData | None = None,
use_cache: bool = True,
return_type: str = "dict",
verbose: bool = True,
) -> (
dict[str, dict[str, Any]]
| pd.DataFrame
| tuple[dict[str, dict[str, Any]], pd.DataFrame]
)
Convert identifiers between UniProt-compatible types.
Supports mapping between protein accessions, gene names, STRING IDs, and organism IDs. Multiple output types may be requested at once.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ids
|
list of str
|
Input identifiers. |
required |
from_type
|
str
|
Source identifier type ('accession', 'gene'). 'organism_id' cannot be used as a source. |
required |
to_type
|
str or list of str
|
Target identifier type(s). May include any of: ['gene', 'string', 'organism_id']. |
required |
pdata
|
pAnnData
|
pAnnData object providing cached
accession–gene mappings. If provided, |
None
|
use_cache
|
bool
|
Whether to use cached mappings from |
True
|
return_type
|
str
|
Output format: - 'dict': {input → {to_type → value}} - 'df': DataFrame with columns [from_type, *to_type] - 'both': (dict, DataFrame) |
'dict'
|
verbose
|
bool
|
Whether to print progress messages. |
True
|
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, Any]] | DataFrame | tuple[dict[str, dict[str, Any]], DataFrame]
|
dict, pandas.DataFrame, or tuple: Depending on |
Example
convert_identifiers(["P12345", "Q9XYZ1"], "accession", "gene", pdata=pdata) convert_identifiers(["P12345"], "accession", ["gene", "string", "organism_id"], return_type="df")
Source code in src/scpviz/utils/id_maps.py
694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 | |
de_adata
de_adata(
adata: AnnData,
values: list[dict[str, Any] | list[str]] | None = None,
class_type: str | list[str] | None = None,
method: str = "ttest",
fold_change_mode: str = "mean",
layer: str = "X",
pval: float = 0.05,
log2fc: float = 1.0,
data_is_log: bool = False,
log_base: float = 2.0,
pseudocount: float = 1.0,
gene_col: str | None = None,
) -> pd.DataFrame
Standalone DE analysis for AnnData. Produces a volcano-ready DataFrame identical to pdata.de().
Supports
- Legacy-style: class_type="condition", values=["A","B"]
- Legacy multi-col: class_type=["cellline","treatment"], values=[["HCT116","DMSO"], ["HCT116","Drug"]]
- Dictionary-style: values=[{"cellline":"HCT116","treatment":"DMSO"}, {...}]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
AnnData object. |
required |
values
|
list of dict or list of list
|
Sample group filters to compare.
|
None
|
class_type
|
str or list of str
|
Legacy-style class label(s) to interpret |
None
|
method
|
str
|
'ttest', 'mannwhitneyu', 'wilcoxon'. |
'ttest'
|
fold_change_mode
|
str
|
'mean' or 'pairwise_median'. |
'mean'
|
layer
|
str
|
Layer to use. Default is 'X'. |
'X'
|
pval_thresh
|
float
|
p-value threshold. |
required |
log2fc_thresh
|
float
|
log2 fold change threshold. |
required |
data_is_log
|
bool
|
If True, treat |
False
|
log_base
|
float
|
Base of the log used in |
2.0
|
pseudocount
|
float
|
If data is log of (x + pseudocount), provide that here (e.g., 1.0 for log2(x+1)). |
1.0
|
gene_col
|
str
|
Column in |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pandas.DataFrame: DE results with volcano-ready columns. |
Source code in src/scpviz/utils/stats.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 | |
filter
filter(
pdata: pAnnData | AnnData,
class_type: str | list[str],
values: dict[str, Any] | list[Any] | str,
exact_cases: bool = False,
debug: bool = False,
) -> pAnnData | ad.AnnData
Legacy-style filtering of samples in pAnnData or AnnData objects.
This function filters samples based on metadata values using the older
(class_type, values) interface. For pAnnData objects, it automatically
delegates to .filter_sample_values() after converting the input into the
recommended dictionary-style format.
Warning
For pAnnData users, prefer .filter_sample_values() with dictionary-style
input, as it is more flexible and consistent. The filter() utility is
retained primarily for backward compatibility and direct AnnData usage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdata
|
pAnnData or AnnData
|
Input data object to filter. |
required |
class_type
|
str or list of str
|
Metadata field(s) in |
required |
values
|
list, dict, or list of dict
|
Metadata values to match.
- If |
required |
exact_cases
|
bool
|
Whether to interpret |
False
|
debug
|
bool
|
If True, print the query string used for filtering. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
filtered |
pAnnData or AnnData
|
A filtered object of the same type as |
Raises:
| Type | Description |
|---|---|
ValueError
|
If input types are invalid, if fields are missing in |
Example
Filter samples by a single metadata field:
Filter by multiple fields with OR logic:
samples = utils.filter(
adata,
class_type=["cell_type", "treatment"],
values=[["wt", "kd"], ["control", "treatment"]]
)
# returns samples where cell_type is either 'wt' or 'kd' and treatment is either 'control' or 'treatment'
Filter by exact case combinations:
samples = utils.filter(
adata,
class_type=["cell_type", "treatment"],
values=[{"cell_type": "wt", "treatment": "control"},
{"cell_type": "kd", "treatment": "treatment"}],
exact_cases=True
)
# returns samples where cell_type is 'wt' and treatment is 'kd', or cell_type is 'control' and treatment is 'treatment'
Source code in src/scpviz/utils/class_filter.py
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 | |
format_class_filter
format_class_filter(
classes: str | list[str],
class_value: str | list[str] | list[list[str]],
exact_cases: bool = False,
) -> dict[str, Any] | list[dict[str, Any]]
Convert legacy-style filter input into dictionary-style format.
This function standardizes (classes, class_value) input into the dictionary
format expected by pAnnData.filter_sample_values(). It supports both loose
OR-style filtering and exact case matching across multiple metadata fields.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
classes
|
str or list of str
|
Metadata field(s) to filter on.
Example: |
required |
class_value
|
str, list of str, or list of list
|
Values to filter by.
- str: May be underscore-joined (e.g. |
required |
exact_cases
|
bool
|
If True, return a list of dictionaries representing exact combinations across fields. If False, return a dictionary with OR logic applied. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
formatted |
dict or list of dict
|
Dictionary-style filter input compatible |
dict[str, Any] | list[dict[str, Any]]
|
with |
Raises:
| Type | Description |
|---|---|
ValueError
|
If input shapes are inconsistent with the number of classes,
or if |
Example
Single class with OR logic:
Multiple classes with loose matching:
Multiple classes with exact cases (underscore-joined strings):
Multiple classes with exact cases (list of lists):
Note
This function is primarily used internally by utils.filter() and
pAnnData.filter_sample_values(). End users should generally call
.filter_sample_values() directly on pAnnData objects instead of
using this helper.
Source code in src/scpviz/utils/class_filter.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | |
get_abundance
Wrapper to extract abundance from either pAnnData or AnnData.
This is a convenience wrapper that dispatches to the appropriate method:
- If pdata is a pAnnData object, it calls pdata.get_abundance().
- If pdata is an AnnData object, it falls back to the internal
helper _get_abundance_from_adata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdata
|
pAnnData or AnnData
|
Input object to extract abundance from. |
required |
*args
|
Any
|
Positional arguments forwarded to |
()
|
**kwargs
|
Any
|
Keyword arguments forwarded to |
{}
|
Note
See pAnnData.get_abundance for full parameter documentation. Briefly,
- namelist (list of str, optional): List of accessions or gene names to extract.
- layer (str): Data layer name (default = "X").
- on (str): "protein" or "peptide".
- classes (str or list of str, optional): Sample-level `.obs` column(s) to include.
- log (bool): If True, applies log2 transform to abundance values.
- x_label (str): Label features by "gene" or "accession".
Returns:
| Name | Type | Description |
|---|---|---|
df |
DataFrame
|
Long-form abundance DataFrame, optionally with |
DataFrame
|
sample metadata and protein/peptide annotations. |
See Also
- :func:
pAnnData.get_abundance(EditingMixin): Full-featured version with detailed docs. - get_adata_layer: Helper to access abundance matrices from AnnData layers.
Source code in src/scpviz/utils/data.py
get_adata
Retrieve the protein- or peptide-level AnnData object from a pAnnData container.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdata
|
pAnnData
|
The parent pAnnData object containing both protein- and peptide-level data. |
required |
on
|
str
|
Which data object to return. |
'protein'
|
Returns:
| Name | Type | Description |
|---|---|---|
adata |
AnnData
|
The requested AnnData object. |
Source code in src/scpviz/utils/data.py
get_adata_layer
Safely extract layer data as dense numpy array.
This helper returns the requested layer as a dense numpy.ndarray,
ensuring compatibility for downstream operations. Supports both
.X and .layers[...].
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
AnnData object containing data matrices. |
required |
layer
|
str
|
Layer key. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
data |
ndarray
|
Dense matrix representation of the requested layer. |
Source code in src/scpviz/utils/data.py
get_classlist
get_classlist(
adata: AnnData,
classes: str | list[str] | None = None,
order: list[str] | None = None,
) -> list[str]
Retrieve unique class values for specified metadata columns. Useful for plot legends.
Unlike get_samplenames, which returns one identifier per row/sample,
this function extracts the set of unique class values for grouping
purposes (e.g., plotting categories). Supports optional reordering.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
AnnData object containing sample metadata. |
required |
classes
|
str or list of str
|
Column(s) in
|
None
|
order
|
list of str
|
Custom order of categories. Must exactly
match the unique values; otherwise, a |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
class_list |
list of str
|
Unique class values in |
Raises:
| Type | Description |
|---|---|
ValueError
|
If invalid columns are provided, or if |
Example
Get unique values from one metadata column:
Combine two columns and return unique class labels:
Reorder categories explicitly:
Source code in src/scpviz/utils/data.py
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 | |
get_pca_importance
get_pca_importance(
model: dict[str, Any] | PCA,
initial_feature_names: list[str],
n: int = 1,
) -> pd.DataFrame
Identify the most important features for each principal component.
This function ranks features by their absolute PCA loading values and extracts the top contributors for each principal component.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
PCA or dict
|
Either a fitted PCA model
from scikit-learn, or a dictionary with key |
required |
initial_feature_names
|
list of str
|
Names of the features, typically
|
required |
n
|
int
|
Number of top features to return per principal component (default = 1). |
1
|
Returns:
| Name | Type | Description |
|---|---|---|
df |
DataFrame
|
DataFrame with one row per principal component, |
DataFrame
|
listing the top contributing features. |
Example
Retrieve the top 5 features contributing to each PC:
Source code in src/scpviz/utils/stats.py
get_pep_prot_mapping
Retrieve the peptide-to-protein mapping column or mapping values.
This function resolves the appropriate .pep.var column for peptide-to-protein
mapping based on the data source recorded in pdata.metadata["source"].
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdata
|
pAnnData
|
The annotated proteomics object containing |
required |
return_series
|
bool
|
If True, return a pandas Series of peptide-to-protein mappings. If False (default), return the column name as a string. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
col |
str
|
Column name in |
str | Series
|
if |
|
mapping |
Series
|
Series mapping peptides to proteins, |
str | Series
|
if |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the data source is unrecognized or no valid mapping column is found. |
Note
The mapping column depends on the import source:
- Proteome Discoverer →
"Master Protein Accessions" - DIA-NN →
"Protein.Group" - MaxQuant →
"Leading razor protein"
Source code in src/scpviz/utils/data.py
get_protein_clusters
get_protein_clusters(
pdata: pAnnData,
on: str = "prot",
layer: str = "X",
t: int = 5,
criterion: str = "maxclust",
) -> dict[Any, list[str]] | None
Retrieve hierarchical clusters of proteins from stored linkage.
This function uses linkage information stored in pdata.stats to
partition proteins into clusters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdata
|
pAnnData
|
Input object containing |
required |
on
|
str
|
Data level to use, |
'prot'
|
layer
|
str
|
Data layer name used when the linkage was computed (default = |
'X'
|
t
|
int or float
|
Number of clusters (if |
5
|
criterion
|
str
|
Clustering criterion passed to |
'maxclust'
|
Returns:
| Name | Type | Description |
|---|---|---|
clusters |
dict
|
Mapping of |
None |
dict[Any, list[str]] | None
|
If no linkage is found in |
Note
Requires that a clustermap has been previously computed and linkage
stored under pdata.stats[f"{on}_{layer}_clustermap"].
Source code in src/scpviz/utils/stats.py
get_samplenames
Retrieve sample names for specified class values.
This function resolves .obs metadata into sample-level identifiers
(one name per row). It is typically used for plotting functions where
sample names are required for labeling or grouping.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
AnnData object containing sample metadata. |
required |
classes
|
str or list of str
|
Column(s) in
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
sample_names |
list of str
|
Sample names dervied from |
Example
Get sample names from a single metadata column:
Combine multiple columns into sample identifiers:
Source code in src/scpviz/utils/data.py
get_string_mappings
get_string_mappings(
identifiers: list[str],
use_uniprot: bool = True,
use_string: bool = True,
caller_identity: str = "scpviz",
batch_size: int = 100,
debug: bool = False,
) -> pd.DataFrame
Resolve STRING identifiers for a list of UniProt accessions.
This function maps UniProt protein accessions to STRING IDs using a two-step strategy:
- UniProt lookup – retrieves STRING cross-references (
xref_string) and organism IDs via the UniProt API (fast). - STRING API lookup – queries the STRING
get_string_idsendpoint for any identifiers not resolved via UniProt.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
identifiers
|
list of str
|
List of UniProt accession IDs to map. |
required |
use_uniprot
|
bool
|
If True (default), attempt mapping via UniProt
|
True
|
use_string
|
bool
|
If True (default), query the STRING API for any identifiers still unresolved after the UniProt step. |
True
|
caller_identity
|
str
|
Identifier passed to the STRING API (default: "scpviz"). |
'scpviz'
|
batch_size
|
int
|
Number of identifiers per batch when querying external APIs (default=100). |
100
|
debug
|
bool
|
If True, print progress and debug information. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pandas.DataFrame: Mapping table with one row per input identifier and |
DataFrame
|
the following columns: |
DataFrame
|
|
DataFrame
|
|
DataFrame
|
|
Example
Map a small set of UniProt accessions to STRING IDs:
Disable the UniProt shortcut and query STRING directly (takes longer than UniProt):
Source code in src/scpviz/utils/id_maps.py
371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 | |
get_uniprot_fields
get_uniprot_fields(
prot_list: list[str],
search_fields: list[str] | None = None,
batch_size: int = 100,
verbose: bool = True,
standardize: bool = True,
worker_verbose: bool = False,
) -> pd.DataFrame
Retrieve UniProt metadata for a list of protein accessions.
This function wraps get_uniprot_fields_worker to handle batching of
protein IDs, returning results as a single DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prot_list
|
list of str
|
List of protein accessions. |
required |
search_fields
|
list of str
|
UniProt fields to return. Defaults include accession, gene names, GO terms, and STRING IDs. |
None
|
batch_size
|
int
|
Number of accessions per batch (max 1024, default=100). |
100
|
verbose
|
bool
|
If True, print progress messages. |
True
|
standardize
|
bool
|
If True (default), normalize UniProt column names to canonical lowercase keys (e.g., "gene_primary", "organism_id", "xref_string") for consistent downstream processing. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
df |
DataFrame
|
DataFrame containing UniProt metadata for the input proteins. |
Example
Query UniProt for a small set of proteins:
proteins = ["P40925", "P40926"]
df = get_uniprot_fields(proteins)
df[["Entry", "Gene Names", "Organism Id"]].head()
Retrieve raw UniProt field names without renaming: >>> df_raw = get_uniprot_fields(proteins, standardize=False)
Source code in src/scpviz/utils/id_maps.py
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 | |
get_uniprot_fields_worker
get_uniprot_fields_worker(
prot_list: list[str],
search_fields: list[str] | None = None,
verbose: bool = False,
) -> pd.DataFrame
Query UniProt for a batch of protein accessions.
This function sends requests to the UniProt REST API for up to 1024 proteins at a time and returns the requested fields as a DataFrame. It handles isoform accessions, fallback queries, and UniProt ID redirects automatically.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prot_list
|
list of str
|
List of protein accessions or IDs. |
required |
search_fields
|
list of str
|
UniProt return fields. See: https://www.uniprot.org/help/return_fields |
None
|
verbose
|
bool
|
If True, print progress messages and missing accessions. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
df |
DataFrame
|
DataFrame containing UniProt metadata for the input proteins. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Info
- This function is intended as a worker and is usually called by
get_uniprot_fields. - It automatically resolves canonical vs. isoform accessions and will attempt UniProt ID mapping if some accessions cannot be found.
Source code in src/scpviz/utils/id_maps.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | |
get_upset_contents
get_upset_contents(
pdata: pAnnData,
classes: str | list[str],
on: str = "protein",
upsetForm: bool = True,
debug: bool = False,
) -> pd.DataFrame | dict[str, list[str]]
Construct contents for an UpSet plot from a pAnnData object.
This function extracts feature sets (proteins or peptides) present in
specified sample classes and returns them either as a dictionary or
in an upsetplot-compatible format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdata
|
pAnnData
|
The pAnnData object containing |
required |
classes
|
str or list of str
|
Metadata column(s) in |
required |
on
|
str
|
Data level to use. Options are |
'protein'
|
upsetForm
|
bool
|
If True, return an |
True
|
debug
|
bool
|
If True, print filtering steps and class resolution details. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
upset_data |
DataFrame
|
Binary presence/absence DataFrame for use with
|
upset_dict |
dict
|
Mapping of class → list of present features,
if |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Example
Get contents for an UpSet plot of sample classes:
upset_data = get_upset_contents(pdata, classes="treatment")
from upsetplot import UpSet
UpSet(upset_data, subset_size="count").plot()
Retrieve raw dictionary of sets instead:
Query proteins from a set and highlight them in a plot:
upset_data = scutils.get_upset_contents(pdata, classes="condition")
prot_df = scutils.get_upset_query(upset_data, present=["treated"], absent=["control"])
scplt.plot_rankquant(ax, pdata, classes="condition", cmap=cmaps, color=colors)
scplt.mark_rankquant(ax, pdata, mark_df=prot_df, class_values=["treated"], color="black")
Source code in src/scpviz/utils/class_filter.py
325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 | |
get_upset_query
Query features from UpSet contents given inclusion and exclusion criteria.
This function extracts the set of features (proteins or peptides) that are present in all specified groups and absent in others. It then queries UniProt metadata for the resulting accessions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
upset_content
|
DataFrame
|
Output from |
required |
present
|
list of str
|
List of groups in which the features must be present. |
required |
absent
|
list of str
|
List of groups in which the features must be absent. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
prot_query_df |
DataFrame
|
DataFrame of features matching the query, |
DataFrame
|
annotated with UniProt metadata via |
Example
Query proteins unique to one group and highlight them in a plot:
upset_data = scutils.get_upset_contents(pdata, classes="condition")
prot_df = scutils.get_upset_query(upset_data, present=["treated"], absent=["control"])
scplt.plot_rankquant(ax, pdata, classes="condition", cmap=cmaps, color=colors)
scplt.mark_rankquant(ax, pdata, mark_df=prot_df, class_values=["treated"], color="black")
Source code in src/scpviz/utils/class_filter.py
infer_layer_is_log
Infer whether a layer contains log-transformed values.
- Registry (if
adatais given andadata.uns['layer_provenance']exists): walk ancestors viainput_layer(cycle-safe). If any step hasop == "log_transform", return True. Iflayeris registered and nolog_transformappears, return False. - Name fallback:
"log" in layer.lower().
Standalone AnnData objects (e.g. passed into low-level utils helpers)
often have no layer_provenance and no pAnnData .history; only the
name heuristic applies unless you populate uns['layer_provenance'] yourself.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
layer
|
str
|
Layer name to inspect. |
required |
adata
|
Optional[AnnData]
|
Optional AnnData carrying |
None
|
Returns:
| Type | Description |
|---|---|
bool
|
True if the layer is treated as log-transformed. |
Source code in src/scpviz/utils/data.py
pairwise_log2fc
Compute pairwise median log2 fold change (log2FC) between two groups.
This function calculates all pairwise log2 ratios between features in
two groups of samples and returns the median value per feature. It is
primarily used as a helper for fold-change strategies in pAnnData.de().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data1
|
ndarray
|
Array of shape |
required |
data2
|
ndarray
|
Array of shape |
required |
Returns:
| Name | Type | Description |
|---|---|---|
median_log2fc |
ndarray
|
Array of shape |
ndarray
|
the median pairwise log2 fold change for each feature. |
Note
This is an internal helper for differential expression calculations.
End users should call pAnnData.de() instead of using this function directly.
Source code in src/scpviz/utils/stats.py
parse_filename_index
parse_filename_index(
df: DataFrame,
obs_columns: list[str],
delimiter: str = "_",
condition: str | None = None,
) -> pd.DataFrame
Parse DataFrame index (filenames) into metadata columns based on a list of obs_columns. Can label a subset based on condition.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame whose index contains delimited filenames. |
required |
obs_columns
|
list of str
|
Names of the metadata columns to extract from the filename. |
required |
delimiter
|
str
|
Character used to split the filename. Default is "_". |
'_'
|
condition
|
str or None
|
Optional boolean expression (evaluated with df.eval) that selects a
subset of rows for parsing. If None, parse all rows. For example, |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Copy of df with added metadata columns. |
Source code in src/scpviz/utils/data.py
resolve_accessions
resolve_accessions(
adata: AnnData | pAnnData,
namelist: list[str],
gene_col: str = "Genes",
gene_map: dict[str, str] | None = None,
) -> list[str] | None
Resolve gene or accession names to accession IDs from .var_names.
This function maps user-specified identifiers (gene names or accession IDs)
to the canonical accession IDs in an AnnData or pAnnData object. It first
checks .var_names for exact matches, then optionally resolves gene names
via a specified column (default "Genes"). Unmatched names are reported.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData or pAnnData
|
AnnData-like object containing |
required |
namelist
|
list of str
|
Input identifiers to resolve (genes or accessions). |
required |
gene_col
|
str
|
Column in |
'Genes'
|
gene_map
|
dict
|
Precomputed mapping of gene → accession. If None,
a mapping is constructed from |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
resolved |
list of str
|
List of accession IDs corresponding to the input names. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If none of the provided names can be resolved to |
Example
Resolve gene symbols to accession IDs:
Resolve accessions directly:
Source code in src/scpviz/utils/data.py
366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 | |
resolve_class_filter
resolve_class_filter(
adata: pAnnData | AnnData,
classes: str | list[str],
class_value: str | list[str],
debug: bool = False,
*,
filter_func: (
Callable[..., pAnnData | AnnData] | None
) = None
) -> pAnnData | ad.AnnData
Resolve (classes, class_value) inputs and apply filtering.
This helper standardizes class/value pairs into dictionary-style filters and applies them to an AnnData or pAnnData object. It is primarily used internally by plotting and analysis functions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData or pAnnData
|
Input data object to filter. |
required |
classes
|
str or list of str
|
Metadata field(s) used for filtering. |
required |
class_value
|
str or list of str
|
Values corresponding to |
required |
debug
|
bool
|
If True, print resolved class/value pairs. |
False
|
filter_func
|
callable
|
Filtering function to apply.
Defaults to :func: |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
filtered |
AnnData or pAnnData
|
Subset of the input object, same type as |
Warning
This is an internal helper for use inside functions such as
plot_rankquant and plot_raincloud. End users should call
pAnnData.filter_sample_values() instead.
Source code in src/scpviz/utils/class_filter.py
resolve_input_layer
Resolve the source layer name for provenance when the user passes layer='X'.
The active matrix .X tracks its logical source in adata.uns['current_X_layer']
(maintained by set_X() and set at import). For any other layer string,
return it unchanged.
If current_X_layer is missing (legacy objects), falls back to "X_raw".
Source code in src/scpviz/utils/data.py
scalarize_taxon
Normalize taxon-id values so they never contain lists or arrays.
Returns:
| Type | Description |
|---|---|
object
|
Scalar string-like taxon id, or pd.NA. |
Source code in src/scpviz/utils/id_maps.py
standardize_uniprot_columns
Normalize UniProt DataFrame column names to a consistent lowercase, snake_case schema.
This ensures stability across UniProt REST API version changes while keeping the user informed only when critical fields are affected.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Raw UniProt metadata table. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame | None
|
pd.DataFrame: Copy of the DataFrame with standardized column names. |
Source code in src/scpviz/utils/id_maps.py
273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 | |
update_layer_provenance
update_layer_provenance(
adata: AnnData,
layer_name: str,
op: str,
input_layer: str,
**kwargs: Any
) -> str
Register a layer in the provenance registry stored in adata.uns.
Preprocessing methods (normalize, impute, log_transform) call this
before assigning adata.layers[...]. Chains are reconstructable by following
input_layer pointers.
If layer_name already exists with a different op or input_layer,
a warning is printed and the record is stored under layer_name_1, layer_name_2, …
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
AnnData to update (must not rely on pAnnData |
required |
layer_name
|
str
|
Intended output layer key. |
required |
op
|
str
|
One of |
required |
input_layer
|
str
|
Source layer name, or |
required |
**kwargs
|
Any
|
Extra metadata (e.g. |
{}
|
Returns:
| Type | Description |
|---|---|
str
|
Actual layer key to use in |