Tutorial 2: Filtering and Normalization

Learn how to filter proteins and peptides in your dataset.

All filter functions return a copy of the pAnnData object unless inplace=True is specified.

There are three main filtering functions:

pdata.filter_prot() – filters proteins from the dataset
- pdata.filter_prot_found() – sub-function for proteins found within specified samples
- pdata.filter_prot_significant() – sub-function for proteins significant within specified samples
pdata.filter_sample() – filters samples from the dataset
pdata.filter_rs() – RS-based filtering (e.g., filtering by unique peptides)

`filter_prot()`

Filter protein data based on metadata conditions or accession lists (protein or gene names).

Tip

Multiple filters can be combined in a single call. For example:

condition = "protein_quant > 0.75"
pdata_filtered = pdata.filter_prot(condition=condition, valid_genes=True, unique_profiles=True)

Condition-based filtering

A condition string to filter protein metadata. Supports: - Standard comparisons, e.g. "Protein FDR Confidence: Combined == 'High'" - Substring queries using includes, e.g. "Description includes 'p97'"

Filter by matching condition metadata

condition = "Protein FDR Confidence: Combined == 'High'"
pdata_filtered = pdata.filter_prot(condition=condition)

Filter by numerical condition on metadata

pdata_filtered = pdata.filter_prot(condition="unique_peptides >= 2")

Note: For condition, the first variable must match a column name in prot.var. Otherwise, an error will be raised.

Substring match on protein description

condition = "Description includes 'VCP'"
pdata_filtered = pdata.filter_prot(condition=condition)

pdata_filtered.prot.var

Accession		Description		Genes
P55072	...	Transitional endoplasmic reticulum ATPase OS=Homo sapiens OX=9606 GN=VCP PE=1 SV=4	...	VCP
Q96JH7	...	Deubiquitinating protein VCPIP1 OS=Homo sapiens OX=9606 GN=VCPIP1 PE=1 SV=2	...	VCPIP1
Q8NHG7	...	Small VCP/p97-interacting protein OS=Homo sapiens OX=9606 GN=SVIP PE=1 SV=1	...	SVIP
Q9H867	...	Protein N-lysine methyltransferase METTL21D OS=Homo sapiens OX=9606 GN=VCPKMT PE=1 SV=2	...	VCPKMT

Accession-based filtering (accession list or gene names)

Accession-based filtering accepts both UniProt accessions as well as gene names. pAnnData objects automatically search the UniProt API for primary gene names upon import, and stores these in the object.

Filter by specific accessions or genes

accessions = ['GAPDH', 'P53']
pdata_filtered = pdata.filter_prot(accessions=accessions)

Valid genes

This removes rows with missing gene names and resolves duplicate gene names by appending numeric suffixes.

Filter proteins with valid genes only

pdata_filtered = pdata.filter_prot(valid_genes=True)

Accession	Genes (before)	Genes (after)
P12345	GAPDH	GAPDH
P23456	ACTB	ACTB
P34567	NaN	❌ (removed)
P45678	HSP90AA1	HSP90AA1
P45679	HSP90AA1	HSP90AA1_2
P56789	TUBB	TUBB
…	…	…

Unique profiles

Removes rows with duplicate abundance profiles across samples (typically for isoforms with no distinguishing peptides).

Tip

Recommended to use for single-cell data, which has higher data sparsity and missing values in peptides, which frequently leads to duplicated profiles.

Filter proteins with unique abundance profiles

pdata_filtered = pdata.filter_prot(unique_profiles=True)

For more information, see the API documentation for filter_prot()

`filter_prot_found()`

Filter proteins or peptides based on "Found In" detection across samples or groups.

This method filters features by checking whether they are found in a minimum number or proportion of samples, either at the group level (e.g., biological condition) or based on individual files.

Note: A true match_any flag retain proteins true in any group/file (OR logic). If False, requires all groups/files to be true (AND logic). The flag defaults to True.

`groups` and `match_any`

Filter proteins found in both cell lines (match_any=False, AND logic)

pdata_filtered = pdata.filter_prot_found(group="cellline", min_count=2, match_any=False)

In this example, the class column "cellline" contains two groups: A and B.
Proteins must be detected in at least two samples within each cell line to be retained.

Cell line A		Cell line B		Result
F1	F2	F3	F4
🟩	🟩	🟩	⚪	✅ Kept (found ≥2 per cell line)
🟩	🟩	⚪	⚪	❌ Filtered (not enough in cell line B), kept if `match_any=True`

The group parameter refers to a sample class column (e.g., "cellline", "treatment", "condition").
Each unique value in that column (e.g., A, B) is treated as a separate subgroup.

Filter proteins found in any cell line (match_any=True, OR logic, ratio ≥ 0.4)

pdata_filtered = pdata.filter_prot_found(group="cellline", min_ratio=0.4, match_any=True)

This example uses the class column "cellline" containing A and B.
Proteins are retained if they are found in at least 40% of samples within any one cell line.

Cell line A			Cell line B		Ratio (A, B)	Result
F1	F2	F3	F4	F5
🟩	🟩	⚪	⚪	⚪	(0.67, 0.00)	✅ Kept (≥0.4 in A)
🟩	⚪	⚪	🟩	⚪	(0.33, 0.50)	✅ Kept (≥0.4 in B)
⚪	⚪	🟩	⚪	⚪	(0.33, 0.00)	❌ Filtered (<0.4 in both)

With match_any=True, OR logic is applied across groups —
a protein passes if it meets the minimum ratio threshold in any one subgroup.

Filter by found in file-list

Filter proteins found in all three input files

pdata_filtered = pdata.filter_prot_found(group=["F1", "F2", "F3"])

F1	F2	F3	Result
🟩	🟩	🟩	✅ Kept (found in all 3 files)
🟩	⚪	🟩	❌ Filtered (not found in File 2)

Filter proteins found in files of a specific sub-group with AND logic

pdata.annotate_found(classes=['group', 'condition'])
pdata_filtered = pdata.filter_prot_found(group=['groupA_control', 'groupB_treated'],min_ratio=0.5,match_any=False)

groupA_control	groupA_treated	groupB_control	groupB_treated	Result
🟩	⚪	⚪	🟩	✅ Kept (Found in both specified groups)
🟩	🟩	🟩	⚪	❌ Filtered (Not found in `groupB_treated` group)

Filter by class column, based on a minimum ratio (e.g., at least 50% in each cell line)

pdata_filtered = pdata.filter_prot_found(group="cellline", min_ratio=0.5, match_any=False)

Cell line A		Cell line B			Ratio	Result
F1	F2	F3	F4	F5	≥ 0.5
🟩	⚪	🟩	🟩	⚪	(0.5, 0.66)	✅
🟩	⚪	🟩	⚪	⚪	(0.5, 0.33)	❌ (✅ if `match_any=True`)
🟩	⚪	⚪	⚪	⚪	(0.25, 0)	❌

For more information, see the API documentation for filter_prot_found()

`filter_prot_significant()`

Filter proteins based on significance across samples or groups using FDR thresholds.

This method filters proteins by checking whether they are significant (e.g., PG.Q.Value < 0.01) in a minimum number or proportion of samples, either per file or grouped.

The grouping logic is akin to that of filter_prot_found().

Warn

Only DIA-NN files contain per-sample specific q-values. For PD files, use pdata.filter_prot_significant() to filter based on global q-values.

Note: A true match_any flag retain proteins significant in any group/file (OR logic). If False, requires all groups/files to be significant (AND logic). The flag defaults to True.

Filter proteins significant in all 'cellline' groups ('celline_A' and 'celline_B')

pdata_filtered = pdata.filter_prot_significant(group=["cellline"], min_count=2)

Filter proteins significant in all three input files

pdata_filtered = pdata.filter_prot_significant(group=["F1", "F2", "F3"])

Filter proteins significant in files of a specific sub-group

pdata.annotate_significant(classes=['group', 'treatment'])
pdata_filtered = pdata.filter_prot_significant(group=["groupA_control", "groupB_treated"])

For more information, see the API documentation for filter_prot_significant()

`filter_sample()`

Filter samples in a pAnnData object based on categorical, numeric, or identifier-based criteria.
Accepts exactly one of the following arguments:

values: A dictionary or list of dictionaries specifying class-based filters (e.g., treatment, cellline).
condition: A logical condition string evaluated against summary-level numeric metadata (e.g., protein count).
file_list: A list of sample or file names to retain.

Filter by value

Categorical metadata filtering allows selection of samples based on .obs or .summary fields such as treatment, cell line, or condition.
This supports:

A single dictionary, e.g. {'cellline': 'A'}
A list of dictionaries for multiple matching cases, e.g. [{...}, {...}]
Exact matching:
- exact_cases=True for strict combination matching across all key–value pairs
- exact_cases=False applies an OR logic within fields and AND logic across fields.

`exact_cases=False`

Filter by metadata values (exact_cases=False)

pdata_filtered = pdata.filter_sample(values={'condition': ['kd','sc'], 'cellline': 'A'})

When exact_cases=False, the logic is (OR within fields, AND across fields).
This means any sample that matches any treatment in ['kd','sc'] and has cellline='A' is kept.

Sample	Treatment	Cell line	Match logic	Result
1	sc	A	✅ treatment in [kd, sc] and ✅ cellline=A	✅ Kept
2	kd	A	✅ treatment in [kd, sc] and ✅ cellline=A	✅ Kept
3	sc	B	✅ treatment in [kd, sc] and ❌ cellline=A	❌
4	kd	B	✅ treatment in [kd, sc] and ❌ cellline=A	❌

`exact_cases=True`

Filter with multiple exact matching cases (exact_cases=True)

pdata_filtered = pdata.filter_sample(
    values=[
        {'condition': 'kd', 'cellline': 'A'},
        {'condition': 'sc', 'cellline': 'B'}
    ],
    exact_cases=True
)

When exact_cases=True, the logic requires an exact match to one of the full dictionaries.
Here, only samples matching either {treatment: 'kd', cellline: 'A'} or {treatment: 'sc', cellline: 'B'} are kept.

Sample	Treatment	Cell line	Match dictionary	Result
1	sc	A	❌ No exact match	❌
2	kd	A	✅ Matches	✅ Kept
3	sc	B	✅ Matches	✅ Kept
4	kd	B	❌ No exact match	❌

Filter by condition

Use a logical condition string referencing columns in pdata.summary.
This enables numeric and boolean filtering based on sample-level summary statistics.

Filter samples with more than 1000 proteins

pdata_filtered = pdata.filter_sample(condition="protein_count > 1000")

Using `min_prot`

A convenience shortcut for filtering based on a minimum protein count.

Filter samples with fewer than 1000 proteins

pdata_filtered = pdata.filter_sample(min_prot=1000)

Filter by file list

Filter samples directly by their file or sample identifiers.

Keep specific samples by name

pdata_filtered = pdata.filter_sample(file_list=['Sample_001', 'Sample_007'])

Exclude specific samples by name

pdata_filtered = pdata.filter_sample(exclude_file_list=['Sample_001', 'Sample_007'])

Advanced query mode

Enable advanced filtering with query_mode=True to execute raw pandas-style queries.
This interprets values or condition as a raw .query() string evaluated directly on .obs or .summary.

Query .obs metadata using values

pdata_filtered = pdata.filter_sample(values="cellline == 'AS' and condition == 'kd'", query_mode=True)

Query .summary metadata using condition

pdata_filtered = pdata.filter_sample(condition="protein_count > 1000 and protein_quant > 0.9", query_mode=True)

Complex logical expressions such as (A and B) or C are supported.

Additional flags

cleanup: If True (default), remove proteins that become all-NaN or all-zero after sample filtering and synchronize RS/peptide matrices.
Set to False to retain all proteins (useful for downstream DE analyses requiring consistent feature alignment).

For more information, see the API documentation for filter_sample()

`filter_rs()`

Filter the RS matrix and associated .prot and .pep data based on peptide–protein relationships.
This method applies rules for retaining proteins with sufficient peptide evidence and/or removing ambiguous peptides.

Key Parameters

min_peptides_per_protein (int, optional) – Minimum total number of peptides required per protein.
min_unique_peptides_per_protein (int, optional) – Minimum number of unique peptides required per protein.
max_proteins_per_peptide (int, optional) – Maximum number of proteins a peptide can map to (peptides exceeding this are removed).
preset (str or dict, optional) – Predefined filter presets:
"default" → unique peptides ≥ 2
"lenient" → total peptides ≥ 2
A dictionary specifying thresholds manually.
The default preset is "default".

Filter by unique peptides

Filter proteins with ≥2 unique peptides

pdata_filtered = pdata.filter_rs(min_unique_peptides_per_protein=2)

Protein	Peptide 1	Peptide 2	Peptide 3	Unique peptides	Result
P001	🟩	🟩	⚪	2	✅ Kept
P002	🟩	⚪	⚪	1	❌ Filtered
P003	🟩	🟩	🟩	3	✅ Kept

Proteins with fewer than two unique peptides are removed by default.
The filtering operation updates both .prot and .pep tables and synchronizes their mappings in the RS matrix.

For more information, see the API documentation for filter_rs()