Hidden Functions

Hidden functions for all MixIns.

Advanced / Internal

The functions in this section are internal utilities. They may change without notice and are not guaranteed to remain stable across releases. Use only if you understand the internal architecture of pAnnData.

src.scpviz.pAnnData.analysis

src.scpviz.pAnnData.base

src.scpviz.pAnnData.editing

src.scpviz.pAnnData.enrichment

_pretty_vs_key

_pretty_vs_key(k)

Format a DE contrast key into a human-readable string.

This function attempts to convert a string representation of a DE comparison (e.g., a list of dictionaries) into a simplified "group1 vs group2" format, using the values from each dictionary in the left and right group.

Parameters:

Name	Type	Description	Default
`k`	`str`	DE key string, typically in the format `"[{{...}}] vs [{{...}}]"`.	required

Returns:

Name	Type	Description
`str`		A simplified, human-readable version of the DE comparison key.

Source code in src/scpviz/pAnnData/enrichment.py

def _pretty_vs_key(k):
    """
    Format a DE contrast key into a human-readable string.

    This function attempts to convert a string representation of a DE comparison
    (e.g., a list of dictionaries) into a simplified `"group1 vs group2"` format,
    using the values from each dictionary in the left and right group.

    Args:
        k (str): DE key string, typically in the format `"[{{...}}] vs [{{...}}]"`.

    Returns:
        str: A simplified, human-readable version of the DE comparison key.
    """
    import ast
    try:
        parts = k.split(" vs ")
        left = "_".join(str(v) for d in ast.literal_eval(parts[0]) for v in d.values())
        right = "_".join(str(v) for d in ast.literal_eval(parts[1]) for v in d.values())
        return f"{left} vs {right}"
    except Exception:
        return k  # fallback to raw key if anything goes wrong

_resolve_de_key

_resolve_de_key(stats_dict, user_key, debug=False)

Resolve a user-supplied DE key to a valid key stored in .stats["de_results"].

This function matches a flexible, human-readable DE key against the internal keys stored in the DE results dictionary. It supports both raw and pretty-formatted keys, and can handle suffixes like _up or _down for directional analysis.

Parameters:

Name	Type	Description	Default
`stats_dict`	`dict`	Dictionary of DE results (typically `pdata.stats["de_results"]`).	required
`user_key`	`str`	User-supplied key to resolve, e.g., "AS_kd vs AS_sc_down".	required
`debug`	`bool`	If True, prints detailed debug output for tracing.	`False`

Returns:

Name	Type	Description
`str`		The matching internal DE result key.

Raises:

Type	Description
`ValueError`	If no matching key is found.

Source code in src/scpviz/pAnnData/enrichment.py

def _resolve_de_key(stats_dict, user_key, debug=False):
    """
    Resolve a user-supplied DE key to a valid key stored in `.stats["de_results"]`.

    This function matches a flexible, human-readable DE key against the internal keys
    stored in the DE results dictionary. It supports both raw and pretty-formatted keys,
    and can handle suffixes like `_up` or `_down` for directional analysis.

    Args:
        stats_dict (dict): Dictionary of DE results (typically `pdata.stats["de_results"]`).
        user_key (str): User-supplied key to resolve, e.g., "AS_kd vs AS_sc_down".
        debug (bool): If True, prints detailed debug output for tracing.

    Returns:
        str: The matching internal DE result key.

    Raises:
        ValueError: If no matching key is found.
    """
    import re
    print(f"[DEBUG] Resolving user key: {user_key}") if debug else None

    # Extract suffix
    suffix = ""
    if user_key.endswith("_up") or user_key.endswith("_down"):
        user_key, suffix = re.match(r"(.+)(_up|_down)", user_key).groups()
        print(f"[DEBUG] Split into base='{user_key}', suffix='{suffix}'") if debug else None

    # Build pretty key mapping
    pretty_map = {}
    for full_key in stats_dict:
        if "vs" not in full_key:
            continue

        if full_key.endswith("_up") or full_key.endswith("_down"):
            base = full_key.rsplit("_", 1)[0]
            full_suffix = "_" + full_key.rsplit("_", 1)[1]
        else:
            base = full_key
            full_suffix = ""

        pretty_key = _pretty_vs_key(base)
        final_key = pretty_key + full_suffix
        pretty_map[final_key] = full_key
        print(f"[DEBUG] Mapped '{final_key}' → '{full_key}'")  if debug else None

    full_user_key = user_key + suffix
    print(f"[DEBUG] Full user key for lookup: '{full_user_key}'")  if debug else None

    if full_user_key in stats_dict:
        print("[DEBUG] Found direct match in stats.") if debug else None
        return full_user_key
    elif full_user_key in pretty_map:
        print(f"[DEBUG] Found in pretty map: {pretty_map[full_user_key]}")  if debug else None
        return pretty_map[full_user_key]
    else:
        pretty_keys = "\n".join(f"  - {k}" for k in pretty_map.keys()) if pretty_map else "  (none found)"
        raise ValueError(
            f"'{full_user_key}' not found in stats.\n"
            f"Available DE keys:\n{pretty_keys}"
        )

src.scpviz.pAnnData.filtering

_detect_ambiguous_input

_detect_ambiguous_input(group, var, group_metrics=None)

Detects ambiguous user input mixing file and group identifiers.

This helper checks whether the group list includes both file-like identifiers (present in .var as 'Found In: ') and group-like identifiers (present in .uns as ('group', 'count') tuples).

Parameters:

Name	Type	Description	Default
`group`	`list of str`	User-provided identifiers for filtering.	required
`var`	`DataFrame`	The `.var` table of the corresponding AnnData object.	required
`group_metrics`	`DataFrame`	MultiIndex DataFrame from `.uns` containing per-group ('count', 'ratio') metrics.	`None`

Returns:

Name	Type	Description
`tuple`	`(bool, list, list)`	(is_ambiguous, annotated_files, annotated_groups) - is_ambiguous (bool): True if both file-like and group-like entries coexist. - annotated_files (list): Entries that match file-level columns. - annotated_groups (list): Entries that match group-level metrics.

Source code in src/scpviz/pAnnData/filtering.py

def _detect_ambiguous_input(group, var, group_metrics=None):
    """
    Detects ambiguous user input mixing file and group identifiers.

    This helper checks whether the `group` list includes both
    file-like identifiers (present in `.var` as 'Found In: <file>')
    and group-like identifiers (present in `.uns` as ('group', 'count') tuples).

    Args:
        group (list of str): User-provided identifiers for filtering.
        var (pd.DataFrame): The `.var` table of the corresponding AnnData object.
        group_metrics (pd.DataFrame, optional): MultiIndex DataFrame from `.uns`
            containing per-group ('count', 'ratio') metrics.

    Returns:
        tuple(bool, list, list):
            (is_ambiguous, annotated_files, annotated_groups)
            - is_ambiguous (bool): True if both file-like and group-like entries coexist.
            - annotated_files (list): Entries that match file-level columns.
            - annotated_groups (list): Entries that match group-level metrics.
    """
    annotated_files = [g for g in group if f"Found In: {g}" in var.columns]
    annotated_groups = []
    if group_metrics is not None:
        annotated_groups = [
            g for g in group
            if (g, "count") in group_metrics.columns or (g, "ratio") in group_metrics.columns
        ]

    has_file_like = bool(annotated_files)
    has_group_like = bool(annotated_groups)

    # Ambiguous only if both exist and sets are distinct
    is_ambiguous = has_file_like and has_group_like and set(annotated_files) != set(annotated_groups)

    return is_ambiguous, annotated_files, annotated_groups

src.scpviz.pAnnData.history

src.scpviz.pAnnData.identifier

src.scpviz.pAnnData.io

_import_proteomeDiscoverer

_import_proteomeDiscoverer(prot_file: Optional[str] = None, pep_file: Optional[str] = None, obs_columns: Optional[List[str]] = ['sample'], **kwargs)

Source code in src/scpviz/pAnnData/io.py

def _import_proteomeDiscoverer(prot_file: Optional[str] = None, pep_file: Optional[str] = None, obs_columns: Optional[List[str]] = ['sample'], **kwargs):
    if not prot_file and not pep_file:
        raise ValueError(f"{format_log_prefix('error')} At least one of prot_file or pep_file must be provided to function. Try prot_file='proteome_discoverer_prot.txt' or pep_file='proteome_discoverer_pep.txt'.")
    print("--------------------------\nStarting import [Proteome Discoverer]\n")

    if prot_file:
        # -----------------------------
        print(f"Source file: {prot_file} / {pep_file}")
        # PROTEIN DATA
        # check file format, if '.txt' then use read_csv, if '.xlsx' then use read_excel
        if prot_file.endswith('.txt') or prot_file.endswith('.tsv'):
            prot_all = pd.read_csv(prot_file, sep='\t')
        elif prot_file.endswith('.xlsx'):
            print("💡 Tip: The read_excel function is slower compared to reading .tsv or .txt files. For improved performance, consider converting your data to .tsv or .txt format.")
            prot_all = pd.read_excel(prot_file)
        # prot_X: sparse data matrix
        prot_X = sparse.csr_matrix(prot_all.filter(regex='Abundance: F', axis=1).values).transpose()
        # prot_layers['mbr']: protein MBR identification
        prot_layers_mbr = prot_all.filter(regex='Found in Sample', axis=1).values.transpose()
        # prot_var_names: protein names
        prot_var_names = prot_all['Accession'].values
        # prot_var: protein metadata
        used_patterns = [
            r'^Abundance: F',           # sample abundance columns (if any)
            r'^Found In',               # found-in sample flags
            r'^Significant In',         # significance flags
            r'^Ratio:',                 # ratio columns (if PD quant ratios exist)
            r'^Abundances \(Grouped\)', # grouped abundance or CV fields
        ]

        exclude_cols = prot_all.filter(regex='|'.join(used_patterns), axis=1).columns
        prot_var = prot_all.drop(columns=exclude_cols, errors='ignore').copy()        

        if 'Exp. q-value: Combined' in prot_all.columns:
            prot_var['Exp. q-value: Combined'] = prot_all['Exp. q-value: Combined']
        elif 'Exp. Protein Group q-value: Combined' in prot_all.columns:
            prot_var['Exp. q-value: Combined'] = prot_all['Exp. Protein Group q-value: Combined']
        else:
            warnings.warn("⚠️ Neither 'Exp. q-value: Combined' nor 'Exp. Protein Group q-value: Combined' found in input file.")

        prot_var.rename(columns={
            'Gene Symbol': 'Genes',
            'Exp. q-value: Combined': 'Global_Q_value',
            '# Unique Peptides': 'unique_peptides'
        }, inplace=True)
        # prot_obs_names: file names
        prot_obs_names = prot_all.filter(regex='Abundance: F', axis=1).columns.str.extract(r'Abundance: (F\d+):')[0].values
        # prot_obs: sample typing from the column name, drop column if all 'n/a'
        prot_obs = prot_all.filter(regex='Abundance: F', axis=1).columns.str.extract(r'Abundance: F\d+: (.+)$')[0].values
        prot_obs = pd.DataFrame(prot_obs, columns=['metadata'])['metadata'] \
            .str.split(',', expand=True)
        prot_obs = _safe_strip(prot_obs).astype('category')

        if (prot_obs == "n/a").all().any():
            print(f"{format_log_prefix('warn')} Found columns with all 'n/a'. Dropping these columns.")
            prot_obs = prot_obs.loc[:, ~(prot_obs == "n/a").all()]

        print(f"Number of files: {len(prot_obs_names)}")
        print(f"Proteins: {len(prot_var)}")
    else:
        prot_X = prot_layers_mbr = prot_var_names = prot_var = prot_obs_names = prot_obs = None

    if pep_file:
        # -----------------------------
        # PEPTIDE DATA
        if pep_file.endswith('.txt') or pep_file.endswith('.tsv'):
            pep_all = pd.read_csv(pep_file, sep='\t')
        elif pep_file.endswith('.xlsx'):
            print(f"{format_log_prefix('warn')} The read_excel function is slower compared to reading .tsv or .txt files. For improved performance, consider converting your data to .tsv or .txt format.")
            pep_all = pd.read_excel(pep_file)

        # Filter out any peptides that are not matched to a protein
        col = "Master Protein Accessions"
        if col in pep_all.columns:
            # treat NaN and empty/whitespace-only strings as missing
            mask_has_master = pep_all[col].notna() & pep_all[col].astype(str).str.strip().ne("")
            n_drop = (~mask_has_master).sum()
            if n_drop:
                print(f"{format_log_prefix('warn')} Dropping {n_drop} peptide rows with missing '{col}'.")
                pep_all = pep_all.loc[mask_has_master].copy()
        else:
            print(f"{format_log_prefix('warn')} Column '{col}' not found in peptide file; cannot filter missing master accessions.")

        # pep_X: sparse data matrix
        pep_X = sparse.csr_matrix(pep_all.filter(regex='Abundance: F', axis=1).values).transpose()
        # pep_layers['mbr']: peptide MBR identification
        pep_layers_mbr = pep_all.filter(regex='Found in Sample', axis=1).values.transpose()
        # pep_var_names: peptide sequence with modifications
        if 'Modifications' in pep_all.columns: # old PD version
            mod_col = 'Modifications'
        elif 'Modifications in Master Proteins' in pep_all.columns: # new PD version
            mod_col = 'Modifications in Master Proteins'
        else:
            mod_col = None # handle fallback if user didn't import?

        if mod_col is not None:
            pep_all[mod_col] = pep_all[mod_col].astype(str)
            pep_var_names = (
                pep_all['Annotated Sequence'] +
                np.where(pep_all[mod_col].isin(['nan', 'None', 'NaN']), '', ' MOD:' + pep_all[mod_col])
            ).values
        else:
            pep_var_names = pep_all['Annotated Sequence'].values
        # pep_obs_names: file names
        pep_obs_names = pep_all.filter(regex='Abundance: F', axis=1).columns.str.extract(r'Abundance: (F\d+):')[0].values
        # pep_var: peptide metadata
        pep_all.rename(columns={
            'Qvality q-value': 'Global_Q_Value',
            'q-Value': 'Global_Q_Value',
            'q-Value (Best File Local)': 'Global_Q_Value',
            'Qvality PEP': 'PEP',
            'PEP (Best File Local)': 'PEP',
        }, inplace=True)

        used_patterns = ['^Abundance: F', '^Found in Sample', '^Abundances \\(Grouped', '^Abundances \\(Grouped\\) CV']
        exclude_cols = pep_all.filter(regex='|'.join(used_patterns), axis=1).columns
        pep_var = pep_all.drop(columns=exclude_cols, errors='ignore')

        # ensure pep_var has only 1D columns
        for col in pep_var.columns:
            if isinstance(pep_var[col].iloc[0], (pd.Series, pd.DataFrame, np.ndarray, list, tuple)):
                print(f"{format_log_prefix('warn')} Dropping nested column '{col}' from peptide metadata.")
                pep_var = pep_var.drop(columns=[col])

        pep_var = pep_var.copy()
        for c in pep_var.columns:
            if not np.issubdtype(pep_var[c].dtype, np.number):
                pep_var[c] = pep_var[c].astype(str)

        # prot_obs: sample typing from the column name, drop column if all 'n/a'
        pep_obs = pep_all.filter(regex='Abundance: F', axis=1).columns.str.extract(r'Abundance: F\d+: (.+)$')[0].values
        pep_obs = pd.DataFrame(pep_obs, columns=['metadata'])['metadata'] \
            .str.split(',', expand=True)
        pep_obs = _safe_strip(pep_obs).astype('category')

        if (pep_obs == "n/a").all().any():
            print(f"{format_log_prefix('warn')} Found columns with all 'n/a'. Dropping these columns.")
            pep_obs = pep_obs.loc[:, ~(pep_obs == "n/a").all()]

        print(f"Peptides: {len(pep_var)}")
        if mod_col == 'Modifications in Master Proteins':
            print(f"\n{format_log_prefix('info_only')} Using 'Modifications in Master Proteins' for modification annotation.")
        elif mod_col == 'Modifications':
            print(f"\n{format_log_prefix('info_only')} Using 'Modifications' for modification annotation.")
        elif mod_col is None:
            print(f"\n{format_log_prefix('warn')} No modification column found. Peptide modifications were not annotated. Please check if 'Modifications' or 'Modifications in Master Protein' columns were exported from PD.")

    else:
        pep_X = pep_layers_mbr = pep_var_names = pep_var = pep_obs_names = pep_obs = None

    if prot_file and pep_file:
        # -----------------------------
        # RS DATA
        # rs is in the form of a binary matrix, protein x peptide
        pep_prot_list = pep_all['Master Protein Accessions'].str.split('; ')
        rs, mlb = _build_rs_matrix(pep_prot_list, prot_var_names = prot_var_names)

    else:
        rs = None

    # ASSERTIONS
    # -----------------------------
    # check if mlb.classes_ has overlap with prot_var
    if prot_file and pep_file:
        mlb_classes_set = set(mlb.classes_)
        prot_var_set = set(prot_var_names)

        if mlb_classes_set != prot_var_set:
            print(
                f"{format_log_prefix('warn')} Master proteins in the peptide matrix do not match proteins in the protein data, please check if files correspond to the same data.\n"
                f"{format_log_prefix('info')} If using PD3.2, this is a known issue due to changed protein grouping rules."
            )

    pdata = _create_pAnnData_from_parts(
        prot_X, pep_X, rs,
        prot_obs, prot_var, prot_obs_names, prot_var_names,
        pep_obs, pep_var, pep_obs_names, pep_var_names,
        obs_columns=obs_columns,
        X_mbr_prot=prot_layers_mbr,
        X_mbr_pep=pep_layers_mbr,
        metadata={
            "source": "proteomeDiscoverer",
            "prot_file": prot_file,
            "pep_file": pep_file
        },
        history_msg=f"Imported Proteome Discoverer data using source file(s): {prot_file}, {pep_file}."
    )

    return pdata

_import_diann

_import_diann(report_file: Optional[str] = None, obs_columns: Optional[List[str]] = None, delimiter: Optional[str] = '_', obs: Optional[pd.DataFrame] = None, prot_value='PG.MaxLFQ', pep_value='Precursor.Normalised', prot_var_columns=['Genes', 'Master.Protein'], pep_var_columns=['Genes', 'Protein.Group', 'Precursor.Charge', 'Modified.Sequence', 'Stripped.Sequence', 'Precursor.Id', 'All Mapped Proteins', 'All Mapped Genes'], **kwargs)

Source code in src/scpviz/pAnnData/io.py

def _import_diann(report_file: Optional[str] = None, obs_columns: Optional[List[str]] = None, delimiter: Optional[str] = '_', obs: Optional[pd.DataFrame] = None, prot_value = 'PG.MaxLFQ', pep_value = 'Precursor.Normalised', prot_var_columns = ['Genes', 'Master.Protein'], pep_var_columns = ['Genes', 'Protein.Group', 'Precursor.Charge','Modified.Sequence', 'Stripped.Sequence', 'Precursor.Id', 'All Mapped Proteins', 'All Mapped Genes'], **kwargs):
    if not report_file:
        raise ValueError(f"{format_log_prefix('error')} Importing from DIA-NN: report.tsv or report.parquet must be provided to function. Try report_file='report.tsv' or report_file='report.parquet'")
    print("--------------------------\nStarting import [DIA-NN]\n")

    print(f"Source file: {report_file}")
    # if csv, then use pd.read_csv, if parquet then use pd.read_parquet('example_pa.parquet', engine='pyarrow')
    if report_file.endswith('.tsv'):
        report_all = pd.read_csv(report_file, sep='\t')
    elif report_file.endswith('.parquet'):
        report_all = pd.read_parquet(report_file, engine='pyarrow')
    report_all['Master.Protein'] = report_all['Protein.Group'].str.split(';')
    report_all = report_all.explode('Master.Protein')
    # -----------------------------
    # PROTEIN DATA
    # prot_X: sparse data matrix
    if prot_value != 'PG.MaxLFQ':
        if report_file.endswith('.tsv') and prot_value == 'PG.Quantity':
            # check if 'PG.Quantity' is in the columns, if yes then pass, if not then throw an error that DIA-NN version >2.0 does not have PG.quantity
            if 'PG.Quantity' not in report_all.columns:
                raise ValueError("Reports generated with DIA-NN version >2.0 do not contain PG.Quantity values, please use PG.MaxLFQ .")
        else:
            print(f"{format_log_prefix('info')} Protein value specified is not PG.MaxLFQ nor PG.Quantity, please check if correct.")
    prot_X_pivot = report_all.pivot_table(index='Master.Protein', columns='Run', values=prot_value, aggfunc='first', sort=False)
    prot_X = sparse.csr_matrix(prot_X_pivot.values).T
    # prot_var_names: protein names
    prot_var_names = prot_X_pivot.index.values
    # prot_obs: file names
    prot_obs_names = prot_X_pivot.columns.values

    # prot_var: protein metadata (default: Genes, Master.Protein)
    if 'First.Protein.Description' in report_all.columns and 'First.Protein.Description' not in prot_var_columns:
        prot_var_columns.insert(0, 'First.Protein.Description')

    if 'Global.PG.Q.Value' in report_all.columns and 'Global.PG.Q.Value' not in prot_var_columns:
        prot_var_columns.append('Global.PG.Q.Value')

    existing_prot_var_columns = [col for col in prot_var_columns if col in report_all.columns]
    missing_columns = set(prot_var_columns) - set(existing_prot_var_columns)

    if missing_columns:
        warnings.warn(
            f"{format_log_prefix('warn')} The following columns are missing: {', '.join(missing_columns)}. "
        )

    prot_var = report_all.loc[:, existing_prot_var_columns].drop_duplicates(subset='Master.Protein').drop(columns='Master.Protein').rename(columns={'Global.PG.Q.Value': 'Global_Q_value'})
    # prot_obs: sample typing from the column name
    if obs is not None:
        prot_obs = obs
        # obs_columns = obs_columns
    else:
        prot_obs = pd.DataFrame(prot_X_pivot.columns.values, columns=['Run'])['Run'].str.split(delimiter, expand=True).rename(columns=dict(enumerate(obs_columns)))

    # PG.Q.Value layer (sample x protein)
    if 'PG.Q.Value' in report_all.columns:
        pgq_pivot = report_all.pivot_table(index='Master.Protein', columns='Run', values='PG.Q.Value', aggfunc='first', sort=False)
        prot_pgq_layer = sparse.csr_matrix(pgq_pivot.values).T
    else:
        prot_pgq_layer = None

    print(f"Number of files: {len(prot_obs_names)}")
    print(f"Proteins: {len(prot_var)}")

    # -----------------------------
    # PEPTIDE DATA
    # pep_X: sparse data matrix
    pep_X_pivot = report_all.pivot_table(index='Precursor.Id', columns='Run', values=pep_value, aggfunc='first', sort=False)
    pep_X = sparse.csr_matrix(pep_X_pivot.values).T
    # pep_var_names: peptide sequence
    pep_var_names = pep_X_pivot.index.values
    # pep_obs_names: file names
    pep_obs_names = pep_X_pivot.columns.values
    # pep_var: peptide sequence with modifications (default: Genes, Protein.Group, Precursor.Charge, Modified.Sequence, Stripped.Sequence, Precursor.Id, All Mapped Proteins, All Mapped Genes)
    existing_pep_var_columns = [col for col in pep_var_columns if col in report_all.columns]
    missing_columns = set(pep_var_columns) - set(existing_pep_var_columns)
    # if missing columns are ['All Mapped Proteins'] and ['All Mapped Genes'], then it is likely that the DIA-NN version is <1.8.1, so we can skip the warning
    if missing_columns == {'All Mapped Proteins', 'All Mapped Genes'}:
        missing_columns = set()

    # Precursor.Quantity layer (if using directLFQ for nomalization)
    if 'Precursor.Quantity' in report_all.columns:
        precursor_q_pivot = report_all.pivot_table(index='Precursor.Id', columns='Run', values='Precursor.Quantity', aggfunc='first', sort=False)
        precursor_q_layer = sparse.csr_matrix(precursor_q_pivot.values).T
    else:
        precursor_q_layer = None


    pep_var = report_all.loc[:, existing_pep_var_columns].drop_duplicates(subset='Precursor.Id').drop(columns='Precursor.Id')
    # pep_obs: sample typing from the column name, same as prot_obs
    pep_obs = prot_obs

    # Q.Value layer (sample x peptide)
    if 'Q.Value' in report_all.columns:
        pepq_pivot = report_all.pivot_table(index='Precursor.Id', columns='Run', values='Q.Value', aggfunc='first', sort=False)
        pep_q_layer = sparse.csr_matrix(pepq_pivot.values).T
    else:
        pep_q_layer = None

    print(f"Peptides: {len(pep_var)}")
    if missing_columns:
        print(
            f"{format_log_prefix('warn')} The following columns are missing: {', '.join(missing_columns)}. "
            "Consider running analysis in the newer version of DIA-NN (1.8.1). "
            "Peptide-protein mapping may differ."
        )

    # -----------------------------
    # RS DATA
    # rs: protein x peptide relational data
    pep_prot_list = report_all.drop_duplicates(subset=['Precursor.Id'])['Protein.Group'].str.split(';')
    rs, mlb = _build_rs_matrix(pep_prot_list, prot_var_names = prot_var_names)

    # -----------------------------
    # ASSERTIONS
    # -----------------------------
    # check if mlb.classes_ has overlap with prot_var
    mlb_classes_set = set(mlb.classes_)
    prot_var_set = set(prot_var_names)

    if mlb_classes_set != prot_var_set:
        print(f"{format_log_prefix('warn')} Master proteins in the peptide matrix do not match proteins in the protein data, please check if files correspond to the same data.")
        print(f"Overlap: {len(mlb_classes_set & prot_var_set)}")
        print(f"Unique to peptide data: {mlb_classes_set - prot_var_set}")
        print(f"Unique to protein data: {prot_var_set - mlb_classes_set}")

    pdata = _create_pAnnData_from_parts(
        prot_X, pep_X, rs,
        prot_obs, prot_var, prot_obs_names, prot_var_names,
        pep_obs, pep_var, pep_obs_names, pep_var_names,
        obs_columns=obs_columns,
        X_qval_prot=prot_pgq_layer,
        X_qval_pep=pep_q_layer,
        X_precursor_pep=precursor_q_layer,
        metadata={
            "source": "diann",
            "file": report_file,
            "protein_metric": prot_value,
            "peptide_metric": pep_value
        },
        history_msg=f"Imported DIA-NN report from {report_file} using {prot_value} (protein) and {pep_value} (peptide)."
    )

    return pdata

_create_pAnnData_from_parts

_create_pAnnData_from_parts(prot_X, pep_X, rs, prot_obs, prot_var, prot_obs_names, prot_var_names, pep_obs=None, pep_var=None, pep_obs_names=None, pep_var_names=None, obs_columns=None, X_mbr_prot=None, X_mbr_pep=None, X_qval_prot=None, X_qval_pep=None, X_precursor_pep=None, found_threshold=0, fdr_threshold=0.01, metadata=None, history_msg='')

Assemble a pAnnData object from processed matrices and metadata.

This function is typically called internally by import functions. It constructs .prot and .pep AnnData objects, assigns optional metadata and MBR layers, adds identifier mappings and sample-level summary metrics, and returns a validated pAnnData object.

Parameters:

Name	Type	Description	Default
`prot_X`	`csr_matrix`	Protein-level expression matrix (samples × proteins).	required
`pep_X`	`csr_matrix or None`	Peptide-level expression matrix (samples × peptides).	required
`rs`	`csr_matrix or None`	Binary matrix linking proteins (rows) to peptides (columns).	required
`prot_obs`	`DataFrame`	Sample-level metadata for protein data.	required
`prot_var`	`DataFrame`	Feature-level metadata for proteins.	required
`prot_obs_names`	`list - like`	Sample identifiers for `.prot`.	required
`prot_var_names`	`list - like`	Protein accession identifiers for `.prot`.	required
`pep_obs`	`DataFrame`	Sample metadata for `.pep`. If not provided, `.prot.obs` is reused.	`None`
`pep_var`	`DataFrame`	Feature metadata for peptides.	`None`
`pep_obs_names`	`list - like`	Sample identifiers for `.pep`.	`None`
`pep_var_names`	`list - like`	Peptide identifiers.	`None`
`obs_columns`	`list of str`	Columns from filenames to include in `.summary` and `.obs`.	`None`
`X_mbr_prot`	`ndarray or DataFrame`	Optional protein-level MBR identification info.	`None`
`X_mbr_pep`	`ndarray or DataFrame`	Optional peptide-level MBR identification info.	`None`
`X_qval_prot`	`ndarray or DataFrame`	Optional protein-level Q-value info.	`None`
`X_qval_pep`	`ndarray or DataFrame`	Optional peptide-level Q-value info.	`None`
`X_precursor_pep`	`ndarray or DataFrame`	Optional peptide-level precursor quantity info. (for directLFQ normalization)	`None`
`metadata`	`dict`	Optional dictionary of import metadata (e.g. `{'source': 'diann'}`).	`None`
`history_msg`	`str`	Operation description to append to the history log.	`''`

Returns:

Name	Type	Description
`pAnnData`		Initialized object with `.prot`, `.pep`, `.summary`, `.rs`, and other metadata filled in.

Note

This is a low-level function. In most cases, users should call import_data(), import_proteomeDiscoverer(), or import_diann() instead.

Source code in src/scpviz/pAnnData/io.py

def _create_pAnnData_from_parts(
    prot_X, pep_X, rs,
    prot_obs, prot_var, prot_obs_names, prot_var_names,
    pep_obs=None, pep_var=None, pep_obs_names=None, pep_var_names=None,
    obs_columns=None,
    X_mbr_prot=None,
    X_mbr_pep=None,
    X_qval_prot=None,
    X_qval_pep=None,
    X_precursor_pep=None,
    found_threshold=0,
    fdr_threshold=0.01,
    metadata=None,
    history_msg=""
):
    """
    Assemble a `pAnnData` object from processed matrices and metadata.

    This function is typically called internally by import functions. It constructs
    `.prot` and `.pep` AnnData objects, assigns optional metadata and MBR layers,
    adds identifier mappings and sample-level summary metrics, and returns a
    validated `pAnnData` object.

    Args:
        prot_X (csr_matrix): Protein-level expression matrix (samples × proteins).
        pep_X (csr_matrix or None): Peptide-level expression matrix (samples × peptides).
        rs (csr_matrix or None): Binary matrix linking proteins (rows) to peptides (columns).
        prot_obs (pd.DataFrame): Sample-level metadata for protein data.
        prot_var (pd.DataFrame): Feature-level metadata for proteins.
        prot_obs_names (list-like): Sample identifiers for `.prot`.
        prot_var_names (list-like): Protein accession identifiers for `.prot`.
        pep_obs (pd.DataFrame, optional): Sample metadata for `.pep`. If not provided, `.prot.obs` is reused.
        pep_var (pd.DataFrame, optional): Feature metadata for peptides.
        pep_obs_names (list-like, optional): Sample identifiers for `.pep`.
        pep_var_names (list-like, optional): Peptide identifiers.
        obs_columns (list of str, optional): Columns from filenames to include in `.summary` and `.obs`.
        X_mbr_prot (np.ndarray or DataFrame, optional): Optional protein-level MBR identification info.
        X_mbr_pep (np.ndarray or DataFrame, optional): Optional peptide-level MBR identification info.
        X_qval_prot (np.ndarray or DataFrame, optional): Optional protein-level Q-value info.
        X_qval_pep (np.ndarray or DataFrame, optional): Optional peptide-level Q-value info.
        X_precursor_pep (np.ndarray or DataFrame, optional): Optional peptide-level precursor quantity info. (for directLFQ normalization)
        metadata (dict, optional): Optional dictionary of import metadata (e.g. `{'source': 'diann'}`).
        history_msg (str): Operation description to append to the history log.

    Returns:
        pAnnData: Initialized object with `.prot`, `.pep`, `.summary`, `.rs`, and other metadata filled in.

    Note:
        This is a low-level function. In most cases, users should call `import_data()`, `import_proteomeDiscoverer()`, or `import_diann()` instead.
    """
    from .pAnnData import pAnnData

    print("")
    pdata = pAnnData(prot_X, pep_X, rs)

    # --- PROTEIN ---
    if prot_X is not None:
        prot_var.index = prot_var.index.astype(str)

        pdata.prot.obs = pd.DataFrame(prot_obs) # type: ignore[attr-defined]
        pdata.prot.var = pd.DataFrame(prot_var) # type: ignore[attr-defined]
        pdata.prot.obs_names = list(prot_obs_names) # type: ignore[attr-defined]
        pdata.prot.var_names = list(prot_var_names) # type: ignore[attr-defined]
        pdata.prot.obs.columns = obs_columns if obs_columns else list(range(pdata.prot.obs.shape[1])) # type: ignore[attr-defined]
        pdata.prot.layers['X_raw'] = prot_X # type: ignore[attr-defined]
        pdata.prot.uns['X_raw_obs_names'] = list(prot_obs_names) # type: ignore[attr-defined]
        pdata.prot.uns['X_raw_var_names'] = list(prot_var_names)
        if X_mbr_prot is not None:
            pdata.prot.layers['X_mbr'] = X_mbr_prot # type: ignore[attr-defined]
        if X_qval_prot is not None:
            pdata.prot.layers['X_qval'] = X_qval_prot # type: ignore[attr-defined]

    if "Genes" in pdata.prot.var.columns and pdata.prot.var["Genes"].isna().any(): # type: ignore[attr-defined]
        pdata.update_missing_genes(gene_col="Genes", verbose=True)

    # --- PEPTIDE ---
    if pep_X is not None:
        pep_var.index = pep_var.index.astype(str)

        pdata.pep.obs = pd.DataFrame(pep_obs) # type: ignore[attr-defined]
        pdata.pep.var = pd.DataFrame(pep_var) # type: ignore[attr-defined]
        pdata.pep.obs_names = list(pep_obs_names) # type: ignore[attr-defined]
        pdata.pep.var_names = list(pep_var_names) # type: ignore[attr-defined]
        pdata.pep.obs.columns = obs_columns if obs_columns else list(range(pdata.pep.obs.shape[1])) # type: ignore[attr-defined]
        pdata.pep.layers['X_raw'] = pep_X # type: ignore[attr-defined]
        pdata.pep.uns['X_raw_obs_names'] = list(pep_obs_names) # type: ignore[attr-defined]
        pdata.pep.uns['X_raw_var_names'] = list(pep_var_names)
        if X_mbr_pep is not None:
            pdata.pep.layers['X_mbr'] = X_mbr_pep # type: ignore[attr-defined]
        if X_qval_pep is not None:
            pdata.pep.layers['X_qval'] = X_qval_pep # type: ignore[attr-defined]
        if X_precursor_pep is not None:
            pdata.pep.layers['X_precursor'] = X_precursor_pep # type: ignore[attr-defined]

    # --- Metadata ---
    metadata = metadata or {}
    metadata.setdefault("imported_at", datetime.datetime.now().isoformat())

    if pdata.prot is not None:
        pdata.prot.uns['metadata'] = metadata
    if pdata.pep is not None:
        pdata.pep.uns['metadata'] = metadata

    # --- Summary + Validation ---
    pdata.update_summary(recompute=True, verbose=False)
    pdata._cleanup_proteins_after_sample_filter(printout=True)
    pdata._annotate_found_samples(threshold=found_threshold)
    pdata._annotate_significant_samples(fdr_threshold=fdr_threshold)

    print("")
    if not pdata.validate():
        print(f"{format_log_prefix('warn')} Validation issues found. Use `pdata.validate()` to inspect.")

    if history_msg:
        pdata._append_history(history_msg)

    print(f"{format_log_prefix('result')} Import complete. Use `print(pdata)` to view the object.")
    print("--------------------------")

    return pdata

_build_rs_matrix

_build_rs_matrix(pep_prot_list, prot_var_names=None)

Build a sparse boolean RS (protein × peptide) relational matrix.

Parameters:

Name	Type	Description	Default
`pep_prot_list`	`list or Series`	List/Series where each entry contains one or more protein accessions (as lists or split strings).	required
`prot_var_names`	`list`	Ordered list of protein accessions to align RS columns. If None, uses the order returned by MultiLabelBinarizer.	`None`

Returns:

Type	Description
	scipy.sparse.csr_matrix: Sparse boolean RS matrix (peptides × proteins).

Source code in src/scpviz/pAnnData/io.py

def _build_rs_matrix(pep_prot_list, prot_var_names=None):
    """
    Build a sparse boolean RS (protein × peptide) relational matrix.

    Args:
        pep_prot_list (list or pd.Series): List/Series where each entry
            contains one or more protein accessions (as lists or split strings).
        prot_var_names (list, optional): Ordered list of protein accessions to
            align RS columns. If None, uses the order returned by MultiLabelBinarizer.

    Returns:
        scipy.sparse.csr_matrix: Sparse boolean RS matrix (peptides × proteins).
    """
    # sparse bool RS matrix to save RAM
    mlb = MultiLabelBinarizer(sparse_output=True)
    rs = mlb.fit_transform(pep_prot_list).astype(bool)

    # reorder columns to match protein order
    if prot_var_names is not None:
        index_dict = {protein: idx for idx, protein in enumerate(mlb.classes_)}
        reorder_indices = [
            index_dict[p] for p in prot_var_names if p in index_dict
        ]
        rs = rs[:, reorder_indices]

    # make csr matrix so downstream operations are faster
    rs = sparse.csr_matrix(rs, dtype=bool)
    rs.eliminate_zeros()

    return rs, mlb

Hidden Functions

src.scpviz.pAnnData.analysis

src.scpviz.pAnnData.base

src.scpviz.pAnnData.editing

src.scpviz.pAnnData.enrichment

_pretty_vs_key

_resolve_de_key

src.scpviz.pAnnData.filtering

_detect_ambiguous_input

src.scpviz.pAnnData.history

src.scpviz.pAnnData.identifier

src.scpviz.pAnnData.io

_import_proteomeDiscoverer

_import_diann

_create_pAnnData_from_parts

_build_rs_matrix

src.scpviz.pAnnData.metrics

src.scpviz.pAnnData.summary

src.scpviz.pAnnData.validation