Skip to content

Importing

Mixins for importing data into pAnndata objects.


IOMixin

Data import utilities for building pAnnData objects from supported proteomics tools.

This module provides functions to parse outputs from common tools such as Proteome Discoverer and DIA-NN, automatically extracting protein and peptide quantification matrices, sample metadata, and relational mappings between peptides and proteins.

Supported tools
  • Proteome Discoverer (PD 3.1, PD 2.4, etc.)
  • DIA-NN (<1.8.1 and >2.0)

Methods:

Name Description
import_data

Main entry point that dispatches to the appropriate import function based on source_type.

import_proteomeDiscoverer

Parses PD output files and initializes a pAnnData object.

import_diann

Parses DIA-NN report file and initializes a pAnnData object.

resolve_obs_columns

Extracts .obs column structure from filenames or metadata.

suggest_obs_from_file

Suggests sample-level metadata based on consistent filename tokens.

analyze_filename_formats

Analyzes filename structures to identify possible grouping patterns.

Source code in src/scpviz/pAnnData/io.py
class IOMixin:
    """
    Data import utilities for building `pAnnData` objects from supported proteomics tools.

    This module provides functions to parse outputs from common tools such as Proteome Discoverer and DIA-NN,
    automatically extracting protein and peptide quantification matrices, sample metadata, and relational mappings
    between peptides and proteins.

    Supported tools:
        - Proteome Discoverer (PD 3.1, PD 2.4, etc.)
        - DIA-NN (<1.8.1 and >2.0)

    Functions:
        import_data: Main entry point that dispatches to the appropriate import function based on source_type.
        import_proteomeDiscoverer: Parses PD output files and initializes a pAnnData object.
        import_diann: Parses DIA-NN report file and initializes a pAnnData object.
        resolve_obs_columns: Extracts `.obs` column structure from filenames or metadata.
        suggest_obs_from_file: Suggests sample-level metadata based on consistent filename tokens.
        analyze_filename_formats: Analyzes filename structures to identify possible grouping patterns.
    """

    @classmethod
    def import_data(cls, *args, **kwargs):
        """
        Unified wrapper for importing data into a `pAnnData` object.

        This function routes to a specific import handler based on the `source_type`,
        such as Proteome Discoverer or DIA-NN. It parses protein/peptide expression data
        and associated sample metadata, returning a fully initialized `pAnnData` object.

        Args:
            source_type (str): The input tool or data source. Supported values:

                - `'pd'`, `'proteomeDiscoverer'`, `'pd13'`, `'pd24'`:  
                → Uses `import_proteomeDiscoverer()`.  
                Required kwargs:
                    - `prot_file` (str): Path to protein-level report file
                    - `obs_columns` (list of str): Columns to extract for `.obs`
                Optional kwargs:
                    - `pep_file` (str): Path to peptide-level report file

                - `'diann'`, `'dia-nn'`:  
                → Uses `import_diann()`.  
                Required kwargs:
                    - `report_file` (str): Path to DIA-NN report file
                    - `obs_columns` (list of str): Columns to extract for `.obs`

                - `'fragpipe'`, `'fp'`: Not yet implemented  
                - `'spectronaut'`, `'sn'`: Not yet implemented

            **kwargs: Additional keyword arguments forwarded to the relevant import function.

        Returns:
            pAnnData: A populated pAnnData object with `.prot`, `.pep`, `.summary`, and identifier mappings.

        Example:
            Importing Proteome Discoverer output for single-cell data:
                ```python
                obs_columns = ['Sample', 'method', 'duration', 'cell_line']
                pdata_untreated_sc = import_data(
                    source_type='pd',
                    prot_file='data/202312_untreated/Marion_20231218_OTE_Aur60min_CBR_prot_Proteins.txt',
                    pep_file='data/202312_untreated/Marion_20231218_OTE_Aur60min_CBR_pep_PeptideGroups.txt',
                    obs_columns=obs_columns
                )
                ```

            Importing PD output for bulk data from an Excel file:
                ```python
                obs_columns = ['Sample', 'cell_line']
                pdata_bulk = import_data(
                    source_type='pd',
                    prot_file='HCT116 resistance_20230601_pdoutput.xlsx',
                    obs_columns=obs_columns
                )
                ```

        Note:
            If `obs_columns` is not provided and filename formats are inconsistent,
            fallback parsing is applied with generic columns (`"File"`, `"parsingType"`).
        """
        return import_data(*args, **kwargs)

    @classmethod
    def import_diann(cls, *args, **kwargs):
        """
        Import DIA-NN output into a `pAnnData` object.

        This function parses a DIA-NN report file and separates protein- and peptide-level expression matrices
        using the specified abundance and metadata columns.

        Args:
            report_file (str): Path to the DIA-NN report file (required).
            obs_columns (list of str): List of metadata columns to extract from the filename for `.obs`.
            prot_value (str): Column name in DIA-NN output to use for protein quantification.
                Default: `'PG.MaxLFQ'`.
            pep_value (str): Column name in DIA-NN output to use for peptide quantification.
                Default: `'Precursor.Normalised'`.
            prot_var_columns (list of str): Columns from the protein group table to store in `.prot.var`.
                Default includes gene and master protein annotations.
            pep_var_columns (list of str): Columns from the precursor table to store in `.pep.var`.
                Default includes peptide sequence, precursor ID, and mapping annotations.
            **kwargs: Additional keyword arguments passed to `import_data()`.

        Returns:
            pAnnData: A populated object with `.prot`, `.pep`, `.summary`, and identifier mappings.

        Example:
            To import data from a DIA-NN report file:
                ```python
                obs_columns = ['Sample', 'treatment', 'replicate']
                pdata = import_diann(
                    report_file='data/project_diaNN_output.tsv',
                    obs_columns=obs_columns,
                    prot_value='PG.MaxLFQ',
                    pep_value='Precursor.Normalised'
                )
                ```

        Note:
            - DIA-NN report should contain both protein group and precursor-level information.
            - Metadata columns in filenames must be consistently formatted to extract `.obs`.
        """
        return import_diann(*args, **kwargs)

    @classmethod
    def import_proteomeDiscoverer(cls, *args, **kwargs):
        """
        Import Proteome Discoverer (PD) output into a `pAnnData` object.

        This is a convenience wrapper for `import_data(source_type='pd')`. It loads protein- and optionally peptide-level
        expression data from PD report files and parses sample metadata columns.

        Args:
            prot_file (str): Path to the protein-level report file (required).
            pep_file (str, optional): Path to the peptide-level report file (optional but recommended).
            obs_columns (list of str): List of columns to extract for `.obs`. These should match metadata tokens
                embedded in the filenames (e.g. sample, condition, replicate).
            **kwargs: Additional keyword arguments passed to `import_data()`.

        Returns:
            pAnnData: A populated object with `.prot`, `.pep` (if provided), `.summary`, and identifier mappings.

        Example:
            To import data from Proteome Discoverer:
                ```python
                obs_columns = ['Sample', 'condition', 'cell_line']
                pdata = import_proteomeDiscoverer(
                    prot_file='my_project/proteins.txt',
                    pep_file='my_project/peptides.txt',
                    obs_columns=obs_columns
                )
                ```

        Note:
            - If `pep_file` is omitted, the resulting `pAnnData` will not include `.pep` or an RS matrix.
            - If filename structure is inconsistent and `obs_columns` cannot be inferred, fallback columns are used.
        """
        return import_proteomeDiscoverer(*args, **kwargs)

    @classmethod
    def get_filenames(cls, *args, **kwargs):
        """
        Extract sample filenames from a DIA-NN or Proteome Discoverer report.

        For DIA-NN reports, this extracts the 'Run' column from the table.
        For Proteome Discoverer (PD) output, it collects unique sample identifiers 
        based on column headers (e.g. abundance columns like "Abundances (SampleX)").

        Args:
            source (str or Path): Path to the input report file.
            source_type (str): Tool used to generate the report. Must be one of {'diann', 'pd'}.

        Returns:
            list of str: Extracted list of sample names or run filenames.

        Example:
            Extract DIA-NN run names:
                ```python
                get_filenames("diann_output.tsv", source_type="diann")
                ```
                ```
                ['Sample1.raw', 'Sample2.raw', 'Sample3.raw']
                ```

            Extract PD sample names from abundance columns:
                ```python
                get_filenames("pd_output.xlsx", source_type="pd")
                ```
                ```
                ['SampleA', 'SampleB', 'SampleC']
                ```
        """
        return get_filenames(*args, **kwargs)

    @classmethod
    def suggest_obs_columns(cls, *args, **kwargs):
        """
        Suggest `.obs` column names based on parsed sample names.

        This function analyzes filenames or run names extracted from Proteome Discoverer
        or DIA-NN reports and attempts to identify consistent metadata fields. These fields
        may include `gradient`, `amount`, `cell_line`, or `well_position`, depending on
        naming conventions and regular expression matches.

        Args:
            source (str or Path, optional): Path to a DIA-NN or PD output file.
            source_type (str, optional): Type of the input file. Supports `'diann'` or `'pd'`.
                If not provided, inferred from filename or fallback heuristics.
            filenames (list of str, optional): List of sample file names or run labels to parse.
                If provided, bypasses file loading.
            delimiter (str, optional): Delimiter to use for tokenizing filenames (e.g., `','`, `'_'`).
                If not specified, will be inferred automatically.

        Returns:
            list of str: Suggested list of metadata column names to assign to `.obs`.

        Example:
            To suggest observation columns from a file:
                ```python
                suggest_obs_columns("my_experiment_PD.txt", source_type="pd")
                ```

                ```
                # Suggested columns: ['Sample', 'gradient', 'cell_line', 'duration']
                ['Sample', 'gradient', 'cell_line', 'duration']
                ```

        Note:
            This function is typically used as part of the `.import_data()` flow
            when filenames embed experimental metadata.
        """
        return suggest_obs_columns(*args, **kwargs)

get_filenames classmethod

get_filenames(*args, **kwargs)

Extract sample filenames from a DIA-NN or Proteome Discoverer report.

For DIA-NN reports, this extracts the 'Run' column from the table. For Proteome Discoverer (PD) output, it collects unique sample identifiers based on column headers (e.g. abundance columns like "Abundances (SampleX)").

Parameters:

Name Type Description Default
source str or Path

Path to the input report file.

required
source_type str

Tool used to generate the report. Must be one of {'diann', 'pd'}.

required

Returns:

Type Description

list of str: Extracted list of sample names or run filenames.

Example

Extract DIA-NN run names:

get_filenames("diann_output.tsv", source_type="diann")
['Sample1.raw', 'Sample2.raw', 'Sample3.raw']

Extract PD sample names from abundance columns:

get_filenames("pd_output.xlsx", source_type="pd")
['SampleA', 'SampleB', 'SampleC']

Source code in src/scpviz/pAnnData/io.py
@classmethod
def get_filenames(cls, *args, **kwargs):
    """
    Extract sample filenames from a DIA-NN or Proteome Discoverer report.

    For DIA-NN reports, this extracts the 'Run' column from the table.
    For Proteome Discoverer (PD) output, it collects unique sample identifiers 
    based on column headers (e.g. abundance columns like "Abundances (SampleX)").

    Args:
        source (str or Path): Path to the input report file.
        source_type (str): Tool used to generate the report. Must be one of {'diann', 'pd'}.

    Returns:
        list of str: Extracted list of sample names or run filenames.

    Example:
        Extract DIA-NN run names:
            ```python
            get_filenames("diann_output.tsv", source_type="diann")
            ```
            ```
            ['Sample1.raw', 'Sample2.raw', 'Sample3.raw']
            ```

        Extract PD sample names from abundance columns:
            ```python
            get_filenames("pd_output.xlsx", source_type="pd")
            ```
            ```
            ['SampleA', 'SampleB', 'SampleC']
            ```
    """
    return get_filenames(*args, **kwargs)

import_data classmethod

import_data(*args, **kwargs)

Unified wrapper for importing data into a pAnnData object.

This function routes to a specific import handler based on the source_type, such as Proteome Discoverer or DIA-NN. It parses protein/peptide expression data and associated sample metadata, returning a fully initialized pAnnData object.

Parameters:

Name Type Description Default
source_type str

The input tool or data source. Supported values:

  • 'pd', 'proteomeDiscoverer', 'pd13', 'pd24':
    → Uses import_proteomeDiscoverer().
    Required kwargs:

    • prot_file (str): Path to protein-level report file
    • obs_columns (list of str): Columns to extract for .obs Optional kwargs:
    • pep_file (str): Path to peptide-level report file
  • 'diann', 'dia-nn':
    → Uses import_diann().
    Required kwargs:

    • report_file (str): Path to DIA-NN report file
    • obs_columns (list of str): Columns to extract for .obs
  • 'fragpipe', 'fp': Not yet implemented

  • 'spectronaut', 'sn': Not yet implemented
required
**kwargs

Additional keyword arguments forwarded to the relevant import function.

{}

Returns:

Name Type Description
pAnnData

A populated pAnnData object with .prot, .pep, .summary, and identifier mappings.

Example

Importing Proteome Discoverer output for single-cell data:

obs_columns = ['Sample', 'method', 'duration', 'cell_line']
pdata_untreated_sc = import_data(
    source_type='pd',
    prot_file='data/202312_untreated/Marion_20231218_OTE_Aur60min_CBR_prot_Proteins.txt',
    pep_file='data/202312_untreated/Marion_20231218_OTE_Aur60min_CBR_pep_PeptideGroups.txt',
    obs_columns=obs_columns
)

Importing PD output for bulk data from an Excel file:

obs_columns = ['Sample', 'cell_line']
pdata_bulk = import_data(
    source_type='pd',
    prot_file='HCT116 resistance_20230601_pdoutput.xlsx',
    obs_columns=obs_columns
)

Note

If obs_columns is not provided and filename formats are inconsistent, fallback parsing is applied with generic columns ("File", "parsingType").

Source code in src/scpviz/pAnnData/io.py
@classmethod
def import_data(cls, *args, **kwargs):
    """
    Unified wrapper for importing data into a `pAnnData` object.

    This function routes to a specific import handler based on the `source_type`,
    such as Proteome Discoverer or DIA-NN. It parses protein/peptide expression data
    and associated sample metadata, returning a fully initialized `pAnnData` object.

    Args:
        source_type (str): The input tool or data source. Supported values:

            - `'pd'`, `'proteomeDiscoverer'`, `'pd13'`, `'pd24'`:  
            → Uses `import_proteomeDiscoverer()`.  
            Required kwargs:
                - `prot_file` (str): Path to protein-level report file
                - `obs_columns` (list of str): Columns to extract for `.obs`
            Optional kwargs:
                - `pep_file` (str): Path to peptide-level report file

            - `'diann'`, `'dia-nn'`:  
            → Uses `import_diann()`.  
            Required kwargs:
                - `report_file` (str): Path to DIA-NN report file
                - `obs_columns` (list of str): Columns to extract for `.obs`

            - `'fragpipe'`, `'fp'`: Not yet implemented  
            - `'spectronaut'`, `'sn'`: Not yet implemented

        **kwargs: Additional keyword arguments forwarded to the relevant import function.

    Returns:
        pAnnData: A populated pAnnData object with `.prot`, `.pep`, `.summary`, and identifier mappings.

    Example:
        Importing Proteome Discoverer output for single-cell data:
            ```python
            obs_columns = ['Sample', 'method', 'duration', 'cell_line']
            pdata_untreated_sc = import_data(
                source_type='pd',
                prot_file='data/202312_untreated/Marion_20231218_OTE_Aur60min_CBR_prot_Proteins.txt',
                pep_file='data/202312_untreated/Marion_20231218_OTE_Aur60min_CBR_pep_PeptideGroups.txt',
                obs_columns=obs_columns
            )
            ```

        Importing PD output for bulk data from an Excel file:
            ```python
            obs_columns = ['Sample', 'cell_line']
            pdata_bulk = import_data(
                source_type='pd',
                prot_file='HCT116 resistance_20230601_pdoutput.xlsx',
                obs_columns=obs_columns
            )
            ```

    Note:
        If `obs_columns` is not provided and filename formats are inconsistent,
        fallback parsing is applied with generic columns (`"File"`, `"parsingType"`).
    """
    return import_data(*args, **kwargs)

import_diann classmethod

import_diann(*args, **kwargs)

Import DIA-NN output into a pAnnData object.

This function parses a DIA-NN report file and separates protein- and peptide-level expression matrices using the specified abundance and metadata columns.

Parameters:

Name Type Description Default
report_file str

Path to the DIA-NN report file (required).

required
obs_columns list of str

List of metadata columns to extract from the filename for .obs.

required
prot_value str

Column name in DIA-NN output to use for protein quantification. Default: 'PG.MaxLFQ'.

required
pep_value str

Column name in DIA-NN output to use for peptide quantification. Default: 'Precursor.Normalised'.

required
prot_var_columns list of str

Columns from the protein group table to store in .prot.var. Default includes gene and master protein annotations.

required
pep_var_columns list of str

Columns from the precursor table to store in .pep.var. Default includes peptide sequence, precursor ID, and mapping annotations.

required
**kwargs

Additional keyword arguments passed to import_data().

{}

Returns:

Name Type Description
pAnnData

A populated object with .prot, .pep, .summary, and identifier mappings.

Example

To import data from a DIA-NN report file:

obs_columns = ['Sample', 'treatment', 'replicate']
pdata = import_diann(
    report_file='data/project_diaNN_output.tsv',
    obs_columns=obs_columns,
    prot_value='PG.MaxLFQ',
    pep_value='Precursor.Normalised'
)

Note
  • DIA-NN report should contain both protein group and precursor-level information.
  • Metadata columns in filenames must be consistently formatted to extract .obs.
Source code in src/scpviz/pAnnData/io.py
@classmethod
def import_diann(cls, *args, **kwargs):
    """
    Import DIA-NN output into a `pAnnData` object.

    This function parses a DIA-NN report file and separates protein- and peptide-level expression matrices
    using the specified abundance and metadata columns.

    Args:
        report_file (str): Path to the DIA-NN report file (required).
        obs_columns (list of str): List of metadata columns to extract from the filename for `.obs`.
        prot_value (str): Column name in DIA-NN output to use for protein quantification.
            Default: `'PG.MaxLFQ'`.
        pep_value (str): Column name in DIA-NN output to use for peptide quantification.
            Default: `'Precursor.Normalised'`.
        prot_var_columns (list of str): Columns from the protein group table to store in `.prot.var`.
            Default includes gene and master protein annotations.
        pep_var_columns (list of str): Columns from the precursor table to store in `.pep.var`.
            Default includes peptide sequence, precursor ID, and mapping annotations.
        **kwargs: Additional keyword arguments passed to `import_data()`.

    Returns:
        pAnnData: A populated object with `.prot`, `.pep`, `.summary`, and identifier mappings.

    Example:
        To import data from a DIA-NN report file:
            ```python
            obs_columns = ['Sample', 'treatment', 'replicate']
            pdata = import_diann(
                report_file='data/project_diaNN_output.tsv',
                obs_columns=obs_columns,
                prot_value='PG.MaxLFQ',
                pep_value='Precursor.Normalised'
            )
            ```

    Note:
        - DIA-NN report should contain both protein group and precursor-level information.
        - Metadata columns in filenames must be consistently formatted to extract `.obs`.
    """
    return import_diann(*args, **kwargs)

import_proteomeDiscoverer classmethod

import_proteomeDiscoverer(*args, **kwargs)

Import Proteome Discoverer (PD) output into a pAnnData object.

This is a convenience wrapper for import_data(source_type='pd'). It loads protein- and optionally peptide-level expression data from PD report files and parses sample metadata columns.

Parameters:

Name Type Description Default
prot_file str

Path to the protein-level report file (required).

required
pep_file str

Path to the peptide-level report file (optional but recommended).

required
obs_columns list of str

List of columns to extract for .obs. These should match metadata tokens embedded in the filenames (e.g. sample, condition, replicate).

required
**kwargs

Additional keyword arguments passed to import_data().

{}

Returns:

Name Type Description
pAnnData

A populated object with .prot, .pep (if provided), .summary, and identifier mappings.

Example

To import data from Proteome Discoverer:

obs_columns = ['Sample', 'condition', 'cell_line']
pdata = import_proteomeDiscoverer(
    prot_file='my_project/proteins.txt',
    pep_file='my_project/peptides.txt',
    obs_columns=obs_columns
)

Note
  • If pep_file is omitted, the resulting pAnnData will not include .pep or an RS matrix.
  • If filename structure is inconsistent and obs_columns cannot be inferred, fallback columns are used.
Source code in src/scpviz/pAnnData/io.py
@classmethod
def import_proteomeDiscoverer(cls, *args, **kwargs):
    """
    Import Proteome Discoverer (PD) output into a `pAnnData` object.

    This is a convenience wrapper for `import_data(source_type='pd')`. It loads protein- and optionally peptide-level
    expression data from PD report files and parses sample metadata columns.

    Args:
        prot_file (str): Path to the protein-level report file (required).
        pep_file (str, optional): Path to the peptide-level report file (optional but recommended).
        obs_columns (list of str): List of columns to extract for `.obs`. These should match metadata tokens
            embedded in the filenames (e.g. sample, condition, replicate).
        **kwargs: Additional keyword arguments passed to `import_data()`.

    Returns:
        pAnnData: A populated object with `.prot`, `.pep` (if provided), `.summary`, and identifier mappings.

    Example:
        To import data from Proteome Discoverer:
            ```python
            obs_columns = ['Sample', 'condition', 'cell_line']
            pdata = import_proteomeDiscoverer(
                prot_file='my_project/proteins.txt',
                pep_file='my_project/peptides.txt',
                obs_columns=obs_columns
            )
            ```

    Note:
        - If `pep_file` is omitted, the resulting `pAnnData` will not include `.pep` or an RS matrix.
        - If filename structure is inconsistent and `obs_columns` cannot be inferred, fallback columns are used.
    """
    return import_proteomeDiscoverer(*args, **kwargs)

suggest_obs_columns classmethod

suggest_obs_columns(*args, **kwargs)

Suggest .obs column names based on parsed sample names.

This function analyzes filenames or run names extracted from Proteome Discoverer or DIA-NN reports and attempts to identify consistent metadata fields. These fields may include gradient, amount, cell_line, or well_position, depending on naming conventions and regular expression matches.

Parameters:

Name Type Description Default
source str or Path

Path to a DIA-NN or PD output file.

required
source_type str

Type of the input file. Supports 'diann' or 'pd'. If not provided, inferred from filename or fallback heuristics.

required
filenames list of str

List of sample file names or run labels to parse. If provided, bypasses file loading.

required
delimiter str

Delimiter to use for tokenizing filenames (e.g., ',', '_'). If not specified, will be inferred automatically.

required

Returns:

Type Description

list of str: Suggested list of metadata column names to assign to .obs.

Example

To suggest observation columns from a file:

suggest_obs_columns("my_experiment_PD.txt", source_type="pd")

```
# Suggested columns: ['Sample', 'gradient', 'cell_line', 'duration']
['Sample', 'gradient', 'cell_line', 'duration']
```
Note

This function is typically used as part of the .import_data() flow when filenames embed experimental metadata.

Source code in src/scpviz/pAnnData/io.py
@classmethod
def suggest_obs_columns(cls, *args, **kwargs):
    """
    Suggest `.obs` column names based on parsed sample names.

    This function analyzes filenames or run names extracted from Proteome Discoverer
    or DIA-NN reports and attempts to identify consistent metadata fields. These fields
    may include `gradient`, `amount`, `cell_line`, or `well_position`, depending on
    naming conventions and regular expression matches.

    Args:
        source (str or Path, optional): Path to a DIA-NN or PD output file.
        source_type (str, optional): Type of the input file. Supports `'diann'` or `'pd'`.
            If not provided, inferred from filename or fallback heuristics.
        filenames (list of str, optional): List of sample file names or run labels to parse.
            If provided, bypasses file loading.
        delimiter (str, optional): Delimiter to use for tokenizing filenames (e.g., `','`, `'_'`).
            If not specified, will be inferred automatically.

    Returns:
        list of str: Suggested list of metadata column names to assign to `.obs`.

    Example:
        To suggest observation columns from a file:
            ```python
            suggest_obs_columns("my_experiment_PD.txt", source_type="pd")
            ```

            ```
            # Suggested columns: ['Sample', 'gradient', 'cell_line', 'duration']
            ['Sample', 'gradient', 'cell_line', 'duration']
            ```

    Note:
        This function is typically used as part of the `.import_data()` flow
        when filenames embed experimental metadata.
    """
    return suggest_obs_columns(*args, **kwargs)

analyze_filename_formats

analyze_filename_formats(filenames, delimiter: str = '_', group_labels=None)

Analyze filename structures to detect format consistency.

This function checks if all filenames can be split into the same number of tokens using the provided delimiter. It can optionally group files by token count and assign custom group labels.

Parameters:

Name Type Description Default
filenames list of str

List of sample or file names.

required
delimiter str

Delimiter used to split each filename (default: "_").

'_'
group_labels list of str

Optional group labels to assign to each unique token length group.

None

Returns:

Name Type Description
dict

Format information containing: - 'uniform': True if all filenames split into the same number of tokens. - 'n_tokens': List of token counts for each filename. - 'group_map': Mapping of filename to group label (if labels are provided).

Example

Check if filenames have a uniform structure:

filenames = ["A_60min_KD", "B_60min_SC", "C_120min_KD"]
analyze_filename_formats(filenames)
{
    'uniform': True,
    'n_tokens': [3, 3, 3],
    'group_map': {}
}

With group labels:

analyze_filename_formats(filenames, group_labels=["Group1"])
{
    'uniform': True,
    'n_tokens': [3, 3, 3],
    'group_map': {
        'A_60min_KD': 'Group1',
        'B_60min_SC': 'Group1',
        'C_120min_KD': 'Group1'
    }
}

Source code in src/scpviz/pAnnData/io.py
def analyze_filename_formats(filenames, delimiter: str = "_", group_labels=None):
    """
    Analyze filename structures to detect format consistency.

    This function checks if all filenames can be split into the same number of tokens
    using the provided delimiter. It can optionally group files by token count and assign
    custom group labels.

    Args:
        filenames (list of str): List of sample or file names.
        delimiter (str): Delimiter used to split each filename (default: "_").
        group_labels (list of str, optional): Optional group labels to assign to each unique token length group.

    Returns:
        dict: Format information containing:
            - 'uniform': True if all filenames split into the same number of tokens.
            - 'n_tokens': List of token counts for each filename.
            - 'group_map': Mapping of filename to group label (if labels are provided).

    Example:
        Check if filenames have a uniform structure:
            ```python
            filenames = ["A_60min_KD", "B_60min_SC", "C_120min_KD"]
            analyze_filename_formats(filenames)
            ```
            ```
            {
                'uniform': True,
                'n_tokens': [3, 3, 3],
                'group_map': {}
            }
            ```

        With group labels:
            ```python
            analyze_filename_formats(filenames, group_labels=["Group1"])
            ```
            ```
            {
                'uniform': True,
                'n_tokens': [3, 3, 3],
                'group_map': {
                    'A_60min_KD': 'Group1',
                    'B_60min_SC': 'Group1',
                    'C_120min_KD': 'Group1'
                }
            }
            ```
    """
    group_counts = defaultdict(list)
    for fname in filenames:
        tokens = fname.split(delimiter)
        group_counts[len(tokens)].append(fname)

    token_lengths = list(group_counts.keys())
    uniform = len(token_lengths) == 1

    if group_labels is None:
        group_labels = [f"{n}-tokens" for n in token_lengths]

    group_map = {}
    for label, n_tok in zip(group_labels, token_lengths):
        for fname in group_counts[n_tok]:
            group_map[fname] = label

    return {
        "uniform": uniform,
        "n_tokens": token_lengths,
        "group_map": group_map
    }

classify_subtokens

classify_subtokens(token, used_labels=None, keyword_map=None)

Classify a token into one or more metadata categories based on keyword or pattern matching.

This function splits a token (e.g. from a filename) into subtokens using character-type transitions (e.g., "Aur60minDIA" → "Aur", "60", "min", "DIA"), then attempts to classify each subtoken using:

  • Regex patterns (e.g., dates, well positions like A01)
  • Fuzzy substring matching via a user-defined or default keyword map

Parameters:

Name Type Description Default
token str

The input string to classify (e.g., "Aur60minDIA").

required
used_labels set

Reserved for future logic to avoid assigning the same label twice.

None
keyword_map dict

A dictionary of metadata categories (e.g., 'gradient') to example substrings.

None

Returns:

Type Description

list of str: A list of predicted metadata labels for the token (e.g., ['gradient', 'acquisition']). If no match is found, returns ['unknown??'].

Example

Classify a gradient+time token:

classify_subtokens("Aur60minDIA")
['gradient', 'acquisition']

Classify a well position:

classify_subtokens("B07")
['well_position']

Source code in src/scpviz/pAnnData/io.py
def classify_subtokens(token, used_labels=None, keyword_map=None):
    """
    Classify a token into one or more metadata categories based on keyword or pattern matching.

    This function splits a token (e.g. from a filename) into subtokens using character-type transitions 
    (e.g., "Aur60minDIA" → "Aur", "60", "min", "DIA"), then attempts to classify each subtoken using:

    - Regex patterns (e.g., dates, well positions like A01)
    - Fuzzy substring matching via a user-defined or default keyword map

    Args:
        token (str): The input string to classify (e.g., "Aur60minDIA").
        used_labels (set, optional): Reserved for future logic to avoid assigning the same label twice.
        keyword_map (dict, optional): A dictionary of metadata categories (e.g., 'gradient') to example substrings.

    Returns:
        list of str: A list of predicted metadata labels for the token (e.g., ['gradient', 'acquisition']).
                     If no match is found, returns ['unknown??'].

    Example:
        Classify a gradient+time token:
            ```python
            classify_subtokens("Aur60minDIA")
            ```
            ```
            ['gradient', 'acquisition']
            ```

        Classify a well position:
            ```python
            classify_subtokens("B07")
            ```
            ```
            ['well_position']
            ```
    """

    default_map = {
        "gradient": ["min", "hr", "gradient", "short", "long", "fast", "slow"],
        "amount": ["cell", "cells", "sc", "bulk", "ng", "ug", "pg", "fmol"],
        "enzyme": ["trypsin", "lysC", "chymotrypsin", "gluc", "tryp", "lys-c", "glu-c"],
        "condition": ["ctrl", "stim", "wt", "ko", "kd", "scramble", "si", "drug"],
        "sample_type": ["embryo", "brain", "liver", "cellline", "mix", "qc"],
        "instrument": ["tims", "tof", "fusion", "exploris","astral","stellar","eclipse","OA","OE480","OE","QE","qexecutive","OTE"],
        "acquisition": ["dia", "prm", "dda", "srm"],
        "column": ['TS25','TS15','TS8','Aur'],
        "organism": ["human", "mouse", "mus", "homo", "drosophila", "musculus", "sapiens"]
    }

    keyword_map = keyword_map or default_map
    labels = set()

    # Split into subtokens (case preserved), in case one token has multiple labels
    subtokens = re.findall(r'[A-Za-z]+|\d+min|\d+(?:ng|ug|pg|fmol)|\d{6,8}', token)

    for sub in subtokens:
        # Check unmodified for regex-based rules
        if is_date_like(sub):
            labels.add("date")
        elif re.match(r"[A-Ha-h]\d{1,2}$", sub):
            labels.add("well_position")
        else:
            # Lowercase for keyword matches
            sub_lower = sub.lower()
            for label, keywords in keyword_map.items():
                if any(kw in sub_lower for kw in keywords):
                    labels.add(label)

    if not labels:
        labels.add("unknown??")
    return list(labels)

get_filenames

get_filenames(source: Union[str, Path], source_type: str) -> List[str]

Extract sample filenames from a DIA-NN or Proteome Discoverer report.

For DIA-NN reports, this extracts the 'Run' column from the table. For Proteome Discoverer (PD) output, it collects unique sample identifiers based on column headers (e.g. abundance columns like "Abundances (SampleX)").

Parameters:

Name Type Description Default
source str or Path

Path to the input report file.

required
source_type str

Tool used to generate the report. Must be one of {'diann', 'pd'}.

required

Returns:

Type Description
List[str]

list of str: Extracted list of sample names or run filenames.

Example

Extract DIA-NN run names:

get_filenames("diann_output.tsv", source_type="diann")
['Sample1.raw', 'Sample2.raw', 'Sample3.raw']

Extract PD sample names from abundance columns:

get_filenames("pd_output.xlsx", source_type="pd")
['SampleA', 'SampleB', 'SampleC']

Source code in src/scpviz/pAnnData/io.py
def get_filenames(source: Union[str, Path], source_type: str) -> List[str]:
    """
    Extract sample filenames from a DIA-NN or Proteome Discoverer report.

    For DIA-NN reports, this extracts the 'Run' column from the table.
    For Proteome Discoverer (PD) output, it collects unique sample identifiers 
    based on column headers (e.g. abundance columns like "Abundances (SampleX)").

    Args:
        source (str or Path): Path to the input report file.
        source_type (str): Tool used to generate the report. Must be one of {'diann', 'pd'}.

    Returns:
        list of str: Extracted list of sample names or run filenames.

    Example:
        Extract DIA-NN run names:
            ```python
            get_filenames("diann_output.tsv", source_type="diann")
            ```
            ```
            ['Sample1.raw', 'Sample2.raw', 'Sample3.raw']
            ```

        Extract PD sample names from abundance columns:
            ```python
            get_filenames("pd_output.xlsx", source_type="pd")
            ```
            ```
            ['SampleA', 'SampleB', 'SampleC']
            ```
    """
    source = Path(source)
    ext = source.suffix.lower()

    # --- DIA-NN ---
    if source_type == "diann":
        if ext in [".csv", ".tsv"]:
            df = pd.read_csv(source, sep="\t" if ext == ".tsv" else ",", usecols=["Run"], low_memory=False)
        elif ext == ".parquet":
            df = pd.read_parquet(source, columns=["Run"], engine="pyarrow")
        else:
            raise ValueError(f"Unsupported file type for DIA-NN: {ext}")

        filenames = df["Run"].dropna().unique().tolist()

    # --- Proteome Discoverer ---
    elif source_type == "pd":
        if ext in [".txt", ".tsv"]:
            df = pd.read_csv(source, sep="\t", nrows=0)
        elif ext == ".xlsx":
            df = pd.read_excel(source, nrows=0)
        else:
            raise ValueError(f"Unsupported file type for PD: {ext}")

        abundance_cols = [col for col in df.columns if re.search(r"Abundance: F\d+: ", col)]
        if not abundance_cols:
            raise ValueError("No 'Abundance: F#:' columns found in PD file.")

        filenames = []
        for col in abundance_cols:
            match = re.match(r"Abundance: F\d+: (.+)", col)
            if match:
                filenames.append(match.group(1).strip())

    else:
        raise ValueError("source_type must be 'pd' or 'diann'")

    return filenames

import_data

import_data(source_type: str, **kwargs)

Unified wrapper for importing data into a pAnnData object.

This function routes to a specific import handler based on the source_type, such as Proteome Discoverer or DIA-NN. It parses protein/peptide expression data and associated sample metadata, returning a fully initialized pAnnData object.

Parameters:

Name Type Description Default
source_type str

The input tool or data source. Supported values:

  • 'pd', 'proteomeDiscoverer', 'pd13', 'pd24':
    → Uses import_proteomeDiscoverer().
    Required kwargs:

    • prot_file (str): Path to protein-level report file
    • obs_columns (list of str): Columns to extract for .obs Optional kwargs:
    • pep_file (str): Path to peptide-level report file
  • 'diann', 'dia-nn':
    → Uses import_diann().
    Required kwargs:

    • report_file (str): Path to DIA-NN report file
    • obs_columns (list of str): Columns to extract for .obs
  • 'fragpipe', 'fp': Not yet implemented

  • 'spectronaut', 'sn': Not yet implemented
required
**kwargs

Additional keyword arguments forwarded to the relevant import function.

{}

Returns:

Name Type Description
pAnnData

A populated pAnnData object with .prot, .pep, .summary, and identifier mappings.

Example

Importing Proteome Discoverer output for single-cell data:

obs_columns = ['Sample', 'method', 'duration', 'cell_line']
pdata_untreated_sc = import_data(
    source_type='pd',
    prot_file='data/202312_untreated/Marion_20231218_OTE_Aur60min_CBR_prot_Proteins.txt',
    pep_file='data/202312_untreated/Marion_20231218_OTE_Aur60min_CBR_pep_PeptideGroups.txt',
    obs_columns=obs_columns
)

Importing PD output for bulk data from an Excel file:

obs_columns = ['Sample', 'cell_line']
pdata_bulk = import_data(
    source_type='pd',
    prot_file='HCT116 resistance_20230601_pdoutput.xlsx',
    obs_columns=obs_columns
)

Note

If obs_columns is not provided and filename formats are inconsistent, fallback parsing is applied with generic columns ("File", "parsingType").

Source code in src/scpviz/pAnnData/io.py
def import_data(source_type: str, **kwargs):
    """
    Unified wrapper for importing data into a `pAnnData` object.

    This function routes to a specific import handler based on the `source_type`,
    such as Proteome Discoverer or DIA-NN. It parses protein/peptide expression data
    and associated sample metadata, returning a fully initialized `pAnnData` object.

    Args:
        source_type (str): The input tool or data source. Supported values:

            - `'pd'`, `'proteomeDiscoverer'`, `'pd13'`, `'pd24'`:  
              → Uses `import_proteomeDiscoverer()`.  
              Required kwargs:
                - `prot_file` (str): Path to protein-level report file
                - `obs_columns` (list of str): Columns to extract for `.obs`
              Optional kwargs:
                - `pep_file` (str): Path to peptide-level report file

            - `'diann'`, `'dia-nn'`:  
              → Uses `import_diann()`.  
              Required kwargs:
                - `report_file` (str): Path to DIA-NN report file
                - `obs_columns` (list of str): Columns to extract for `.obs`

            - `'fragpipe'`, `'fp'`: Not yet implemented  
            - `'spectronaut'`, `'sn'`: Not yet implemented

        **kwargs: Additional keyword arguments forwarded to the relevant import function.

    Returns:
        pAnnData: A populated pAnnData object with `.prot`, `.pep`, `.summary`, and identifier mappings.

    Example:
        Importing Proteome Discoverer output for single-cell data:
            ```python
            obs_columns = ['Sample', 'method', 'duration', 'cell_line']
            pdata_untreated_sc = import_data(
                source_type='pd',
                prot_file='data/202312_untreated/Marion_20231218_OTE_Aur60min_CBR_prot_Proteins.txt',
                pep_file='data/202312_untreated/Marion_20231218_OTE_Aur60min_CBR_pep_PeptideGroups.txt',
                obs_columns=obs_columns
            )
            ```

        Importing PD output for bulk data from an Excel file:
            ```python
            obs_columns = ['Sample', 'cell_line']
            pdata_bulk = import_data(
                source_type='pd',
                prot_file='HCT116 resistance_20230601_pdoutput.xlsx',
                obs_columns=obs_columns
            )
            ```

    Note:
        If `obs_columns` is not provided and filename formats are inconsistent,
        fallback parsing is applied with generic columns (`"File"`, `"parsingType"`).
    """

    print(f"{format_log_prefix('user')} Importing data of type [{source_type}]")

    source_type = source_type.lower()
    obs_columns = kwargs.get('obs_columns', None)
    if obs_columns is None:
        source = kwargs.get('report_file') if 'report_file' in kwargs else kwargs.get('prot_file')
        delimiter = kwargs.get('delimiter') if 'delimiter' in kwargs else None
        format_info, fallback_columns, fallback_obs = resolve_obs_columns(source, source_type, delimiter=delimiter)

        if format_info["uniform"]:
            # Prompt user to rerun with obs_columns
            return None
        else:
            # non-uniform format, use fallback obs
            kwargs["obs_columns"] = fallback_columns
            kwargs["obs"] = fallback_obs

    if source_type in ['diann', 'dia-nn']:
        return _import_diann(**kwargs)

    elif source_type in ['pd', 'proteomediscoverer', 'proteome_discoverer', 'pd2.5', 'pd24']:
        return _import_proteomeDiscoverer(**kwargs)

    elif source_type in ['fragpipe', 'fp']:
        raise NotImplementedError("FragPipe import is not yet implemented. Stay tuned!")

    elif source_type in ['spectronaut', 'sn']:
        raise NotImplementedError("Spectronaut import is not yet implemented. Stay tuned!")

    else:
        raise ValueError(f"{format_log_prefix('error')} Unsupported import source: '{source_type}'. "
                         "Valid options: 'diann', 'proteomeDiscoverer', 'fragpipe', 'spectronaut'.")

import_diann

import_diann(report_file: Optional[str] = None, obs_columns: Optional[List[str]] = None, delimiter: Optional[str] = '_', prot_value: str = 'PG.MaxLFQ', pep_value: str = 'Precursor.Normalised', prot_var_columns: List[str] = ['Genes', 'Master.Protein'], pep_var_columns: List[str] = ['Genes', 'Protein.Group', 'Precursor.Charge', 'Modified.Sequence', 'Stripped.Sequence', 'Precursor.Id', 'All Mapped Proteins', 'All Mapped Genes'], **kwargs)

Import DIA-NN output into a pAnnData object.

This function parses a DIA-NN report file and separates protein- and peptide-level expression matrices using the specified abundance and metadata columns.

Parameters:

Name Type Description Default
report_file str

Path to the DIA-NN report file (required).

None
obs_columns list of str

List of metadata columns to extract from the filename for .obs.

None
delimiter str

Character to split file names by to set up metadata in obs.

'_'
prot_value str

Column name in DIA-NN output to use for protein quantification. Default: 'PG.MaxLFQ'.

'PG.MaxLFQ'
pep_value str

Column name in DIA-NN output to use for peptide quantification. Default: 'Precursor.Normalised'.

'Precursor.Normalised'
prot_var_columns list of str

Columns from the protein group table to store in .prot.var. Default includes gene and master protein annotations.

['Genes', 'Master.Protein']
pep_var_columns list of str

Columns from the precursor table to store in .pep.var. Default includes peptide sequence, precursor ID, and mapping annotations.

['Genes', 'Protein.Group', 'Precursor.Charge', 'Modified.Sequence', 'Stripped.Sequence', 'Precursor.Id', 'All Mapped Proteins', 'All Mapped Genes']
**kwargs

Additional keyword arguments passed to import_data().

{}

Returns:

Name Type Description
pAnnData

A populated object with .prot, .pep, .summary, and identifier mappings.

Example

To import data from a DIA-NN report file:

obs_columns = ['Sample', 'treatment', 'replicate']
pdata = import_diann(
    report_file='data/project_diaNN_output.tsv',
    obs_columns=obs_columns,
    prot_value='PG.MaxLFQ',
    pep_value='Precursor.Normalised'
)

Note
  • DIA-NN report should contain both protein group and precursor-level information.
  • Metadata columns in filenames must be consistently formatted to extract .obs.
Source code in src/scpviz/pAnnData/io.py
def import_diann(report_file: Optional[str] = None, obs_columns: Optional[List[str]] = None, delimiter: Optional[str] = '_', prot_value: str = 'PG.MaxLFQ', pep_value: str = 'Precursor.Normalised', prot_var_columns: List[str] = ['Genes', 'Master.Protein'], pep_var_columns: List[str] = ['Genes', 'Protein.Group', 'Precursor.Charge', 'Modified.Sequence', 'Stripped.Sequence', 'Precursor.Id', 'All Mapped Proteins', 'All Mapped Genes'], **kwargs):
    """
    Import DIA-NN output into a `pAnnData` object.

    This function parses a DIA-NN report file and separates protein- and peptide-level expression matrices
    using the specified abundance and metadata columns.

    Args:
        report_file (str): Path to the DIA-NN report file (required).
        obs_columns (list of str): List of metadata columns to extract from the filename for `.obs`.
        delimiter (str): Character to split file names by to set up metadata in obs.
        prot_value (str): Column name in DIA-NN output to use for protein quantification.
            Default: `'PG.MaxLFQ'`.
        pep_value (str): Column name in DIA-NN output to use for peptide quantification.
            Default: `'Precursor.Normalised'`.
        prot_var_columns (list of str): Columns from the protein group table to store in `.prot.var`.
            Default includes gene and master protein annotations.
        pep_var_columns (list of str): Columns from the precursor table to store in `.pep.var`.
            Default includes peptide sequence, precursor ID, and mapping annotations.
        **kwargs: Additional keyword arguments passed to `import_data()`.

    Returns:
        pAnnData: A populated object with `.prot`, `.pep`, `.summary`, and identifier mappings.

    Example:
        To import data from a DIA-NN report file:
            ```python
            obs_columns = ['Sample', 'treatment', 'replicate']
            pdata = import_diann(
                report_file='data/project_diaNN_output.tsv',
                obs_columns=obs_columns,
                prot_value='PG.MaxLFQ',
                pep_value='Precursor.Normalised'
            )
            ```

    Note:
        - DIA-NN report should contain both protein group and precursor-level information.
        - Metadata columns in filenames must be consistently formatted to extract `.obs`.
    """
    return import_data(source_type='diann', report_file=report_file, obs_columns=obs_columns, delimiter = delimiter, prot_value=prot_value, pep_value=pep_value, prot_var_columns=prot_var_columns, pep_var_columns=pep_var_columns, **kwargs)

import_proteomeDiscoverer

import_proteomeDiscoverer(prot_file: Optional[str] = None, pep_file: Optional[str] = None, obs_columns: Optional[List[str]] = ['sample'], **kwargs)

Import Proteome Discoverer (PD) output into a pAnnData object.

This is a convenience wrapper for import_data(source_type='pd'). It loads protein- and optionally peptide-level expression data from PD report files and parses sample metadata columns.

Parameters:

Name Type Description Default
prot_file str

Path to the protein-level report file (required).

None
pep_file str

Path to the peptide-level report file (optional but recommended).

None
obs_columns list of str

List of columns to extract for .obs. These should match metadata tokens embedded in the filenames (e.g. sample, condition, replicate).

['sample']
**kwargs

Additional keyword arguments passed to import_data().

{}

Returns:

Name Type Description
pAnnData

A populated object with .prot, .pep (if provided), .summary, and identifier mappings.

Example

To import data from Proteome Discoverer:

obs_columns = ['Sample', 'condition', 'cell_line']
pdata = import_proteomeDiscoverer(
    prot_file='my_project/proteins.txt',
    pep_file='my_project/peptides.txt',
    obs_columns=obs_columns
)

Note
  • If pep_file is omitted, the resulting pAnnData will not include .pep or an RS matrix.
  • If filename structure is inconsistent and obs_columns cannot be inferred, fallback columns are used.
Source code in src/scpviz/pAnnData/io.py
def import_proteomeDiscoverer(prot_file: Optional[str] = None, pep_file: Optional[str] = None, obs_columns: Optional[List[str]] = ['sample'], **kwargs):
    """
    Import Proteome Discoverer (PD) output into a `pAnnData` object.

    This is a convenience wrapper for `import_data(source_type='pd')`. It loads protein- and optionally peptide-level
    expression data from PD report files and parses sample metadata columns.

    Args:
        prot_file (str): Path to the protein-level report file (required).
        pep_file (str, optional): Path to the peptide-level report file (optional but recommended).
        obs_columns (list of str): List of columns to extract for `.obs`. These should match metadata tokens
            embedded in the filenames (e.g. sample, condition, replicate).
        **kwargs: Additional keyword arguments passed to `import_data()`.

    Returns:
        pAnnData: A populated object with `.prot`, `.pep` (if provided), `.summary`, and identifier mappings.

    Example:
        To import data from Proteome Discoverer:
            ```python
            obs_columns = ['Sample', 'condition', 'cell_line']
            pdata = import_proteomeDiscoverer(
                prot_file='my_project/proteins.txt',
                pep_file='my_project/peptides.txt',
                obs_columns=obs_columns
            )
            ```

    Note:
        - If `pep_file` is omitted, the resulting `pAnnData` will not include `.pep` or an RS matrix.
        - If filename structure is inconsistent and `obs_columns` cannot be inferred, fallback columns are used.
    """
    return import_data(source_type='pd', prot_file=prot_file, pep_file=pep_file, obs_columns=obs_columns)

resolve_obs_columns

resolve_obs_columns(source: str, source_type: str, delimiter: Optional[str] = None) -> Tuple[Dict[str, Any], Optional[List[str]], Optional[pd.DataFrame]]

Resolve observation columns from sample filenames or metadata fields.

This function attempts to infer sample-level metadata (.obs) from filenames or a report file (DIA-NN or Proteome Discoverer). It classifies tokens using regex patterns and known metadata heuristics.

Parameters:

Name Type Description Default
source str

Path to the report file (DIA-NN or PD).

required
source_type str

Source type — one of {'diann', 'pd'}.

required
delimiter str

Delimiter used to split filename tokens. If None, auto-inferred.

None

Returns:

Type Description
Dict[str, Any]

Tuple[dict, list[str] or None, pd.DataFrame or None]: A tuple of:

Optional[List[str]]
  • metadata (dict): Metadata extracted during parsing, including fallback flags.
Optional[DataFrame]
  • suggested_obs (list of str or None): Suggested observation column names, or None if inconsistent format.
Tuple[Dict[str, Any], Optional[List[str]], Optional[DataFrame]]
  • obs_df (pd.DataFrame or None): Parsed observation DataFrame.
Note

If filename formats are inconsistent across samples, the fallback .obs will include: - A generic 'File' column with raw filenames - A 'parsingType' column indicating parsing structure

Example

Inferring observation columns from a PD file:

resolve_obs_columns('filepaths/pd_report.xlsx', source_type='pd')

Inferring from a DIA-NN report with custom delimiter:

resolve_obs_columns('filepaths/diann.tsv', source_type='diann', delimiter='_')

Source code in src/scpviz/pAnnData/io.py
def resolve_obs_columns(source: str, source_type: str, delimiter: Optional[str] = None) -> Tuple[Dict[str, Any], Optional[List[str]], Optional[pd.DataFrame]]:
    """
    Resolve observation columns from sample filenames or metadata fields.

    This function attempts to infer sample-level metadata (`.obs`) from filenames
    or a report file (DIA-NN or Proteome Discoverer). It classifies tokens using 
    regex patterns and known metadata heuristics.

    Args:
        source (str): Path to the report file (DIA-NN or PD).
        source_type (str): Source type — one of {'diann', 'pd'}.
        delimiter (str, optional): Delimiter used to split filename tokens. If None, auto-inferred.

    Returns:
        Tuple[dict, list[str] or None, pd.DataFrame or None]: A tuple of:

        - **metadata** (dict): Metadata extracted during parsing, including fallback flags.
        - **suggested_obs** (list of str or None): Suggested observation column names, or None if inconsistent format.
        - **obs_df** (pd.DataFrame or None): Parsed observation DataFrame.

    Note:
        If filename formats are inconsistent across samples, the fallback `.obs` will include:
        - A generic 'File' column with raw filenames
        - A 'parsingType' column indicating parsing structure

    Example:
        Inferring observation columns from a PD file:
            ```python
            resolve_obs_columns('filepaths/pd_report.xlsx', source_type='pd')
            ```

        Inferring from a DIA-NN report with custom delimiter:
            ```python
            resolve_obs_columns('filepaths/diann.tsv', source_type='diann', delimiter='_')
            ```
    """

    filenames = get_filenames(source, source_type=source_type)
    if not filenames:
        raise ValueError(f"{format_log_prefix('error')} No sample filenames could be extracted from the provided source: {source}.")

    if delimiter is None:
        first_fname = filenames[0]
        all_delims = re.findall(r'[^A-Za-z0-9]', first_fname)
        delimiter = Counter(all_delims).most_common(1)[0][0] if all_delims else '_'
        print(f"      Auto-detecting '{delimiter}' as delimiter from first filename.")

    format_info = analyze_filename_formats(filenames, delimiter=delimiter)

    if format_info["uniform"]:
        # Uniform format — suggest obs_columns using classification
        print(f"{format_log_prefix('info_only')} Filenames are uniform. Using `suggest_obs_columns()` to recommend obs_columns...")
        obs_columns = suggest_obs_columns(filenames=filenames, source_type=source_type, delimiter=delimiter)
        print(f"{format_log_prefix('warn')} Please review the suggested `obs_columns` above.")
        print("   → If acceptable, rerun `import_data(..., obs_columns=...)` with this list.\n")
        return format_info, obs_columns, None
    else:
        # Non-uniform format — return fallback DataFrame
        print(f"{format_log_prefix('warn',indent=2)} {len(format_info['n_tokens'])} different filename formats detected. Proceeding with fallback `.obs` structure... (File Number, Parsing Type)")

        obs = pd.DataFrame({
            "File": list(range(1, len(filenames) + 1)),
            "parsingType": [format_info['group_map'][fname] for fname in filenames]
        })
        obs_columns = ["File", "parsingType"]
        return format_info, obs_columns, obs

suggest_obs_columns

suggest_obs_columns(source=None, source_type=None, filenames=None, delimiter=None)

Suggest .obs column names based on parsed sample names.

This function analyzes filenames or run names extracted from Proteome Discoverer or DIA-NN reports and attempts to identify consistent metadata fields. These fields may include gradient, amount, cell_line, or well_position, depending on naming conventions and regular expression matches.

Parameters:

Name Type Description Default
source str or Path

Path to a DIA-NN or PD output file.

None
source_type str

Type of the input file. Supports 'diann' or 'pd'. If not provided, inferred from filename or fallback heuristics.

None
filenames list of str

List of sample file names or run labels to parse. If provided, bypasses file loading.

None
delimiter str

Delimiter to use for tokenizing filenames (e.g., ',', '_'). If not specified, will be inferred automatically.

None

Returns:

Type Description

list of str: Suggested list of metadata column names to assign to .obs.

Example

To suggest observation columns from a file:

suggest_obs_columns("my_experiment_PD.txt", source_type="pd")

Suggested columns: ['Sample', 'gradient', 'cell_line', 'duration']

['Sample', 'gradient', 'cell_line', 'duration']

Note

This function is typically used as part of the .import_data() flow when filenames embed experimental metadata.

Source code in src/scpviz/pAnnData/io.py
def suggest_obs_columns(source=None, source_type=None, filenames=None, delimiter=None):
    """
    Suggest `.obs` column names based on parsed sample names.

    This function analyzes filenames or run names extracted from Proteome Discoverer
    or DIA-NN reports and attempts to identify consistent metadata fields. These fields
    may include `gradient`, `amount`, `cell_line`, or `well_position`, depending on
    naming conventions and regular expression matches.

    Args:
        source (str or Path, optional): Path to a DIA-NN or PD output file.
        source_type (str, optional): Type of the input file. Supports `'diann'` or `'pd'`.
            If not provided, inferred from filename or fallback heuristics.
        filenames (list of str, optional): List of sample file names or run labels to parse.
            If provided, bypasses file loading.
        delimiter (str, optional): Delimiter to use for tokenizing filenames (e.g., `','`, `'_'`).
            If not specified, will be inferred automatically.

    Returns:
        list of str: Suggested list of metadata column names to assign to `.obs`.

    Example:
        To suggest observation columns from a file:
            ```python
            suggest_obs_columns("my_experiment_PD.txt", source_type="pd")
            ```

        Suggested columns: ['Sample', 'gradient', 'cell_line', 'duration']
            ```python
            ['Sample', 'gradient', 'cell_line', 'duration']
            ```

    Note:
        This function is typically used as part of the `.import_data()` flow
        when filenames embed experimental metadata.
    """
    from pathlib import Path
    import csv
    from collections import Counter

    if filenames is None:
        if source is None or source_type is None:
            raise ValueError("If `filenames` is not provided, both `source` and `source_type` must be specified.")
        source = Path(source)
        filenames = get_filenames(source, source_type=source_type)

    if not filenames:
        raise ValueError("No sample filenames could be extracted from the provided source.")

    # Pick the first filename for token analysis
    fname = filenames[0]

    # Infer delimiter if not provided
    if delimiter is None:
        all_delims = re.findall(r'[^A-Za-z0-9]', fname)
        delimiter = Counter(all_delims).most_common(1)[0][0] if all_delims else '_'
        print(f"Auto-detecting '{delimiter}' as delimiter.")

    if source_type == 'pd':
        # Custom comma-based parsing for PD
        match = re.match(r'Abundance: (F\d+): (.+)', f"Abundance: F1: {fname}")
        if match:
            _, meta = match.groups()
            raw_tokens = [t.strip() for t in meta.split(',') if t.strip().lower() != 'n/a']
            fname = ', '.join(raw_tokens)  # for clean display later
            tokens = raw_tokens
            delimiter = ','
        else:
            raise ValueError(f"Could not parse metadata from PD filename: {fname}")

    else:
        # --- Generic tokenization for DIA-NN or other formats ---
        tokens = [t.strip() for t in fname.split(delimiter) if t.strip()]

    # --- Classify tokens ---
    suggestion = {}
    obs_columns = []
    token_label_map = []
    multi_matched_tokens = []
    unrecognized_tokens = []

    for tok in tokens:
        labels = classify_subtokens(tok)
        label = labels[0]
        if label == "unknown??":
            obs_columns.append(f"<{tok}?>")
        else:
            obs_columns.append(label)
        token_label_map.append((tok, labels))
        if label != "unknown??" and label not in suggestion:
            suggestion[label] = tok
        if "unknown??" in labels:
            unrecognized_tokens.append(tok)
        elif len(labels) > 1:
            multi_matched_tokens.append((labels, tok))

    # --- Print suggestions ---
    print(f"\nFrom filename: {fname}")
    print("Suggested .obs columns:")
    for tok, labels in token_label_map:
        print(f"  {' OR '.join(labels):<26}: {tok}")
    if multi_matched_tokens:
        print(f"\nMultiple matched token(s): {[t for _, t in multi_matched_tokens]}")
    if unrecognized_tokens:
        print(f"Unrecognized token(s): {unrecognized_tokens}")
    if multi_matched_tokens or unrecognized_tokens:
        print("Please manually label these.")

    print(f"\n{format_log_prefix('info_only')} Suggested obs:\nobs_columns = {obs_columns}")

    return obs_columns