Data Inputs¶

Champollion expects a paired bridge during fit and unpaired modality-specific data during transport.

The model is modular with respect to the single-cell representations used as input. Champollion does not impose a preprocessing pipeline: the main representations can be raw feature matrices, normalized features, PCA, LSI, or any cell low-dimensional embeddings learned by another model. This lets users choose the representation that best matches the biological question and dataset scale.

Bridge Cells¶

The bridge is passed as a MuData object. The two selected modalities must contain the same observations. If observation names match but are ordered differently, Champollion reorders the second modality to match the first.

model.fit(
    mdata_bridge,
    modality_1="rna",
    modality_2="atac",
    x_1_rep="X_pca",
    x_2_rep="X_lsi",
)

Unpaired Cells¶

After fitting, transport inputs are passed as a dictionary keyed by the same modality names used during fit.

result = model.transport(
    {"rna": adata_rna, "atac": adata_atac},
)

Champollion deliberately requires modality names at transport time to avoid accidentally swapping modalities.

Representations¶

Representations can be specified with:

"X" for adata.X
"layers/counts" for adata.layers["counts"]
"obsm/X_pca" for adata.obsm["X_pca"]
a shorthand key when unambiguous

For X and layers, feature names are read from adata.var_names. For obsm, feature names are generated as <rep>_<idx> unless explicit names are provided with feature_names.

Preprocessing¶

Champollion accepts any representation that can be stored in an AnnData object, so preprocessing can be adapted to the dataset, modality, and downstream interpretation goals. In the experiments reported in the paper, we used the following choices.

For RNA/ATAC integration, both modalities were log-normalized, scaled with scanpy.pp.scale, and embedded with PCA (due to the extremely high dimensionality of ATAC data). In a separate case study, we used DRVI embeddings for both modalities instead of PCA-based representations.

For RNA/ADT integration with a CITE-seq bridge, ADT counts were normalized with Muon’s implementation of centered log-ratio (CLR) normalization. RNA was log-normalized and scaled, then represented either with PCA or with 4,000 highly variable genes when direct feature-level interpretability was preferred.

Prior Representations¶

Champollion can add a prior cost term based on prior representations for the two modalities. Prior information provides a common ground for directly comparing cells across assays, complementing the learned cross-modal cost. By incorporating this external knowledge, it helps guide the matching and improves robustness, particularly when the bridge data alone is insufficient to ensure a reliable integration. In practice, these priors often come from sparse known connections between features across modalities: ATAC peaks can be mapped to nearby genes through gene activities, transcripts can be paired with their encoded proteins, or other feature-level links can be defined from biological knowledge.

model.fit(
    mdata_bridge,
    modality_1="rna",
    modality_2="atac",
    x_1_rep="X_pca",
    x_2_rep="X_lsi",
    y_prior_1_rep="X_prior",
    y_prior_2_rep="X_prior",
)

If priors are used during fit, matching prior representations are expected during transport unless the same representation names should be reused:

result = model.transport(
    {"rna": adata_rna, "atac": adata_atac},
    y_prior_reps={"rna": "X_prior", "atac": "X_prior"},
)

For RNA/ATAC integration, we first used Signac to compute gene activities from ATAC profiles. To build the prior representation, we subsetted both RNA profiles and ATAC gene activities to the common genes (only in the prior representations, not for the main representations), log-normalized them, concatenated the two cell-by-gene matrices, and ran PCA on the concatenated representation.

For RNA/ADT integration, we manually mapped each surface protein to its coding gene using GeneCards, then restricted both modalities to the resulting gene-protein pairs. The prior cost was the correlation distance, 1 - r where r is Pearson’s correlation, between every cell in modality 1 and every cell in modality 2, computed with scipy.spatial.distance.cdist (Virtanen et al., 2020).

Prior representations are centered and scaled cell-wise before computing the correlation-based prior cost. Cells with zero norm are left unnormalized to avoid introducing missing values. The prior contribution is controlled by lambda_prior.