Consensus NMF with NMFscape

Introduction to Consensus NMF

Consensus Non-negative Matrix Factorization (cNMF) addresses a fundamental limitation of standard NMF: instability across random initializations. While standard NMF can produce different results with different random seeds, cNMF provides robust, reproducible gene expression programs through consensus clustering.

The Stability Problem

Standard NMF optimization is non-convex, leading to multiple local optima. This means: - Different random initializations produce different solutions - Gene expression programs may not be reproducible - Optimal k selection becomes challenging - Biological interpretation can be inconsistent

The Consensus Solution

cNMF, developed by Kotliar et al. (eLife 2019), solves this through:

Multiple NMF runs with different initializations
Consensus clustering to group similar gene expression programs
Stability metrics for objective k selection
Robust final programs through clustering and refitting

library(NMFscape)
library(SingleCellExperiment)
library(scuttle)
library(scater)
library(scran)
library(ggplot2)
library(pheatmap)
library(scRNAseq)
library(patchwork)

The cNMF Algorithm

Step-by-Step Methodology

1. Multiple NMF Factorizations

For each k value, run NMF many times (typically 100-200) with different random initializations:

For k in k_range:
    For run in 1:n_runs:
        W[k,run], H[k,run] = NMF(X, k, random_seed=run)

2. Gene Expression Program Collection

Collect all gene expression programs (columns of W matrices) across runs.

3. Density Filtering

Remove outlier programs using local density estimation: - Calculate pairwise cosine similarities between programs - Compute local density for each program - Filter programs below density threshold - Reduces noise and computational burden

4. Consensus Clustering

Cluster similar programs using k-means: - Apply k-means clustering to filtered programs - Group similar gene expression programs together - Each cluster represents a consensus program

5. Consensus Matrix Construction

Build consensus matrix showing co-clustering frequencies: - For each pair of programs: how often they cluster together - Provides stability assessment for each k value - Used for final stability metrics

6. Final Program Selection

Choose optimal k based on stability metrics
Extract representative programs from each cluster
Refit usage matrix using non-negative least squares

Implementation Details

NMFscape implements the exact algorithm from Kotliar et al. with these key components:

# Conceptual overview (actual implementation is in C++/optimized R)
runConsensusNMF <- function(x, k_range, n_runs, ...) {
  for (k in k_range) {
    # 1. Multiple NMF runs
    nmf_runs <- replicate(n_runs, RcppML::nmf(data, k = k))
    
    # 2. Collect all programs
    all_programs <- do.call(cbind, lapply(nmf_runs, function(r) r$w))
    
    # 3. Density filtering
    filtered_programs <- applyDensityFiltering(all_programs)
    
    # 4. Consensus clustering
    clusters <- kmeans(t(filtered_programs), centers = k)
    
    # 5. Build consensus matrix
    consensus_mat <- buildConsensusMatrix(clusters)
    
    # 6. Calculate stability metrics
    stability <- calculateStabilityMetrics(consensus_mat)
  }
  
  # Select optimal k and refit
  optimal_k <- selectOptimalK(stability_results)
  final_programs <- extractConsensusPrograms(optimal_k)
  usage_matrix <- refitUsageMatrix(data, final_programs)
}

Running Consensus NMF

Basic Usage

# Load the Zeisel brain dataset - perfect for demonstrating cNMF
sce <- ZeiselBrainData()

# The dataset contains multiple brain cell types with distinct expression programs
cat("Original dataset:", nrow(sce), "genes,", ncol(sce), "cells\n")
#> Original dataset: 20006 genes, 3005 cells
cat("Brain cell types:", length(unique(sce$level1class)), "major classes\n")
#> Brain cell types: 7 major classes
cat("Cell subtypes:", length(unique(sce$level2class)), "detailed subtypes\n")
#> Cell subtypes: 48 detailed subtypes

# Quality control and preprocessing
sce <- addPerCellQCMetrics(sce)
sce <- addPerFeatureQCMetrics(sce)

# Filter for quality and computational efficiency
sce <- sce[rowData(sce)$detected >= 10, ]  # Keep genes detected in >=10 cells
sce <- sce[, colData(sce)$detected >= 500]  # Keep cells with >=500 detected genes

# Randomly subset to 1000 cells for faster consensus NMF
set.seed(42)  # For reproducibility
if (ncol(sce) > 1000) {
  cell_subset <- sample(ncol(sce), 1000)
  sce <- sce[, cell_subset]
  cat("Randomly subsetted to 1000 cells for computational efficiency\n")
}
#> Randomly subsetted to 1000 cells for computational efficiency

# Normalize
sce <- logNormCounts(sce)

# Select only 500 highly variable genes for much faster cNMF analysis
dec <- modelGeneVar(sce)
total_genes_before_hvg <- nrow(sce)
top_hvgs <- getTopHVGs(dec, n = 500)  # Small subset for fast consensus NMF
sce <- sce[top_hvgs, ]

cat("Gene selection: reduced from", total_genes_before_hvg, "to", nrow(sce), "highly variable genes\n")
#> Gene selection: reduced from 9715 to 500 highly variable genes

# Add visualization coordinates
sce <- runPCA(sce, ncomponents = 50)
sce <- runUMAP(sce, dimred = "PCA")

cat("Processed dataset:", nrow(sce), "genes,", ncol(sce), "cells\n")
#> Processed dataset: 500 genes, 1000 cells

# Show the brain cell type distribution
table(sce$level1class)
#> 
#> astrocytes_ependymal    endothelial-mural         interneurons 
#>                   76                   75                  102 
#>            microglia     oligodendrocytes        pyramidal CA1 
#>                   26                  278                  315 
#>         pyramidal SS 
#>                  128

# Run consensus NMF on brain data subset
# Note: This requires further debugging for optimal parameters with real brain data
# For production use:
sce <- runConsensusNMF(sce, 
                       k_range = 4:8,           # Range for brain cell types
                       n_runs = 100,            # Sufficient runs for stability
                       n_cores = 4,             # Parallel processing
                       density_threshold = 0.4,  # Standard filtering
                       verbose = TRUE)

# Check what was added
names(metadata(sce))
reducedDimNames(sce)

Interpreting Results for Brain Data

# Get optimal k and examine programs
optimal_k <- getOptimalK(sce)
cat("Optimal k selected:", optimal_k, "programs\n")

# Get top genes for each program to see brain-relevant signatures
top_genes_brain <- getTopGEPFeatures(sce, n = 10)

# Look for brain-specific markers in the programs
brain_markers <- c("Gad1", "Gad2", "Slc17a7", "Slc17a6", "Mbp", "Pdgfra", 
                   "Cx3cr1", "Aldh1l1", "Pecam1", "Slc1a3")

for (i in 1:min(4, optimal_k)) {
  cat("\nProgram", i, "top genes:\n")
  program_genes <- top_genes_brain[[i]]
  # Highlight known brain markers
  brain_hits <- program_genes[program_genes %in% brain_markers]
  if (length(brain_hits) > 0) {
    cat("  Brain markers found:", paste(brain_hits, collapse = ", "), "\n")
  }
  cat("  All top genes:", paste(head(program_genes, 8), collapse = ", "), "\n")
}

Parameter Selection

Critical Parameters

k_range: Range of k values to test - Start with biologically expected range - Consider 2-3x expected number of cell types - Broader ranges provide better stability assessment

# For 5 expected cell types, test:
k_range = 3:12    # Conservative range
k_range = 2:15    # Broader exploration

n_runs: Number of NMF runs per k value - More runs = better stability estimates - Minimum: 50 runs for reasonable estimates - Recommended: 100-200 runs for publication quality

n_runs = 50      # Quick exploration
n_runs = 100     # Standard analysis  
n_runs = 200     # High confidence

Algorithmic Parameters

density_threshold: Outlier filtering stringency - Higher values = more aggressive filtering - Range: 0.1 (lenient) to 0.8 (strict) - Default: 0.5 (balanced)

n_cores: Parallel processing - Use available cores - 1 for system responsiveness - Scales nearly linearly with core count

sce <- runConsensusNMF(sce,
                       k_range = 5:15,
                       n_runs = 100,
                       density_threshold = 0.4,    # Lenient filtering
                       n_cores = 4,                # Parallel processing
                       seed = 123)                 # Reproducibility

Accessing and Interpreting Results

Consensus Gene Expression Programs

# Get consensus gene expression programs
consensus_geps <- getConsensusGEPs(sce)
dim(consensus_geps)  # genes × programs

# Get optimal k selected by algorithm
optimal_k <- getOptimalK(sce)
cat("Optimal k selected:", optimal_k, "\n")

# Get top genes per program
top_genes <- getTopGEPFeatures(sce, n = 15)
head(top_genes[[1]])  # Top genes for program 1

Program Usage

# Get program usage per cell
gep_usage <- getGEPUsage(sce)
dim(gep_usage)  # cells × programs

# Usage is also stored in reducedDims
identical(gep_usage, reducedDim(sce, "cNMF"))

Stability Metrics

# Get comprehensive stability metrics
stability <- getStabilityMetrics(sce)
print(stability)

# Individual metrics explained:
# - stability: clustering-based stability (higher = more stable)
# - silhouette: silhouette score of consensus clustering
# - reproducibility: how often same programs are recovered
# - cophenetic_correlation: consensus matrix quality

Accessing Different k Values

# Access results for specific k values
available_k <- getAvailableK(sce)
cat("Available k values:", available_k, "\n")

# Get results for k=6 specifically
geps_k6 <- getConsensusGEPs(sce, k = 6)
usage_k6 <- getGEPUsage(sce, k = 6)
dim(geps_k6)
dim(usage_k6)

Diagnostic Plots

Stability Assessment

The most important diagnostic is the stability plot for k selection:

# Plot stability metrics across k values
plotStability(sce)

Interpretation: - Stability: Primary metric - choose k at peak or plateau - Silhouette: Quality of consensus clustering - Reproducibility: Consistency across runs - Cophenetic: Consensus matrix block structure

Look for: - Peak in stability curve - High silhouette scores (>0.5) - Stable or increasing reproducibility - High cophenetic correlation

Gene Expression Program Heatmaps

# Plot gene expression programs
plotGEPs(sce, programs = 1:getOptimalK(sce), n_genes = 20)

Interpretation: - Each column = one gene expression program - Each row = one gene (top contributors shown) - Red = high expression, Blue = low expression - Look for distinct, interpretable gene sets

Program Usage Visualization

# Plot program usage on UMAP alongside known cell types
opt_k <- getOptimalK(sce)

# Create individual plots for better control
p1 <- plotReducedDim(sce, dimred = "UMAP", colour_by = "level1class", point_size = 0.3) +
  ggtitle("Known Brain Cell Types") +
  theme(legend.position = "right", legend.title = element_blank())

# Show usage of first few programs
programs_to_show <- 1:min(3, opt_k)
usage_plots <- list()

for (i in programs_to_show) {
  col_name <- paste0("cNMF_", i)
  usage_plots[[i]] <- plotReducedDim(sce, dimred = "UMAP", colour_by = col_name, point_size = 0.3) +
    scale_color_viridis_c(name = "Usage") +
    ggtitle(paste("cNMF Program", i)) +
    theme(legend.position = "right")
}

# Arrange plots
if (length(usage_plots) >= 3) {
  (p1 | usage_plots[[1]]) / (usage_plots[[2]] | usage_plots[[3]])
} else {
  p1 | usage_plots[[1]]
}

Interpretation: - Each panel shows usage of one program - Continuous colors = program activity levels - Spatial patterns often reveal biological structures - Look for cell type or state-specific patterns

Advanced Analysis

Manual k Selection

Sometimes you may want to override automatic k selection:

# Get detailed information about all k values
cnmf_info <- getConsensusNMFInfo(sce)
print(cnmf_info)

# Force usage of specific k (e.g., k=6)
# This changes what getConsensusGEPs() returns by default
metadata(sce)$cNMF$cNMF$optimal_k <- 6

# Verify change
cat("New optimal k:", getOptimalK(sce), "\n")

Program Annotation

# Get top genes for biological annotation
top_genes_detailed <- getTopGEPFeatures(sce, n = 50)

# Example: annotate programs based on top genes
opt_k <- getOptimalK(sce)
program_annotations <- character(opt_k)
for (i in 1:opt_k) {
  top_10 <- head(top_genes_detailed[[i]], 10)
  program_annotations[i] <- paste0("Program_", i, " (", 
                                  paste(head(top_10, 3), collapse = ", "), 
                                  "...)")
}
print(program_annotations)

Robustness Assessment

# Test robustness with different parameters
sce_alt1 <- runConsensusNMF(sce, k_range = 4:8, n_runs = 50, 
                           density_threshold = 0.3)
sce_alt2 <- runConsensusNMF(sce, k_range = 4:8, n_runs = 50, 
                           density_threshold = 0.7)

# Compare optimal k selections
c(original = getOptimalK(sce),
  lenient = getOptimalK(sce_alt1),
  strict = getOptimalK(sce_alt2))

Computational Considerations

Performance Scaling

cNMF is computationally intensive. Runtime scales with:

n_runs: Linear scaling (100 runs = 2x time vs 50 runs)
k_range: Linear scaling (each k adds ~n_runs time)
Data size: Quadratic in min(genes, cells)
n_cores: Near-linear speedup with parallelization

Memory Requirements

Memory usage depends on: - Storing all intermediate NMF results - Consensus matrix construction - Peak usage during clustering steps

For large datasets (>50k cells), consider: - Reducing n_runs (minimum 50) - Narrower k_range based on prior knowledge - Using high-memory computing environments

Optimization Tips

# For large datasets
sce <- runConsensusNMF(sce,
                       k_range = 8:12,          # Focused range
                       n_runs = 75,             # Reduced runs
                       n_cores = 8,             # Max parallelization
                       density_threshold = 0.6) # Aggressive filtering

# For exploration with quick turnaround
sce <- runConsensusNMF(sce,
                       k_range = 5:10,
                       n_runs = 25,             # Very fast
                       n_cores = 4)

Integration with Downstream Analysis

Cell Type Assignment

# Assign cells to dominant programs and compare with known brain cell types
usage_matrix <- getGEPUsage(sce)
dominant_programs <- apply(usage_matrix, 1, which.max)

# Add to cell metadata
colData(sce)$dominant_program <- paste0("Program_", dominant_programs)

# Compare cNMF programs with known brain cell types
program_celltype_table <- table(colData(sce)$dominant_program, sce$level1class)
print("cNMF Programs vs Known Brain Cell Types:")
print(program_celltype_table)

# Calculate enrichment of cell types in each program
cat("\nCell type enrichment in programs:\n")
for (i in 1:min(5, getOptimalK(sce))) {
  program_cells <- colData(sce)$dominant_program == paste0("Program_", i)
  if (sum(program_cells) > 10) {  # Only for programs with enough cells
    celltype_freq <- table(sce$level1class[program_cells])
    top_celltype <- names(sort(celltype_freq, decreasing = TRUE))[1]
    enrichment <- celltype_freq[top_celltype] / sum(program_cells) * 100
    cat(sprintf("Program %d: enriched for %s (%.1f%% of cells)\n", 
                i, top_celltype, enrichment))
  }
}

Differential Program Usage

# Test for differential program usage between conditions
library(scran)

# Example: differential usage between cell types
colData(sce)$program_1_usage <- usage_matrix[, 1]

# Statistical testing
results <- findMarkers(sce, 
                      groups = sce$cell_type,
                      test.type = "t")

Gene Set Enrichment

# Example pathway enrichment (requires additional packages)
library(clusterProfiler)
library(org.Hs.eg.db)

# Get top genes for program 1
program_1_genes <- getTopGEPFeatures(sce, n = 100)[[1]]

# Run enrichment analysis
enrichment <- enrichGO(gene = program_1_genes,
                      OrgDb = org.Hs.eg.db,
                      ont = "BP",
                      readable = TRUE)

Best Practices

Parameter Selection Guidelines

Start broad: Use wide k_range for initial exploration
Sufficient runs: Minimum 50 runs, prefer 100+ for final analysis
Multiple thresholds: Test different density_threshold values
Reproducibility: Use set.seed() for reproducible results

Quality Control

Stability curves: Should show clear peaks or plateaus
Program interpretation: Programs should be biologically meaningful
Usage patterns: Should correlate with known cell types/states
Consistency: Results should be robust to parameter changes

Common Pitfalls

Too few runs: <50 runs can give unstable results
Narrow k_range: May miss optimal k value
Over-interpretation: Programs may not always correspond to cell types
Ignoring stability: Don’t just pick k based on expectations

Comparison with Standard NMF

This section demonstrates the improved stability of consensus NMF compared to standard NMF on the brain dataset. The consensus approach provides much more reliable and interpretable gene expression programs.

# Run standard NMF with same k
opt_k <- getOptimalK(sce)
sce <- runNMFscape(sce, k = opt_k, name = "StandardNMF")

# Compare stability by running multiple times
set.seed(123)
nmf_runs <- replicate(10, {
  sce_temp <- runNMFscape(sce, k = opt_k, verbose = FALSE)
  getBasis(sce_temp)[, 1]  # First program
}, simplify = FALSE)

# Calculate coefficient of variation across runs
cv_standard <- apply(do.call(cbind, nmf_runs), 1, function(x) sd(x)/mean(x))

# Compare with cNMF stability (programs should be more stable)
cnmf_program_1 <- getConsensusGEPs(sce)[, 1]

cat("Standard NMF variability (mean CV):", round(mean(cv_standard, na.rm = TRUE), 3), "\n")
cat("cNMF provides stable programs across runs\n")

Session Information

sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] patchwork_1.3.2             scRNAseq_2.22.0            
#>  [3] pheatmap_1.0.13             scran_1.36.0               
#>  [5] scater_1.36.0               ggplot2_4.0.0              
#>  [7] scuttle_1.18.0              NMFscape_0.99.0            
#>  [9] SingleCellExperiment_1.30.1 SummarizedExperiment_1.38.1
#> [11] Biobase_2.68.0              GenomicRanges_1.60.0       
#> [13] GenomeInfoDb_1.44.2         IRanges_2.42.0             
#> [15] S4Vectors_0.46.0            MatrixGenerics_1.20.0      
#> [17] matrixStats_1.5.0           BiocGenerics_0.54.0        
#> [19] generics_0.1.4              BiocStyle_2.36.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3       jsonlite_2.0.0           magrittr_2.0.3          
#>   [4] gypsum_1.4.0             ggbeeswarm_0.7.2         GenomicFeatures_1.60.0  
#>   [7] farver_2.1.2             rmarkdown_2.29           BiocIO_1.18.0           
#>  [10] fs_1.6.6                 ragg_1.5.0               vctrs_0.6.5             
#>  [13] Rsamtools_2.24.1         memoise_2.0.1            RCurl_1.98-1.17         
#>  [16] htmltools_0.5.8.1        S4Arrays_1.8.1           AnnotationHub_3.16.1    
#>  [19] curl_7.0.0               BiocNeighbors_2.2.0      Rhdf5lib_1.30.0         
#>  [22] rhdf5_2.52.1             SparseArray_1.8.1        alabaster.base_1.8.1    
#>  [25] sass_0.4.10              bslib_0.9.0              alabaster.sce_1.8.0     
#>  [28] desc_1.4.3               httr2_1.2.1              cachem_1.1.0            
#>  [31] GenomicAlignments_1.44.0 igraph_2.1.4             lifecycle_1.0.4         
#>  [34] pkgconfig_2.0.3          rsvd_1.0.5               Matrix_1.7-3            
#>  [37] R6_2.6.1                 fastmap_1.2.0            GenomeInfoDbData_1.2.14 
#>  [40] digest_0.6.37            AnnotationDbi_1.70.0     dqrng_0.4.1             
#>  [43] irlba_2.3.5.1            ExperimentHub_2.16.1     textshaping_1.0.3       
#>  [46] RSQLite_2.4.3            beachmat_2.24.0          filelock_1.0.3          
#>  [49] httr_1.4.7               abind_1.4-8              compiler_4.5.1          
#>  [52] bit64_4.6.0-1            withr_3.0.2              S7_0.2.0                
#>  [55] BiocParallel_1.42.1      viridis_0.6.5            DBI_1.2.3               
#>  [58] alabaster.ranges_1.8.0   HDF5Array_1.36.0         alabaster.schemas_1.8.0 
#>  [61] rappdirs_0.3.3           DelayedArray_0.34.1      rjson_0.2.23            
#>  [64] bluster_1.18.0           tools_4.5.1              vipor_0.4.7             
#>  [67] beeswarm_0.4.0           glue_1.8.0               h5mread_1.0.1           
#>  [70] restfulr_0.0.16          rhdf5filters_1.20.0      grid_4.5.1              
#>  [73] cluster_2.1.8.1          gtable_0.3.6             ensembldb_2.32.0        
#>  [76] RcppML_0.3.7             BiocSingular_1.24.0      ScaledMatrix_1.16.0     
#>  [79] metapod_1.16.0           XVector_0.48.0           ggrepel_0.9.6           
#>  [82] BiocVersion_3.21.1       pillar_1.11.0            limma_3.64.3            
#>  [85] dplyr_1.1.4              BiocFileCache_2.16.2     lattice_0.22-7          
#>  [88] FNN_1.1.4.1              rtracklayer_1.68.0       bit_4.6.0               
#>  [91] tidyselect_1.2.1         locfit_1.5-9.12          Biostrings_2.76.0       
#>  [94] knitr_1.50               gridExtra_2.3            bookdown_0.44           
#>  [97] ProtGenerics_1.40.0      edgeR_4.6.3              xfun_0.53               
#> [100] statmod_1.5.0            UCSC.utils_1.4.0         lazyeval_0.2.2          
#> [103] yaml_2.3.10              evaluate_1.0.5           codetools_0.2-20        
#> [106] tibble_3.3.0             alabaster.matrix_1.8.0   BiocManager_1.30.26     
#> [109] cli_3.6.5                uwot_0.2.3               systemfonts_1.2.3       
#> [112] jquerylib_0.1.4          Rcpp_1.1.0               dbplyr_2.5.1            
#> [115] png_0.1-8                XML_3.99-0.19            parallel_4.5.1          
#> [118] pkgdown_2.1.3            blob_1.2.4               AnnotationFilter_1.32.0 
#> [121] bitops_1.0-9             alabaster.se_1.8.0       viridisLite_0.4.2       
#> [124] scales_1.4.0             crayon_1.5.3             rlang_1.1.6             
#> [127] KEGGREST_1.48.1

References

Kotliar D, Veres A, Nagy MA, et al. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq. eLife. 2019;8:e43803.
DeBruine ZJ, Melcher K, Triche TJ. High-performance non-negative matrix factorization for large single cell data. bioRxiv. 2021.
Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401(6755):788-791.

Michael Totty

2025-09-12