Published

June 13, 2025

Lattice data analysis – multivariate methods for imaging-based data

In this vignette we will show:

Multivariate lattice data analysis methods for imaging-based approaches.
This includes global metrics on the entire field of view and local variants thereof.
The use case is a CosMx data set from He et al. (2022).
The R implementations rely on the Voyager package. The data is represented as SpatialFeatureExperiment (Moses et al. 2023). Complementary resources using this data and methods are found in the Voyager CosMx vignette, Voyager bivariate vignette and Voyager multivariate vignette.
Python implementations rely on the the packages esda, pysal and squidpy (Rey and Anselin 2010; Palla et al. 2022). Data representation rely on the anndata structure (Virshup et al. 2024).

Show the code

source("utils.R")

roma_colors <- data.frame(roma_colors = scico::scico(256, palette = 'roma'))
write.csv(roma_colors , "../misc/roma_colors.csv")

theme_set(theme_light())

Show the code

import numpy as np
import scanpy as sc
import squidpy as sq
from esda.moran import Moran_BV, Moran_Local_BV
from esda.lee import Spatial_Pearson, Spatial_Pearson_Local
from esda.geary_local_mv import Geary_Local_MV
from scipy.stats import false_discovery_control
from libpysal.cg import KDTree
from libpysal.weights import W, KNN
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
import warnings

warnings.filterwarnings("ignore")

df_cmap_continuous = pd.read_csv("../misc/roma_colors.csv", index_col=0)
cmap_continuous = LinearSegmentedColormap.from_list("roma", list(df_cmap_continuous["roma_colors"])).reversed()

For this representation of cells, we will rely on the SpatialFeatureExperiment package. For preprocessing of the dataset, we refer the reader to the vignette of the Voyager package.

Setup and Preprocessing

R
Python

Show the code

#taken from https://pachterlab.github.io/voyager/articles/vig4_cosmx.html
(sfe <- HeNSCLCData())

class: SpatialFeatureExperiment 
dim: 980 100290 
metadata(0):
assays(1): counts
rownames(980): AATK ABL1 ... NegPrb22 NegPrb23
rowData names(3): means vars cv2
colnames(100290): 1_1 1_2 ... 30_4759 30_4760
colData names(17): Area AspectRatio ... nCounts nGenes
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : CenterX_global_px CenterY_global_px
imgData names(1): sample_id

unit: full_res_image_pixels
Geometries:
colGeometries: centroids (POINT), cellSeg (POLYGON) 

Graphs:
sample01:

Show the code

# Empty cells
colData(sfe)$is_empty <- colData(sfe)$nCounts < 1
# Select, sum negative control probes
(neg_inds <- str_detect(rownames(sfe), "^NegPrb")) %>% sum

[1] 20

Show the code

colData(sfe)$prop_neg <- colSums(counts(sfe)[neg_inds,])/colData(sfe)$nCounts
# Remove low quality cells
sfe <- sfe[,!sfe$is_empty & sfe$prop_neg < 0.1]
# Re-calculate stats
rowData(sfe)$is_neg <- neg_inds
# log Counts
sfe <- logNormCounts(sfe)

# save for python
colData(sfe)$CenterX_global_px <- spatialCoords(sfe)[,1]
colData(sfe)$CenterY_global_px <- spatialCoords(sfe)[,2]

Show the code

sfe <- sfe[,st_intersects(colGeometries(sfe)$centroids, bbox_use, sparse = FALSE)]
ann <-  zellkonverter::SCE2AnnData(sfe, X_name = "counts")
anndata::write_h5ad(ann, "../data/imaging_sfe.h5ad")

Show the code

adata = sc.read_h5ad("../data/imaging_sfe.h5ad")
adata.obsm["spatial"] = np.column_stack([adata.obs["CenterX_global_px"], adata.obs["CenterY_global_px"]])
# normalise counts
adata.raw = adata.copy()
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
# invert the spatial coordinates
adata.obsm["spatial"][:, 1] = adata.obsm["spatial"][:, 1].max() - adata.obsm["spatial"][:, 1]
adata

AnnData object with n_obs × n_vars = 27205 × 980
    obs: 'Area', 'AspectRatio', 'Width', 'Height', 'Mean.MembraneStain', 'Max.MembraneStain', 'Mean.PanCK', 'Max.PanCK', 'Mean.CD45', 'Max.CD45', 'Mean.CD3', 'Max.CD3', 'Mean.DAPI', 'Max.DAPI', 'sample_id', 'nCounts', 'nGenes', 'is_empty', 'prop_neg', 'sizeFactor', 'CenterX_global_px', 'CenterY_global_px'
    var: 'means', 'vars', 'cv2', 'is_neg'
    uns: 'X_name', 'log1p'
    obsm: 'spatial'
    layers: 'logcounts'

In this vignette we are highlighting lattice data analysis approaches for multivariate observations. We will show the metrics related to KRT17 (basal cells) and TAGLN (Smooth muscle cells) (He et al. 2022).

Show the code

plotSpatialFeature(sfe, c("KRT17"),
                   colGeometryName = "centroids", 
                   ncol = 2, size = 1, scattermore = FALSE) + 
  theme_void()

Show the code

plotSpatialFeature(sfe, c("TAGLN"),
                   colGeometryName = "centroids", 
                   ncol = 2, size = 1, scattermore = FALSE) + 
  theme_void()

Here we set the arguments for the examples below.

R
Python

Show the code

features <- c("KRT17", "TAGLN")
colGraphName <- "knn5"
colGeometryName <- "centroids"
segmentation <- "cellSeg"
plotsize = 1.5

Show the code

# predefine genes
features = ["KRT17", "TAGLN"]
figsize = (10, 7)
pointsize = 12

Lattice data

A lattice consists of individual spatial units \(D = \{A_1, A_2,...,A_n\}\) where the units do not overlap. The data is then a realisation of a random variable along the lattice \(Y_i = Y (A_i)\) (Zuur, Ieno, and Smith 2007). The lattice is irregular, if the units have variable size and are not spaced regularly, such as is the case with cells in tissue.

More details about lattices can be found on here.

Spatial weight matrix

One of the challenges when working with (irregular) lattice data is the construction of a neighbourhood graph (Pebesma and Bivand 2023). The main question is, what to consider as neighbours, as this will affect downstream analyses. Various methods exist to define neighbours, such as contiguity-based neighbours (neighbours in direct contact), graph-based neighbours (e.g., \(k\)-nearest neighbours), distance-based neighbours or higher order neighbours (Getis 2009; Zuur, Ieno, and Smith 2007; Pebesma and Bivand 2023). The documentation of the package spdep provides an overview of the different methods (Bivand 2022).

We consider first contiguity-based neighbours. As cell segmentation is notoriously imperfect, we add a snap value, which means that we consider all cells with distance \(20\) or less as contiguous (Pebesma and Bivand 2023; Wang 2019).

Show the code

colGraph(sfe, "poly2nb") <-
  findSpatialNeighbors(sfe,
    type = "cellSeg",
    method = "poly2nb", # wraps the spdep function with the same name
    style = "W",
    snap = 20 # all cells with less distance  apart are considered contiguous
  )

p1 <- plotColGraph(sfe,
  colGraphName = "poly2nb",
  colGeometryName = "cellSeg",
  bbox =  c(xmin = 3500, xmax = 10000, ymin = 157200, ymax = 162200)
) + theme_void()

Alternatively, we can use a \(k\)-nearest neighbours approach. The choice of the number \(k\) is somewhat arbitrary.

R
Python

Show the code

colGraph(sfe, "knn5") <-
  findSpatialNeighbors(sfe,
    method = "knearneigh", # wraps the spdep function with the same name
    k = 5,
    zero.policy = TRUE
  )

p2 <- plotColGraph(sfe,
  colGraphName = "knn5",
  colGeometryName = "cellSeg",
  bbox = c(xmin = 3500, xmax = 10000, ymin = 157200, ymax = 162200)
) + theme_void()

#calculate binary nearest neighbour weight matrix too
colGraph(sfe, "binary") <-
  findSpatialNeighbors(sfe,
    method = "knearneigh", # wraps the spdep function with the same name
    k = 5,
    zero.policy = TRUE,
    style = "B"
  )

Show the code

spatial_weights = KNN(KDTree(adata.obsm['spatial']), 5)
adata.obsp['spatial_connectivities'] = spatial_weights.sparse

The graphs below show noticeable differences. In the contiguous neighbour graph on the left (neighbours in direct contact), we can see the formation of distinct patches that are not connected to the rest of the tissue. In addition, some cells do not have any direct neighbours. In contrast, the \(k\)-nearest neighbours (\(k\)NN) graph on the right reveals that these patches tend to be connected to the rest of the structure.

R
Python

Show the code

p1 + p2

Show the code

fig, ax = plt.subplots(1, 1, figsize=(10, 10))
sq.pl.spatial_scatter(
    adata[adata.obsp["spatial_connectivities"].nonzero()[0], :],
    connectivity_key="spatial_connectivities",
    size=0.1,
    na_color="black",
    edges_color="black",
    edges_width=0.1,
    shape=None,
    library_id="spatial",
    ax=ax,
    fig=fig,
    crop_coord = (3500, 3999, 10000, 8999)
)

With a defined spatial weight matrix, one can calculate multivariate spatial metrics. We will consider both global and local bivariate observations as well as local multivariate spatial metrics.

Global Measures for Bivariate Data

Global Bivariate Moran’s \(I\)

For two continous variables the global bivariate Moran’s \(I\) is defined as (Wartenberg 1985; Bivand 2022)

\[I_B = \frac{\sum_i(\sum_j{w_{ij}y_j\times x_i})}{\sum_i{x_i^2}}\]

where \(x_i\) and \(y_j\) are the two variables of interest and \(w_{ij}\) is the value of the spatial weights matrix for positions \(i\) and \(j\).

The global bivariate Moran’s \(I\) is a measure of correlation between the variables \(x\) and \(y\) where \(y\) has a spatial lag. The result might overestimate the spatial autocorrelation of the variables due to the non-spatial correlation of \(x\) and \(y\) (Bivand 2022).

R
Python

Show the code

res <- spdep::moran_bv(x = logcounts(sfe)[features[1],],
         y = logcounts(sfe)[features[2],],
         listw =  colGraph(sfe, colGraphName),
         nsim = 499,
         scale = TRUE)
res


DATA PERMUTATION


Call:
boot(data = xx, statistic = bvm_boot, R = nsim, sim = "permutation", 
    listw = listw, parallel = parallel, ncpus = ncpus, cl = cl)


Bootstrap Statistics :
      original    bias    std. error
t1* -0.1577906 0.1578076 0.002761044

Show the code

plot(res)

Show the code

ci <- boot::boot.ci(res, conf = c(0.99, 0.95, 0.9), type = "basic")
ci

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 499 bootstrap replicates

CALL : 
boot::boot.ci(boot.out = res, conf = c(0.99, 0.95, 0.9), type = "basic")

Intervals : 
Level      Basic         
99%   (-0.3231, -0.3087 )   
95%   (-0.3212, -0.3102 )   
90%   (-0.3201, -0.3110 )  
Calculations and Intervals on Original Scale
Some basic intervals may be unstable

Show the code

np.random.seed(3407)
moran_bv = Moran_BV(
    adata[:, features[0]].X.toarray(),
    adata[:, features[1]].X.toarray(),
    spatial_weights,
    transformation="r",
    permutations=499,
)
adata.uns[f"moran_bv_{features[0]}_{features[1]}"] = moran_bv.I
adata.uns[f"moran_bv_{features[0]}_{features[1]}_p_sim"] = moran_bv.p_sim

for key in filter(lambda x: x.startswith("moran_"), adata.uns.keys()):
    print(f"{key}: {adata.uns[key].round(4)}")

moran_bv_KRT17_TAGLN: -0.1577
moran_bv_KRT17_TAGLN_p_sim: 0.002

Show the code

    
plt.figure(figsize=(5, 4), dpi=100)
hist = plt.hist(moran_bv.sim, bins=30, color="lightgrey", edgecolor="black")
plt.axvline(moran_bv.I, color="red", linestyle="--")
plt.xlabel("Simulated bivariate Moran's I")
plt.ylabel("Simulated frequency")
plt.title(f"Bivariate Moran's I: {features[0]} vs {features[1]} (I: {moran_bv.I.round(4)}, p: {moran_bv.p_sim.round(4)})")
sns.despine()

The value t0 indicates the test statistic of global bivariate Moran’s \(I\). The global bivariate Moran’s \(I\) value for the genes KRT17, TAGLN is -0.1577906. Significance can be assessed by comparing the permuted confidence interval with the test statistic.

Global Bivariate Lee’s \(L\)

Lee’s \(L\) is a bivariate measure that combines non-spatial pearson correlation with spatial autocorrelation via Moran’s \(I\) (Lee 2001). This enables us to asses the spatial dependence of two continuous variables in a single measure. The measure is defined as

\[L(x,y) = \frac{n}{\sum_{i=1}^n(\sum_{j=1}^nw_{ij})^2}\frac{\sum_{i=1}^n[\sum_{j=1}^nw_{ij}(x_j-\bar{x})](\sum_{j=1}^nw_{ij}(y_j-\bar{y}))}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}}\]

where \(w_{ij}\) is the value of the spatial weights matrix for positions \(i\) and \(j\), \(x\) and \(y\), the two variables of interest and \(\bar{x}\) and \(\bar{y}\) their means (Lee 2001; Bivand 2022).

R
Python

Show the code

res_lee <- calculateBivariate(sfe, type = "lee.mc", 
                   feature1 = features[1], feature2 = features[2],
                   colGraphName = colGraphName,
                   nsim = 499)
res_lee$lee.mc_statistic

statistic 
  -0.1528

Show the code

res_lee$lee.mc_p.value

[1] 0.998

The effect size of bivariate Lee’s \(L\) for the genes KRT17, TAGLN is -0.1528 and the associated p-value is 0.998

Show the code

np.random.seed(3407)
lees_l_estimator = Spatial_Pearson(connectivity=spatial_weights.to_sparse(), permutations=499)
lees_l_estimator.fit(
    adata[:, features[0]].X.toarray(),
    adata[:, features[1]].X.toarray(),
)

Spatial_Pearson(connectivity=<COOrdinate sparse array of dtype 'float64'
    with 136025 stored elements and shape (27205, 27205)>,
                permutations=499)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show the code

adata.uns[f"lees_l_{features[0]}_{features[1]}"] = lees_l_estimator.association_
adata.uns[f"lees_l_{features[0]}_{features[1]}_p_sim"] = lees_l_estimator.significance_

fig, ax = plt.subplots(figsize=(5, 5))
sns.heatmap(
    lees_l_estimator.association_,
    cmap=cmap_continuous,
    annot=True,
    cbar=False,
    ax=ax,
    mask=np.triu(np.ones_like(lees_l_estimator.association_)) - np.eye(2),
)
ax.set_title("Lee's L")
ax.set_xticklabels(features[:2])
ax.set_yticklabels(features[:2], rotation=0)

for i, row in enumerate(lees_l_estimator.significance_):
    for j, p in enumerate(row):
        ax.text(j + 0.5, i + 0.75, f"p = {p:.3f}", ha="center", va="center", color="white", fontsize=8)

Local Measures for Bivariate Data

Local Bivariate Moran’s \(I\)

Similar to the global bivariate Moran’s \(I\) statistic, there is a local analogue. The formula is given by:

\[ I_i^B = x_i\sum_jw_{ij}y_j \]

(Anselin 2024; Bivand 2022).

This can be interesting in the context of detection of coexpressed ligand-receptor pairs. A method that is based on local bivariate Moran’s \(I\) and tries to detect such pairs is SpatialDM (Li et al. 2023).

R
Python

Show the code

sfe <- runBivariate(sfe, type = "localmoran_bv",
                    feature1 = features[1], feature2 = features[2],
                    colGraphName = colGraphName,
                    nsim = 499)

plotLocalResult(sfe, "localmoran_bv", 
                 features = localResultFeatures(sfe, "localmoran_bv"),
                ncol = 2, divergent = TRUE, diverge_center = 0,
                colGeometryName = colGeometryName, size = plotsize)

Show the code

local_moran_bv = Moran_Local_BV(
    adata[:, features[0]].X.toarray().astype(np.float64),
    adata[:, features[1]].X.toarray().astype(np.float64),
    spatial_weights,
    transformation="r",
    permutations=499,
    seed=3407,
)
adata.obs[f"local_moran_bv_{features[0]}_{features[1]}"] = local_moran_bv.Is
adata.obs[f"local_moran_bv_{features[0]}_{features[1]}_p_sim"] = local_moran_bv.p_sim

fig, ax = plt.subplots(1, 1, figsize=figsize, layout = "tight")
sq.pl.spatial_scatter(
    adata,
    color=f"local_moran_bv_{features[0]}_{features[1]}",
    cmap=cmap_continuous,
    vmin=-adata.obs[f"local_moran_bv_{features[0]}_{features[1]}"].abs().max(),
    vmax=adata.obs[f"local_moran_bv_{features[0]}_{features[1]}"].abs().max(),
    vcenter=0,
    shape=None,
    library_id="spatial",
    title=f"Local bivariate Moran's I: {features[0]}"+ r"$\rightarrow$" + f"{features[1]}",
    ax=ax,
    fig=fig,
    size=pointsize,
)
ax.set_axis_off()

Local Bivariate Lee’s \(L\)

Similar to the global variant of Lee’s \(L\) the local variant (Lee 2001; Bivand 2022) is defined as

\[L_i(x,y) = \frac{n(\sum_{j=1}^nw_{ij}(x_j-\bar{x}))(\sum_{j=1}^nw_{ij}(y_j-\bar{y}))}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}}\] Local Lee’s \(L\) is a measure of spatial co-expression, when the variables of interest are gene expression measurements. Unlike the gobal version, the variables are not averaged and show the local contribution to the metric. Positive values indicate colocalization, negative values indicate segregation (Lee 2001; Bivand 2022).

R
Python

Show the code

sfe <- runBivariate(sfe, type = "locallee",
                    feature1 = features[1], feature2 = features[2],
                    colGraphName = colGraphName)

plotLocalResult(sfe, "locallee", 
                 features = localResultFeatures(sfe, "locallee"),
                ncol = 2, divergent = TRUE, diverge_center = 0,
                colGeometryName = colGeometryName, size = plotsize)

Show the code

np.random.seed(3407)
local_lees_l_estimator = Spatial_Pearson_Local(connectivity=spatial_weights.to_sparse(), permutations=499)
local_lees_l_estimator.fit(
    adata[:, features[0]].X.toarray().astype(np.float64),
    adata[:, features[1]].X.toarray().astype(np.float64),
)

Spatial_Pearson_Local(connectivity=<COOrdinate sparse array of dtype 'float64'
    with 136025 stored elements and shape (27205, 27205)>,
                      permutations=499)

Show the code

adata.obs[f"local_lees_l_{features[0]}_{features[1]}"] = local_lees_l_estimator.associations_

fig, ax = plt.subplots(1, 1, figsize=figsize, layout = "tight")
sq.pl.spatial_scatter(
    adata,
    color=f"local_lees_l_{features[0]}_{features[1]}",
    cmap=cmap_continuous,
    vmin=-adata.obs[f"local_lees_l_{features[0]}_{features[1]}"].abs().max(),
    vmax=adata.obs[f"local_lees_l_{features[0]}_{features[1]}"].abs().max(),
    vcenter=0,
    shape=None,
    library_id="spatial",
    title=f"Local Lees's Ls: {features[0]}"+ r"$\leftrightarrow$" + f"{features[1]}",
    ax=ax,
    fig=fig,
    size=pointsize,
)
ax.set_axis_off()

Local Measures for Multivariate Data

Multivariate local Geary’s \(C\)

Geary’s \(C\) is a measure of spatial autocorrelation that is based on the difference between a variable and its neighbours. (Anselin 2019, 1995) defines it as

\[C_i = \sum_{j=1}^n w_{ij}(x_i-y_j)^2\]

and can be generalized to \(k\) features (in our case genes) by expanding

\[C_{k,i} = \sum_{v=1}^k C_{v,i}\]

where \(c_{v,i}\) is the local Geary’s \(C\) for the \(v\)th variable at location \(i\). The number of variables that can be used is not fixed, which makes the interpretation a bit more difficult. In general, the metric summarizes similarity in the “multivariate attribute space” (i.e. the gene expression) to its geographic neighbours. The common difficulty in these analyses is the interpretation of the mixture of similarity in the physical space and similarity in the attribute space (Anselin 2019, 1995).

R
Python

To speed up computation we will use highly variable genes.

Show the code

hvgs <- getTopHVGs(sfe, n = 100)

# Subset of the tissue
sfe <- runMultivariate(sfe, type = "localC_multi",
                    subset_row = hvgs,
                    colGraphName = colGraphName)

# # Local C mutli is stored in colData so this is a workaround to plot it
# plotSpatialFeature(sfe, "localC_multi", size = plotsize, scattermore = FALSE)

Show the code

sc.pp.highly_variable_genes(adata, n_top_genes=100, flavor="seurat_v3", inplace=True)

local_geary_mv_estimator = Geary_Local_MV(connectivity=spatial_weights, permutations=100)
local_geary_mv_estimator.fit(
    [
        adata[:, highly_variable_gene].X.toarray()[:, 0]
        for highly_variable_gene in adata.var_names[adata.var["highly_variable"]].to_list()
    ]
)

Geary_Local_MV(connectivity=<libpysal.weights.distance.KNN object at 0x7f33e9debaf0>,
               permutations=100)

Show the code


adata.obs["local_geary_mv"] = local_geary_mv_estimator.localG
adata.obs["local_geary_mv_p"] = -np.log10(false_discovery_control(local_geary_mv_estimator.p_sim))

We can further plot the results of the permutation test. Significant values indicate interesting regions, but should be interpreted with care for various reasons. For example, we are looking for similarity in a combination of multiple features but the exact combination is not known. Anselin (2019) write “Overall, however, the statistic indicates a combination of the notion of distance in multi-attribute space with that of geographic neighbors. This is the essence of any spatial autocorrelation statistic. It is also the trade-off encountered in spatially constrained multivariate clustering methods (for a recent discussion, see, e.g., Grubesic, Wei, and Murray 2014).”. Multi-attribute space refers here to the highly variable genes. The problem can be summarised to where the similarity comes from, the gene expression or the physical space (Anselin 2019). The same problem is common in spatial domain detection methods.

R
Python

Show the code

sfe <- runMultivariate(sfe, type = "localC_perm_multi",
                    subset_row = hvgs,
                    nsim = 100,
                    colGraphName= colGraphName)

# stored as spatially reduced dim; plot it in this way
spatialReducedDim(sfe, "localC_perm_multi",  c(1, 11), size = plotsize/2)

Show the code

fig, ax = plt.subplots(1, 2, figsize=(len(features)*7, 7), layout = "tight")
sq.pl.spatial_scatter(
    adata,
    color=["local_geary_mv", "local_geary_mv_p"],
    cmap="Blues",
    vmin=0,
    shape=None,
    library_id="spatial",
    title=["Geary's local multivariate C", "Simulated $-log_{10}(p_{adjusted})$"],
    ax=ax,
    fig=fig,
    size=pointsize
)
for ax in ax:
    ax.set_axis_off()

plotted are the effect size and the adjusted p-values in space.

Local Neighbour Match Test

This test is useful to assess the overlap of the \(k\)-nearest neighbours from physical distances (tissue space) with the \(k\)-nearest neighbours from the gene expression measurements (attribute space). \(k\)-nearest neighbour matrices are computed for both physical and attribute space. In a second step the probability of overlap between the two matrices is computed (Anselin and Li 2020).

R
Python

Show the code

sf <- colGeometries(sfe)[[colGeometryName]]
sf <- cbind(sf,  t(as.matrix(logcounts(sfe)[hvgs,])))
# "-" gets replaced by "." so harmonise here
hvgs <- gsub("-", ".", hvgs)

nbr_test <- neighbor_match_test(sf[c(hvgs)], k = 6)

sf$Probability <- nbr_test$Probability
sf$Cardinality <- nbr_test$Cardinality

p <- ggplot() +
  geom_sf(data = sf, aes(fill = Cardinality, color = Cardinality), size = plotsize * 0.6) +
  theme_void() +
  scale_color_viridis_b() +
  scale_fill_viridis_b()
q <- ggplot() +
  geom_sf(data = sf, aes(fill = Probability, color = Probability), size = plotsize * 0.6)+
  theme_void() +
  scale_color_viridis_c(option = "C", direction = -1) +
  scale_fill_viridis_c(option = "C", direction = -1)

p + q

Show the code

k = 20
# Spatial grid neighbors
df_neighbors_spatial = spatial_weights.to_adjlist()
df_neighbors_features = KNN(KDTree(adata[:, adata.var["highly_variable"]].X.toarray(), distance_metric="euclidean"), k=k).to_adjlist()

focal_points = sorted(set(df_neighbors_spatial.focal).intersection(df_neighbors_features.focal))
focal_points_names = adata.obs_names[focal_points]

df_neighborhood_match_test = pd.DataFrame(columns=["neighbors_match_count", "neighbors_match_fraction"], index=focal_points_names)

for focal_point, focal_name in zip(focal_points, focal_points_names):
    neighbors_spatial = set(df_neighbors_spatial[df_neighbors_spatial.focal == focal_point].neighbor)
    neighbors_features = set(df_neighbors_features[df_neighbors_features.focal == focal_point].neighbor)
    neighbors_match_count = len(neighbors_spatial.intersection(neighbors_features))
    neighbors_match_fraction = neighbors_match_count / len(neighbors_spatial)
    df_neighborhood_match_test.loc[focal_name] = [neighbors_match_count, neighbors_match_fraction]

adata.obs["neighbors_match_count"] = df_neighborhood_match_test.loc[adata.obs_names, "neighbors_match_count"]
adata.obs["neighbors_match_fraction"] = df_neighborhood_match_test.loc[adata.obs_names, "neighbors_match_fraction"]

# plot the results
neighborhood_match_test_features = ["neighbors_match_count", "neighbors_match_fraction"]
fig, axes = plt.subplots(1, len(neighborhood_match_test_features), figsize=(len(neighborhood_match_test_features)*5, 5), layout = "tight")
for feature, ax in zip(neighborhood_match_test_features, axes):
    title = feature.replace("_", " ").capitalize()
    sq.pl.spatial_scatter(adata,
    color=feature, 
    shape=None, 
    size=pointsize, 
    cmap="YlOrBr", 
    title=title, 
    ax=ax, 
    fig=fig,
    use_raw=False)
    ax.set_axis_off()

Cardinality is a measure of how many neighbours of the two matrices are common. Some regions show high cardinality with low probability and therefore share similarity in both attribute and physical space. In contrast to multivariate local Geary’s \(C\) this metric focuses directly on the distances and not on a weighted average. A problem of this approach is called the empty space problem which states that as the number of dimensions of the feature sets increase, the empty space between observations also increases (Anselin and Li 2020).

Measures for binary and categorical data

Join count statistic

In addition to measures of spatial autocorrelation for continuous data as seen above, the join count statistic method applies the same concept to binary and categorical data. In essence, the joint count statistic compares the distribution of categorical marks in a lattice with frequencies that would occur randomly. These random occurrences can be computed using a theoretical approximation or random permutations. The same concept was also extended in a multivariate setting with more than two categories. The corresponding spdep functions are called joincount.test and joincount.multi (Dale and Fortin 2014; Bivand 2022; Cliff and Ord 1981).

First, we need to get categorical marks for each data point. We do so by running (non-spatial) PCA on the data followed by Leiden clustering (Traag, Waltman, and Van Eck 2019).

R
Python

Show the code

library(BiocNeighbors)
library(BiocSingular)

set.seed(123)
# Run PCA on the sample
sfe <- runPCA(sfe, exprs_values = "logcounts", ncomponents = 50, BSPARAM = IrlbaParam())
# Cluster based on first 20 PC's and using leiden
colData(sfe)$cluster <- clusterRows(reducedDim(sfe, "PCA")[,1:10],
                                    BLUSPARAM = KNNGraphParam(
                                      k = 20,
                                      BNPARAM=AnnoyParam(ntrees=50),
                                      cluster.fun = "leiden",
                                      cluster.args = list(
                                          resolution = 0.3,
                                          objective_function = "modularity")))

plotSpatialFeature(sfe,
  "cluster",
  colGeometryName = colGeometryName, size = plotsize
)

Show the code

np.random.seed(123)
#compute a PCA on the 
sc.pp.pca(adata, n_comps = 50, zero_center = True, svd_solver = "arpack")
#compute the neighbours
sc.pp.neighbors(adata, use_rep = "X_pca", knn = True, n_pcs = 10)
#compute leiden clustering
sc.tl.leiden(adata, resolution = 0.3, flavor = "igraph", objective_function = "modularity")

fig, ax = plt.subplots(1, 1, figsize=figsize, layout = "tight")
sq.pl.spatial_scatter(
    adata,
    color="leiden",
    shape=None,
    library_id="spatial",
    title="Clusters",
    ax=ax,
    fig=fig,
    size=pointsize
)
ax.set_axis_off()

The join count statistics of this example are:

R
Python

Show the code

joincount.multi(colData(sfe)$cluster,
             colGraph(sfe, 'binary'))

     Joincount  Expected  Variance   z-value
1:1   1892.500   713.551   540.754   50.6985
2:2   2855.000   895.435   662.289   76.1441
3:3   4643.000  2019.249  1327.454   72.0133
4:4  10916.000  4893.904  2598.194  118.1441
5:5    494.500    85.489    73.577   47.6829
6:6    861.000    84.077    72.401   91.3073
7:7  16931.500  5015.361  2641.979  231.8307
2:1   1891.500  1599.216  1210.675    8.4002
3:1   2518.500  2401.385  1725.298    2.8196
3:2   2871.000  2690.034  1913.794    4.1367
4:1   4317.000  3738.334  2445.447   11.7017
4:2   2432.500  4187.685  2717.079  -33.6722
4:3   5747.000  6288.234  3923.351   -8.6409
5:1    513.500   494.312   400.486    0.9588
5:2    493.000   553.729   443.400   -2.8840
5:3   1129.000   831.481   629.135   11.8616
5:4   1635.500  1294.400   884.247   11.4708
6:1    494.500   490.214   397.262    0.2150
6:2    472.000   549.139   439.829   -3.6782
6:3    594.500   824.588   624.058   -9.2104
6:4    897.000  1283.669   877.079  -13.0563
6:5     65.500   169.737   146.267   -8.6188
7:1    223.500  3784.435  2467.213  -71.6904
7:2   1390.500  4239.328  2741.434  -54.4098
7:3   1219.500  6365.782  3959.727  -81.7826
7:4    129.000  9909.871  5757.405 -128.9033
7:5     43.500  1310.362   891.830  -42.4217
7:6    341.000  1299.499   884.600  -32.2269
Jtot 29419.000 54305.434  9321.576 -257.7615

Show the code

sq.gr.interaction_matrix(adata, "leiden", normalized = False, connectivity_key="spatial", weights = False)
df_interactions = pd.DataFrame(adata.uns["leiden_interactions"], columns=np.unique(adata.obs["leiden"]), index=np.unique(adata.obs["leiden"]))
# add lower triangular matrix (w/o diagonal) to the dataframe and divide by 2
array_join_counts = (df_interactions + np.tril(df_interactions, k = -1).T)/2
#only print the upper triangular matrix
np.triu(array_join_counts)

array([[  943.5,   521. ,  2316.5,   429.5,  1160.5,   503. ,   309.5,
         1729.5],
       [    0. ,  2356.5,  3244.5,   249.5,  2792.5,   826. ,   501.5,
           46. ],
       [    0. ,     0. ,  9595.5,  1245. ,  6264.5,  2192. ,   942.5,
          489.5],
       [    0. ,     0. ,     0. ,  1568.5,   892.5,   230. ,   268. ,
         1176.5],
       [    0. ,     0. ,     0. ,     0. ,  5212. ,  2007. ,   989.5,
          849. ],
       [    0. ,     0. ,     0. ,     0. ,     0. ,   943.5,   276. ,
          127. ],
       [    0. ,     0. ,     0. ,     0. ,     0. ,     0. ,   365. ,
          104. ],
       [    0. ,     0. ,     0. ,     0. ,     0. ,     0. ,     0. ,
        14345.5]])

The Python function sq.gr.interaction_matrix counts the interaction for each pair twice, while the R function joincount.multi counts each interaction only once. Therefore, in Python we add the lower triangular matrix to the upper triangle (without the diagonal) and divide the resulting interaction matrix by 2. Since there are differences in the implementation of the principal component calculcation (namely in the SVD decomposition of the sparse logcounts matrix), the results are not perfectly corresponding, c.f. Rich et al. (2024).

The rows show different combinations of clusters that are in physical contact. E.g. \(1:1\) means the cluster \(1\) with itself. The column Joincount is the observed statistic whereas the column Expected is the expected value of the statistic for this combination. Like this, we can compare whether contacts among cluster combinations occur more frequently than expected at random (Cliff and Ord 1981).

A note of caution

The local methods presented above should always be interpreted with care, since we face the problem of multiple testing when calculating them for each cell. Moreover, the presented methods should mainly serve as exploratory measures to identify interesting regions in the data. Multiple processes can lead to the same pattern, thus the underlying process cannot be inferred from characterising the pattern. Indication of clustering does not explain why this occurs. On one hand, clustering can be the result of spatial interaction between the variables of interest. This is the case if a gene of interest is highly expressed in a tissue region. On the other hand, clustering can be the result of spatial heterogeneity, when local similarity is created by structural heterogeneity in the tissue, e.g., when cells with uniform expression of a gene of interest are grouped together which then creates the apparent clustering of the gene expression measurement.

Appendix

Session info

Show the code

sessionInfo()

R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods  
[8] base     

other attached packages:
 [1] BiocSingular_1.24.0             BiocNeighbors_2.2.0            
 [3] dixon_0.0-10                    splancs_2.01-45                
 [5] sp_2.2-0                        bluster_1.18.0                 
 [7] magrittr_2.0.3                  stringr_1.5.1                  
 [9] spdep_1.3-11                    spData_2.3.4                   
[11] tmap_4.0                        scater_1.36.0                  
[13] scran_1.36.0                    scuttle_1.18.0                 
[15] SFEData_1.10.0                  Voyager_1.10.0                 
[17] SpatialFeatureExperiment_1.10.0 rgeoda_0.1.0                   
[19] digest_0.6.37                   sf_1.0-20                      
[21] reshape2_1.4.4                  patchwork_1.3.0                
[23] STexampleData_1.16.0            ExperimentHub_2.16.0           
[25] AnnotationHub_3.16.0            BiocFileCache_2.16.0           
[27] dbplyr_2.5.0                    rlang_1.1.6                    
[29] ggplot2_3.5.2                   dplyr_1.1.4                    
[31] spatstat_3.3-2                  spatstat.linnet_3.2-5          
[33] spatstat.model_3.3-5            rpart_4.1.24                   
[35] spatstat.explore_3.4-2          nlme_3.1-168                   
[37] spatstat.random_3.3-3           spatstat.geom_3.3-6            
[39] spatstat.univar_3.1-2           spatstat.data_3.1-6            
[41] SpatialExperiment_1.18.0        SingleCellExperiment_1.30.0    
[43] SummarizedExperiment_1.38.0     Biobase_2.68.0                 
[45] GenomicRanges_1.60.0            GenomeInfoDb_1.44.0            
[47] IRanges_2.42.0                  S4Vectors_0.46.0               
[49] BiocGenerics_0.54.0             generics_0.1.3                 
[51] MatrixGenerics_1.20.0           matrixStats_1.5.0              

loaded via a namespace (and not attached):
  [1] spatialreg_1.3-6          spatstat.sparse_3.1-0    
  [3] bitops_1.0-9              EBImage_4.50.0           
  [5] httr_1.4.7                RColorBrewer_1.1-3       
  [7] tools_4.5.0               R6_2.6.1                 
  [9] HDF5Array_1.36.0          mgcv_1.9-3               
 [11] anndata_0.7.5.6           rhdf5filters_1.20.0      
 [13] withr_3.0.2               gridExtra_2.3            
 [15] leaflet_2.2.2             leafem_0.2.3             
 [17] cli_3.6.5                 sandwich_3.1-1           
 [19] labeling_0.4.3            mvtnorm_1.3-3            
 [21] proxy_0.4-27              R.utils_2.13.0           
 [23] spacesXYZ_1.5-1           dichromat_2.0-0.1        
 [25] scico_1.5.0               limma_3.64.0             
 [27] RSQLite_2.3.11            crosstalk_1.2.1          
 [29] Matrix_1.7-3              ggbeeswarm_0.7.2         
 [31] logger_0.4.0              abind_1.4-8              
 [33] R.methodsS3_1.8.2         terra_1.8-42             
 [35] lifecycle_1.0.4           multcomp_1.4-28          
 [37] yaml_2.3.10               edgeR_4.6.1              
 [39] tmaptools_3.2             rhdf5_2.52.0             
 [41] SparseArray_1.8.0         grid_4.5.0               
 [43] blob_1.2.4                dqrng_0.4.1              
 [45] crayon_1.5.3              dir.expiry_1.16.0        
 [47] lattice_0.22-7            beachmat_2.24.0          
 [49] KEGGREST_1.48.0           magick_2.8.6             
 [51] zeallot_0.1.0             pillar_1.10.2            
 [53] knitr_1.50                metapod_1.16.0           
 [55] rjson_0.2.23              boot_1.3-31              
 [57] codetools_0.2-20          wk_0.9.4                 
 [59] glue_1.8.0                data.table_1.17.0        
 [61] memuse_4.2-3              vctrs_0.6.5              
 [63] png_0.1-8                 gtable_0.3.6             
 [65] assertthat_0.2.1          cachem_1.1.0             
 [67] xfun_0.52                 mime_0.13                
 [69] S4Arrays_1.8.0            DropletUtils_1.28.0      
 [71] cols4all_0.8              coda_0.19-4.1            
 [73] survival_3.8-3            sfheaders_0.4.4          
 [75] units_0.8-7               statmod_1.5.0            
 [77] TH.data_1.1-3             bit64_4.6.0-1            
 [79] filelock_1.0.3            irlba_2.3.5.1            
 [81] vipor_0.4.7               KernSmooth_2.23-26       
 [83] colorspace_2.1-1          DBI_1.2.3                
 [85] zellkonverter_1.18.0      leaflegend_1.2.1         
 [87] raster_3.6-32             tidyselect_1.2.1         
 [89] bit_4.6.0                 compiler_4.5.0           
 [91] curl_6.2.2                h5mread_1.0.0            
 [93] basilisk.utils_1.20.0     DelayedArray_0.34.1      
 [95] scales_1.4.0              classInt_0.4-11          
 [97] rappdirs_0.3.3            tiff_0.1-12              
 [99] goftest_1.2-3             fftwtools_0.9-11         
[101] spatstat.utils_3.1-3      rmarkdown_2.29           
[103] basilisk_1.20.0           XVector_0.48.0           
[105] htmltools_0.5.8.1         pkgconfig_2.0.3          
[107] jpeg_0.1-11               base64enc_0.1-3          
[109] sparseMatrixStats_1.20.0  fastmap_1.2.0            
[111] htmlwidgets_1.6.4         UCSC.utils_1.4.0         
[113] DelayedMatrixStats_1.30.0 farver_2.1.2             
[115] zoo_1.8-14                jsonlite_2.0.0           
[117] BiocParallel_1.42.0       R.oo_1.27.0              
[119] RCurl_1.98-1.17           GenomeInfoDbData_1.2.14  
[121] s2_1.1.7                  Rhdf5lib_1.30.0          
[123] Rcpp_1.0.14               reticulate_1.42.0        
[125] ggnewscale_0.5.1          viridis_0.6.5            
[127] leafsync_0.1.0            stringi_1.8.7            
[129] MASS_7.3-65               plyr_1.8.9               
[131] parallel_4.5.0            ggrepel_0.9.6            
[133] deldir_2.0-4              stars_0.6-8              
[135] Biostrings_2.76.0         splines_4.5.0            
[137] tensor_1.5                locfit_1.5-9.12          
[139] igraph_2.1.4              ScaledMatrix_1.16.0      
[141] LearnBayes_2.15.1         XML_3.99-0.18            
[143] BiocVersion_3.21.1        evaluate_1.0.3           
[145] leaflet.providers_2.0.0   renv_1.1.4               
[147] BiocManager_1.30.25       purrr_1.0.4              
[149] polyclip_1.10-7           rsvd_1.0.5               
[151] lwgeom_0.2-14             e1071_1.7-16             
[153] RSpectra_0.16-2           viridisLite_0.4.2        
[155] class_7.3-23              tibble_3.2.1             
[157] memoise_2.0.1             beeswarm_0.4.0           
[159] AnnotationDbi_1.70.0      cluster_2.1.8.1          
[161] BiocStyle_2.36.0

References

Anselin, Luc. 1995. “Local Indicators of Spatial Association—LISA.” Geographical Analysis 27 (2): 93–115. https://doi.org/10.1111/j.1538-4632.1995.tb00338.x.

———. 2019. “A Local Indicator of Multivariate Spatial Association: Extending Geary’s c.” Geographical Analysis 51 (2): 133–50. https://doi.org/10.1111/gean.12164.

———. 2024. An Introduction to Spatial Data Science with GeoDa: Volume 1: Exploring Spatial Data. CRC Press.

Anselin, Luc, and Xun Li. 2020. “Tobler’s Law in a Multivariate World.” Geographical Analysis 52 (4): 494–510. https://doi.org/10.1111/gean.12237.

Bivand, Roger. 2022. “R Packages for Analyzing Spatial Data: A Comparative Case Study with Areal Data.” Geographical Analysis 54 (3): 488–518. https://doi.org/10.1111/gean.12319.

Cliff, Andrew David, and J Keith Ord. 1981. Spatial Processes: Models & Applications. Pion, London.

Dale, Mark R. T., and Marie-Josée Fortin. 2014. Spatial Analysis: A Guide for Ecologists. Second Edition. Cambridge ; New York: Cambridge University Press.

Getis, Arthur. 2009. “Spatial Weights Matrices.” Geographical Analysis 41 (4): 404–10. https://doi.org/10.1111/j.1538-4632.2009.00768.x.

He, Shanshan, Ruchir Bhatt, Carl Brown, Emily A. Brown, Derek L. Buhr, Kan Chantranuvatana, Patrick Danaher, et al. 2022. “High-Plex Imaging of RNA and Proteins at Subcellular Resolution in Fixed Tissue by Spatial Molecular Imaging.” Nature Biotechnology 40 (12): 1794–1806. https://doi.org/10.1038/s41587-022-01483-z.

Lee, Sang-Il. 2001. “Developing a Bivariate Spatial Association Measure: An Integration of Pearson’s r and Moran’s I.” Journal of Geographical Systems 3 (4): 369–85. https://doi.org/10.1007/s101090100064.

Li, Zhuoxuan, Tianjie Wang, Pentao Liu, and Yuanhua Huang. 2023. “SpatialDM for Rapid Identification of Spatially Co-Expressed Ligand–Receptor and Revealing Cell–Cell Communication Patterns.” Nature Communications 14 (1): 3995. https://doi.org/10.1038/s41467-023-39608-w.

Moses, Lambda, Pétur Helgi Einarsson, Kayla C. Jackson, Laura Luebbert, Ali Sina Booeshaghi, Sindri Emmanúel Antonsson, Nicolas Bray, Páll Melsted, and Lior Pachter. 2023. “Voyager: Exploratory Single-Cell Genomics Data Analysis with Geospatial Statistics.” bioRxiv. https://doi.org/10.1101/2023.07.20.549945.

Palla, Giovanni, Hannah Spitzer, Michal Klein, David Fischer, Anna Christina Schaar, Louis Benedikt Kuemmerle, Sergei Rybakov, et al. 2022. “Squidpy: A Scalable Framework for Spatial Omics Analysis.” Nature Methods 19 (2): 171–78. https://doi.org/10.1038/s41592-021-01358-2.

Pebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With Applications in R. 1st ed. New York: Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016.

Rey, Sergio J., and Luc Anselin. 2010. “PySAL: A Python Library of Spatial Analytical Methods.” In Handbook of Applied Spatial Analysis, 175–93. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03647-7_11.

Rich, Joseph M, Lambda Moses, Pétur Helgi Einarsson, Kayla Jackson, Laura Luebbert, A. Sina Booeshaghi, Sindri Antonsson, et al. 2024. “The Impact of Package Selection and Versioning on Single-Cell RNA-seq Analysis.” bioRxiv, April, 2024.04.04.588111. https://doi.org/10.1101/2024.04.04.588111.

Traag, Vincent A, Ludo Waltman, and Nees Jan Van Eck. 2019. “From Louvain to Leiden: Guaranteeing Well-Connected Communities.” Scientific Reports 9 (1): 1–12. https://doi.org/10.1038/s41598-019-41695-z.

Virshup, Isaac, Sergei Rybakov, Fabian J Theis, Philipp Angerer, and F Alexander Wolf. 2024. “Anndata: Access and Store Annotated Data Matrices.” Journal of Open Source Software 9 (101): 4371. https://doi.org/10.21105/joss.04371.

Wang, Zhenzhou. 2019. “Cell Segmentation for Image Cytometry: Advances, Insufficiencies, and Challenges.” Cytometry Part A 95 (7): 708–11. https://doi.org/10.1002/cyto.a.23686.

Wartenberg, Daniel. 1985. “Multivariate Spatial Correlation: A Method for Exploratory Geographical Analysis.” Geographical Analysis 17 (4): 263–83. https://doi.org/10.1111/j.1538-4632.1985.tb00849.x.

Zuur, Alain F., Elena N. Ieno, and Graham M. Smith. 2007. Analysing Ecological Data. Statistics for Biology and Health. New York: Springer.