Published

February 4, 2025

Lattice data analysis – multivaraite methods for HTS-based data

In this vignette we will show:

Multivariate lattice data analysis methods for HTS-based approaches.
This includes global metrics on the entire field of view and local variants thereof.
The use case is a 10x Visium data set from 10x Genomics (2021) downloaded from the squidpy visium datasets (Palla et al. 2022).
The R implementations rely on the Voyager package. The data is represented as SpatialFeatureExperiment (Moses et al. 2023). Complementary resources using these methods are found in the Voyager bivariate vignette and Voyager multivariate vignette.
Python implementations rely on the the packages esda, pysal and squidpy (Rey and Anselin 2010; Palla et al. 2022). Data representation rely on the anndata structure (Virshup et al. 2024).

R
Python

Show the code

source("utils.R")
theme_set(theme_light())

Show the code

import numpy as np
import scanpy as sc
import squidpy as sq
from esda.moran import Moran_BV, Moran_Local_BV
from esda.lee import Spatial_Pearson, Spatial_Pearson_Local
from esda.geary_local_mv import Geary_Local_MV
from scipy.stats import false_discovery_control
from libpysal.cg import KDTree
from libpysal.weights import W, KNN
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
import warnings

warnings.filterwarnings("ignore")

plt.rcParams['font.family'] = 'Futura'
df_cmap_discrete = pd.read_csv("../misc/ditto_colors.csv", index_col=0)
df_cmap_continuous = pd.read_csv("../misc/roma_colors.csv", index_col=0)
cmap_discrete = ListedColormap(df_cmap_discrete["ditto_colors"].values, "ditto")
cmap_continuous = LinearSegmentedColormap.from_list("roma", df_cmap_continuous["roma_colors"]).reversed()

#load the dataset
adata = sq.datasets.visium("Visium_FFPE_Mouse_Kidney")
# normalise counts already here that both R and python have same values
adata.raw = adata.copy()
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
# store data as h5ad object
adata.write("../data/adata_neighbors.h5ad")

Show the code

library(SpatialExperiment)

sce <- zellkonverter::readH5AD("../data/adata_neighbors.h5ad")
colnames(reducedDim(sce, "spatial")) <- c("x", "y")
spe <- toSpatialExperiment(sce)

spatialCoords(spe) <- reducedDim(sce, "spatial")

sfe <- SpatialFeatureExperiment::toSpatialFeatureExperiment(spe)

# the counts are already normalised
assayNames(sfe) <- "logcounts"

sfe

class: SpatialFeatureExperiment 
dim: 19465 3124 
metadata(2): log1p spatial
assays(1): logcounts
rownames(19465): Xkr4 Rp1 ... Gm3376 Gm20830
rowData names(3): gene_ids feature_types genome
colnames(3124): AAACAAGTATCTCCCA-1 AAACAATCTACTAGCA-1 ...
  TTGTTTCCATACAACT-1 TTGTTTGTGTAAATTC-1
colData names(4): in_tissue array_row array_col sample_id
reducedDimNames(1): spatial
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : x y
imgData names(0):

unit:
Geometries:
colGeometries: centroids (POINT) 

Graphs:
sample01:

Show the code

# invert the spatial coordinates
adata.obsm["spatial"][:, 1] = adata.obsm["spatial"][:, 1].max() - adata.obsm["spatial"][:, 1]
adata

AnnData object with n_obs × n_vars = 3124 × 19465
    obs: 'in_tissue', 'array_row', 'array_col'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial', 'log1p'
    obsm: 'spatial'

Given this data from 10x we choose two genes to analyse henceforth, named Slc5a12 (solute carrier family 5, member 12) and Calr (calreticulin) (10x Genomics 2021). For further information on the preprocessing of visium datasets, please refer to the Voyager 10X Visium vignette.

Show the code

# there is an empty spot we have to exclude as it has no neighbours
sfe <- sfe[, colnames(sfe) != "GAGATCTGCTTGGCAT-1"]
# construct the weight matrix using the Voyager function
colGraph(sfe, "visium") <- findVisiumGraph(sfe)
colGraph(sfe, "binary") <- findVisiumGraph(sfe, style = "B")

Here we set the arguments for the examples below.

R
Python

Show the code

features <- c("Slc5a12", "Calr")
colGraphName <- "visium"
colGeometryName <- "centroids"
plotsize = 3

Show the code

# predefine genes
features = ["Slc5a12", "Calr"]
figsize = (len(features)*5, 7)
pointsize = 45

Regular lattice and spatial weight matrix

Spot based data is collected along a regular spaced grid where all sample areas have the same size. Such a grid is also called a regular lattice. In more rigorous terms the data \(Y\) is a realisation of a random variable fixed along a lattice \(D\). The lattice \(D\) does not have to be regular but in the scope of spot based data it is. Spot-based data is generated by a defined sampling strategy, whereas point pattern data is the result of a stochastic process (Zuur, Ieno, and Smith 2007).

The lattice is composed of individual spatial units

\[D = \{A_1, A_2,...,A_n\}\]

where these units do not overlap

\[A_i \cap A_j = \emptyset \forall i \neq j\]

The data is then a random variable of the spatial unit along the lattice

\[Y_i = Y(A_i)\]

Most lattice data analysis technique build on the concept of neighbours. Therefore, the spatial relationship can be modelled with e.g. a spatial weight matrix \(W\). There are a lot of ways to define a spatial weight matrix \(W\). For example, the spatial weight matrix can be defined as a binary contiguity matrix where adjacent units are indicated by one and non-adjacent units as zero.

\[w_{ij} = \begin{cases} 1 \text{ if } A_i \text{ and } A_j \text{ are adjacent}\\ 0 \text{ otw} \end{cases}\]

other options to specify the weight matrix \(W\) are mentioned in Zuur, Ieno, and Smith (2007).

Voyager has a special function for the construction of the weight matrix in Visium data findVisiumGraph. This function defines the neigbours to be the surrounding spots.

R
Python

Show the code

# there is an empty spot we have to exclude as it has no neighbours
sfe <- sfe[, colnames(sfe) != "GAGATCTGCTTGGCAT-1"]

# construct the weight matrix using the Voyager function
colGraph(sfe, "visium") <- findVisiumGraph(sfe)
colGraph(sfe, "binary") <- findVisiumGraph(sfe, style = "B")

plotColGraph(sfe,
  colGraphName = colGraphName
) + theme_void()

Show the code

sq.gr.spatial_neighbors(adata, n_neighs=6, coord_type="grid")
# remove unconnected spots
adata = adata[adata.obsp["spatial_connectivities"].sum(1).A1 > 0, :]
# calculate spatial weights
spatial_weights = W.from_sparse(adata.obsp["spatial_connectivities"])

# define a wider neighbourhood graph - result is more smoothed
sq.gr.spatial_neighbors(adata, n_neighs=50, coord_type="grid", key_added="wide_spatial")
spatial_weights_wide = W.from_sparse(adata.obsp["wide_spatial_connectivities"])

fig, ax = plt.subplots(1, 1, figsize=(7, 7), layout = "tight")
sq.pl.spatial_scatter(
    adata[adata.obsp["spatial_connectivities"].nonzero()[0], :],
    connectivity_key="spatial_connectivities",
    size=0.1,
    na_color="black",
    edges_color="black",
    edges_width=0.1,
    shape=None,
    library_id="spatial",
    ax=ax,
    fig=fig,
)

ax.set_axis_off()
fig.suptitle("Connectivity grid", size=10, y=0.9)
fig.set_dpi(200)

Global Measures for Bivariate Data

Global Bivariate Moran’s \(I\)

For two continous variables the global bivariate Moran’s \(I\) is defined as (Wartenberg 1985; Bivand 2022)

\[I_B = \frac{\Sigma_i(\Sigma_j{w_{ij}y_j\times x_i})}{\Sigma_i{x_i^2}}\]

where \(x_i\) and \(y_j\) are the two variables of interest and \(w_{ij}\) is the value of the spatial weights matrix for positions \(i\) and \(j\).

The global bivariate Moran’s \(I\) is a measure of correlation between the variables \(x\) and \(y\) where \(y\) has a spatial lag. The result might overestimate the spatial autocorrelation of the variables due to the non-spatial correlation of \(x\) and \(y\) (Bivand 2022).

R
Python

Show the code

res <- spdep::moran_bv(x = logcounts(sfe)[features[1],],
         y = logcounts(sfe)[features[2],],
         listw =  colGraph(sfe, colGraphName),
         nsim = 499,
         scale = TRUE)
res


DATA PERMUTATION


Call:
boot(data = xx, statistic = bvm_boot, R = nsim, sim = "permutation", 
    listw = listw, parallel = parallel, ncpus = ncpus, cl = cl)


Bootstrap Statistics :
     original     bias    std. error
t1* 0.2663111 -0.2657355 0.007882321

Show the code

plot(res)

Show the code

ci <- boot::boot.ci(res, conf = c(0.99, 0.95, 0.9), type = "basic")
ci

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 499 bootstrap replicates

CALL : 
boot::boot.ci(boot.out = res, conf = c(0.99, 0.95, 0.9), type = "basic")

Intervals : 
Level      Basic         
99%   ( 0.5116,  0.5538 )   
95%   ( 0.5152,  0.5473 )   
90%   ( 0.5188,  0.5452 )  
Calculations and Intervals on Original Scale
Some basic intervals may be unstable

Show the code

np.random.seed(3407)
moran_bv = Moran_BV(
    adata[:, features[0]].X.toarray(),
    adata[:, features[1]].X.toarray(),
    spatial_weights,
    transformation="r",
    permutations=499,
)
adata.uns[f"moran_bv_{features[0]}_{features[1]}"] = moran_bv.I
adata.uns[f"moran_bv_{features[0]}_{features[1]}_p_sim"] = moran_bv.p_sim

for key in filter(lambda x: x.startswith("moran_"), adata.uns.keys()):
    print(f"{key}: {adata.uns[key].round(4)}")

moran_bv_Slc5a12_Calr: 0.2663
moran_bv_Slc5a12_Calr_p_sim: 0.002

Show the code

    
plt.figure(figsize=(5, 4), dpi=100)
hist = plt.hist(moran_bv.sim, bins=30, color="lightgrey", edgecolor="black")
plt.axvline(moran_bv.I, color="red", linestyle="--")
plt.xlabel("Simulated bivariate Moran's I")
plt.ylabel("Simulated frequency")
plt.title(f"Bivariate Moran's I: {features[0]} vs {features[1]} (I: {moran_bv.I.round(4)}, p: {moran_bv.p_sim.round(4)})")
sns.despine()

The value t0 indicates the test statistic of global bivariate Moran’s \(I\). The global bivariate Moran’s \(I\) value for the genes Slc5a12, Calr is 0.2663111. Significance can be assessed by comparing the permuted confidence interval with the test statistic.

Global Bivariate Lee’s \(L\)

Lee’s \(L\) is a bivariate measure that combines non-spatial pearson correlation with spatial autocorrelation via Moran’s \(I\) (Lee 2001). This enables us to asses the spatial dependence of two continuous variables in a single measure. The measure is defined as

\[L(x,y) = \frac{n}{\sum_{i=1}^n(\sum_{j=1}^nw_{ij})^2}\frac{\sum_{i=1}^n[\sum_{j=1}^nw_{ij}(x_j-\bar{x})](\sum_{j=1}^nw_{ij}(y_j-\bar{y}))}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}}\]

where \(w_{ij}\) is the value of the spatial weights matrix for positions \(i\) and \(j\), \(x\) and \(y\), the two variables of interest and \(\bar{x}\) and \(\bar{y}\) their means (Lee 2001; Bivand 2022).

R
Python

Show the code

res_lee <- calculateBivariate(sfe, type = "lee.mc", 
                   feature1 = features[1], feature2 = features[2],
                   colGraphName = colGraphName,
                   nsim = 499)
res_lee$lee.mc_statistic

statistic 
  0.26808

Show the code

res_lee$lee.mc_p.value

[1] 0.002

The effect size of bivariate Lee’s \(L\) for the genes Slc5a12, Calr is 0.26808 and the associated p-value is 0.002

Show the code

np.random.seed(3407)
lees_l_estimator = Spatial_Pearson(connectivity=spatial_weights.to_sparse(), permutations=499)
lees_l_estimator.fit(
    adata[:, features[0]].X.toarray(),
    adata[:, features[1]].X.toarray(),
)

Spatial_Pearson(connectivity=<COOrdinate sparse array of dtype 'float64'
    with 18018 stored elements and shape (3123, 3123)>,
                permutations=499)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show the code

adata.uns[f"lees_l_{features[0]}_{features[1]}"] = lees_l_estimator.association_
adata.uns[f"lees_l_{features[0]}_{features[1]}_p_sim"] = lees_l_estimator.significance_

fig, ax = plt.subplots(figsize=(5, 5))
sns.heatmap(
    lees_l_estimator.association_,
    cmap=cmap_continuous,
    annot=True,
    cbar=False,
    ax=ax,
    mask=np.triu(np.ones_like(lees_l_estimator.association_)) - np.eye(2),
)
ax.set_title("Lee's L")
ax.set_xticklabels(features[:2])
ax.set_yticklabels(features[:2], rotation=0)

for i, row in enumerate(lees_l_estimator.significance_):
    for j, p in enumerate(row):
        ax.text(j + 0.5, i + 0.75, f"p = {p:.3f}", ha="center", va="center", color="white", fontsize=8)

Local Measures for Bivariate Data

Local Bivariate Moran’s \(I\)

Similar to the global bivariate Moran’s \(I\) statistic, there is a local analogue. The formula is given by:

\[ I_i^B = x_i\sum_jw_{ij}y_j \]

(Anselin 2024; Bivand 2022).

This can be interesting in the context of detection of coexpressed ligand-receptor pairs. A method that is based on local bivariate Moran’s \(I\) and tries to detect such pairs is SpatialDM (Li et al. 2023).

R
Python

Show the code

sfe_tissue <- runBivariate(sfe, type = "localmoran_bv",
                    feature1 = features[1], feature2 = features[2],
                    colGraphName = colGraphName,
                    nsim = 499)

plotLocalResult(sfe_tissue, "localmoran_bv", 
                 features = localResultFeatures(sfe_tissue, "localmoran_bv"),
                ncol = 2, divergent = TRUE, diverge_center = 0,
                colGeometryName = colGeometryName, size = plotsize)

Show the code

local_moran_bv = Moran_Local_BV(
    adata[:, features[0]].X.toarray().astype(np.float64),
    adata[:, features[1]].X.toarray().astype(np.float64),
    spatial_weights,
    transformation="r",
    permutations=499,
    seed=3407,
)
adata.obs[f"local_moran_bv_{features[0]}_{features[1]}"] = local_moran_bv.Is
adata.obs[f"local_moran_bv_{features[0]}_{features[1]}_p_sim"] = local_moran_bv.p_sim

fig, ax = plt.subplots(1, 1, figsize=figsize, layout = "tight")
sq.pl.spatial_scatter(
    adata,
    color=f"local_moran_bv_{features[0]}_{features[1]}",
    cmap=cmap_continuous,
    vmin=-adata.obs[f"local_moran_bv_{features[0]}_{features[1]}"].abs().max(),
    vmax=adata.obs[f"local_moran_bv_{features[0]}_{features[1]}"].abs().max(),
    vcenter=0,
    shape=None,
    library_id="spatial",
    title=f"Local bivariate Moran's I: {features[0]}"+ r"$\rightarrow$" + f"{features[1]}",
    ax=ax,
    fig=fig,
    size=pointsize,
)
ax.set_axis_off()

Local Bivariate Lee’s \(L\)

Similar to the global variant of Lee’s \(L\) the local variant (Lee 2001; Bivand 2022) is defined as

\[L_i(x,y) = \frac{n(\sum_{j=1}^nw_{ij}(x_j-\bar{x}))(\sum_{j=1}^nw_{ij}(y_j-\bar{y}))}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}}\] Local Lee’s \(L\) is a measure of spatial co-expression, when the variables of interest are gene expression measurements. Unlike the gobal version, the variables are not averaged and show the local contribution to the metric. Positive values indicate colocalization, negative values indicate segregation (Lee 2001; Bivand 2022).

R
Python

Show the code

sfe_tissue <- runBivariate(sfe, type = "locallee",
                    feature1 = features[1], feature2 = features[2],
                    colGraphName = colGraphName)

plotLocalResult(sfe_tissue, "locallee", 
                 features = localResultFeatures(sfe_tissue, "locallee"),
                ncol = 2, divergent = TRUE, diverge_center = 0,
                colGeometryName = colGeometryName, size = plotsize)

Show the code

np.random.seed(3407)
local_lees_l_estimator = Spatial_Pearson_Local(connectivity=spatial_weights.to_sparse(), permutations=499)
local_lees_l_estimator.fit(
    adata[:, features[0]].X.toarray().astype(np.float64),
    adata[:, features[1]].X.toarray().astype(np.float64),
)

Spatial_Pearson_Local(connectivity=<COOrdinate sparse array of dtype 'float64'
    with 18018 stored elements and shape (3123, 3123)>,
                      permutations=499)

Show the code

adata.obs[f"local_lees_l_{features[0]}_{features[1]}"] = local_lees_l_estimator.associations_

fig, ax = plt.subplots(1, 1, figsize=figsize, layout = "tight")
sq.pl.spatial_scatter(
    adata,
    color=f"local_lees_l_{features[0]}_{features[1]}",
    cmap=cmap_continuous,
    vmin=-adata.obs[f"local_lees_l_{features[0]}_{features[1]}"].abs().max(),
    vmax=adata.obs[f"local_lees_l_{features[0]}_{features[1]}"].abs().max(),
    vcenter=0,
    shape=None,
    library_id="spatial",
    title=f"Local Lees's Ls: {features[0]}"+ r"$\leftrightarrow$" + f"{features[1]}",
    ax=ax,
    fig=fig,
    size=pointsize,
)
ax.set_axis_off()

Local Measures for Multivariate Data

Multivariate local Geary’s \(C\)

Geary’s \(C\) is a measure of spatial autocorrelation that is based on the difference between a variable and its neighbours. (Anselin 2019, 1995) defines it as

\[c_i = \sum_{j=1}^n w_{ij}(x_i-y_j)^2\]

and can be generalized to \(k\) features (in our case genes) by expanding

\[c_{k,i} = \sum_{v=1}^k c_{v,i}\]

where \(c_{v,i}\) is the local Geary’s \(C\) for the \(v\)th variable at location \(i\). The number of variables that can be used is not fixed, which makes the interpretation a bit more difficult. In general, the metric summarizes similarity in the “multivariate attribute space” (i.e. the gene expression) to its geographic neighbours. The common difficulty in these analyses is the interpretation of the mixture of similarity in the physical space and similarity in the attribute space (Anselin 2019, 1995).

R
Python

To speed up computation we will use highly variable genes.

Show the code

hvgs <- getTopHVGs(sfe, n = 100)

# Subset of the tissue
sfe_tissue <- runMultivariate(sfe, type = "localC_multi",
                    subset_row = hvgs,
                    colGraphName = colGraphName)

# # Local C mutli is stored in colData so this is a workaround to plot it
# plotSpatialFeature(sfe_tissue, "localC_multi", size = plotsize, scattermore = FALSE)

Show the code

sc.pp.highly_variable_genes(adata, n_top_genes=100, flavor="seurat_v3", inplace=True)

local_geary_mv_estimator = Geary_Local_MV(connectivity=spatial_weights, permutations=100)
local_geary_mv_estimator.fit(
    [
        adata[:, highly_variable_gene].X.toarray()[:, 0]
        for highly_variable_gene in adata.var_names[adata.var["highly_variable"]].to_list()
    ]
)

Geary_Local_MV(connectivity=<libpysal.weights.weights.W object at 0x34781e860>,
               permutations=100)

Show the code


adata.obs["local_geary_mv"] = local_geary_mv_estimator.localG
adata.obs["local_geary_mv_p"] = -np.log10(false_discovery_control(local_geary_mv_estimator.p_sim))

We can further plot the results of the permutation test. Significant values indicate interesting regions, but should be interpreted with care for various reasons. For example, we are looking for similarity in a combination of multiple features but the exact combination is not known. Anselin (2019) write “Overall, however, the statistic indicates a combination of the notion of distance in multi-attribute space with that of geographic neighbors. This is the essence of any spatial autocorrelation statistic. It is also the trade-off encountered in spatially constrained multivariate clustering methods (for a recent discussion, see, e.g., Grubesic, Wei, and Murray 2014).”. Multi-attribute space refers here to the highly variable genes. The problem can be summarised to where the similarity comes from, the gene expression or the physical space (Anselin 2019). The same problem is common in spatial domain detection methods.

R
Python

Show the code

sfe <- runMultivariate(sfe, type = "localC_perm_multi",
                    subset_row = hvgs,
                    nsim = 100,
                    colGraphName= colGraphName)

# stored as spatially reduced dim; plot it in this way
spatialReducedDim(sfe, "localC_perm_multi",  c(1, 11), size = plotsize/2)

Show the code

fig, ax = plt.subplots(1, 2, figsize=(len(features)*7, 7), layout = "tight")
sq.pl.spatial_scatter(
    adata,
    color=["local_geary_mv", "local_geary_mv_p"],
    cmap="Blues",
    vmin=0,
    shape=None,
    library_id="spatial",
    title=["Geary's local multivariate C", "Simulated $-log_{10}(p_{adjusted})$"],
    ax=ax,
    fig=fig,
    size=pointsize
)
for ax in ax:
    ax.set_axis_off()

plotted are the effect size and the adjusted p-values in space.

Local Neighbour Match Test

This test is useful to assess the overlap of the \(k\)-nearest neighbours from physical distances (tissue space) with the \(k\)-nearest neighbours from the gene expression measurements (attribute space). \(k\)-nearest neighbour matrices are computed for both physical and attribute space. In a second step the probability of overlap between the two matrices is computed (Anselin and Li 2020).

R
Python

Show the code

sf <- colGeometries(sfe)[[colGeometryName]]
sf <- cbind(sf,  t(as.matrix(logcounts(sfe)[hvgs,])))
# "-" gets replaced by "." so harmonise here
hvgs <- gsub("-", ".", hvgs)

nbr_test <- neighbor_match_test(sf[c(hvgs)], k = 6)

sf$Probability <- nbr_test$Probability
sf$Cardinality <- nbr_test$Cardinality

p <- tm_shape(sf) + tm_dots(col = 'Cardinality', size = 0.06 * plotsize)
q <- tm_shape(sf) + tm_dots(col = 'Probability', size = 0.06 * plotsize)  

tmap_arrange(p,q)

Show the code

k = 20
# Spatial grid neighbors
df_neighbors_spatial = spatial_weights.to_adjlist()
df_neighbors_features = KNN(KDTree(adata[:, adata.var["highly_variable"]].X.toarray(), distance_metric="euclidean"), k=k).to_adjlist()

focal_points = sorted(set(df_neighbors_spatial.focal).intersection(df_neighbors_features.focal))
focal_points_names = adata.obs_names[focal_points]

df_neighborhood_match_test = pd.DataFrame(columns=["neighbors_match_count", "neighbors_match_fraction"], index=focal_points_names)

for focal_point, focal_name in zip(focal_points, focal_points_names):
    neighbors_spatial = set(df_neighbors_spatial[df_neighbors_spatial.focal == focal_point].neighbor)
    neighbors_features = set(df_neighbors_features[df_neighbors_features.focal == focal_point].neighbor)
    neighbors_match_count = len(neighbors_spatial.intersection(neighbors_features))
    neighbors_match_fraction = neighbors_match_count / len(neighbors_spatial)
    df_neighborhood_match_test.loc[focal_name] = [neighbors_match_count, neighbors_match_fraction]

adata.obs["neighbors_match_count"] = df_neighborhood_match_test.loc[adata.obs_names, "neighbors_match_count"]
adata.obs["neighbors_match_fraction"] = df_neighborhood_match_test.loc[adata.obs_names, "neighbors_match_fraction"]

# plot the results
neighborhood_match_test_features = ["neighbors_match_count", "neighbors_match_fraction"]
fig, axes = plt.subplots(1, len(neighborhood_match_test_features), figsize=(len(neighborhood_match_test_features)*5, 5), layout = "tight")
for feature, ax in zip(neighborhood_match_test_features, axes):
    title = feature.replace("_", " ").capitalize()
    sq.pl.spatial_scatter(adata,
    color=feature, 
    shape=None, 
    size=pointsize, 
    cmap="YlOrBr", 
    title=title, 
    ax=ax, 
    fig=fig,
    use_raw=False)
    ax.set_axis_off()

Cardinality is a measure of how many neighbours of the two matrices are common. Some regions show high cardinality with low probability and therefore share similarity on both attribute and physical space. In contrast to multivariate local Geary’s \(C\) this metric focuses directly on the distances and not on a weighted average. A problem of this approach is called the empty space problem which states that as the number of dimensions of the feature sets increase, the empty space between observations also increases (Anselin and Li 2020).

Measures for binary and categorical data

Join count statistic

In addition to measures of spatial autocorrelation of continuous data as seen above, the join count statistic method applies the same concept to binary and categorical data. In essence, the joint count statistic compares the distribution of categorical marks in a lattice with frequencies that would occur randomly. These random occurrences can be computed using a theoretical approximation or random permutations. The same concept was also extended in a multivariate setting with more than two categories. The corresponding spdep functions are called joincount.test and joincount.multi (Dale and Fortin 2014; Bivand 2022; Cliff and Ord 1981).

First, we need to get categorical marks for each data point. We do so by running (non-spatial) PCA on the data followed by Leiden clustering (Traag, Waltman, and Van Eck 2019).

R
Python

Show the code

library(BiocNeighbors)
library(BiocSingular)

set.seed(123)
# Run PCA on the sample
sfe <- runPCA(sfe, exprs_values = "logcounts", ncomponents = 50, BSPARAM = IrlbaParam())
# Cluster based on first 20 PC's and using leiden
colData(sfe)$cluster <- clusterRows(reducedDim(sfe, "PCA")[,1:10],
                                    BLUSPARAM = KNNGraphParam(
                                      k = 20,
                                      BNPARAM=AnnoyParam(ntrees=50),
                                      cluster.fun = "leiden",
                                      cluster.args = list(
                                          resolution = 0.3,
                                          objective_function = "modularity")))

plotSpatialFeature(sfe,
  "cluster",
  colGeometryName = colGeometryName, size = plotsize
)

Show the code

from sklearn_ann.kneighbors.annoy import AnnoyTransformer

np.random.seed(123)
#compute a PCA on the 
sc.pp.pca(adata, n_comps = 50, zero_center = True, svd_solver = "arpack")
#compute the neighbours
sc.pp.neighbors(adata, use_rep = "X_pca", knn = True, n_pcs = 10, transformer=AnnoyTransformer(n_neighbors=20, n_trees=50))
#compute leiden clustering
sc.tl.leiden(adata, resolution = 0.3, flavor = "igraph", objective_function = "modularity")

fig, ax = plt.subplots(1, 1, figsize=figsize, layout = "tight")
sq.pl.spatial_scatter(
    adata,
    color="leiden",
    palette=ListedColormap(cmap_discrete.colors[:len(np.unique(adata.obs["leiden"]))]),
    shape=None,
    library_id="spatial",
    title="Clusters",
    ax=ax,
    fig=fig,
    size=pointsize
)
ax.set_axis_off()

The join count statistics of this example are:

R
Python

Show the code

joincount.multi(colData(sfe)$cluster,
             colGraph(sfe, 'binary'))

     Joincount Expected Variance  z-value
1:1   1595.000  342.131  230.086   82.596
2:2   2964.000 1210.328  529.827   76.187
3:3   1555.000  390.990  255.053   72.885
4:4    666.000   62.703   53.387   82.568
5:5   1215.000  192.554  144.048   85.189
2:1      3.000 1288.619  776.179  -46.146
3:1    283.000  732.656  509.910  -19.913
3:2    463.000 1377.489  824.087  -31.856
4:1    132.000  293.738  226.149  -10.755
4:2      0.000  552.265  351.323  -29.464
4:3      0.000  313.995  238.454  -20.334
5:1     24.000  514.322  376.868  -25.257
5:2     51.000  966.993  596.735  -37.497
5:3      4.000  549.793  397.877  -27.362
5:4     54.000  220.424  178.065  -12.472
Jtot  1014.000 6810.293 1497.669 -149.776

Show the code

sq.gr.interaction_matrix(adata, "leiden", normalized = False, connectivity_key="spatial", weights = False)
df_interactions = pd.DataFrame(adata.uns["leiden_interactions"], columns=np.unique(adata.obs["leiden"]), index=np.unique(adata.obs["leiden"]))
# add lower triangular matrix (w/o diagonal) to the dataframe and divide by 2
array_join_counts = (df_interactions + np.tril(df_interactions, k = -1).T)/2
#only print the upper triangular matrix
np.triu(array_join_counts)

array([[1.254e+03, 2.000e+00, 1.310e+02, 3.100e+02, 0.000e+00, 6.000e+00],
       [0.000e+00, 2.938e+03, 0.000e+00, 4.530e+02, 1.900e+01, 2.500e+01],
       [0.000e+00, 0.000e+00, 9.920e+02, 6.000e+00, 0.000e+00, 7.300e+01],
       [0.000e+00, 0.000e+00, 0.000e+00, 1.552e+03, 0.000e+00, 1.600e+01],
       [0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 3.240e+02, 5.300e+01],
       [0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 8.550e+02]])

The Python function sq.gr.interaction_matrix counts the interaction for each pair twice, while the R function joincount.multi counts each interaction only once. Therefore, in Python we add the lower triangular matrix to the upper triangle (without the diagonal) and divide the resulting interaction matrix by 2. Since there are differences in the implementation of the principal component calculcation (namely in the SVD decomposition of the sparse logcounts matrix), the results are not perfectly corresponding, c.f. Rich et al. (2024).

The rows show different combinations of clusters that are in physical contact. E.g. \(1:1\) means the cluster \(1\) with itself. The column Joincount is the observed statistic whereas the column Expected is the expected value of the statistic for this combination. Like this, we can compare whether contacts among cluster combinations occur more frequently than expected at random (Cliff and Ord 1981).

A note of caution

The local methods presented above should always be interpreted with care, since we face the problem of multiple testing when calculating them for each cell. Moreover, the presented methods should mainly serve as exploratory measures to identify interesting regions in the data. Multiple processes can lead to the same pattern, thus the underlying process cannot be inferred from characterising the pattern. Indication of clustering does not explain why this occurs. On one hand, clustering can be the result of spatial interaction between the variables of interest. This is the case if a gene of interest is highly expressed in a tissue region. On the other hand, clustering can be the result of spatial heterogeneity, when local similarity is created by structural heterogeneity in the tissue, e.g., when cells with uniform expression of a gene of interest are grouped together which then creates the apparent clustering of the gene expression measurement.

Appendix

Session info

Show the code

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.7.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] BiocSingular_1.16.0            BiocNeighbors_1.18.0          
 [3] dixon_0.0-9                    splancs_2.01-45               
 [5] sp_2.1-4                       bluster_1.10.0                
 [7] magrittr_2.0.3                 stringr_1.5.1                 
 [9] spdep_1.3-8                    spData_2.3.3                  
[11] tmap_3.3-4                     scater_1.28.0                 
[13] scran_1.28.2                   scuttle_1.10.3                
[15] SFEData_1.2.0                  SpatialFeatureExperiment_1.2.3
[17] Voyager_1.2.7                  rgeoda_0.0.11-1               
[19] digest_0.6.37                  sf_1.0-19                     
[21] reshape2_1.4.4                 patchwork_1.3.0               
[23] STexampleData_1.8.0            ExperimentHub_2.8.1           
[25] AnnotationHub_3.8.0            BiocFileCache_2.8.0           
[27] dbplyr_2.3.4                   rlang_1.1.4                   
[29] ggplot2_3.5.1                  dplyr_1.1.4                   
[31] spatstat_3.3-0                 spatstat.linnet_3.2-3         
[33] spatstat.model_3.3-3           rpart_4.1.24                  
[35] spatstat.explore_3.3-3         nlme_3.1-166                  
[37] spatstat.random_3.3-2          spatstat.geom_3.3-4           
[39] spatstat.univar_3.1-1          spatstat.data_3.1-4           
[41] SpatialExperiment_1.10.0       SingleCellExperiment_1.22.0   
[43] SummarizedExperiment_1.30.2    Biobase_2.60.0                
[45] GenomicRanges_1.52.1           GenomeInfoDb_1.36.4           
[47] IRanges_2.34.1                 S4Vectors_0.38.2              
[49] BiocGenerics_0.46.0            MatrixGenerics_1.12.3         
[51] matrixStats_1.5.0             

loaded via a namespace (and not attached):
  [1] spatstat.sparse_3.1-0         bitops_1.0-9                 
  [3] httr_1.4.7                    RColorBrewer_1.1-3           
  [5] tools_4.3.1                   R6_2.5.1                     
  [7] HDF5Array_1.28.1              mgcv_1.9-1                   
  [9] rhdf5filters_1.12.1           withr_3.0.2                  
 [11] gridExtra_2.3                 leaflet_2.2.2                
 [13] leafem_0.2.3                  cli_3.6.3                    
 [15] labeling_0.4.3                proxy_0.4-27                 
 [17] dbscan_1.2-0                  R.utils_2.12.3               
 [19] dichromat_2.0-0.1             scico_1.5.0                  
 [21] limma_3.56.2                  rstudioapi_0.17.1            
 [23] RSQLite_2.3.9                 generics_0.1.3               
 [25] crosstalk_1.2.1               Matrix_1.5-4.1               
 [27] ggbeeswarm_0.7.2              abind_1.4-8                  
 [29] R.methodsS3_1.8.2             terra_1.8-5                  
 [31] lifecycle_1.0.4               yaml_2.3.10                  
 [33] edgeR_3.42.4                  rhdf5_2.44.0                 
 [35] tmaptools_3.1-1               grid_4.3.1                   
 [37] blob_1.2.4                    promises_1.3.2               
 [39] dqrng_0.4.1                   crayon_1.5.3                 
 [41] dir.expiry_1.8.0              lattice_0.22-6               
 [43] beachmat_2.16.0               KEGGREST_1.40.1              
 [45] magick_2.8.5                  pillar_1.10.1                
 [47] knitr_1.49                    metapod_1.7.0                
 [49] rjson_0.2.23                  boot_1.3-28.1                
 [51] codetools_0.2-19              wk_0.9.4                     
 [53] glue_1.8.0                    vctrs_0.6.5                  
 [55] png_0.1-8                     gtable_0.3.6                 
 [57] cachem_1.1.0                  xfun_0.50                    
 [59] S4Arrays_1.0.6                mime_0.12                    
 [61] DropletUtils_1.20.0           units_0.8-5                  
 [63] statmod_1.5.0                 interactiveDisplayBase_1.38.0
 [65] bit64_4.5.2                   filelock_1.0.3               
 [67] irlba_2.3.5.1                 vipor_0.4.7                  
 [69] KernSmooth_2.23-26            colorspace_2.1-1             
 [71] DBI_1.2.3                     zellkonverter_1.9.0          
 [73] raster_3.6-30                 tidyselect_1.2.1             
 [75] bit_4.5.0.1                   compiler_4.3.1               
 [77] curl_6.1.0                    basilisk.utils_1.12.1        
 [79] DelayedArray_0.26.7           scales_1.3.0                 
 [81] classInt_0.4-10               rappdirs_0.3.3               
 [83] goftest_1.2-3                 spatstat.utils_3.1-2         
 [85] rmarkdown_2.29                basilisk_1.11.2              
 [87] XVector_0.40.0                htmltools_0.5.8.1            
 [89] pkgconfig_2.0.3               base64enc_0.1-3              
 [91] sparseMatrixStats_1.12.2      fastmap_1.2.0                
 [93] htmlwidgets_1.6.4             shiny_1.10.0                 
 [95] DelayedMatrixStats_1.22.6     farver_2.1.2                 
 [97] jsonlite_1.8.9                BiocParallel_1.34.2          
 [99] R.oo_1.27.0                   RCurl_1.98-1.16              
[101] GenomeInfoDbData_1.2.10       s2_1.1.7                     
[103] Rhdf5lib_1.22.1               munsell_0.5.1                
[105] Rcpp_1.0.13-1                 ggnewscale_0.5.0             
[107] viridis_0.6.5                 reticulate_1.40.0            
[109] stringi_1.8.4                 leafsync_0.1.0               
[111] zlibbioc_1.46.0               plyr_1.8.9                   
[113] parallel_4.3.1                ggrepel_0.9.6                
[115] deldir_2.0-4                  Biostrings_2.68.1            
[117] stars_0.6-7                   splines_4.3.1                
[119] tensor_1.5                    locfit_1.5-9.10              
[121] igraph_2.1.2                  ScaledMatrix_1.8.1           
[123] BiocVersion_3.17.1            XML_3.99-0.18                
[125] evaluate_1.0.1                BiocManager_1.30.25          
[127] httpuv_1.6.15                 purrr_1.0.2                  
[129] polyclip_1.10-7               rsvd_1.0.5                   
[131] lwgeom_0.2-14                 xtable_1.8-4                 
[133] e1071_1.7-16                  RSpectra_0.16-2              
[135] later_1.4.1                   viridisLite_0.4.2            
[137] class_7.3-22                  tibble_3.2.1                 
[139] memoise_2.0.1                 beeswarm_0.4.0               
[141] AnnotationDbi_1.62.2          cluster_2.1.4                
[143] BiocStyle_2.28.1

References

10x Genomics. 2021. “Adult Mouse Kidney (FFPE), Spatial Gene Expression Dataset Analyzed Using Space Ranger 1.3.0.”

Anselin, Luc. 1995. “Local Indicators of Spatial Association—LISA.” Geographical Analysis 27 (2): 93–115. https://doi.org/10.1111/j.1538-4632.1995.tb00338.x.

———. 2019. “A Local Indicator of Multivariate Spatial Association: Extending Geary’s c.” Geographical Analysis 51 (2): 133–50. https://doi.org/10.1111/gean.12164.

———. 2024. An Introduction to Spatial Data Science with GeoDa: Volume 1: Exploring Spatial Data. CRC Press.

Anselin, Luc, and Xun Li. 2020. “Tobler’s Law in a Multivariate World.” Geographical Analysis 52 (4): 494–510. https://doi.org/10.1111/gean.12237.

Bivand, Roger. 2022. “R Packages for Analyzing Spatial Data: A Comparative Case Study with Areal Data.” Geographical Analysis 54 (3): 488–518. https://doi.org/10.1111/gean.12319.

Cliff, Andrew David, and J Keith Ord. 1981. Spatial Processes: Models & Applications. Pion, London.

Dale, Mark R. T., and Marie-Josée Fortin. 2014. Spatial Analysis: A Guide for Ecologists. Second Edition. Cambridge ; New York: Cambridge University Press.

Lee, Sang-Il. 2001. “Developing a Bivariate Spatial Association Measure: An Integration of Pearson’s r and Moran’s I.” Journal of Geographical Systems 3 (4): 369–85. https://doi.org/10.1007/s101090100064.

Li, Zhuoxuan, Tianjie Wang, Pentao Liu, and Yuanhua Huang. 2023. “SpatialDM for Rapid Identification of Spatially Co-Expressed Ligand–Receptor and Revealing Cell–Cell Communication Patterns.” Nature Communications 14 (1): 3995. https://doi.org/10.1038/s41467-023-39608-w.

Moses, Lambda, Pétur Helgi Einarsson, Kayla C. Jackson, Laura Luebbert, Ali Sina Booeshaghi, Sindri Emmanúel Antonsson, Nicolas Bray, Páll Melsted, and Lior Pachter. 2023. “Voyager: Exploratory Single-Cell Genomics Data Analysis with Geospatial Statistics.” bioRxiv. https://doi.org/10.1101/2023.07.20.549945.

Palla, Giovanni, Hannah Spitzer, Michal Klein, David Fischer, Anna Christina Schaar, Louis Benedikt Kuemmerle, Sergei Rybakov, et al. 2022. “Squidpy: A Scalable Framework for Spatial Omics Analysis.” Nature Methods 19 (2): 171–78. https://doi.org/10.1038/s41592-021-01358-2.

Rey, Sergio J., and Luc Anselin. 2010. “PySAL: A Python Library of Spatial Analytical Methods.” In Handbook of Applied Spatial Analysis, 175–93. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03647-7_11.

Rich, Joseph M, Lambda Moses, Pétur Helgi Einarsson, Kayla Jackson, Laura Luebbert, A. Sina Booeshaghi, Sindri Antonsson, et al. 2024. “The Impact of Package Selection and Versioning on Single-Cell RNA-seq Analysis.” bioRxiv, April, 2024.04.04.588111. https://doi.org/10.1101/2024.04.04.588111.

Traag, Vincent A, Ludo Waltman, and Nees Jan Van Eck. 2019. “From Louvain to Leiden: Guaranteeing Well-Connected Communities.” Scientific Reports 9 (1): 1–12. https://doi.org/10.1038/s41598-019-41695-z.

Virshup, Isaac, Sergei Rybakov, Fabian J Theis, Philipp Angerer, and F Alexander Wolf. 2024. “Anndata: Access and Store Annotated Data Matrices.” Journal of Open Source Software 9 (101): 4371. https://doi.org/10.21105/joss.04371.

Wartenberg, Daniel. 1985. “Multivariate Spatial Correlation: A Method for Exploratory Geographical Analysis.” Geographical Analysis 17 (4): 263–83. https://doi.org/10.1111/j.1538-4632.1985.tb00849.x.

Zuur, Alain F., Elena N. Ieno, and Graham M. Smith. 2007. Analysing Ecological Data. Statistics for Biology and Health. New York: Springer.