Show the code
source("utils.R")June 13, 2025
In this vignette we will show:
Univariate point pattern data analysis methods for imaging-based approaches.
This includes global metrics on the entire field of view and local variants thereof.
The use case is a MERFISH data set from Moffitt et al. (2018).
For more background on point pattern analysis we refer the reader to Baddeley, Rubak, and Turner (2015).
For the implementations we use R. Data is represented as SpatialExperiment (Righelli et al. 2022). For calculation of spatial functions we will use the package spatialFDA.
The script that is sourced contains functions that can be found in the setup.
spe <- readRDS("../data/spe.rds")
#define the Z-stacks that you want to compare
zstack_list <- list("-0.09", "0.01", "0.21")
#define the celltype that you want to compare across the stacks - hereby we assume independence across the z-stacks which is an assumption that can be challenged
celltype_ls <- "OD Mature"
selectZstacks <- function(zstack, spe){
  sub <- spe[, spe$sample_id == zstack]
  pp <- .ppp(sub, marks = "cluster_id")
  return(pp)
}
pp_ls <- lapply(zstack_list, selectZstacks, spe)
names(pp_ls) <- zstack_listThe theory of spatial point patterns is discussed in great detail in (Baddeley, Rubak, and Turner 2015). The book has an accompanying package called spatstat which offers great functionality to the theoretical concepts described in the book (Baddeley and Turner 2005). This chapter relies heavily on both publications.
In point pattern analysis we assume that the patterns we observe are a realisation of a stochastic process called a point process. The inferences we make about the point pattern are based on the point process. For example, if a pattern is created by a Poisson point process it will be evenly distributed in the observation window (Baddeley, Rubak, and Turner 2015, 127).
When considering a pattern with \(m\) multiple types, as we do in the (Moffitt et al. 2018) dataset, there are two distinct ways of looking at the data. One can view the pattern as a multitype point pattern, where all the points are sampled from the same point process. The other option is to consider the pattern as a multivariate point pattern, where the points come from \(m\) distinct point processes. The distinction between these two views lies in the assumption that in the multitype framework all points stem from a single point process, while in the multivariate framework, types stem from distinct point processes, allowing us to consider dependencies of individual types. Whether or not the patterns stem from the same point process depends on the biological question. If we analyse two cell types within one slice of a tissue, we should consider them as being sampled from one point process. However, if we consider the distribution of a cell type in two slices of the same tissue we have grounds to consider them as distinct processes (Baddeley, Rubak, and Turner 2015, 564–65).
Most often, one does not observe the entire point process, but only a subset of the process. This is called window sampling. Instead of observing the entire pattern we examine a subset of this pattern in the so called window. For example, small microscopy windows provide a view of portions of a larger tissue slice (Baddeley, Rubak, and Turner 2015, 143–45).
There is another concept called the small world model. It assumes that points can only be observed in a finite small world and not beyond these boundaries. When thinking of an entire tissue, this is a very common scenario. Cells can only be observed within the tissue and not beyond. In this case, it would be correct to not assume a rectangular observation window but to use methods to estimate an unknown sampling window such as the Ripley-Rasson estimate of a spatial region (Baddeley, Rubak, and Turner 2015, 144–45; Ripley and Rasson 1977).
The Ripley-Rasson method estimates convex observation windows. In our sample an ideal approximation would be non-convex, due to the tissue fold at the bottom. However, as non-convex estimation requires more complicated methods we will estimate the observation window using a Ripley-Rasson estimator.
setRiprasWindows <- function(pp){
  Window(pp) <- ripras(pp)
  return(pp)
}
#the entire point patterns with the ripras windows
pp <- lapply(pp_ls, setRiprasWindows)
separateMarks <- function(pp){
  #split the multitype point process into several single type processes
  ppls <- split(pp)
  return (ppls)
}
#the point patterns separated by their marks
pp_ls <- lapply(pp, separateMarks)An option to define arbitrary observation windows is to base them directly on the intensity of points. We do this using sosta. This package calculates an observation window based on an intensity threshold of points in space.
library('sosta')
### code adapted from https://bioconductor.org/packages/release/bioc/vignettes/sosta/inst/doc/StructureReconstructionVignette.html ###
#subset to three images
spe_sub <- subset(spe, ,sample_id %in% c('-0.09', '0.01', '0.21'))
# first we select the intensity thresholds for the binarisation of the image
n <- estimateReconstructionParametersSPE(
    spe_sub,
    marks = "cluster_id",
    imageCol = "sample_id",
    markSelect = NULL,
    plotHist = FALSE
)
thresSPE <- mean(n$thres)
bndwSPE <- mean(n$bndw)
# calculate the observation window based on the density
allStructs <- reconstructShapeDensitySPE(
    spe_sub,
    marks = "cluster_id",
    imageCol = "sample_id",
    markSelect = NULL,
    bndw = bndwSPE,
    thres = thresSPE,
    nCores = 1
)
p_ls <- lapply(list('-0.09', '0.01', '0.21'), function(elem){
    spe_sub <- subset(spe,,sample_id == elem)
    allStructs_sub <- subset(allStructs, sample_id == elem)
    p <- cbind(colData(spe_sub), spatialCoords(spe_sub)) |>
    as.data.frame() |>
    ggplot(aes(x = x, y = y)) +
    geom_point(size = 0.25) +
    geom_sf(
      data = allStructs_sub,
      fill = NA,
      color = "red",
      inherit.aes = FALSE, # this is important
      linewidth = 0.8
    ) +
    facet_grid(~sample_id)+
    theme_light()
    return(p)
})
patchwork::wrap_plots(p_ls)Alternatively, one can lend from the concept of inhomogeneity correction (discussed later). Here, we calculate the density of the unmarked point pattern. We then scale this density by the step size of the image. This allows us to use inhomogeneous functions such as Kinhom with a custom intensity based on the unmarked point pattern as lambdaX = dens_im_scaled. Both the restriction of the observation window based on the density of points and the inhomogeneity correction via the scaled density of the unmarked point pattern are discussed in more detail in Emons et al. (2024)
ppDens <- SPE2ppp(spe,
marks = "cluster_id", imageCol = "sample_id",
imageId = "-0.09"
)
#dens_im <- density()
dens_im <- density(unmark(ppDens))
# calculate the scale as we have less points compared to unmarked
scale <- sum(dens_im$v) / (dens_im$xstep * dens_im$ystep)
# scale the density
dens_im_scaled <- dens_im * scale
plot(dens_im_scaled)Complete spatial randomness (CSR) is often used as the null model for various point patterns. It is the result of a Poisson process. A completely spatial random process is characterised by two properties, homogeneity and independence, as discussed below (Baddeley, Rubak, and Turner 2015, 132).
“Homogeneity […] means that the expected number of points falling in a region \(B\) should be proportional to its area \(|B|\)” (Baddeley, Rubak, and Turner 2015, 132) given a proportionality constant \(\lambda\). The constant \(\lambda\) represents the intensity of the process, i.e., the average number of points in a unit area (Baddeley, Rubak, and Turner 2015, 132–33)
\[ \mathbb{E}[X\cap B] = \lambda |B|. \label{eq:expected_number_points} \]
Independence implies that in two (non-overlapping) regions \(A\) and \(B\), the number of points \(n(X\cap A)\) and \(n(X\cap B)\) are independent random variables. In other words, the number of points in region \(A\) does not affect the number of points in region \(B\). (Baddeley, Rubak, and Turner 2015, 133).
A Poisson process that is spatially varying in its average density of points is called inhomogeneous. Here, the average density, \(\lambda(u)\), sometimes known as the intensity function (see below), is a function of the spatial location \(u\). In this case, the expected number of points within a region \(B\), denoted by \(\mu = n(X\cap B)\), is given by the integral of the intensity function over that region (Baddeley, Rubak, and Turner 2015, 138)
\[ \mu = \int_{B} \lambda(u) du. \label{eq:expected_number_inhomogeneous} \]
Often, it is hard to distinguish inhomogeneous intensity from genuine clustering of points. Patterns can arise from an inhomogeneous process or a clustered process and it is not straightforward to distinguish them: one is governed by different (inhomogeoenous) intensities whereas the other is due to interactions (clustering) between points (Baddeley, Rubak, and Turner 2015, 151 ff.). One way this can occur is if cells of different sizes are treated as points instead of considering their entire volume (Baddeley, Rubak, and Turner 2015, 210 ff.).
A point process is called isotropic, if its statistical properties are invariant to rotations; a CSR process is both stationary and isotropic (Baddeley, Rubak, and Turner 2015, 147).
“A point process is called stationary if, when we view the process through a window \(W\) , its statistical properties do not depend on the location of the window in two-dimensional space.” (Baddeley, Rubak, and Turner 2015, 146). This is the case for any homogeneous point process, where the statistical properties of the pattern are unchanged given shifting of the observation window. This means it is stationary in all statistical properties; first-order properties (e.g. intensity) and second-order properties (e.g. correlation) (Baddeley, Rubak, and Turner 2015, 218). Not all metrics assume stationarity in its full sense. Inhomogeneous metrics only assume second-order / correlation stationarity. This means that while the intensity function can vary spatially (first-order stationarity is not given), the estimates of correlation functions (e.g. the inhomogeneous \(K\)-function) are expected to be consistent across different parts of the window (Baddeley, Rubak, and Turner 2015, 689 ff.).
If a process is not correlation stationary, so the estimates of the inhomogeneous metric vary between locations, locally-scaled versions of the metric are applicable. This means that in subregions, the process is still stationary and isotropic, but there is a rescaling factor that can vary across the total process (Baddeley, Rubak, and Turner 2015, 246–47).
To test the assumption of inhomogeneity, we can use a permutation test. In this scenario, we split the patterns into quadrats and compare the estimated functions between the quadrats. It should be noted that this test depends on the arbitrary definition of the quadrats. The chosen patterns are not independent but result as marks from an overall point-pattern. Therefore, the permutation approach is questionable (Baddeley, Rubak, and Turner 2015, 689–93).
As the interpretation of the permutation test is highly dependent on the quadrats, the results should be interpreted with care. Both inhomogeneous and locally scaled versions of the summary functions have support and both offer interesting insights into the spatial pattern. Therefore, we will compare all versions and show what the choice of metrics means for their interpretation.
Intensity is the expected density of points per unit area. It can be interpreted as the rate of occurrence or the abundance of events. The intensity represents a first order property because it is related to the expected number of points. More formally the average intensity of a point process is defined as:
\[ \bar{\lambda} = \frac{n(x)}{|W|} \label{eq:average_intensity} \]
As this is an average over the entire window, it is a good estimate for a homogeneous point process (Baddeley, Rubak, and Turner 2015, 157–60).
For a homogeneous point process, the intensity can be estimated in a simplistic way: summing the individual intensities of the marks (Baddeley, Rubak, and Turner 2015, 161).
[1] 0.001909Otherwise, we can compute the intensity for each mark individually.
In kernel estimation, we try to estimate the intensity function \(\lambda(u)\) of the point process. There is a wide variety of kernel estimators, see (Baddeley, Rubak, and Turner 2015, 168), but a popular choice is the isotropic Gaussian kernel where the standard deviation corresponds to the smoothing bandwidth (Baddeley, Rubak, and Turner 2015, 168).
In quadrat counting, all points falling into a given quadrat are counted. This gives an overview on the characteristics of the point pattern, such as correlation stationarity (Baddeley, Rubak, and Turner 2015, 163).
Under independence assumptions, the quadrat counts can be used for testing homogeneity, i.e., whether the points are distributed evenly across the quadrats (Baddeley, Rubak, and Turner 2015, 164–65).
    Conditional Monte Carlo test of CSR using quadrat counts
    Test statistic: Pearson X2 statistic
data:  pp_ls[["0.01"]]$`OD Mature`
X2 = 635.09, p-value = 1
alternative hypothesis: regular
Quadrats: 25 tiles (irregular windows)A p-value of 1 indicates that the null hypothesis of irregularity can’t be rejected at \(\alpha = 0.05\). Thus, the point pattern of oligodendrocytes is strongly irregular.
Alternatively, we can inspect departures from the hypothesis that points were generated by a Poisson process. We can identify hotspots and coldspots by comparing the standard error of the relrisk function, which computes nonparamatric estimates of the relative risk by kernel smoothing, to the theoretical null distribution of points. The relative risk is the ratio of spatially varying probablilities of different types (Buller 2020).
Whether or not a point process is completely spatially random (CSR) depends on two characteristics: points need to be distributed homogeneously and they have to be independent of each other (see definitions above). There are various ways to test for CSR, here we show the use-case of the clark-evans test (Baddeley, Rubak, and Turner 2015, 165–66).
In the following document we will often compare the distribution of mature oligodendrocytes (OD mature cells) across different \(z\)-slices of the same tissue. We assume these slices to be enough apart to be considered as generated by different point processes. Since we consider the dependence of one mark among itself, we are in a within cell type setting per slice. We compare several curves along different \(z\)-slices, which is in turn a multivariate comparison (Baddeley, Rubak, and Turner 2015, 564 ff.).
In our example dataset we analyse the mouse preoptic hypothalamus (Moffitt et al. 2018). The lower boundary is the border of the tissue whereas the other three boundaries are technical. Therefore, our example is a mixture between window sampling and the small world model. In order to decrease the bias of the tissue border, we use the Ripley-Rasson estimate of a spatial domain to estimate the sampling window (Baddeley, Rubak, and Turner 2015, 55; Ripley and Rasson 1977).
Correlation, or more generally covariance, represents a second-order summary statistic and measures dependence between data points (Baddeley, Rubak, and Turner 2015, 199 ff.).
With Ripley’s \(K\) we measure “the average number of \(r\)-neighbours of a typical random point” (Baddeley, Rubak, and Turner 2015, 204). This number is still dependent on the size of the observation window so we can normalise it by the number of points \(n\) and the window size, \(|W|\). We then obtain the empirical Ripley’s \(K\) function (Baddeley, Rubak, and Turner 2015, 204; Ripley 1977, 1976):
\[ \hat{K}(r) = \frac{|W|}{n(n-1)}\sum_{i=1}^n\sum_{j=1 \\j \neq i}^n\{d_{ij}\leq r\} e_{ij}(r) \]
The standardisation makes it possible to compare point patterns with different observation windows and with different numbers of points. However, using the empirical \(K\) function assumes that the point process has homogeneous intensity, which is often not the case for biological tissue (Baddeley, Rubak, and Turner 2015, 204–5). We will return to this issue below in the Correcting for Inhomogeneity. The factor \(e_{ij}(r)\) is an edge correction. We will briefly cover this in the following section.
Edge effects describe the phenomenon that not the entire point process is observed, but rather only the part within the window \(W\). This means the value of various statistics could be biased along the edges (Baddeley, Rubak, and Turner 2015, 213 ff.).
There are many corrections for edge effects that are briefly listed here (Baddeley, Rubak, and Turner 2015, 214–19):
In border correction the summation of data points is restricted to \(x_i\) for which \(b(x_i,r)\) is completely in the window \(W\).
We can regard edge effect as a sampling bias. Larger distances (e.g. close to the edges) are less likely to be observed. This can be corrected for.
A stationary point process \(X\) is invariant to translations. So the entire point process can be shifted by a vector \(s\) to be at the position \(X+s\).
The \(K\)-function can be ‘’centered’’, which is then referred to as Besag’s \(L\)-function. The \(L\)-function is a variance-stabilising version of the \(K\)-function (Canete et al. 2022; Besag 1977):
\[ L(r) = \sqrt{\frac{K(r)}{\pi}}. \]
By taking the square root the variance is approximately constant across the domain (Bartlett 1947).
We have seen above that the \(K\)-function is cumulative. That is, the contributions of all distances smaller equal to \(r\) are considered. An alternative is to take the derivative of the \(K\)-function in order to obtain contributions of distances between points equal to \(r\), according to:
\[ g(r) = \frac{K'(r)}{2\pi r}, \]
where the derivative of the \(K\) function divided by the probability of a Poisson process at this radius (Baddeley, Rubak, and Turner 2015, 225 ff.).
library('spatialFDA')
res_ls <- lapply(list('Kest', 'Lest', 'pcf'), function(fun) {
  res <- calcMetricPerFov(
    spe,
    'OD Mature',
    subsetby = 'sample_id',
    fun = fun,
    marks = 'cluster_id',
    rSeq = NULL,
    by = c('Animal_ID', 'sample_id')
  )
  res <- subset(res, sample_id %in% c('-0.09', '0.01', '0.21'))
  return(res)
})
p_ls <- lapply(res_ls, function(res) {
  plotMetricPerFov(
    res,
    theo = TRUE,
    correction = "iso",
    x = "r",
    imageId = 'sample_id',
    legend.position = "right"
  )
})
The plot above shows the \(K\)-function (right), \(L\)-function (middle) and pair-correlation function for the three different slices.
In the homogeneous case we see that all slices are above the Poisson line, indicating positive association for all slices. The association is strongest for slice \(-0.09\) followed by \(0.01\) and \(0.21\). Interestingly, at radii \(r>300µm\) the \(K\) curve for slice \(0.21\) crosses the other two curves.
In the case that a spatial pattern is known or suspected to be inhomogeneous, we have to take this into account in the analysis. Inhomogeneous alternatives can be estimated via:
\[ \hat{K}_{inhom}(r) = \frac{1}{D^p|W|}\sum_i\sum_{j \neq i} \frac{\mathbb{1}\{||u-x_j||\leq r\}}{\hat{\lambda}(x_j)\hat{\lambda}(x_i)}e(x_j,x_i;r), \]
where \(e(u,v;r)\) is an edge correction weight and \(\hat{\lambda}(x_i)\) is an estimator of the intensity at point \(x_i\). The estimation of the local intensities can happen in a data-dependent manner via kernel-smoothing. As this is the same data to then calculate the metric on, this can lead to biases. However, if the local intensities are known, the inhomogeneous \(K\) function is an unbiased estimator (Baddeley, Rubak, and Turner 2015, 242–46).
res_ls <- lapply(list('Kinhom', 'Linhom', 'pcfinhom'), function(fun) {
  res <- calcMetricPerFov(
    spe,
    'OD Mature',
    subsetby = 'sample_id',
    fun = fun,
    marks = 'cluster_id',
    rSeq = NULL,
    by = c('Animal_ID', 'sample_id'),
    correction = 'isotropic'
  )
  res <- subset(res, sample_id %in% c('-0.09', '0.01', '0.21'))
  return(res)
})
p_ls <- lapply(res_ls, function(res) {
  plotMetricPerFov(
    res,
    correction = "iso",
    theo = TRUE,
    x = "r",
    imageId = 'sample_id',
    legend.position = "right"
  )
})The inhomogeneous \(K\)-function indicates that the oligodendrocytes of slice \(0.21\) are close to a Poisson process (dashed line) and can therefore be assumed to be randomly distributed and not clustered. The two other slices show a slightly stronger association among the oligodendrocytes than the slice \(0.21\) at \(r<250 µm\).
The \(L\) function is complementary to the \(K\) function in this example.
The pair correlation function is the derivative of the \(K\)-function. The pcf plot gives similar information as before: Oligodendrocytes show the strongest association at \(\sim r = 25\) whereas the association is weaker in slice 0.21.
Interestingly, the curves for the inhomogeneous functions of slices \(-0.09\) and \(0.01\) cross the Poisson line at \(r\sim250\). This means that the inhomogeneous functions find repulsion of slices past a radius of \(250\).
In the inhomogeneous \(K\)-function approach above, we assume that the intensities can vary locally but the scale of the point process is not changed. This means that while the intensities might vary in the parts of the point pattern, the pattern in one subquadrat is not just a scaled version of another subquadrat (Baddeley, Rubak, and Turner 2015, 246–47; Prokešová, Hahn, and Jensen 2006).
To account for this local scaling, we can assume that the process is subdivided into small regions. In these small regions, the point process is a scaled version of a template process. This template process needs to be both stationary and isotropic (Baddeley, Rubak, and Turner 2015, 246–47).
Since the \(L\)-function is simply a transformation of the \(K\)-function, the same local scaling framework can be applied to the \(L\)-function (Baddeley, Rubak, and Turner 2015, 246–47).
res_ls <- lapply(list('Kscaled', 'Lscaled'), function(fun) {
  res <- calcMetricPerFov(
    spe,
    'OD Mature',
    subsetby = 'sample_id',
    fun = fun,
    marks = 'cluster_id',
    rSeq = NULL,
    by = c('Animal_ID', 'sample_id')
  )
  res <- subset(res, sample_id %in% c('-0.09', '0.01', '0.21'))
  return(res)
})
p_ls <- lapply(res_ls, function(res) {
  plotMetricPerFov(
    res,
    correction = "iso",
    theo = TRUE,
    x = "r",
    imageId = 'sample_id',
    legend.position = "right"
  )
})We see, that in slice \(0.01\) oligodendrocytes are far from the Poisson process line, indicating a strong association. The other two slices show a less strong association.
In the plot above we see all variants of the correlation metrics. The assumptions of either homogeneity (first row), inhomogeneity (second row) and local scaling (third row) change the interpretation of these example point patterns. In summary, in this example homogeneous variants show a positive association for all slices. Furthermore, homogeneous variants estimated clustering for all radii, whereas clustering in slice \(0.21\) became stronger than in the other two slices past \(r>250\). Inhomogeneity inferred a Poisson distribution for slice \(0.21\). In addition, the inhomogeneous variant estimated repulsion for slices \(-0.09\) and \(0.01\). The locally scaled version showed positive associations for all slices and no crossing of curves over the radii.
Deciding whether a pattern is homogeneous or inhomogeneous can be based on metrics as seen above. However, as the result of these metrics depend on the parameter choice (e.g., size of quadrants), the biological question can also be used to decide whether a pattern is assumed to be homogeneous or inhomogeneous. We provide some recommendations in the “Overview” vignettes.
It is worth noting that the \(K\)- and \(L\)-functions described above are summary statistics over the entire pattern (i.e., averaged over all points). However, if we know that there are different regions in our point pattern, an alternative strategy is to compute ‘’local’’ contributions to these patterns, i.e., local \(K\)- ,\(L\)- or pair-correlation functions (Anselin 1995). Baddeley, Rubak, and Turner (2015) propose to compare these \(n\) functions with so-called functional principal component analysis (see below). We will show here the example of the LISA version of the \(L\)-function (Baddeley, Rubak, and Turner 2015, 247–48).
L_odmature_lisa <- localL(pp_ls[['0.01']]$`OD Mature`)
df <- as.data.frame(L_odmature_lisa)
dfm <- reshape2::melt(df, "r")
get_sel <- dfm %>% filter(r > 200.5630 &
                            r < 201.4388, variable != "theo") %>%
  mutate(sel = value) %>% select(variable, sel)
dfm <- dfm %>% left_join(get_sel)
p <- ggplot(dfm, aes(
  x = r,
  y = value,
  group = variable,
  colour = sel
)) +
  geom_line() +
  scale_color_continuous(type = "viridis") +
  geom_vline(xintercept = 200) +
  theme(legend.position = "none") +
  theme_light() +
  ggtitle("LISA curves of slice 0.01")
ppdf <- as.data.frame(pp[['0.01']]) %>% filter(marks == "OD Mature")
ppdf$sel <- get_sel$sel # assume they are in same order
q <- ggplot(ppdf, aes(x = x, y = y, colour = sel)) +
  geom_point(size = 0.75) +
  scale_color_continuous(type = "viridis") +
  theme(legend.position = "none") +
  theme_light() +
  ggtitle("Points coloured by LISA value at r ~ 200")In the case of the OD mature cells, we obtain further information with this plot. We note that there are two distinct populations of curves: those that are clearly above the mean LISA curve in grey and others that are around/underneath. This indicates that there are two different kinds of interactions in the OD mature cells. Stronger and less clustered regions.
There are inhomogeneous versions of these (e.g. localLinhom) that are not shown here for brevity.
We apply functional PCA to retrieve the main trends in these individual curves. The idea of functional PCA is the same as for ordinary PCA but applied to functional data (i.e., each observation is a function). For the \(n\) functions above, functional PCA will recover the main trends in the data (Ramsay and Silverman 2005). We use the R package refund to perform functional PCA (Goldsmith et al. 2024; Xiao et al. 2016).
#normalise the data
df_fdob <- asinh(df %>% as.matrix / 50) %>% as.data.frame()
# extract the functional response matrix
mat <- df_fdob %>%
  select(!c(r, theo)) %>%
  t()
# create a dataframe as required by pffr
dat <- data.frame(ID = rownames(mat))
dat$Y <- mat
dat$sel <- get_sel$sel
# perform functional PCA
res <- functionalPCA(
  dat = dat,
  r = df_fdob$r |> unique()
)
# extract the scores
scores_df <- res$scores %>% as.data.frame()
# plot a biplot
p_biplot <- ggplot(scores_df, aes(scores_df[, 1], scores_df[, 2], colour = (dat[['sel']]))) +
  geom_point() +
  coord_equal() +
  theme_light() +
  scale_color_continuous(type = "viridis") +
  xlab('PC1') +
  ylab('PC2')p1 <- ggplot(ppdf, aes(
  x = x,
  y = y,
  colour = res$scores[, 1]
)) +
  scale_color_continuous(type = "viridis", name = 'loading PC1') +
  theme_light() +
  geom_point(size = 0.75)
p2 <- ggplot(ppdf, aes(
  x = x,
  y = y,
  colour = res$scores[, 2]
)) +
  scale_color_continuous(type = "viridis", name = 'loading PC2') +
  theme_light() +
  geom_point(size = 0.75)The biplot shows the distribution of the first two loadings of the functional PCA. The points are coloured as they were in the plots of the LISA \(L\)-curves. The first principal component clearly separates the two populations. In the last plot we project the loadings of the fPCs back onto the biological slices and find the same separation.
So far we have considered first- and second-order summary statistics and local (or inhomogeneous) adaptations of them. In the second order, one considers (counts of) pairs (e.g., \(K\) function). In a third-order setting, we would count triplets of points. A triplet is counted as the normalised expected value of triangles where all edges are smaller than the radius \(r\) (Baddeley, Rubak, and Turner 2015, 249).
So far, most approaches considered intensity and correlation as measures to assess a point pattern. Next, we will look at measures of spacing and shortest-distances to assess spatial arrangements (Baddeley, Rubak, and Turner 2015, 255).
Baddeley et al. summarises three basic distances:
Note also that there are tests of CSR that are based on spacing, including the Clark-Evans and Hopkins-Skellam Index tests that were discussed above ‘’Testing for CSR’’.
Nearest-neighbour (NN) methods are based on the notion of “nearness”. In particular, we introduce nndist from spatstat, a method to calculate the distances until \(k\) NN are found. This function returns a density for each specified \(k\) for the \(k\) neighbour distances. We can for instance collapse the \(k\) curves into a mean curve per point pattern. This information of the mean nearest neighbour distance (MMND) can be summarised as a density. Note, that these distances are “raw” nearest-neighbour distances which are not corrected for edge effects. Edge correction for the nearest neighbour distance (\(k = 1\)) is implemented in the function Gest, see next paragraph (Baddeley, Rubak, and Turner 2015, 256; Baddeley and Turner 2005).
nndistance <- function(pp, nk) {
  xy <- cbind(pp$x, pp$y)
  nndistances_k15 <- nndist(xy, k = nk)
  nndistances_mean <- rowMeans(nndistances_k15)
  return(nndistances_mean)
}
#PRE: list of point pattern, corresponding celltypes of interest, functions to evaluate
#POST: result of the metric
metricRes_nndist <- function(ppls, celltype, fun) {
  metric.res <- list(res = do.call(fun,
                                   args = list(pp = ppls[[celltype]],
                                               nk = seq(1:15))))
  metric.res$type = celltype
  return(metric.res)
}
celltypes <- c("Ependymal", "OD Mature", "Microglia")
#go through all defined celltypes and calculate the nearest-neighbour distance
res_ls <- lapply(celltypes, metricRes_nndist, fun = nndistance, ppls = pp_ls[['0.01']])
#initialise a dataframe for the metric values and the type information
res_df <- data.frame(metric = numeric(0), type = character(0))
# Loop through the res_ls list and combine the metric values with their corresponding type
for (i in 1:length(res_ls)) {
  metric_values <- res_ls[[i]]$res
  metric_type <- rep(res_ls[[i]]$type, length(metric_values))
  df <- data.frame(metric = metric_values, type = metric_type)
  res_df <- rbind(res_df, df)
}
#plot the densities
p <- ggplot(res_df, aes(x = metric, col = type)) +
  geom_density(linewidth = 1) +
  scale_x_sqrt() +
  theme_light() +
  ggtitle('Sqrt of the Mean Nearest-Neighbour Distance')
pIn the MNND empirical distribution, the ependymal cells show the shortest NN distances, a reflection of their clustering. The OD mature cells have larger NN distances as well as a bimodal distribution, indicating a mix of shorter and longer distances (as visible in the LISA \(L\)-functions). Microglia cells show the widest distances and the symmetry of the curve indicates similar distances throughout the field of view.
In a stationary spatial point process, the empty-space distance is defined as:
\[ d(u,X) = \min\{||u-x_i||: x_i \in X\} \]
Note that this is an edge-corrected distribution function of the nearest-neighbour distance above.
The empty space function is then the cumulative distribution function of the empty-space distances defined above:
\[ F(r) = \mathbb{P}\{d(u,X)\leq r\}. \]
The NN distance is defined as:
\[ d_i = \min_{j\neq i}||x_j-x_i||. \]
The NN distance distribution function \(G(r)\) is then defined as:
\[ G(r) = \mathbb{P}\{d(x,X\backslash u \leq r |X\ has\ a\ point\ at\ u\}. \]
For a homogeneous Poisson process, the NN distance distribution is identical to the empty-space function of the same process:
\[ G_{pois} \equiv F_{pois}. \]
For a general point process, the \(F\) and \(G\) functions are different (Baddeley, Rubak, and Turner 2015, 261–67).
The \(F\) and \(G\) functions are, like the \(K\) function, cumulative. Therefore, an analogue to the pair-correlation function would make sense to consider. For practical reasons, this is no longer the derivative of the \(F\) function but rather a hazard rate:
\[ h(r) = \frac{f(r)}{1-F(r)}. \]
The concepts of the empty-space function \(F\) and the NN function \(G\) are complementary. If one decreases, the other increases.
Thus, the \(J\) function is a combination of both functions:
\[ J(r) = \frac{1-G(r)}{1-F(r)}. \]
For a CSR process, \(J_{pois} \equiv 1\), whereas values of \(J(r) > 1\) are consistent with a regular (e.g., repelling) pattern, and \(J(r) < 1\) represents a clustered process (Baddeley, Rubak, and Turner 2015, 275–77).
[1] "Calculating Gest of OD Mature"
[1] "Calculating Fest of OD Mature"
[1] "Calculating Jest of OD Mature"There are inhomogeneous variants of the spacing functions explained above (Baddeley, Rubak, and Turner 2015, 277–78).
[1] "Calculating Ginhom of OD Mature"
[1] "Calculating Finhom of OD Mature"
[1] "Calculating Jinhom of OD Mature"The inhomogeneous curves look different to their homogeneous counterparts but the relative ordering of the curves per plot is the same.
Next to the NN distance, we can estimate the orientation of the neighbours, which gives an indication of the orientation of the spacing. It works by taking the angle between each point and its \(k^{th}\) nearest neighbour. The angle is anticlockwise from the x-axis (Baddeley, Rubak, and Turner 2015, 278–79) (Baddeley and Turner 2005).
res <- calcMetricPerFov(
  spe,
  'OD Mature',
  subsetby = 'sample_id',
  fun = 'nnorient',
  marks = 'cluster_id',
  rSeq = NULL,
  by = c('Animal_ID', 'sample_id')
)
res <- subset(res, sample_id %in% c('-0.09', '0.01', '0.21'))
p <- plotMetricPerFov(
  res,
  correction = "bordm",
  theo = TRUE,
  x = "phi",
  imageId = 'sample_id',
  legend.position = "right"
)The values of \(\phi\) correspond to the orientation of the original point pattern. The horizontal axis goes from \(180\) to \(0\) (left to right) and the vertical from \(90\) to \(270\) (top to bottom).
An easier representation of the metric above can be plotted as a rose diagram.
[[1]]
window: rectangle = [-0.00484648, 0.00484648] x [-0.00484648, 0.00484648] units
[[2]]
window: rectangle = [-0.005196989, 0.005196989] x [-0.005196989, 0.005196989] 
units
[[3]]
window: rectangle = [-0.00439373, 0.00439373] x [-0.00439373, 0.00439373] unitsThe two plots are complementary and show which are the preferred orientations of the point patterns. Furthermore, they show whether or not the assumption of isotropy (no change in the statistical properties of a point pattern after rotations) is justified or not (Baddeley, Rubak, and Turner 2015, 236 ff.). Isotropy is an assumption that a lot of spatial metrics make and in our example we note, that the point patterns are in fact anisotropic as we can see a trend for the preferred orientation in the rose diagrams. An option for analysing anisotropic stationary point patterns is to not calculate the metric on the actual point pattern but rather calculating it on the fry plot of the point pattern. This generalises e.g. Ripley’s \(K\) function from circles to arbitrary shapes (Baddeley, Rubak, and Turner 2015, 239 ff.).
Note also that the concepts of spacing are not only usable in point pattern analysis but also more broadly in other spatial contexts (e.g., spacing between shapes instead of points) (Baddeley, Rubak, and Turner 2015, 279 ff.).
The same consideration about edge effects as for the \(K\) (and related) functions need to be made for the spacing functions; uncorrected estimates are negatively biased estimators. The easiest approach is to draw an artificial border and consider NNs within it. Other approaches are based on sampling. Yet another approach relates to survival analysis, with the idea is that a circle of a point to grows homogeneously with increasing radius until it hits the frame of the window and “dies”. This gives survival distributions similar to censored data, where the Kaplan-Meier estimator is a good choice (Baddeley, Rubak, and Turner 2015, 285–92).
In spatstat, a mark can basically take any value, categorical (e.g. cell types, as we have seen above) or continuous (e.g., gene expression) (Baddeley, Rubak, and Turner 2015, 637 ff.). In our example, we take the gene expression of some marker genes from Fig. 6 of the original publication (Moffitt et al. 2018). This is a typical numerical mark for points in a biological dataset. Here, we will use all cell types for the analysis of numerical marks.
Here we see spatial distribution of the counts of the three genes Slc18a2, Esr1 and Pgr.
The function pairs from spatstat generates a scatterplot of the counts of the marks (in our case the three genes) against each other and against the \(x\) and \(y\) coordinates. We can add a non-linear smoothing curve to make the general trends a bit more obvious (Baddeley, Rubak, and Turner 2015, 641).
We find that the counts of the three genes are very evenly distributed along the \(x\) and \(y\) coordinate, indicating a homogeneous distribution. The counts of Esr1 and Pgr are positively associated, indicating a dependence of these two marks.
NN interpolations uses the nearest mark to measure the intensity at each spatial location. This is conceptually similar to taking a very small bandwidth for Gaussian kernel smoothing (Baddeley, Rubak, and Turner 2015, 642).
We see that there is e.g. a clear spatial structure in the expression of e.g. Esr1. It shows a half moon shape.
As in the discrete case, summary functions assume that the point process is stationary.
The mark correlation function measures the dependence between two marks for two points at distance \(r\). It is applicable to stationary point processes with marks. The generalized mark correlation function is given by:
\[ k_f(r) = \frac{\mathbb{E}[f(m(u),m(v)) \mid \| u - v \| = r]}{\mathbb{E}[f(M,M')]},\]
where \(f(m(u),m(v))\) is a test function with two arguments (representing the two marks at locations \(u\) and \(v\) with distance \(r\) apart) and returns a non-negative value. For continuous non-negative marks, the canonical choice for \(f\) is typically \(f(m(u),m(v))= m(u)m(v)\). \(M\) and \(M′\) represent independent, identically distributed random points with the same distribution as the mark of a randomly chosen point. This denominator is chosen such that random marks have a mark correlation of 1 (Baddeley, Rubak, and Turner 2015, 644–45).
[1] "Calculating markcorr of OD Mature"From this plot we show that all genes show a positive correlation at small distances which decline with increasing radius \(r\). The association is strongest for the Slc18a2 gene. We can calculate simulation envelopes to estimate the significance of this association. This is not shown for brevity.
The mark-weigthed \(K\)-function is a generalization of the \(K\)-function in which the contribution from each pair of points is weighted by a function of their respective marks. It is given by:
\[K_f(r) = \frac 1 \lambda \frac{C_f(r)}{E[ f(M_1, M_2) ]},\] where
\[ C_f(r) = E \left[ \sum_{x_j \in X} f(m(u), m(x_j)) \mathbf{1} \{0 < ||u - x_j|| \le r\} \; \mid \; u \in X \right], \]
is equivalent to the unnormalized mark-weighted \(K\)-function. For every point \(u\), we sum the euclidean distance \(||u - x||\) of all other points \(x\) that are within a distance \(r\). This sum is weighted by the function \(f(.,.)\) of the marks of \(u\) and \(x\). The function is standardized by the expected value of \(f(M_1, M_2)\) where \(M_1, M_2\) represent independent, identically distributed random points with the same distribution as the mark of a randomly chosen point (Baddeley, Rubak, and Turner 2015, 646–47).
In the scenario of random labeling, so where the marks are distributed randomly, the mark-weighted \(K\)-function corresponds to the standard Ripley’s \(K\)-function.
Also here, the canonical function is: \(f(m_1, m_2) = m_1 m_2\). This means we weigh each interaction between points by the product of the continuous marks of both points.
[1] "Calculating Kmark of OD Mature"It is important to note that the theoretical value of the \(K\)-function is not very informative since it represents the \(K\)-function of a Poisson point process and the underlying point process might not be Poisson. Therefore we compare the mark-weighted with its unmarked analogue. Like this, we can assess whether the points weighted by a continuous mark are more or less correlated than their unmarked analogues (Baddeley, Rubak, and Turner 2015, 647).
Here we will compare the \(L\)-functions weighted by the mark of the gene Esr1 and the unmarked \(L\)-function.
We note that the difference between \(L\)-function weighted by the expression of Esr1 minus the unmarked \(L\)-function is positively different to the Poisson difference, meaning that the expression of the continuous mark Esr1 is correlated among itself.
In this chapter we have looked at point pattern methods that can be applied to univariate marks. These univariate marks can either be categorical (in a molecular biological context that could be celltypes) or continuous (e.g. expression of a gene). There are approaches that summarise correlative dependencies among points or that consider spacing functions.
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods  
[8] base     
other attached packages:
 [1] spatialFDA_1.0.0                sosta_1.0.0                    
 [3] dixon_0.0-10                    splancs_2.01-45                
 [5] sp_2.2-0                        bluster_1.18.0                 
 [7] magrittr_2.0.3                  stringr_1.5.1                  
 [9] spdep_1.3-11                    spData_2.3.4                   
[11] tmap_4.0                        scater_1.36.0                  
[13] scran_1.36.0                    scuttle_1.18.0                 
[15] SFEData_1.10.0                  Voyager_1.10.0                 
[17] SpatialFeatureExperiment_1.10.0 rgeoda_0.1.0                   
[19] digest_0.6.37                   sf_1.0-20                      
[21] reshape2_1.4.4                  patchwork_1.3.0                
[23] STexampleData_1.16.0            ExperimentHub_2.16.0           
[25] AnnotationHub_3.16.0            BiocFileCache_2.16.0           
[27] dbplyr_2.5.0                    rlang_1.1.6                    
[29] ggplot2_3.5.2                   dplyr_1.1.4                    
[31] spatstat_3.3-2                  spatstat.linnet_3.2-5          
[33] spatstat.model_3.3-5            rpart_4.1.24                   
[35] spatstat.explore_3.4-2          nlme_3.1-168                   
[37] spatstat.random_3.3-3           spatstat.geom_3.3-6            
[39] spatstat.univar_3.1-2           spatstat.data_3.1-6            
[41] SpatialExperiment_1.18.0        SingleCellExperiment_1.30.0    
[43] SummarizedExperiment_1.38.0     Biobase_2.68.0                 
[45] GenomicRanges_1.60.0            GenomeInfoDb_1.44.0            
[47] IRanges_2.42.0                  S4Vectors_0.46.0               
[49] BiocGenerics_0.54.0             generics_0.1.3                 
[51] MatrixGenerics_1.20.0           matrixStats_1.5.0              
loaded via a namespace (and not attached):
  [1] spatialreg_1.3-6          spatstat.sparse_3.1-0    
  [3] bitops_1.0-9              EBImage_4.50.0           
  [5] httr_1.4.7                RColorBrewer_1.1-3       
  [7] grpreg_3.5.0              tools_4.5.0              
  [9] R6_2.6.1                  HDF5Array_1.36.0         
 [11] mgcv_1.9-3                rhdf5filters_1.20.0      
 [13] withr_3.0.2               gridExtra_2.3            
 [15] leaflet_2.2.2             leafem_0.2.3             
 [17] cli_3.6.5                 sandwich_3.1-1           
 [19] labeling_0.4.3            mvtnorm_1.3-3            
 [21] proxy_0.4-27              R.utils_2.13.0           
 [23] spacesXYZ_1.5-1           dichromat_2.0-0.1        
 [25] scico_1.5.0               limma_3.64.0             
 [27] RSQLite_2.3.11            crosstalk_1.2.1          
 [29] Matrix_1.7-3              ggbeeswarm_0.7.2         
 [31] logger_0.4.0              abind_1.4-8              
 [33] R.methodsS3_1.8.2         terra_1.8-42             
 [35] lifecycle_1.0.4           multcomp_1.4-28          
 [37] yaml_2.3.10               edgeR_4.6.1              
 [39] tmaptools_3.2             rhdf5_2.52.0             
 [41] SparseArray_1.8.0         grid_4.5.0               
 [43] blob_1.2.4                dqrng_0.4.1              
 [45] crayon_1.5.3              lattice_0.22-7           
 [47] beachmat_2.24.0           KEGGREST_1.48.0          
 [49] magick_2.8.6              refund_0.1-37            
 [51] zeallot_0.1.0             pillar_1.10.2            
 [53] knitr_1.50                metapod_1.16.0           
 [55] rjson_0.2.23              boot_1.3-31              
 [57] fda_6.2.0                 codetools_0.2-20         
 [59] wk_0.9.4                  glue_1.8.0               
 [61] data.table_1.17.0         memuse_4.2-3             
 [63] Rdpack_2.6.4              vctrs_0.6.5              
 [65] png_0.1-8                 gtable_0.3.6             
 [67] ks_1.14.3                 cachem_1.1.0             
 [69] xfun_0.52                 rbibutils_2.3            
 [71] S4Arrays_1.8.0            DropletUtils_1.28.0      
 [73] pracma_2.4.4              fds_1.8                  
 [75] cols4all_0.8              reformulas_0.4.0         
 [77] coda_0.19-4.1             pcaPP_2.0-5              
 [79] survival_3.8-3            pbs_1.1                  
 [81] sfheaders_0.4.4           units_0.8-7              
 [83] statmod_1.5.0             TH.data_1.1-3            
 [85] smoothr_1.0.1             bit64_4.6.0-1            
 [87] filelock_1.0.3            irlba_2.3.5.1            
 [89] vipor_0.4.7               KernSmooth_2.23-26       
 [91] colorspace_2.1-1          DBI_1.2.3                
 [93] leaflegend_1.2.1          raster_3.6-32            
 [95] tidyselect_1.2.1          bit_4.6.0                
 [97] compiler_4.5.0            curl_6.2.2               
 [99] BiocNeighbors_2.2.0       h5mread_1.0.0            
[101] DelayedArray_0.34.1       scales_1.4.0             
[103] classInt_0.4-11           rappdirs_0.3.3           
[105] tiff_0.1-12               goftest_1.2-3            
[107] minqa_1.2.8               rainbow_3.8              
[109] fftwtools_0.9-11          spatstat.utils_3.1-3     
[111] rmarkdown_2.29            XVector_0.48.0           
[113] htmltools_0.5.8.1         pkgconfig_2.0.3          
[115] jpeg_0.1-11               base64enc_0.1-3          
[117] lme4_1.1-37               sparseMatrixStats_1.20.0 
[119] fastmap_1.2.0             htmlwidgets_1.6.4        
[121] UCSC.utils_1.4.0          RLRsim_3.1-8             
[123] DelayedMatrixStats_1.30.0 farver_2.1.2             
[125] zoo_1.8-14                jsonlite_2.0.0           
[127] mclust_6.1.1              BiocParallel_1.42.0      
[129] R.oo_1.27.0               BiocSingular_1.24.0      
[131] RCurl_1.98-1.17           GenomeInfoDbData_1.2.14  
[133] s2_1.1.7                  Rhdf5lib_1.30.0          
[135] Rcpp_1.0.14               ggnewscale_0.5.1         
[137] viridis_0.6.5             leafsync_0.1.0           
[139] stringi_1.8.7             MASS_7.3-65              
[141] plyr_1.8.9                parallel_4.5.0           
[143] ggrepel_0.9.6             deldir_2.0-4             
[145] stars_0.6-8               Biostrings_2.76.0        
[147] splines_4.5.0             tensor_1.5               
[149] locfit_1.5-9.12           igraph_2.1.4             
[151] ScaledMatrix_1.16.0       magic_1.6-1              
[153] LearnBayes_2.15.1         XML_3.99-0.18            
[155] BiocVersion_3.21.1        evaluate_1.0.3           
[157] leaflet.providers_2.0.0   renv_1.1.4               
[159] BiocManager_1.30.25       deSolve_1.40             
[161] nloptr_2.2.1              purrr_1.0.4              
[163] tidyr_1.3.1               polyclip_1.10-7          
[165] rsvd_1.0.5                lwgeom_0.2-14            
[167] e1071_1.7-16              RSpectra_0.16-2          
[169] viridisLite_0.4.2         class_7.3-23             
[171] tibble_3.2.1              memoise_2.0.1            
[173] beeswarm_0.4.0            AnnotationDbi_1.70.0     
[175] gamm4_0.2-7               cluster_2.1.8.1          
[177] hdrcde_3.4                BiocStyle_2.36.0         ©2024 The pasta authors. Content is published under Creative Commons CC-BY-4.0 License for the text and GPL-3 License for any code.
# Point pattern analysis -- univariate methods for imaging-based data
In this vignette we will show:
- Univariate point pattern data analysis methods for imaging-based approaches.
- This includes global metrics on the entire field of view and local variants thereof. 
- The use case is a MERFISH data set from @moffittMolecularSpatialFunctional2018.
- For more background on point pattern analysis we refer the reader to @baddeleySpatialPointPatterns2015. 
- For the implementations we use `R`. Data is represented as `r BiocStyle::Biocpkg('SpatialExperiment')` [@righelliSpatialExperimentInfrastructureSpatiallyresolved2022]. For calculation of spatial functions we will use the package `r BiocStyle::Biocpkg('spatialFDA')`.
# categorical Marks
{{< include theory/01-theory-point.qmd >}}
## Univariate viewpoint
In the following document we will often compare the distribution of mature oligodendrocytes (OD mature cells) across different $z$-slices of the same tissue. We assume these slices to be enough apart to be considered as generated by different point processes. Since we consider the dependence of one mark among itself, we are in a within cell type setting per slice. We compare several curves along different $z$-slices, which is in turn a multivariate comparison [@baddeleySpatialPointPatterns2015, pp. 564 ff.].
In our example dataset we analyse the mouse preoptic hypothalamus [@moffittMolecularSpatialFunctional2018]. The lower boundary is the border of the tissue whereas the other three boundaries are technical. Therefore, our example is a mixture between window sampling and the small world model. In order to decrease the bias of the tissue border, we use the Ripley-Rasson estimate of a spatial domain to estimate the sampling window [@baddeleySpatialPointPatterns2015, pp. 55; @ripleyFindingEdgePoisson1977].
```{r, fig.width=12, fig.height=6, cache=FALSE, message= FALSE, include = FALSE, eval = FALSE}
par(mfrow=c(1,3))
#Plot the marks separately 
lapply(zstack_list, function(zstack){
  plot(pp_ls[[zstack]][[celltype_ls]], main = zstack, legend = FALSE)
})
dev.off()
pls <- lapply(zstack_list, function(zstack){
  pp_sel <- pp_ls[[zstack]][[celltype_ls]]
  p <- pp_sel |> as.data.frame() |> 
  ggplot(aes(x = x, y = y, alpha = 0.3)) +
  geom_point(size=0.75) +
  theme_minimal() +
  coord_fixed() +
  ggtitle(zstack)
  return(p)
})
wrap_plots(pls, guides = 'collect')
```
## Correlation
Correlation, or more generally covariance, represents a second-order summary statistic and measures dependence between data points [@baddeleySpatialPointPatterns2015, pp. 199 ff.].
### Ripley's $K$
#### Empirical Ripley's $K$
With Ripley's $K$ we measure "the average number of $r$-neighbours of a typical random point" [@baddeleySpatialPointPatterns2015, pp. 204]. This number is still dependent on the size of the observation window so we can normalise it by the number of points $n$ and the window size, $|W|$. We then obtain the empirical Ripley's $K$ function [@baddeleySpatialPointPatterns2015, pp. 204; @ripleyModellingSpatialPatterns1977; @ripleySecondOrderAnalysisStationary1976]:
$$
\hat{K}(r) = \frac{|W|}{n(n-1)}\sum_{i=1}^n\sum_{j=1 \\j \neq i}^n\{d_{ij}\leq r\} e_{ij}(r)
$$
The standardisation makes it possible to compare point patterns with different observation windows and with different numbers of points. However, using the empirical $K$ function assumes that the point process has homogeneous intensity, which is often not the case for biological tissue [@baddeleySpatialPointPatterns2015, pp. 204-205]. We will return to this issue below in the `Correcting for Inhomogeneity`. The factor $e_{ij}(r)$ is an edge correction. We will briefly cover this in the following section.
#### Edge effects and their corrections for spatial metrics
Edge effects describe the phenomenon that not the entire point process is observed, but rather only the part within the window $W$. This means the value of various statistics could be biased along the edges [@baddeleySpatialPointPatterns2015, pp. 213 ff.].
There are many corrections for edge effects that are briefly listed here [@baddeleySpatialPointPatterns2015, pp. 214-219]:
##### Border correction
In border correction the summation of data points is restricted to $x_i$ for which $b(x_i,r)$ is completely in the window $W$.
##### Isotropic correction
We can regard edge effect as a sampling bias. Larger distances (e.g. close to the edges) are less likely to be observed. This can be corrected for.
##### Translation correction
A stationary point process $X$ is invariant to translations. So the entire point process can be shifted by a vector $s$ to be at the position $X+s$.
### The $L$-function
The $K$-function can be ''centered'', which is then referred to as Besag's $L$-function. The $L$-function is a variance-stabilising version of the $K$-function [@caneteSpicyRSpatialAnalysis2022; @besag1977contribution]:
$$
L(r) = \sqrt{\frac{K(r)}{\pi}}.
$$
By taking the square root the variance is approximately constant across the domain [@bartlettUseTransformations1947].
### Pair Correlation function
We have seen above that the $K$-function is cumulative. That is, the contributions of all distances smaller equal to $r$ are considered. An alternative is to take the derivative of the $K$-function in order to obtain contributions of distances between points equal to $r$, according to:
$$
g(r) = \frac{K'(r)}{2\pi r},
$$
where the derivative of the $K$ function divided by the probability of a Poisson process at this radius [@baddeleySpatialPointPatterns2015, pp. 225 ff.].
```{r, error=FALSE, message=FALSE, warning=FALSE, results='hide', include=TRUE}
#| message: false
#| warning: false
library('spatialFDA')
res_ls <- lapply(list('Kest', 'Lest', 'pcf'), function(fun) {
  res <- calcMetricPerFov(
    spe,
    'OD Mature',
    subsetby = 'sample_id',
    fun = fun,
    marks = 'cluster_id',
    rSeq = NULL,
    by = c('Animal_ID', 'sample_id')
  )
  res <- subset(res, sample_id %in% c('-0.09', '0.01', '0.21'))
  return(res)
})
p_ls <- lapply(res_ls, function(res) {
  plotMetricPerFov(
    res,
    theo = TRUE,
    correction = "iso",
    x = "r",
    imageId = 'sample_id',
    legend.position = "right"
  )
})
```
\
```{r, cache=FALSE, fig.height = 6, fig.width = 16, include=TRUE}
p_homo <- wrap_plots(p_ls, guides = 'collect')
p_homo
```
The plot above shows the $K$-function (right), $L$-function (middle) and pair-correlation function for the three different slices.
In the homogeneous case we see that all slices are above the Poisson line, indicating positive association for all slices. The association is strongest for slice $-0.09$ followed by $0.01$ and $0.21$. Interestingly, at radii $r>300µm$ the $K$ curve for slice $0.21$ crosses the other two curves.
### Correcting for Inhomogeneity
#### Inhomogeneous $K$-function
In the case that a spatial pattern is known or suspected to be inhomogeneous, we have to take this into account in the analysis. Inhomogeneous alternatives can be estimated via:
$$
\hat{K}_{inhom}(r) = \frac{1}{D^p|W|}\sum_i\sum_{j \neq i} \frac{\mathbb{1}\{||u-x_j||\leq r\}}{\hat{\lambda}(x_j)\hat{\lambda}(x_i)}e(x_j,x_i;r),
$$
where $e(u,v;r)$ is an edge correction weight and $\hat{\lambda}(x_i)$ is an estimator of the intensity at point $x_i$. The estimation of the local intensities can happen in a data-dependent manner via kernel-smoothing. As this is the same data to then calculate the metric on, this can lead to biases. However, if the local intensities are known, the inhomogeneous $K$ function is an unbiased estimator [@baddeleySpatialPointPatterns2015, pp. 242-246].
```{r, error=FALSE, message=FALSE, warning=FALSE, results='hide'}
res_ls <- lapply(list('Kinhom', 'Linhom', 'pcfinhom'), function(fun) {
  res <- calcMetricPerFov(
    spe,
    'OD Mature',
    subsetby = 'sample_id',
    fun = fun,
    marks = 'cluster_id',
    rSeq = NULL,
    by = c('Animal_ID', 'sample_id'),
    correction = 'isotropic'
  )
  res <- subset(res, sample_id %in% c('-0.09', '0.01', '0.21'))
  return(res)
})
p_ls <- lapply(res_ls, function(res) {
  plotMetricPerFov(
    res,
    correction = "iso",
    theo = TRUE,
    x = "r",
    imageId = 'sample_id',
    legend.position = "right"
  )
})
```
```{r, cache=FALSE, fig.height = 6, fig.width = 16}
p_inhomo <- wrap_plots(p_ls, guides = 'collect')
p_inhomo
```
The inhomogeneous $K$-function indicates that the oligodendrocytes of slice $0.21$ are close to a Poisson process (dashed line) and can therefore be assumed to be randomly distributed and not clustered. The two other slices show a slightly stronger association among the oligodendrocytes than the slice $0.21$ at $r<250 µm$.
The $L$ function is complementary to the $K$ function in this example.
The pair correlation function is the derivative of the $K$-function. The pcf plot gives similar information as before: Oligodendrocytes show the strongest association at $\sim r = 25$ whereas the association is weaker in slice 0.21.
Interestingly, the curves for the inhomogeneous functions of slices $-0.09$ and $0.01$ cross the Poisson line at $r\sim250$. This means that the inhomogeneous functions find repulsion of slices past a radius of $250$.
### Local Scaling
#### Locally-scaled $K$-function
In the inhomogeneous $K$-function approach above, we assume that the intensities can vary locally but the scale of the point process is not changed. This means that while the intensities might vary in the parts of the point pattern, the pattern in one subquadrat is not just a scaled version of another subquadrat [@baddeleySpatialPointPatterns2015 pp. 246-247; @prokesovaStatisticsLocallyScaled2006].
To account for this local scaling, we can assume that the process is subdivided into small regions. In these small regions, the point process is a scaled version of a template process. This template process needs to be both stationary and isotropic [@baddeleySpatialPointPatterns2015 pp. 246-247].
#### Locally-scaled $L$-function
Since the $L$-function is simply a transformation of the $K$-function, the same local scaling framework can be applied to the $L$-function [@baddeleySpatialPointPatterns2015 pp. 246-247].
```{r, error=FALSE, message=FALSE, warning=FALSE, results='hide', include=TRUE}
res_ls <- lapply(list('Kscaled', 'Lscaled'), function(fun) {
  res <- calcMetricPerFov(
    spe,
    'OD Mature',
    subsetby = 'sample_id',
    fun = fun,
    marks = 'cluster_id',
    rSeq = NULL,
    by = c('Animal_ID', 'sample_id')
  )
  res <- subset(res, sample_id %in% c('-0.09', '0.01', '0.21'))
  return(res)
})
p_ls <- lapply(res_ls, function(res) {
  plotMetricPerFov(
    res,
    correction = "iso",
    theo = TRUE,
    x = "r",
    imageId = 'sample_id',
    legend.position = "right"
  )
})
```
```{r, cache=FALSE, fig.height = 6, fig.width = 16, include=TRUE}
p_scaled <- wrap_plots(p_ls, guides = 'collect')
p_scaled
```
We see, that in slice $0.01$ oligodendrocytes are far from the Poisson process line, indicating a strong association. The other two slices show a less strong association.
```{r, cache=FALSE, fig.height = 10, fig.width = 16, include=TRUE}
p <- wrap_plots(list(p_homo, p_inhomo, p_scaled),
                nrow = 3,
                guides = 'collect') + plot_annotation(tag_levels = 'A')
p
```
In the plot above we see all variants of the correlation metrics. The assumptions of either homogeneity (first row), inhomogeneity (second row) and local scaling (third row) change the interpretation of these example point patterns. In summary, in this example homogeneous variants show a positive association for all slices. Furthermore, homogeneous variants estimated clustering for all radii, whereas clustering in slice $0.21$ became stronger than in the other two slices past $r>250$. Inhomogeneity inferred a Poisson distribution for slice $0.21$. In addition, the inhomogeneous variant estimated repulsion for slices $-0.09$ and $0.01$. The locally scaled version showed positive associations for all slices and no crossing of curves over the radii.
Deciding whether a pattern is homogeneous or inhomogeneous can be based on metrics as seen above. However, as the result of these metrics depend on the parameter choice (e.g., size of quadrants), the biological question can also be used to decide whether a pattern is assumed to be homogeneous or inhomogeneous. We provide some recommendations in the "Overview" vignettes.
#### Local Indicators of Spatial Association
It is worth noting that the $K$- and $L$-functions described above are summary statistics over the entire pattern (i.e., averaged over all points). However, if we know that there are different regions in our point pattern, an alternative strategy is to compute ''local'' contributions to these patterns, i.e., local $K$- ,$L$- or pair-correlation functions [@anselinLocalIndicatorsSpatial1995]. @baddeleySpatialPointPatterns2015 propose to compare these $n$ functions with so-called functional principal component analysis (see below). We will show here the example of the LISA version of the $L$-function [@baddeleySpatialPointPatterns2015 pp. 247-248].
##### Local $L$ function
```{r, error=FALSE, message=FALSE, warning=FALSE, results='hide'}
L_odmature_lisa <- localL(pp_ls[['0.01']]$`OD Mature`)
df <- as.data.frame(L_odmature_lisa)
dfm <- reshape2::melt(df, "r")
get_sel <- dfm %>% filter(r > 200.5630 &
                            r < 201.4388, variable != "theo") %>%
  mutate(sel = value) %>% select(variable, sel)
dfm <- dfm %>% left_join(get_sel)
p <- ggplot(dfm, aes(
  x = r,
  y = value,
  group = variable,
  colour = sel
)) +
  geom_line() +
  scale_color_continuous(type = "viridis") +
  geom_vline(xintercept = 200) +
  theme(legend.position = "none") +
  theme_light() +
  ggtitle("LISA curves of slice 0.01")
ppdf <- as.data.frame(pp[['0.01']]) %>% filter(marks == "OD Mature")
ppdf$sel <- get_sel$sel # assume they are in same order
q <- ggplot(ppdf, aes(x = x, y = y, colour = sel)) +
  geom_point(size = 0.75) +
  scale_color_continuous(type = "viridis") +
  theme(legend.position = "none") +
  theme_light() +
  ggtitle("Points coloured by LISA value at r ~ 200")
```
```{r, fig.height=5, fig.width=10}
p|q
```
In the case of the OD mature cells, we obtain further information with this plot. We note that there are two distinct populations of curves: those that are clearly above the mean LISA curve in grey and others that are around/underneath. This indicates that there are two different kinds of interactions in the OD mature cells. Stronger and less clustered regions.
There are inhomogeneous versions of these (e.g. `localLinhom`) that are not shown here for brevity.
##### Functional PCA for the $n$ Curves
We apply functional PCA to retrieve the main trends in these individual curves. The idea of functional PCA is the same as for ordinary PCA but applied to functional data (i.e., each observation is a function). For the $n$ functions above, functional PCA will recover the main trends in the data [@ramsayPrincipalComponentsAnalysis2005]. We use the `R` package `r BiocStyle::CRANpkg('refund')` to perform functional PCA [@goldsmithRefundRegressionFunctional2024; @xiaoFastCovarianceEstimation2016].
```{r, error=FALSE, message=FALSE, warning=FALSE, results='hide'}
#normalise the data
df_fdob <- asinh(df %>% as.matrix / 50) %>% as.data.frame()
# extract the functional response matrix
mat <- df_fdob %>%
  select(!c(r, theo)) %>%
  t()
# create a dataframe as required by pffr
dat <- data.frame(ID = rownames(mat))
dat$Y <- mat
dat$sel <- get_sel$sel
# perform functional PCA
res <- functionalPCA(
  dat = dat,
  r = df_fdob$r |> unique()
)
# extract the scores
scores_df <- res$scores %>% as.data.frame()
# plot a biplot
p_biplot <- ggplot(scores_df, aes(scores_df[, 1], scores_df[, 2], colour = (dat[['sel']]))) +
  geom_point() +
  coord_equal() +
  theme_light() +
  scale_color_continuous(type = "viridis") +
  xlab('PC1') +
  ylab('PC2')
```
```{r, fig.height=5, fig.width=10}
p_biplot
```
```{r, error=FALSE, message=FALSE, warning=FALSE, results='hide'}
p1 <- ggplot(ppdf, aes(
  x = x,
  y = y,
  colour = res$scores[, 1]
)) +
  scale_color_continuous(type = "viridis", name = 'loading PC1') +
  theme_light() +
  geom_point(size = 0.75)
p2 <- ggplot(ppdf, aes(
  x = x,
  y = y,
  colour = res$scores[, 2]
)) +
  scale_color_continuous(type = "viridis", name = 'loading PC2') +
  theme_light() +
  geom_point(size = 0.75)
```
```{r, fig.height=5, fig.width=10}
p1|p2
```
The biplot shows the distribution of the first two loadings of the functional PCA. The points are coloured as they were in the plots of the LISA $L$-curves. The first principal component clearly separates the two populations. In the last plot we project the loadings of the fPCs back onto the biological slices and find the same separation.
### Third-Order Summary Statistics
So far we have considered first- and second-order summary statistics and local (or inhomogeneous) adaptations of them. In the second order, one considers (counts of) pairs (e.g., $K$ function). In a third-order setting, we would count triplets of points. A triplet is counted as the normalised expected value of triangles where all edges are smaller than the radius $r$ [@baddeleySpatialPointPatterns2015 pp. 249].
````{=html}
<!--
$$
T(r) = \frac{1}{\lambda^3}\mathbb{E}\left[\sum_{i=1}^n\sum_{j=1\\j\neq i}^nm(x_i,x_j,u) | u \in X\right]
$$
here m is the maximum side of the triangle
$$
m(a,b,c) = \max(||a-b||,||a-c||,||b-c||)
$$
```{r, include=FALSE, eval=FALSE}
p <- plotMetric(zstack_list, pp_ls, celltype_ls, 'r', 'Tstat', 'border', bootstrap = FALSE)
p
```
[MR: should we skip the plot in this section?]
-->
````
## Spacing
So far, most approaches considered intensity and correlation as measures to assess a point pattern. Next, we will look at measures of spacing and shortest-distances to assess spatial arrangements [@baddeleySpatialPointPatterns2015 pp. 255].
Baddeley et al. summarises three basic distances:
-   pairwise distance: $d_{i,j} = ||x_i-x_j||$
-   NN distances: $d_i = \min_{j \neq i}d_{ij}$
-   empty-space distance: $d(u) = \min_j||u-x_j||$
Note also that there are tests of CSR that are based on spacing, including the Clark-Evans and Hopkins-Skellam Index tests that were discussed above ''Testing for CSR''.
### Nearest-neighbour approaches
Nearest-neighbour (NN) methods are based on the notion of "nearness". In particular, we introduce `nndist` from `r BiocStyle::CRANpkg('spatstat')`, a method to calculate the distances until $k$ NN are found. This function returns a density for each specified $k$ for the $k$ neighbour distances. We can for instance collapse the $k$ curves into a mean curve per point pattern. This information of the mean nearest neighbour distance (MMND) can be summarised as a density. Note, that these distances are "raw" nearest-neighbour distances which are not corrected for edge effects. Edge correction for the nearest neighbour distance ($k = 1$) is implemented in the function `Gest`, see next paragraph [@baddeleySpatialPointPatterns2015 pp. 256; @baddeleySpatstatPackageAnalyzing2005].
```{r, cache=FALSE}
nndistance <- function(pp, nk) {
  xy <- cbind(pp$x, pp$y)
  nndistances_k15 <- nndist(xy, k = nk)
  nndistances_mean <- rowMeans(nndistances_k15)
  return(nndistances_mean)
}
#PRE: list of point pattern, corresponding celltypes of interest, functions to evaluate
#POST: result of the metric
metricRes_nndist <- function(ppls, celltype, fun) {
  metric.res <- list(res = do.call(fun,
                                   args = list(pp = ppls[[celltype]],
                                               nk = seq(1:15))))
  metric.res$type = celltype
  return(metric.res)
}
celltypes <- c("Ependymal", "OD Mature", "Microglia")
#go through all defined celltypes and calculate the nearest-neighbour distance
res_ls <- lapply(celltypes, metricRes_nndist, fun = nndistance, ppls = pp_ls[['0.01']])
#initialise a dataframe for the metric values and the type information
res_df <- data.frame(metric = numeric(0), type = character(0))
# Loop through the res_ls list and combine the metric values with their corresponding type
for (i in 1:length(res_ls)) {
  metric_values <- res_ls[[i]]$res
  metric_type <- rep(res_ls[[i]]$type, length(metric_values))
  df <- data.frame(metric = metric_values, type = metric_type)
  res_df <- rbind(res_df, df)
}
#plot the densities
p <- ggplot(res_df, aes(x = metric, col = type)) +
  geom_density(linewidth = 1) +
  scale_x_sqrt() +
  theme_light() +
  ggtitle('Sqrt of the Mean Nearest-Neighbour Distance')
p
```
In the MNND empirical distribution, the ependymal cells show the shortest NN distances, a reflection of their clustering. The OD mature cells have larger NN distances as well as a bimodal distribution, indicating a mix of shorter and longer distances (as visible in the LISA $L$-functions). Microglia cells show the widest distances and the symmetry of the curve indicates similar distances throughout the field of view.
### Nearest-neighbour function $G$ and empty-space function $F$
#### Definitions of $F$ and $G$ function
In a stationary spatial point process, the empty-space distance is defined as:
$$
d(u,X) = \min\{||u-x_i||: x_i \in X\}
$$
Note that this is an edge-corrected distribution function of the nearest-neighbour distance above.
The empty space function is then the cumulative distribution function of the empty-space distances defined above:
$$
F(r) = \mathbb{P}\{d(u,X)\leq r\}.
$$
The NN distance is defined as:
$$
d_i = \min_{j\neq i}||x_j-x_i||.
$$
The NN distance distribution function $G(r)$ is then defined as:
$$
G(r) = \mathbb{P}\{d(x,X\backslash u \leq r |X\ has\ a\ point\ at\ u\}.
$$
For a homogeneous Poisson process, the NN distance distribution is identical to the empty-space function of the same process:
$$
G_{pois} \equiv F_{pois}.
$$
For a general point process, the $F$ and $G$ functions are different [@baddeleySpatialPointPatterns2015 pp. 261-267].
### Empty-space hazard
The $F$ and $G$ functions are, like the $K$ function, cumulative. Therefore, an analogue to the pair-correlation function would make sense to consider. For practical reasons, this is no longer the derivative of the $F$ function but rather a hazard rate:
$$
h(r) = \frac{f(r)}{1-F(r)}.
$$
[@baddeleySpatialPointPatterns2015 pp. 271-274].
### $J$-Function
The concepts of the empty-space function $F$ and the NN function $G$ are complementary. If one decreases, the other increases.
Thus, the $J$ function is a combination of both functions:
$$
J(r) = \frac{1-G(r)}{1-F(r)}.
$$
For a CSR process, $J_{pois} \equiv 1$, whereas values of $J(r) > 1$ are consistent with a regular (e.g., repelling) pattern, and $J(r) < 1$ represents a clustered process [@baddeleySpatialPointPatterns2015 pp. 275-277].
```{r, fig.height = 10, fig.width = 7}
res_ls <- lapply(list('Gest', 'Fest', 'Jest'), function(fun) {
  res <- calcMetricPerFov(
    spe,
    'OD Mature',
    subsetby = 'sample_id',
    fun = fun,
    marks = 'cluster_id',
    rSeq = NULL,
    by = c('Animal_ID', 'sample_id')
  )
  res <- subset(res, sample_id %in% c('-0.09', '0.01', '0.21'))
  return(res)
})
p_ls <- lapply(res_ls, function(res) {
  plotMetricPerFov(
    res,
    correction = "rs",
    theo = TRUE,
    x = "r",
    imageId = 'sample_id',
    legend.position = "right"
  )
})
```
```{r, cache=FALSE, fig.height = 6, fig.width = 16}
wrap_plots(p_ls, guides = 'collect')
```
### Accounting for Inhomogeneity in Spacing Functions
There are inhomogeneous variants of the spacing functions explained above [@baddeleySpatialPointPatterns2015 pp. 277-278].
```{r}
res_ls <- lapply(list('Ginhom', 'Finhom', 'Jinhom'), function(fun) {
  res <- calcMetricPerFov(
    spe,
    'OD Mature',
    subsetby = 'sample_id',
    fun = fun,
    marks = 'cluster_id',
    rSeq = NULL,
    by = c('Animal_ID', 'sample_id')
  )
  res <- subset(res, sample_id %in% c('-0.09', '0.01', '0.21'))
  return(res)
})
p_ls <- lapply(res_ls, function(res) {
  plotMetricPerFov(
    res,
    correction = "bord",
    theo = TRUE,
    x = "r",
    imageId = 'sample_id',
    legend.position = "right"
  )
})
```
```{r, cache=FALSE, fig.height = 6, fig.width = 16}
wrap_plots(p_ls, guides = 'collect')
```
The inhomogeneous curves look different to their homogeneous counterparts but the relative ordering of the curves per plot is the same.
### Nearest-Neighbour Orientation
Next to the NN distance, we can estimate the *orientation* of the neighbours, which gives an indication of the orientation of the spacing. It works by taking the angle between each point and its $k^{th}$ nearest neighbour. The angle is anticlockwise from the x-axis [@baddeleySpatialPointPatterns2015 pp. 278-279] [@baddeleySpatstatPackageAnalyzing2005].
```{r, error=FALSE, message=FALSE, warning=FALSE, results='hide', include=TRUE}
res <- calcMetricPerFov(
  spe,
  'OD Mature',
  subsetby = 'sample_id',
  fun = 'nnorient',
  marks = 'cluster_id',
  rSeq = NULL,
  by = c('Animal_ID', 'sample_id')
)
res <- subset(res, sample_id %in% c('-0.09', '0.01', '0.21'))
p <- plotMetricPerFov(
  res,
  correction = "bordm",
  theo = TRUE,
  x = "phi",
  imageId = 'sample_id',
  legend.position = "right"
)
```
```{r}
p
```
The values of $\phi$ correspond to the orientation of the original point pattern. The horizontal axis goes from $180$ to $0$ (left to right) and the vertical from $90$ to $270$ (top to bottom).
An easier representation of the metric above can be plotted as a rose diagram.
```{r, message=FALSE, warning=FALSE, fig.width=6, fig.height=6}
lapply(seq_along(pp), function(x){rose(nnorient(pp[[x]]), main = names(pp)[x])})
```
The two plots are complementary and show which are the preferred orientations of the point patterns. Furthermore, they show whether or not the assumption of isotropy (no change in the statistical properties of a point pattern after rotations) is justified or not [@baddeleySpatialPointPatterns2015 pp. 236 ff.]. Isotropy is an assumption that a lot of spatial metrics make and in our example we note, that the point patterns are in fact anisotropic as we can see a trend for the preferred orientation in the rose diagrams. An option for analysing anisotropic stationary point patterns is to not calculate the metric on the actual point pattern but rather calculating it on the fry plot of the point pattern. This generalises e.g. Ripley's $K$ function from circles to arbitrary shapes [@baddeleySpatialPointPatterns2015 pp. 239 ff.].
Note also that the concepts of spacing are not only usable in *point* pattern analysis but also more broadly in other spatial contexts (e.g., spacing between shapes instead of points) [@baddeleySpatialPointPatterns2015 pp. 279 ff.].
### Edge Corrections
The same consideration about edge effects as for the $K$ (and related) functions need to be made for the spacing functions; uncorrected estimates are negatively biased estimators. The easiest approach is to draw an artificial border and consider NNs within it. Other approaches are based on sampling. Yet another approach relates to survival analysis, with the idea is that a circle of a point to grows homogeneously with increasing radius until it hits the frame of the window and "dies". This gives survival distributions similar to censored data, where the Kaplan-Meier estimator is a good choice [@baddeleySpatialPointPatterns2015 pp. 285-292].
# Continuous Marks
## Setup
```{r, message=FALSE}
#| label: load-data 2
# redefine the pp here to be zstack 0.01
pp <- pp[['0.01']]
sub <- spe[, spe$sample_id == '0.01']
```
In `r BiocStyle::CRANpkg('spatstat')`, a `mark` can basically take any value, categorical (e.g. cell types, as we have seen above) or continuous (e.g., gene expression) [@baddeleySpatialPointPatterns2015 pp. 637 ff.]. In our example, we take the gene expression of some marker genes from Fig. 6 of the original publication [@moffittMolecularSpatialFunctional2018]. This is a typical numerical mark for points in a biological dataset. Here, we will use all cell types for the analysis of numerical marks.
```{r}
#  Genes from Fig. 6 of Moffitt et al. (2018)
genes <- c('Slc18a2', 'Esr1', 'Pgr')
gex <- assay(sub)[genes,] %>% t %>% as.matrix %>% 
  data.frame %>% set_rownames(NULL)
marks(pp) <- gex
```
```{r, fig.width = 10, fig.height=6}
# create a dataframe in long format for plotting
pp_df <- pp |>
  as.data.frame() |>
  tidyr::pivot_longer(cols = 3:5)
ggplot(pp_df, aes(x, y, colour = log(value + 1))) +
  geom_point(size = 0.5) +
  facet_wrap(~name) +
  coord_equal() +
  scale_color_continuous(type = "viridis") + 
  theme_light()
```
Here we see spatial distribution of the counts of the three genes *Slc18a2*, *Esr1* and *Pgr*.
The function `pairs` from `r BiocStyle::CRANpkg('spatstat')` generates a scatterplot of the counts of the marks (in our case the three genes) against each other and against the $x$ and $y$ coordinates. We can add a non-linear smoothing curve to make the general trends a bit more obvious [@baddeleySpatialPointPatterns2015 pp.641].
```{r, fig.width=8, fig.height=8}
pairs(as.data.frame(pp), panel = panel.smooth, pch=".")
```
We find that the counts of the three genes are very evenly distributed along the $x$ and $y$ coordinate, indicating a homogeneous distribution. The counts of *Esr1* and *Pgr* are positively associated, indicating a dependence of these two marks.
NN interpolations uses the nearest mark to measure the intensity at each spatial location. This is conceptually similar to taking a very small bandwidth for Gaussian kernel smoothing [@baddeleySpatialPointPatterns2015 pp. 642].
```{r, fig.width=8, fig.height=8}
plot(nnmark(pp))
```
We see that there is e.g. a clear spatial structure in the expression of e.g. *Esr1*. It shows a half moon shape.
## Summary functions for continuous marks
As in the discrete case, summary functions assume that the point process is stationary.
### Mark correlation function
The mark correlation function measures the dependence between two marks for two points at distance $r$. It is applicable to stationary point processes with marks. The generalized mark correlation function is given by:
$$ k_f(r) = \frac{\mathbb{E}[f(m(u),m(v)) \mid \| u - v \| = r]}{\mathbb{E}[f(M,M')]},$$
where $f(m(u),m(v))$ is a test function with two arguments (representing the two marks at locations $u$ and $v$ with distance $r$ apart) and returns a non-negative value. For continuous non-negative marks, the canonical choice for $f$ is typically $f(m(u),m(v))= m(u)m(v)$. $M$ and $M′$ represent independent, identically distributed random points with the same distribution as the mark of a randomly chosen point. This denominator is chosen such that random marks have a mark correlation of 1 [@baddeleySpatialPointPatterns2015 pp. 644-645].
```{r}
res <- calcMetricPerFov(
  sub,
  'OD Mature',
  subsetby = genes,
  fun = 'markcorr',
  marks = genes,
  rSeq = NULL,
  by = c('Animal_ID', 'sample_id'),
  continuous = TRUE
)
p <- plotMetricPerFov(
  res,
  correction = "iso",
  theo = TRUE,
  x = "r",
  imageId = 'gene',
  legend.position = "right"
)
p
```
From this plot we show that all genes show a positive correlation at small distances which decline with increasing radius $r$. The association is strongest for the *Slc18a2* gene. We can calculate simulation envelopes to estimate the significance of this association. This is not shown for brevity.
### Mark-weighted $K$-function
The mark-weigthed $K$-function is a generalization of the $K$-function in which the contribution from each pair of points is weighted by a function of their respective marks. It is given by:
$$K_f(r) = \frac 1  \lambda \frac{C_f(r)}{E[ f(M_1, M_2) ]},$$ where
$$ C_f(r) = E \left[ \sum_{x_j \in X} f(m(u), m(x_j)) \mathbf{1} \{0 < ||u - x_j|| \le r\} \;  \mid \; u \in X \right], $$
is equivalent to the unnormalized mark-weighted $K$-function. For every point $u$, we sum the euclidean distance $||u - x||$ of all other points $x$ that are within a distance $r$. This sum is weighted by the function $f(.,.)$ of the marks of $u$ and $x$. The function is standardized by the expected value of $f(M_1, M_2)$ where $M_1, M_2$ represent independent, identically distributed random points with the same distribution as the mark of a randomly chosen point [@baddeleySpatialPointPatterns2015 pp. 646-647].
In the scenario of random labeling, so where the marks are distributed randomly, the mark-weighted $K$-function corresponds to the standard Ripley's $K$-function.
Also here, the canonical function is: $f(m_1, m_2) = m_1 m_2$. This means we weigh each interaction between points by the product of the continuous marks of both points.
```{r}
res <- calcMetricPerFov(
  sub,
  'OD Mature',
  subsetby = genes,
  fun = 'Kmark',
  marks = genes,
  rSeq = NULL,
  by = c('Animal_ID', 'sample_id'),
  continuous = TRUE
)
p <- plotMetricPerFov(
  res,
  correction = "iso",
  theo = TRUE,
  x = "r",
  imageId = 'gene',
  legend.position = "right"
)
p
```
It is important to note that the theoretical value of the $K$-function is not very informative since it represents the $K$-function of a Poisson point process and the underlying point process might not be Poisson. Therefore we compare the mark-weighted with its unmarked analogue. Like this, we can assess whether the points weighted by a continuous mark are more or less correlated than their unmarked analogues [@baddeleySpatialPointPatterns2015 pp. 647].
Here we will compare the $L$-functions weighted by the mark of the gene *Esr1* and the unmarked $L$-function.
```{r}
ppEsr1 <- subset(pp, select = 'Esr1')
L.Esr1L <- Kmark(ppEsr1, function(m1,m2) {m1*m2}, returnL = TRUE)
Lest.ppEsr1 <- Lest(ppEsr1, nlarge=7000)
plot(eval.fv(L.Esr1L - Lest.ppEsr1))
```
We note that the difference between $L$-function weighted by the expression of *Esr1* minus the unmarked $L$-function is positively different to the Poisson difference, meaning that the expression of the continuous mark *Esr1* is correlated among itself.
# Summary
In this chapter we have looked at point pattern methods that can be applied to univariate marks. These univariate marks can either be categorical (in a molecular biological context that could be celltypes) or continuous (e.g. expression of a gene). There are approaches that summarise correlative dependencies among points or that consider spacing functions.
# Appendix
## Session info
```{r, cache=FALSE}
#| label: session-info
sessionInfo()
```