Encoding spatial patterns as variables

Principal Coordinate Analysis & Moran Eigenvectors

spatial
r
Author

Josiah Parry

Published

May 17, 2024

I’ve begun reading “Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices (PCNM)” which describes the process of making “Moran Eigenvector Maps.”

In this case, I haven’t finished reading the paper but am quite thrilled by the prospect of it. One of the biggest problems in ecological and social science modelling is that space is often a confounder in models. By this I mean that a lot of phenomena we see are spatially dependent.

Reductively, spatial dependence means that variables or outcomes are strongly linked to where things are. For example, income tends to be spatially dependent. Meaning that high income areas are typically surounded by other high income areas.

The problem

When modelling data that exhibit spatial dependence, spatial relationships need to be accounted for. Otherwise, you will often find that model residuals (errors) also exhibit spatial dependence. So? How can you control for this.

There are a number of techniques that people use from more statistically sound ones, to tricks used by ML engineers. For example you may introduce the spatial lag (neighborhood average of a variable) to account for some of the spatial association.

Principal Coordinate Analysis (PCoA)

One interesting idea is using principle components analysis to encode geography into numeric variables. Conceptually, the idea is actually rather simple!

When we do spatial statistics, we create what are called spatial weights matrices. These define which features are related to eachother.

For example we can identify the neighbors from the famous guerry dataset based on the contiguity—that is if they are touching. We create a nb and wt object. The nb are the neighbors and wt uses a gaussian kernel. The gaussian kernel assigns more weight to to locations that are closer and less weight to those that are further—essentially following the normal distribution.

library(sfdep)
library(dplyr)

geoms <- guerry$geometry
centroids <- sf::st_centroid(geoms)

nb <- st_contiguity(geoms)
wt <- st_kernel_weights(nb, centroids, "gaussian")

Visually, this is what the neighborhood relationship looks like:

Code
library(ggplot2)
sfn <- st_as_graph(geoms, nb, )

autoplot(sfn) +
  geom_sf(data = geoms, fill = NA, color = "black", lwd = 0.2) +
  theme_void()

The weights object is a ragged array which is used to be a sparse matrix representation of the spatial weights.

head(wt)
[[1]]
[1] 1.553402 1.857660 2.062100 1.676694

[[2]]
[1] 1.801787 1.717777 1.439955 1.721547 1.260566 1.429496

[[3]]
[1] 1.599532 1.527097 1.376795 1.722723 1.865664 1.350771

[[4]]
[1] 2.040754 1.356645 1.871658 1.685343

[[5]]
[1] 2.040754 1.674375 1.689488

[[6]]
[1] 2.075805 1.679763 1.357435 1.308397 2.009760 1.812262 1.432539

The spatial weights are an n x n square matrix. The idea behind the paper above is that we can encode the spatial relationships in this neighborhood matrix using principle components.

We can take the weights matrix and create a dense matrix from it:

m <- wt_as_matrix(nb, wt)

Using this new matrix, we can perform PCA on it.

pca_res <- prcomp(m)
summary(pca_res)
Importance of components:
                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     0.95957 0.89585 0.85836 0.81484 0.79239 0.72065 0.66135
Proportion of Variance 0.06928 0.06038 0.05543 0.04996 0.04724 0.03907 0.03291
Cumulative Proportion  0.06928 0.12966 0.18510 0.23505 0.28229 0.32137 0.35428
                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
Standard deviation     0.65347 0.60391 0.58993 0.54551 0.51607 0.51048 0.50266
Proportion of Variance 0.03213 0.02744 0.02618 0.02239 0.02004 0.01961 0.01901
Cumulative Proportion  0.38640 0.41385 0.44003 0.46242 0.48246 0.50206 0.52107
                          PC15    PC16    PC17    PC18    PC19    PC20    PC21
Standard deviation     0.50008 0.49651 0.48004 0.47334 0.46571 0.46447 0.45886
Proportion of Variance 0.01882 0.01855 0.01734 0.01686 0.01632 0.01623 0.01584
Cumulative Proportion  0.53989 0.55844 0.57578 0.59263 0.60895 0.62518 0.64103
                          PC22   PC23    PC24    PC25    PC26    PC27    PC28
Standard deviation     0.45371 0.4495 0.43495 0.43208 0.42533 0.42265 0.40912
Proportion of Variance 0.01549 0.0152 0.01423 0.01405 0.01361 0.01344 0.01259
Cumulative Proportion  0.65651 0.6717 0.68595 0.70000 0.71361 0.72705 0.73964
                          PC29    PC30    PC31    PC32    PC33    PC34    PC35
Standard deviation     0.40662 0.40248 0.39657 0.38949 0.38172 0.37648 0.36612
Proportion of Variance 0.01244 0.01219 0.01183 0.01141 0.01096 0.01066 0.01009
Cumulative Proportion  0.75208 0.76427 0.77610 0.78752 0.79848 0.80915 0.81923
                          PC36    PC37    PC38    PC39    PC40    PC41    PC42
Standard deviation     0.35885 0.35324 0.35042 0.34655 0.33906 0.33458 0.32477
Proportion of Variance 0.00969 0.00939 0.00924 0.00904 0.00865 0.00842 0.00794
Cumulative Proportion  0.82892 0.83831 0.84755 0.85658 0.86523 0.87366 0.88159
                          PC43    PC44    PC45    PC46    PC47    PC48    PC49
Standard deviation     0.32182 0.30859 0.30426 0.30100 0.29700 0.28072 0.27493
Proportion of Variance 0.00779 0.00716 0.00697 0.00682 0.00664 0.00593 0.00569
Cumulative Proportion  0.88938 0.89655 0.90351 0.91033 0.91697 0.92290 0.92858
                          PC50    PC51    PC52    PC53    PC54    PC55    PC56
Standard deviation     0.26620 0.25927 0.25817 0.25373 0.25203 0.23148 0.22505
Proportion of Variance 0.00533 0.00506 0.00501 0.00484 0.00478 0.00403 0.00381
Cumulative Proportion  0.93391 0.93897 0.94399 0.94883 0.95361 0.95764 0.96145
                          PC57   PC58    PC59    PC60    PC61    PC62    PC63
Standard deviation     0.21925 0.2124 0.20738 0.20542 0.20426 0.17969 0.17415
Proportion of Variance 0.00362 0.0034 0.00324 0.00318 0.00314 0.00243 0.00228
Cumulative Proportion  0.96507 0.9685 0.97170 0.97488 0.97801 0.98044 0.98273
                          PC64    PC65    PC66    PC67    PC68    PC69    PC70
Standard deviation     0.17330 0.16078 0.15374 0.14641 0.14201 0.13335 0.13088
Proportion of Variance 0.00226 0.00194 0.00178 0.00161 0.00152 0.00134 0.00129
Cumulative Proportion  0.98499 0.98693 0.98871 0.99032 0.99184 0.99318 0.99447
                          PC71    PC72    PC73    PC74    PC75    PC76    PC77
Standard deviation     0.12526 0.10810 0.10181 0.08943 0.08425 0.07410 0.07172
Proportion of Variance 0.00118 0.00088 0.00078 0.00060 0.00053 0.00041 0.00039
Cumulative Proportion  0.99565 0.99653 0.99731 0.99791 0.99844 0.99885 0.99924
                          PC78    PC79    PC80    PC81    PC82    PC83     PC84
Standard deviation     0.06809 0.04608 0.03760 0.03481 0.02250 0.01105 0.008143
Proportion of Variance 0.00035 0.00016 0.00011 0.00009 0.00004 0.00001 0.000000
Cumulative Proportion  0.99959 0.99975 0.99986 0.99995 0.99999 1.00000 1.000000
                            PC85
Standard deviation     3.921e-16
Proportion of Variance 0.000e+00
Cumulative Proportion  1.000e+00

The spatial relationships that are embedded by the spatial weights matrix, are now encoded as components from a PCA. This means that we can use each of these components as a univariate measure of space. And, they also exhibit quite interesting patterns of spatial dependence.

Exploring PCoA

These components essentially capture spatial autocorrelation. For example we an look at the first component.

# extract the first component
comp1 <- pca_res$rotation[, 1]

ggplot(guerry, aes(fill = comp1)) +
  geom_sf(color = "black", lwd = 0.2) +
  scale_fill_viridis_c() +
  theme_void() +
  labs(fill = "Eigenvector")

It displays a pattern of being near Paris (the dark purple, or negative eigenvector values) or being nearer to Aveyron, the positive eigenvector values. Clearly, this displays some interesting global spatial autocorrelation. But how much?

We can measure the global spatial autocorrelation of this component using Moran’s I.

global_moran_perm(comp1, nb, wt)

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 500 

statistic = 1.0698, observed rank = 500, p-value < 2.2e-16
alternative hypothesis: two.sided

The result is 1.0698 which is greater than the theoretical maximum of 1. There is a ridiculous amount of spatial autocorrelation here.

Using PCoA Eigenvectors to reduce spatial confounding

Predicting crime based on population and the prostitution levels of 1830s France shows that there is a lot of spatial autocorrelation in the residuals. This means that the results of the model do not appropriately account for spatial dependence.

mod <- lm(crime_pers ~ pop1831 + prostitutes, data = guerry)
summary(mod)

Call:
lm(formula = crime_pers ~ pop1831 + prostitutes, data = guerry)

Residuals:
     Min       1Q   Median       3Q      Max 
-13762.2  -4592.1   -974.6   4892.4  18672.5 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13688.932   2265.393   6.043 4.27e-08 ***
pop1831        17.676      5.881   3.006  0.00352 ** 
prostitutes    -3.197      1.665  -1.920  0.05833 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7001 on 82 degrees of freedom
Multiple R-squared:  0.1021,    Adjusted R-squared:  0.08016 
F-statistic:  4.66 on 2 and 82 DF,  p-value: 0.01211
global_moran_test(resid(mod), nb, wt)

    Moran I test under randomisation

data:  x  
weights: listw    

Moran I statistic standard deviate = 5.0424, p-value = 2.298e-07
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.328493924      -0.011904762       0.004557168 

If you include the first eigenvector component, the spatial autocorrelation of the residuals decrease dramatically.

mod <- lm(crime_pers ~ pop1831 + prostitutes + comp1, data = guerry)
summary(mod)

Call:
lm(formula = crime_pers ~ pop1831 + prostitutes + comp1, data = guerry)

Residuals:
     Min       1Q   Median       3Q      Max 
-14228.8  -3822.7   -893.4   4232.5  19718.8 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  14847.292   2080.269   7.137 3.66e-10 ***
pop1831         15.179      5.386   2.818  0.00607 ** 
prostitutes     -4.597      1.551  -2.964  0.00399 ** 
comp1       -28422.828   6708.123  -4.237 5.95e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6373 on 81 degrees of freedom
Multiple R-squared:  0.265, Adjusted R-squared:  0.2378 
F-statistic: 9.733 on 3 and 81 DF,  p-value: 1.482e-05
global_moran_test(resid(mod), nb, wt)

    Moran I test under randomisation

data:  x  
weights: listw    

Moran I statistic standard deviate = 2.666, p-value = 0.003838
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.167439656      -0.011904762       0.004525529 

Interestingly, this increases the \(R^2\) by 16 which is nothing to scoff at. The significance of prostitutes variable increases and the \(\beta\) values shrink. And the first component accounts for pretty much everything else lol!

What about another component?

We can plot the relationship that is capture by the second component.

# extract the second component
comp2 <- pca_res$rotation[, 2]

ggplot(guerry, aes(fill = comp2)) +
  geom_sf(color = "black", lwd = 0.2) +
  scale_fill_viridis_c() +
  theme_void() +
  labs(fill = "Eigenvector")

This component captures a west to east relationship rather than a north to south one. Is the second component spatially autocorrelated?

global_moran_perm(comp2, nb, wt)

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 500 

statistic = 0.99864, observed rank = 500, p-value < 2.2e-16
alternative hypothesis: two.sided

Oh hell yeah it is.

If this component is included in the model instead of the first one we see something interesting.

mod <- lm(crime_pers ~ pop1831 + prostitutes + comp2, data = guerry)
summary(mod)

Call:
lm(formula = crime_pers ~ pop1831 + prostitutes + comp2, data = guerry)

Residuals:
   Min     1Q Median     3Q    Max 
-13617  -4584  -1150   4831  18360 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13730.388   2278.027   6.027 4.71e-08 ***
pop1831        17.518      5.919   2.960  0.00404 ** 
prostitutes    -3.098      1.686  -1.837  0.06989 .  
comp2        3303.053   7091.459   0.466  0.64262    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7034 on 81 degrees of freedom
Multiple R-squared:  0.1045,    Adjusted R-squared:  0.0713 
F-statistic:  3.15 on 3 and 81 DF,  p-value: 0.02941
global_moran_test(resid(mod), nb, wt)

    Moran I test under randomisation

data:  x  
weights: listw    

Moran I statistic standard deviate = 5.0189, p-value = 2.598e-07
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.326953780      -0.011904762       0.004558477 

The model is not impacted nor is the spatial autocorrelation. So the pattern encompassed by the second component is not confounding our variables like the first one is.

What does this mean?

If you have spatially dependent features that you’re predicting you should consider using these as input features to your models. I have a hunch that they would work insanely well with computer vision tasks and things models like Random Forests and XGBoost.