-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
337 lines (254 loc) · 13.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
---
title: "Partitioned Local Depth"
author: "Katherine Moore, Kenneth Berenhaut, and Lucy D'Agostino McGowan"
date: "January 2022"
bibliography: pald_package.bib
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
<!-- badges: start -->
[![R-CMD-check](https://github.com/LucyMcGowan/pald/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/LucyMcGowan/pald/actions/workflows/check-standard.yaml)
<!-- badges: end -->
This package provides an implementation of the Partitioned Local Depth (PaLD) approach which consists of a measure of *local depth* and the *cohesion* of a point to another which, together with a universal threshold for distinguishing strong and weak ties, may be used to reveal local and global structure in data. No extraneous inputs, distributional assumptions, iterative procedures nor optimization criteria are employed. This package includes functions for computing local depths and cohesion matrices as well as flexible functions for plotting community networks and displays of cohesion against distance.
For further discussion of the perspective, including some theoretical results and applications, see:
Berenhaut, Kenneth S., Katherine E. Moore, and Ryan L. Melvin. 2022. “A Social Perspective on Perceived Distances Reveals Deep Community Structure.” *Proceedings of the National Academy of Sciences,* 119 (3).
## Installation
Install the CRAN version:
```{r, eval = FALSE}
install.packages("pald")
```
Or you can install the development version of pald from GitHub with:
``` r
# install.packages("devtools")
devtools::install_github("LucyMcGowan/pald")
```
```{r example, results='hide'}
library(pald)
```
## Input
The input for the Partitioned Local Depth (PaLD) approach is a distance matrix (or `dist` object) associated with a finite collection of data points. Throughout, no distributional assumptions, iterative procedures nor optimization criteria are employed.
The only information extracted from the distance matrix are within-triplet dissimilarity comparisons. As a result, outputs are unaffected by monotone transformations of the collection of distances (e.g., $\log_2$). Further, one may transform any measure of similarity, $s(x, y)$, to a measure of dissimilarity, $d(x,y)$, via any order-reversing monotone transformation, for instance by taking $d(x, y) = 1/(1 + s(x, y))$. This provides the user some flexibility in the choice of dissimilarity (e.g., triangle inequality is not required) and care should be taken at this stage.
The function `dist()` from the `stats` package converts an input data frame (with $n$ rows) into an $n \times n$ distance matrix. In Euclidean examples here, we will use the default Euclidean distance.
## A Small Example
For demonstration purposes, let's begin with the small example from Figure 1 in [@bmm22].
```{r fig-1, echo=FALSE}
par(pty = "s")
plot(exdata1,
pch = 16,
xlim = c(-2.5, 2.25),
ylim = c(-1.5, 3.25),
xlab = "",
ylab = "")
text(exdata1 + .2,
lab = 1:8,
xlim = c(-2.5, 2.25),
ylim=c(-1.5, 3.25),
xlab = "",
ylab = "")
```
The wrapper function `pald()` computes the cohesion matrix from which local depths are determined and community networks may be formed. In the plots of the community networks, strongly cohesive pairs are colored according to connected component. Such connected components may be considered "(community) clusters." Note that the Fruchterman Reingold (FR) force-directed network drawing algorithm employed here will provide somewhat different graph layouts each time it is run.
```{r pald}
par(mfrow = c(1, 2), pty = "s")
D <- dist(exdata1)
pald_results <- pald(D, emph_strong = 1, vertex.label.cex = 3)
###
plot(exdata1,
pch = 16,
xlim = c(-2.5, 2.25),
ylim=c(-1.5, 3.25),
xlab = "",
ylab = "",
main = "Local Depths")
text(exdata1 + .23,
lab = round(pald_results$local_depths, 2),
xlim = c(-2.5, 2.25),
ylim = c(-1.5, 3.25),
xlab = "",
ylab = "",
cex = .8)
```
The wrapper function `pald()` returns a list containing: the cohesion matrix, local depths, (community) clusters, the threshold for identifying strong ties, the thresholded and symmetrized cohesion matrix, the community graph whose edges are weighted by mutual cohesion, the weighted graph of strong ties, and the layout provided by the FR network drawing algorithm applied to the community graph.
Each time the function `pald()` is called, the matrix of cohesion values is re-computed. To avoid unnecessary computation, the following functions are included: `local_depths()`,`strong_threshold()`, `cohesion_strong()`, `community_graphs()`, and `plot_community_graphs()`. We will now walk through each function in turn.
## Cohesion Matrix
Cohesion reflects relationship strength from the perspective of relative position, see [@bmm22]. To begin PaLD analysis, we must first compute the matrix of cohesion values from the input distance matrix or `dist` object. Note that cohesion is not symmetric. The values, $C[x, w]$, in the cohesion matrix are interpretable probabilities which capture the strength of the alignment of $w$ to $x$. The sum of the cohesion matrix is always equal to $n/2$ (where $n$ is the number of data points).
```{r cohesion}
D <- dist(exdata1)
C <- cohesion_matrix(D)
round(C, 4)
# A heat-map of the cohesion matrix
image(t(apply(C, 2, rev)), main = "Cohesion Matrix Heatmap")
```
## Local Depths
Local depth is a probability which describes the support for each point in data-determined local foci. Cohesion is obtained by partitioning local depth, and thus the values of local depth can be computed as the row sums of the cohesion matrix. See also Figure 2(b), above. The average of the values of local depth is always equal to 1/2.
```{r}
# local depths are obtained by computing: rowSums(C)
local_depths(C)
mean(local_depths(C))
```
## Threshold and Strong Ties
The threshold provided in [@bmm22] for distinguishing between strongly and weakly cohesive pairs is equal to of half the average of the diagonal of the cohesion matrix. A function for computing this is provided.
```{r}
# the threshold is obtained by computing: mean(diag(C))/2
strong_threshold(C)
```
Pairs of points for which mutual cohesion (i.e., $\min\{C_{x, w}, C_{w, x}$}) is greater than the above threshold are considered to be ``strongly cohesive." The thresholded and symmetrized cohesion matrix can be obtained using the function 'cohesion_strong.'
```{r}
round(cohesion_strong(C), 4)
```
## Community Structure and Display
The overall structure of the data can be observed via the networks obtained from cohesion (referred to here as "community graphs"). The community graph is a symmetric, weighted graph which is obtained from symmetrizing the cohesion matrix (using $\min\{C_{x, w}, C_{w, x}\}$) and removing self-loops. The "community cluster graph" is the subgraph consisting of only the edges for which mutual cohesion greater than the above threshold.
The connected components of the community cluster graph, `G_strong`, are referred to the (community) clusters of the data. Note that no additional inputs (e.g., number of clusters, neighborhood size) nor optimization criteria are employed in (community) cluster identification.
Strong ties between points will not be severed; if further partitioning of the community graph is desired, one could employ other methods (for instance, the Louvain method) to the resulting community graphs.
```{r cluster, message=FALSE}
graphs <- community_graphs(C)
community_clusters(C)
```
A function for plotting the community graphs, in which the edges and vertices are colored according to cluster membership, is provided. The default layout is obtained via the Fruchterman Reingold (FR) force-directed graph drawing algorithm. Note that the FR force-directed algorithm will provide somewhat different layouts each time it is run.
```{r comm}
plot_community_graphs(C)
```
You can save a particular network layout using `community_graphs(C)$layout`.The function `plot_community_graphs` can take a given layout as an argument. Below, let's overlay the community graph on the original data, that is, plot the network using the layout provided by the original data.
```{r comm-2}
par(pty = "s")
plot(exdata1,
xlim = c(0, 1),
ylim = c(0, 1),
col = "white",
xlab = "",
ylab = "")
par(new = TRUE)
plot_community_graphs(C, layout = as.matrix(exdata1), show_labels = FALSE)
```
## Cohesion Against Distance
Observe that cohesion is not a direct transformation of distance. The `dist_cohesion_plot()` function provides a plot of pairwise distances and associated value(s) of cohesion; the horizontal line indicates the threshold. Within-cluster edges are colored, and weak ties are plotted as open circles. See [@bmm22] for more on the interpretation of these plots.
Let's re-create Figure 2 in [@bmm22].
```{r fig-2}
D <- dist(exdata2)
C <- cohesion_matrix(D)
par(mfrow = c(1, 2))
par(pty = "s")
plot(exdata2,
col = "white",
xlab = "",
ylab = "")
par(new = TRUE)
plot_community_graphs(C,
layout = as.matrix(exdata2),
show_labels = FALSE,
vertex.size = 3)
dist_cohesion_plot(D, cex = .8, weak_gray = TRUE)
```
Rather than showing both pairs, (d(x, y), C(x, w)) and (d(x, y), C(w, x)) as distinct points in the plot, setting `mutual = TRUE` will only plot mutual cohesion, that is the set of points with x-coordinate d(x, y) and y-coordinate min{C(x, w), C(w, x)}.
```{r mutual}
par(pty = "s")
D <- dist(exdata2)
dist_cohesion_plot(D, mutual = TRUE)
```
## Randomly-Generated Data
In this example, we consider a randomly-generated data set of 15 points. Points are selected according to the uniform distribution on the unit square. We first compute the distance matrix and from this calculate the cohesion matrix. We overlay the community graphs on the data set using `plot_community_graphs` and use our original data as the `layout`.
```{r rand, results = 'hide'}
ex_data <- matrix(runif(30), ncol = 2)
D <- dist(ex_data)
C <- cohesion_matrix(D)
par(pty = "s")
plot(
ex_data,
xlim = c(0, 1),
ylim = c(0, 1),
col = "white",
xlab = "",
ylab = ""
)
par(new = TRUE)
plot_community_graphs(
C,
layout = ex_data,
emph_strong = 1,
show_labels = FALSE,
edge_width_factor = 50,
vertex.size = 8
)
```
## Cognate-based Language Families
Let's perform PaLD analysis on the cognate data set from [@dyen92]. For clarity of the display, we show how to only include (a random set of the) 37 vertex labels. The remaining arguments are for aesthetics of the plot. Note that the network layout is somewhat different each time the FR network drawing algorithm is called.
```{r lang}
C_lang <- cohesion_matrix(cognate_dist)
lang_lab_subset <- rownames(C_lang)
lang_lab_subset[sample(1:87, 50)] <- ''
plot_community_graphs(
C_lang,
edge_width_factor = 30,
emph_strong = 3,
vertex.label = lang_lab_subset,
vertex.label.cex = .65,
vertex.size = 3
)
```
One could alternatively use the wrapper function:
$\texttt{pald(cognate_dist, emph_strong = 3, edge_width_factor = 30, vertex.label = lang_lab_subset, vertex.label.cex = .65, vertex.size = 3)}$. It will return a list containing: the cohesion matrix, local depths, (community) clusters, the threshold for identifying strong ties, the thresholded and symmetrized cohesion matrix, the community graph whose edges are weighted by mutual cohesion, the weighted graph of strong ties, and the layout provided by the FR network drawing algorithm applied to the community graph.
One can determine the (strongly cohesive) neighbors using igraph's `neighbor` function. Edge-weights are given by cohesion (or mutual cohesion) and can be found directly from the cohesion matrix.
```{r strong, message = FALSE, warning = FALSE}
library(igraph)
G_strong_lang <- community_graphs(C_lang)$G_strong
neighbors(G_strong_lang, "French")
#And print associated neighborhood weights
C_lang["French", neighbors(G_strong_lang, "French")]
```
## Clustering in the Presence of Varying Density
Cohesion is particularly useful when considering data with varying local density, see discussion in [@bmm22]. Note that PaLD was able to detect the eight natural groups within the data without the use of any additional inputs (e.g., number of clusters) nor optimization criteria. Despite providing the "correct" number of clusters (i.e., $k = 8$) both *k*-means and hierarchical clustering did not give the desired result.
```{r vary-d}
D3 <- dist(exdata3)
C3 <- cohesion_matrix(D3)
par(pty = "s")
plot(
exdata3,
col = "white",
xlab = "",
ylab = "",
main = "PaLD Community Graphs"
)
par(new = TRUE)
plot_community_graphs(
C3,
layout = as.matrix(exdata3),
show_labels = FALSE,
emph_strong = 25,
edge_width_factor = 2,
vertex.size = 5
)
### The cluster vector is provided by `pald' and also may be computed via:
library(igraph)
cluster_graph <- community_graphs(C3)$G_strong
#pald_cluster_vector<- clusters(cluster_graph)$membership
table(clusters(cluster_graph)$membership)
```
Here are the results for the data obtained from *k*-means and hierarchical clustering when $k = 8$.
```{r k-mean}
par(mfrow = c(1, 2), pty = "s")
km_clusters <- kmeans(exdata3, 8)$cluster
plot(
exdata3,
pch = 16,
col = pald_colors[km_clusters],
xlab = "",
ylab = "",
main = "K-Means Clusters (k = 8)"
)
h_clusters <- cutree(hclust(dist(exdata3)), k = 8)
plot(
exdata3,
pch = 16,
col = pald_colors[h_clusters],
xlab = "",
ylab = "",
main = "Hiearchical Clusters (k = 8)"
)
```