cNMF/Tutorials/R_vignette.Rmd at main · dylkot/cNMF · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
title: "R_cnmf_vignette"
output: html_notebook
---

This notebook provides example code for running cNMF on the standard PBMC example seurat counts matrix. Similar methodology can be used for other R object types, such as SingleCellExperiment. Generally, the counts matrix should be converted into a .mtx or .h5ad file (as shown below) for use in cnmf.


```{r}

suppressPackageStartupMessages({
    library(data.table)
    library(Matrix)
    library(Seurat)
})


```

First we download the standard pbmc3k example data used for Seura to a directory called ./R_Example_Data


```{r}

data_url <- 'https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz'
data_dir = './R_Example_Data/'

if (!dir.exists(data_dir)) {
  dir.create(data_dir, recursive = TRUE)
}

# Path where the tar.gz file will be saved
file_path <- file.path(data_dir, "pbmc3k_filtered_gene_bc_matrices.tar.gz")

# Download the file
download.file(data_url, file_path, mode="wb")

```

Then we extract the tar.gz file

```{r}
extract_command <- sprintf("tar -xzf %s -C %s", shQuote(file_path), shQuote(data_dir))
system(extract_command)

rm_command <- sprintf("rm %s", file_path)
system(rm_command)
```

Now the data should be stored locally

```{r}
system("ls ./R_Example_Data/filtered_gene_bc_matrices/hg19", intern=TRUE)
```

This data can actually be passed directly to cnmf as below. However, we recommend to filter genes detected in few cells and cells with few counts before running cnmf. We therefore do some filtering here first.

```{r}
pbmc.data <- Read10X(data.dir = "./R_Example_Data/filtered_gene_bc_matrices/hg19")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nCount_RNA > 200)
```

```{r}
dim(pbmc)
```


```{r}
counts <- pbmc@assays$RNA$counts
barcodes <- colnames(counts)
gene_names <- rownames(counts)
counts[1:5, 1:5]
```
```{r}
barcodes[1:5]
```


Now lets output the filtered matrix to disk.

```{r}

filtered_dir = './R_Example_Data/filtered/'

if (!dir.exists(filtered_dir)) {
  dir.create(filtered_dir, recursive = TRUE)
}

# Output counts matrix
writeMM(counts, paste0(filtered_dir, 'matrix.mtx'))

# Output cell barcodes
barcodes <- colnames(counts)
write.table(as.data.frame(barcodes), paste0(filtered_dir, 'barcodes.tsv'),
           col.names = FALSE, row.names = FALSE, sep = "\t")


# Output feature names
gene_names <- rownames(counts)
features <- data.frame("gene_id" = gene_names,"gene_name" = gene_names,type = "Gene Expression")
write.table(as.data.frame(features), sep = "\t", paste0(filtered_dir, 'genes.tsv'),
           col.names = FALSE, row.names = FALSE)
```

Now we run cnmf by passing the command line commands to the system() function

First is the prepare step which normalizes the count matrix and prepares the factorization step

```{r}

runname = "example_cNMF"
cmd = paste("cnmf prepare --output-dir", data_dir,
            "--name", runname,
            "-c", paste0(filtered_dir, 'matrix.mtx'),
            "--max-nmf-iter 2000",
            "-k 5 6 7 8 9 10 --n-iter 20", sep=" ")
print(cmd)
system(cmd)
```
Next is the factorization step which runs NMF --n-iter times (in this case 20) for each value of K. In this tutorial we run all these jobs sequentially on a single worker but in theory this can be distributed to multiple cores or nodes with separate commands like so:

cnmf factorize --output-dir ./R_Example_Data/ --name example_cNMF --worker-index 0 --total-workers 3
cnmf factorize --output-dir ./R_Example_Data/ --name example_cNMF --worker-index 1 --total-workers 3
cnmf factorize --output-dir ./R_Example_Data/ --name example_cNMF --worker-index 2 --total-workers 3


```{r}
cmd = paste("cnmf factorize --output-dir", data_dir,
            "--name", runname,
            "--worker-index 0 --total-workers 1", sep=" ")
print(cmd)
system(cmd)
```

Next we concatenate the results for each value of K into a single file

```{r}
cmd = paste("cnmf combine --output-dir", data_dir,
            "--name", runname, sep=" ")
print(cmd)
system(cmd)
```
And make a plot estimating the trade-off between higher values of K and stability and error

```{r}
cmd = paste("cnmf k_selection_plot --output-dir", data_dir,
            "--name", runname, sep=" ")
print(cmd)
system(cmd)
```

We can load the saved png file to see the results in the Rmd notebook.

![Alt text](./R_Example_Data/example_cNMF/example_cNMF.k_selection.png)

This plot suggests K=7 might be a local optimum in stability so lets try that solution.

```{r}
cmd = paste("cnmf consensus --output-dir", data_dir,
            "--name", runname,
            '--components', 7,
            '--local-density-threshold', 0.1,
            '--show-clustering', sep=" ")
print(cmd)
system(cmd)
```

This step creates a plot to help visualize the consensus clustering

![Alt text](./R_Example_Data/example_cNMF/example_cNMF.clustering.k_7.dt_0_1.png)
Looks like the clustering is very clean. Lets load in the resulting files.

```{r}
usage_file <- paste(data_dir[1:length(data_dir)], runname, paste(runname, "usages", "k_7.dt_0_1", 'consensus', 'txt', sep="."), sep="/")
spectra_score_file <- paste(data_dir[1:length(data_dir)], runname, paste(runname, "gene_spectra_score", "k_7.dt_0_1", 'txt', sep="."), sep="/")
spectra_tpm_file <- paste(data_dir[1:length(data_dir)], runname, paste(runname, "gene_spectra_tpm", "k_7.dt_0_1", 'txt', sep="."), sep="/")

usage <- read.table(usage_file, sep='\t', row.names=1, header=TRUE)
spectra_score <- read.table(spectra_score_file, sep='\t', row.names=1, header=TRUE)
spectra_tpm <- read.table(spectra_tpm_file, sep='\t', row.names=1, header=TRUE)
head(usage)
```

For most analyses we normalize the resulting usage file output so that each cell sums to 1. We do that below

```{r}
usage_norm <- as.data.frame(t(apply(usage, 1, function(x) x / sum(x))))
```

Now lets concatenate usage_norm into the metadata of the Seurat object to make it easier to plot later

```{r}
library(dplyr)
new_metadata <- merge(pbmc@meta.data, usage_norm, by = "row.names", all.x = TRUE)
rownames(new_metadata) <- new_metadata$Row.names
pbmc@meta.data <- new_metadata
```

Now lets run the standard Seurat UMAP pipeline so we can plot the GEP usages over the UMAP

```{r}
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
pbmc <- FindNeighbors(pbmc, dims = 1:15)
pbmc <- RunUMAP(pbmc, dims = 1:15)
```

Now we can plot the usages of all of the GEPs on the UMAP


```{r}
library(ggplot2)
p <- FeaturePlot(pbmc, features = colnames(usage_norm), combine=F)
p
```
To help make sense of the learned GEPs, we extract the top 20 most highly weighted genes for each GEP as below.

```{r}

get_top_colnames <- function(row) {
  # Orders the values in descending order and gets the names of the top 20
  print(row[1:5])
  top_indices <- order(row, decreasing = TRUE)[1:20]
  return(colnames(spectra_score)[top_indices])
}

top_colnames <- apply(spectra_score, 1, get_top_colnames)
top_colnames <- as.data.frame(top_colnames)

top_colnames
```

GEP 1 reflects T-cells, 2 reflects CD14 monocytes, 3 reflects B cells, 4 reflects NK cells, 5 reflects CD16 monocytes, 6 reflects cell cycle, 7 is less clear