The analyses in this document are primarily taken from a notebook provided by Alejandro in which he sought to leverage WGCNA (Langfelder and Horvath (2008)) to see if/how well it corresponds to results provided by STRING (Szklarczyk et al. (2011)) and/or our DE analyses. Since it seemed likely that at least some of the figures/analyses he produced might end up in the paper, I emailed him for a copy of his notebook and plopped his work here (with minor modifications for style and readability).
I am using Alejandro’s document to get a feeling for how WGCNA is getting modules relevant to cure/fail. I have been relatively skeptical of WGCNA; I watched Keith, who is freaky brilliant, struggle with it and get very frustrated how even miniscule changes to the input data would result in profound changes in the resulting modules. Nonetheless, this seemed like a good opportunity for me to get acquainted with it.
I am reasonably certain Alejandro is using this document as his input:
https://bioinformaticsworkbook.org/tutorials/wgcna.html#gsc.tab=0
threads <- WGCNA::enableWGCNAThreads(nThreads = 8)## Allowing parallel execution with up to 8 working processes.
The input used is the cell-type specific, tmaco-only, rpkm, modified by sva.
In the following block I am breaking down what Alejandro did into smaller pieces so that I can make certain I understand what happened.
input <- normalize_expt(t_monocytes, filter = TRUE, batch = "svaseq") %>%
normalize_expt(convert = "rpkm", column = "mean_cds_len", na_to_zero = TRUE)## Removing 9018 low-count genes (10840 remaining).
## Setting 280 entries to zero.
wgcna_input <- as.data.frame(exprs(input))
wgcna_input[["ENSEMBLID"]] <- rownames(wgcna_input)
wgcna_melted <- wgcna_input %>%
gather(key = "samples", value = "counts", -ENSEMBLID)
wgcna_with_meta <- wgcna_melted %>%
inner_join(., pData(input), by = c("samples" = "tmrcidentifier"))
wgcna_selected <- wgcna_with_meta %>%
select("ENSEMBLID", "samples", "counts") %>%
spread(key = "samples", value = "counts")Unless I am mistaken, the above just converted the matrix of counts into a merged/melted copy of same with the metadata, then removed the metadata and returned us back to the original state? hmmm…
good_samples_genes <- WGCNA::goodSamplesGenes(t(exprs(input)))## Flagging genes and samples with too many missing values...
## ..step 1
summary(good_samples_genes)## Length Class Mode
## goodGenes 10840 -none- logical
## goodSamples 42 -none- logical
## allOK 1 -none- logical
good_samples_genes[["allOK"]]## [1] TRUE
l2input <- normalize_expt(input, transform = "log2")## transform_counts: Found 280 values equal to 0, adding 1 to the matrix.
power_test <- c(c(1:10), seq(from = 12, to = 20, by = 2))
threshold_search <- WGCNA::pickSoftThreshold(
t(exprs(l2input)), powerVector = power_test,
networkType = "signed", verbose = 5)## pickSoftThreshold: will use block size 4127.
## pickSoftThreshold: calculating connectivity for given powers...
## ..working on genes 1 through 4127 of 10840
## ..working on genes 4128 through 8254 of 10840
## ..working on genes 8255 through 10840 of 10840
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.2300 7.300 0.898 5610.0 5630.00 6300
## 2 2 0.1370 2.420 0.894 3170.0 3120.00 4030
## 3 3 0.0175 -0.441 0.821 1910.0 1860.00 2870
## 4 4 0.2560 -1.200 0.818 1220.0 1160.00 2180
## 5 5 0.5440 -1.560 0.884 812.0 747.00 1740
## 6 6 0.6960 -1.670 0.918 561.0 497.00 1420
## 7 7 0.7760 -1.740 0.935 400.0 338.00 1190
## 8 8 0.8080 -1.760 0.936 293.0 235.00 1010
## 9 9 0.8420 -1.760 0.946 219.0 166.00 876
## 10 10 0.8670 -1.760 0.958 168.0 119.00 764
## 11 12 0.8680 -1.770 0.947 103.0 63.60 596
## 12 14 0.8590 -1.790 0.939 66.8 35.60 477
## 13 16 0.8650 -1.770 0.943 45.3 20.80 390
## 14 18 0.8760 -1.750 0.953 31.8 12.50 324
## 15 20 0.8730 -1.740 0.951 22.9 7.72 272
a1 <- ggplot(threshold_search[["fitIndices"]], aes(Power, SFT.R.sq, label = Power)) +
geom_point() +
geom_text(nudge_y = 0.1) +
geom_hline(yintercept = 0.8, color = 'red') +
labs(x = 'Power', y = 'Scale free topology model fit, signed R^2')
a1a2 <- ggplot(threshold_search[["fitIndices"]], aes(Power, mean.k., label = Power)) +
geom_point() +
geom_text(nudge_y = 0.1) +
labs(x = 'Power', y = 'Mean Connectivity')
a2chosen_power <- 8
## WGCNA calls cor() without specifying its own namespace, so overwrite cor for the moment.
cor <- WGCNA::cor
initial_modules <- WGCNA::blockwiseModules(
t(exprs(l2input)), maxBlockSize = 11000, TOMType = "signed",
power = chosen_power, mergeCutHeight = 0.25, numericLabels = FALSE,
verbose = 3)## Calculating module eigengenes block-wise from all genes
## Flagging genes and samples with too many missing values...
## ..step 1
## ..Working on block 1 .
## TOM calculation: adjacency..
## ..will use 8 parallel threads.
## Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
## ....clustering..
## ....detecting modules..
## ....calculating module eigengenes..
## ....checking kME in modules..
## ..removing 23 genes from module 1 because their KME is too low.
## ..removing 9 genes from module 2 because their KME is too low.
## ..removing 6 genes from module 3 because their KME is too low.
## ..reassigning 20 genes from module 1 to modules with higher KME.
## ..reassigning 17 genes from module 2 to modules with higher KME.
## ..reassigning 11 genes from module 3 to modules with higher KME.
## ..reassigning 2 genes from module 4 to modules with higher KME.
## ..reassigning 3 genes from module 7 to modules with higher KME.
## ..reassigning 2 genes from module 8 to modules with higher KME.
## ..reassigning 1 genes from module 11 to modules with higher KME.
## ..reassigning 1 genes from module 12 to modules with higher KME.
## ..reassigning 1 genes from module 13 to modules with higher KME.
## ..reassigning 1 genes from module 15 to modules with higher KME.
## ..reassigning 1 genes from module 21 to modules with higher KME.
## ..reassigning 1 genes from module 30 to modules with higher KME.
## ..reassigning 2 genes from module 39 to modules with higher KME.
## ..merging modules that are too close..
## mergeCloseModules: Merging modules whose distance is less than 0.25
## Calculating new MEs...
cor <- stats::cor
initial_eigen <- initial_modules[["MEs"]]network_colors <- WGCNA::labels2colors(initial_modules[["colors"]])
WGCNA::plotDendroAndColors(
initial_modules[["dendrograms"]][[1]],
network_colors[initial_modules[["blockGenes"]][[1]]],
"Modules",
dendroLabels = FALSE,
hang = 0.03,
addGuide = TRUE,
guideHang = 0.05)WGCNA::plotDendroAndColors(
initial_modules[["dendrograms"]][[1]],
cbind(initial_modules[["unmergedColors"]], initial_modules[["colors"]]),
c("unmerged", "merged"),
dendroLabels = FALSE,
addGuide = TRUE,
hang = 0.03,
guideHang = 0.05)WGCNA::plotDendroAndColors(
initial_modules[["dendrograms"]][[1]],
initial_modules[["colors"]],
"ME",
dendroLabels = FALSE,
addGuide = TRUE,
hang= 0.03,
cex.colorLabels = 2,
marAll = c(1, 5, 3, 1),
main = ("WGCNA Cluster Dendrogram"),
guideHang = 0.05)This does not appear to work, FIXME
initial_reorder <- WGCNA::consensusOrderMEs(
initial_eigen, useAbs = FALSE,
useSets = NULL, greyLast = TRUE,
greyName = paste(WGCNA::moduleColor.getMEprefix(), "grey", sep = ""),
method = "consensus")meta_numeric <- data.frame(
"cf_numeric" = as.numeric(as.factor(pData(l2input)[["finaloutcome"]])),
"visit_numeric" = as.numeric(as.factor(pData(l2input)[["visitnumber"]])))
rownames(meta_numeric) <- rownames(pData(l2input))
meta_factors <- pData(l2input)[, c("finaloutcome", "visitnumber")]
meta_eigen <- merge(initial_eigen, meta_factors, by = "row.names")
rownames(meta_eigen) <- meta_eigen[["Row.names"]]
meta_eigen[["Row.names"]] <- NULL
kappa <- irr::kappam.fleiss(meta_eigen)
module_trait_corr <- stats::cor(initial_eigen, meta_numeric, use = "p")
module_trait_corr## cf_numeric visit_numeric
## MEpurple -0.16229 -0.09231
## MEsalmon -0.17636 -0.14937
## MEcyan -0.14906 0.16435
## MEgreenyellow 0.55583 -0.10184
## MEbrown 0.27988 -0.05919
## MEturquoise 0.43850 0.01520
## MEmidnightblue 0.07250 0.05253
## MElightcyan 0.31647 -0.02097
## MEyellow 0.31978 -0.03496
## MEblack 0.07970 0.06500
## MEmagenta -0.02517 -0.01685
## MEgreen 0.14295 0.27874
## MEpink -0.21535 0.04562
## MEred -0.36431 0.27486
## MEblue -0.19605 0.19608
## MEtan -0.31286 0.15814
## MEgrey 0.10746 0.12134
module_trait_pvalues <- WGCNA::corPvalueStudent(module_trait_corr, nrow(meta_numeric))
module_trait_pvalues## cf_numeric visit_numeric
## MEpurple 0.3044842 0.56094
## MEsalmon 0.2638981 0.34509
## MEcyan 0.3461291 0.29832
## MEgreenyellow 0.0001327 0.52102
## MEbrown 0.0726243 0.70963
## MEturquoise 0.0036752 0.92391
## MEmidnightblue 0.6481970 0.74108
## MElightcyan 0.0411610 0.89514
## MEyellow 0.0389768 0.82601
## MEblack 0.6158650 0.68255
## MEmagenta 0.8742900 0.91567
## MEgreen 0.3664587 0.07385
## MEpink 0.1707995 0.77419
## MEred 0.0176966 0.07813
## MEblue 0.2133732 0.21330
## MEtan 0.0436624 0.31720
## MEgrey 0.4981795 0.44399
On my computer at least, there seems to be difficulty installing the CorLevelPlot library, so I will just remove this piece of Alejandro’s code for now.
module_trait_merged <- merge(initial_eigen, meta_numeric, by = "row.names")
rownames(module_trait_merged) <- module_trait_merged[["Row.names"]]
module_trait_merged[["Row.names"]] <- NULL
#CorLevelPlot::CorLevelPlot(
# module_trait_merged,
# x = names(module_trait_merged)[1:18],
# rotLabX = 90,
# y = names(module_trait_merged)[19:20],
# posColKey = "top",
# col = c("blue1", "skyblue", "white", "pink", "red"))It appears that the modules ‘turqoise’ and ‘pink’ are likely the most interesting. We can extract the genes from them:
wanted <- initial_modules[["colors"]] == "turqoise" | initial_modules[["colors"]] == "pink"
sum(wanted)## [1] 123
interesting_genes <- names(initial_modules[["colors"]])[wanted]
fData(l2input)[interesting_genes, "hgnc_symbol"]## [1] "GAS7" "LRRC23" "TSPAN9" "ABCC2" "RSF1"
## [6] "PPP1R3F" "POLQ" "SZRD1" "SLC6A16" "CHI3L2"
## [11] "METTL22" "NUCB2" "MAP3K13" "MKRN2" "MCM6"
## [16] "MYH7B" "RABL2B" "MOK" "RBL1" "ITGAE"
## [21] "ZNF324" "ZNF446" "MECOM" "SMOX" "PCDH11Y"
## [26] "PATZ1" "TTLL1" "CGRRF1" "PABPN1" "NKAP"
## [31] "LAPTM4B" "UBXN8" "CEP41" "RAB5C" "ACAD10"
## [36] "COX6A1" "SPARC" "HEMK1" "CENPA" "TANC1"
## [41] "PECR" "ADGRL2" "GLT8D2" "STX16" "EIF2S2"
## [46] "ZNF426" "CKMT2" "FMO5" "DHX30" "EIF5A"
## [51] "ARF3" "SLC9A5" "STXBP1" "SEMA6D" "BBS7"
## [56] "MMAB" "EFCAB11" "ZMYND15" "MIEN1" "PIGK"
## [61] "AMMECR1L" "NBEAL1" "POMGNT2" "TMEM44" "TRIM41"
## [66] "NOPCHAP1" "FRMD4A" "MED19" "DGKI" "PAFAH2"
## [71] "STARD9" "TPPP3" "BDH1" "PPP1R32" "SCNN1D"
## [76] "PLXNB1" "SEPTIN8" "CFAP47" "CNPY4" "ZNF526"
## [81] "ZNF283" "GLYCTK" "ADAL" "DTWD2" "BTD"
## [86] "ZNF692" "CFL1" "GXYLT2" "MTX1" "PELI3"
## [91] "CLVS1" "MAGED1" "HS3ST4" "ZNF623" "CMSS1"
## [96] "PIGP" "GNB1L" "ANKRD46" "SLC35F1" "ZFP62"
## [101] "ABCB8" "FKBP1C" "TXNRD1" "TPM2" "ZNF28"
## [106] "KIFBP" "OOEP" "STK19" "MICA" "HLA-G"
## [111] "PPM1N" "DNASE1" "ZSWIM7" "" "UQCRHL"
## [116] "UBE2L5" "C10orf143" "APOBEC3C" "UBAP1L" ""
## [121] "MAGIX" "RBAK-RBAKDN" ""
written_interesting <- write_xlsx(fData(l2input)[interesting_genes, ],
excel = glue("excel/wgcna_interesting_genes-v{ver}.xlsx"))
## Note that we can do similarity matrices on the samples too in order to get
## dendrograms which may get interesting groups of samples?
not_grey <- initial_modules[["colors"]] != "grey"
not_grey_exprs <- t(exprs(l2input))[, not_grey]
dim(not_grey_exprs)## [1] 42 8073
not_grey_genes <- colnames(not_grey_exprs)
dist_tom <- 1 - WGCNA::TOMsimilarityFromExpr(
not_grey_exprs,
power = chosen_power)## TOM calculation: adjacency..
## ..will use 8 parallel threads.
## Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
colnames(dist_tom) <- not_grey_genes
rownames(dist_tom) <- colnames(dist_tom)
similarity_cluster <- flashClust::flashClust(as.dist(dist_tom), method = "average")
WGCNA::plotDendroAndColors(
similarity_cluster,
colors = initial_modules[["colors"]][not_grey_genes],
"ME",
dendroLabels = FALSE,
hang = 0.03,
addGuide = TRUE,
guideHang = 0.05,
cex.colorLabels = 1.5,
main = ("Cluster Dendrogram (WGCNA)"))## In my initial pass, I did the clustering on the samples instead of genes and got
## two primary groups:
## 26,6,7,12,4,27,3,13,9,30,17,22,18,42,15,31,8,25,5,16,14,20,10,38
## 1,2,23,11,40,34,33,35,37,24,41,19,29,28,36,21,32,39
## I assume that these two groups will have some meaning vis a vis the monocyte samples?
first_group <- c(26,6,7,12,4,27,3,13,9,30,17,22,18,42,15,31,8,25,5,16,14,20,10,38)
second_group <- c(1,2,23,11,40,34,33,35,37,24,41,19,29,28,36,21,32,39)
unique(pData(l2input)[first_group, "tubelabelorigin"])## [1] "su2168" "su2066" "su2071" "su2065" "su2172" "su2068" "su2173" "su2161"
## [9] "su2167" "su2162" "su2201" "su2073" "su2188"
unique(pData(l2input)[second_group, "tubelabelorigin"])## [1] "su2052" "su2167" "su2068" "su2190" "su2183" "su2184" "su2168" "su2162"
## [9] "su2172" "su2173"
## So, they are two distinct groups of donors...
table(pData(l2input)[first_group, "finaloutcome"])##
## cure failure
## 14 10
table(pData(l2input)[second_group, "finaloutcome"])##
## cure failure
## 7 11
table(pData(l2input)[first_group, "visitnumber"])##
## v3 v2 v1
## 5 8 11
table(pData(l2input)[second_group, "visitnumber"])##
## v3 v2 v1
## 8 5 5