The analyses in this document are primarily taken from a notebook provided by Alejandro in which he sought to leverage WGCNA (Langfelder and Horvath (2008)) to see if/how well it corresponds to results provided by STRING (Szklarczyk et al. (2011)) and/or our DE analyses. Since it seemed likely that at least some of the figures/analyses he produced might end up in the paper, I emailed him for a copy of his notebook and plopped his work here (with minor modifications for style and readability).
I am using Alejandro’s document to get a feeling for how WGCNA is getting modules relevant to cure/fail. I have been relatively skeptical of WGCNA; I watched Keith, who is freaky brilliant, struggle with it and get very frustrated how even miniscule changes to the input data would result in profound changes in the resulting modules. Nonetheless, this seemed like a good opportunity for me to get acquainted with it.
I am reasonably certain he is in turn using this document as his input:
https://bioinformaticsworkbook.org/tutorials/wgcna.html#gsc.tab=0
threads <- WGCNA::enableWGCNAThreads(nThreads = 8)## Allowing parallel execution with up to 8 working processes.
The input used is the cell-type specific, tmaco-only, rpkm, modified by sva.
In the following block I am breaking down what Alejandro did into smaller pieces so that I can make certain I understand what happened.
input <- normalize_expt(t_monocytes, filter = TRUE, batch = "svaseq") %>%
normalize_expt(convert = "rpkm", column = "mean_cds_len", na_to_zero = TRUE)## Removing 9090 low-count genes (10862 remaining).
## Setting 295 low elements to zero.
wgcna_input <- as.data.frame(exprs(input))
wgcna_input[["ENSEMBLID"]] <- rownames(wgcna_input)
wgcna_melted <- wgcna_input %>%
gather(key = "samples", value = "counts", -ENSEMBLID)
wgcna_with_meta <- wgcna_melted %>%
inner_join(., pData(input), by = c("samples" = "tmrcidentifier"))
wgcna_selected <- wgcna_with_meta %>%
select("ENSEMBLID", "samples", "counts") %>%
spread(key = "samples", value = "counts")Unless I am mistaken, the above just converted the matrix of counts into a merged/melted copy of same with the metadata, then removed the metadata and returned us back to the original state? hmmm…
good_samples_genes <- WGCNA::goodSamplesGenes(t(exprs(input)))## Flagging genes and samples with too many missing values...
## ..step 1
summary(good_samples_genes)## Length Class Mode
## goodGenes 10862 -none- logical
## goodSamples 42 -none- logical
## allOK 1 -none- logical
good_samples_genes[["allOK"]]## [1] TRUE
l2input <- normalize_expt(input, transform = "log2")## transform_counts: Found 295 values equal to 0, adding 1 to the matrix.
power_test <- c(c(1:10), seq(from = 12, to = 20, by = 2))
threshold_search <- WGCNA::pickSoftThreshold(
t(exprs(l2input)), powerVector = power_test,
networkType = "signed", verbose = 5)## pickSoftThreshold: will use block size 4118.
## pickSoftThreshold: calculating connectivity for given powers...
## ..working on genes 1 through 4118 of 10862
## ..working on genes 4119 through 8236 of 10862
## ..working on genes 8237 through 10862 of 10862
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.2150 7.550 0.906 5610.0 5630.00 6290
## 2 2 0.1310 2.470 0.912 3170.0 3130.00 4000
## 3 3 0.0158 -0.437 0.844 1910.0 1860.00 2840
## 4 4 0.2360 -1.190 0.834 1210.0 1160.00 2150
## 5 5 0.5280 -1.560 0.892 807.0 748.00 1710
## 6 6 0.6860 -1.680 0.924 556.0 495.00 1400
## 7 7 0.7720 -1.760 0.941 396.0 337.00 1170
## 8 8 0.8120 -1.770 0.945 289.0 233.00 996
## 9 9 0.8410 -1.780 0.949 216.0 164.00 858
## 10 10 0.8670 -1.770 0.961 165.0 118.00 748
## 11 12 0.8660 -1.790 0.948 101.0 62.70 583
## 12 14 0.8610 -1.800 0.942 65.5 35.00 467
## 13 16 0.8770 -1.770 0.951 44.3 20.40 381
## 14 18 0.8860 -1.750 0.959 31.0 12.30 316
## 15 20 0.8750 -1.750 0.952 22.4 7.61 265
a1 <- ggplot(threshold_search[["fitIndices"]], aes(Power, SFT.R.sq, label = Power)) +
geom_point() +
geom_text(nudge_y = 0.1) +
geom_hline(yintercept = 0.8, color = 'red') +
labs(x = 'Power', y = 'Scale free topology model fit, signed R^2')
a1a2 <- ggplot(threshold_search[["fitIndices"]], aes(Power, mean.k., label = Power)) +
geom_point() +
geom_text(nudge_y = 0.1) +
labs(x = 'Power', y = 'Mean Connectivity')
a2chosen_power <- 8
## WGCNA calls cor() without specifying its own namespace, so overwrite cor for the moment.
cor <- WGCNA::cor
initial_modules <- WGCNA::blockwiseModules(
t(exprs(l2input)), maxBlockSize = 11000, TOMType = "signed",
power = chosen_power, mergeCutHeight = 0.25, numericLabels = FALSE,
verbose = 3)## Calculating module eigengenes block-wise from all genes
## Flagging genes and samples with too many missing values...
## ..step 1
## ..Working on block 1 .
## TOM calculation: adjacency..
## ..will use 8 parallel threads.
## Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
## ....clustering..
## ....detecting modules..
## ....calculating module eigengenes..
## ....checking kME in modules..
## ..removing 15 genes from module 1 because their KME is too low.
## ..removing 5 genes from module 2 because their KME is too low.
## ..removing 9 genes from module 3 because their KME is too low.
## ..removing 1 genes from module 8 because their KME is too low.
## ..removing 1 genes from module 32 because their KME is too low.
## ..reassigning 20 genes from module 1 to modules with higher KME.
## ..reassigning 7 genes from module 2 to modules with higher KME.
## ..reassigning 13 genes from module 3 to modules with higher KME.
## ..reassigning 8 genes from module 4 to modules with higher KME.
## ..reassigning 7 genes from module 5 to modules with higher KME.
## ..reassigning 3 genes from module 7 to modules with higher KME.
## ..reassigning 2 genes from module 8 to modules with higher KME.
## ..reassigning 2 genes from module 10 to modules with higher KME.
## ..reassigning 2 genes from module 18 to modules with higher KME.
## ..reassigning 1 genes from module 20 to modules with higher KME.
## ..reassigning 1 genes from module 32 to modules with higher KME.
## ..reassigning 1 genes from module 42 to modules with higher KME.
## ..merging modules that are too close..
## mergeCloseModules: Merging modules whose distance is less than 0.25
## Calculating new MEs...
cor <- stats::cor
initial_eigen <- initial_modules[["MEs"]]network_colors <- WGCNA::labels2colors(initial_modules[["colors"]])
WGCNA::plotDendroAndColors(
initial_modules[["dendrograms"]][[1]],
network_colors[initial_modules[["blockGenes"]][[1]]],
"Modules",
dendroLabels = FALSE,
hang = 0.03,
addGuide = TRUE,
guideHang = 0.05)WGCNA::plotDendroAndColors(
initial_modules[["dendrograms"]][[1]],
cbind(initial_modules[["unmergedColors"]], initial_modules[["colors"]]),
c("unmerged", "merged"),
dendroLabels = FALSE,
addGuide = TRUE,
hang = 0.03,
guideHang = 0.05)WGCNA::plotDendroAndColors(
initial_modules[["dendrograms"]][[1]],
initial_modules[["colors"]],
"ME",
dendroLabels = FALSE,
addGuide = TRUE,
hang= 0.03,
cex.colorLabels = 2,
marAll = c(1, 5, 3, 1),
main = ("WGCNA Cluster Dendrogram"),
guideHang = 0.05)This does not appear to work, FIXME
initial_reorder <- WGCNA::consensusOrderMEs(
initial_eigen, useAbs = FALSE,
useSets = NULL, greyLast = TRUE,
greyName = paste(WGCNA::moduleColor.getMEprefix(), "grey", sep = ""),
method = "consensus")meta_numeric <- data.frame(
"cf_numeric" = as.numeric(as.factor(pData(l2input)[["finaloutcome"]])),
"visit_numeric" = as.numeric(as.factor(pData(l2input)[["visitnumber"]])))
rownames(meta_numeric) <- rownames(pData(l2input))
meta_factors <- pData(l2input)[, c("finaloutcome", "visitnumber")]
meta_eigen <- merge(initial_eigen, meta_factors, by = "row.names")
rownames(meta_eigen) <- meta_eigen[["Row.names"]]
meta_eigen[["Row.names"]] <- NULL
kappa <- irr::kappam.fleiss(meta_eigen)
module_trait_corr <- stats::cor(initial_eigen, meta_numeric, use = "p")
module_trait_corr## cf_numeric visit_numeric
## MEmagenta -0.250603 0.06262
## MEpink -0.009289 -0.02001
## MEblue -0.225538 0.17650
## MEsalmon -0.224603 0.22226
## MEbrown 0.225532 0.26091
## MEgreenyellow -0.010594 0.12363
## MEyellow 0.204959 -0.03818
## MEcyan -0.174736 -0.04785
## MEgreen -0.360079 0.26368
## MEgrey60 -0.161160 0.16775
## MElightgreen 0.341113 -0.02765
## MEblack 0.510394 -0.02238
## MEred 0.188770 -0.03790
## MEmidnightblue 0.343209 -0.09313
## MEturquoise 0.444315 0.01825
## MEpurple -0.171154 -0.09374
## MElightcyan -0.173734 -0.12381
## MEtan -0.016408 -0.23809
## MEgrey 0.104307 0.12547
module_trait_pvalues <- WGCNA::corPvalueStudent(module_trait_corr, nrow(meta_numeric))
module_trait_pvalues## cf_numeric visit_numeric
## MEmagenta 0.1094349 0.69361
## MEpink 0.9534445 0.89992
## MEblue 0.1509670 0.26350
## MEsalmon 0.1527144 0.15716
## MEbrown 0.1509787 0.09513
## MEgreenyellow 0.9469086 0.43536
## MEyellow 0.1928918 0.81030
## MEcyan 0.2683877 0.76348
## MEgreen 0.0191638 0.09155
## MEgrey60 0.3079190 0.28830
## MElightgreen 0.0270546 0.86202
## MEblack 0.0005538 0.88814
## MEred 0.2312092 0.81166
## MEmidnightblue 0.0260680 0.55745
## MEturquoise 0.0032005 0.90868
## MEpurple 0.2784738 0.55488
## MElightcyan 0.2711828 0.43469
## MEtan 0.9178559 0.12893
## MEgrey 0.5109354 0.42852
On my computer at least, there seems to be difficulty installing the CorLvelPlot library, so I will just remove this piece of Alejandro’s code for now.
module_trait_merged <- merge(initial_eigen, meta_numeric, by = "row.names")
rownames(module_trait_merged) <- module_trait_merged[["Row.names"]]
module_trait_merged[["Row.names"]] <- NULL
#CorLevelPlot::CorLevelPlot(
# module_trait_merged,
# x = names(module_trait_merged)[1:18],
# rotLabX = 90,
# y = names(module_trait_merged)[19:20],
# posColKey = "top",
# col = c("blue1", "skyblue", "white", "pink", "red"))It appears that the modules ‘turqoise’ and ‘pink’ are likely the most interesting. We can extract the genes from them:
wanted <- initial_modules[["colors"]] == "turqoise" | initial_modules[["colors"]] == "pink"
sum(wanted)## [1] 125
interesting_genes <- names(initial_modules[["colors"]])[wanted]
fData(l2input)[interesting_genes, "hgnc_symbol"]## [1] "PITHD1" "PFN2" "MOK" "PCNP" "MECOM"
## [6] "MGST2" "PCDH11X" "NOMO1" "TJP1" "RP1"
## [11] "SIGLEC5" "PLEKHA4" "STEAP1B" "EIF4H" "ASIC2"
## [16] "CPE" "AMBRA1" "SLC35F2" "MLEC" "PTP4A1"
## [21] "GRB14" "MARK1" "COLEC11" "TAF1L" "BCAS4"
## [26] "OVOL2" "GNGT1" "ZNF341" "NDUFA2" "KHDRBS3"
## [31] "SLX1A" "PCBD2" "CCNJL" "HNRNPA1" "GABBR2"
## [36] "SEMA6D" "GAREM1" "RNF157" "PIGK" "RABL2A"
## [41] "PTPRG" "EPHA5" "CDH18" "HTR2C" "GPC3"
## [46] "ASTN2" "DLG2" "AKAP6" "PABPC3" "PTPN14"
## [51] "ACOXL" "CEP112" "KDM8" "FMN2" "RWDD2B"
## [56] "AGAP1" "STARD9" "NAPEPLD" "ZYG11B" "HFM1"
## [61] "H3-3A" "ANKRD30BL" "PCOLCE2" "C4orf45" "MAMDC2"
## [66] "PDZRN4" "JAM3" "PLEKHA7" "NPIPB12" "ADRA1B"
## [71] "LGALS9B" "LONRF2" "FRMD5" "C5orf34" "NEGR1"
## [76] "LSM1" "ZNF683" "ZNF354C" "GRAMD1C" "WASHC1"
## [81] "RGS7" "CCSER1" "ROBO2" "EFCAB10" "RGPD2"
## [86] "LSAMP" "C15orf41" "EDARADD" "PTMA" "C2orf88"
## [91] "H3-5" "HMGB1" "GRM7" "CPNE4" "SLC35F1"
## [96] "SCN8A" "RPS26" "TXNRD1" "RYR2" "TAFA2"
## [101] "HMGN2" "H2AC18" "AGAP6" "C9orf129" "PKHD1L1"
## [106] "TMSB4X" "EIF3CL" "C16orf96" "NEURL1B" "ZC3H11B"
## [111] "HNRNPA1P48" "RPL41" "" "RAB6D" "LILRA4"
## [116] "TMEM150C" "NACA2" "MTRNR2L8" "" "FAM156A"
## [121] "" "NBPF26" "F8A3" "" ""
written_interesting <- write_xlsx(fData(l2input)[interesting_genes, ],
excel = glue("excel/wgcna_interesting_genes-v{ver}.xlsx"))
## Note that we can do similarity matrices on the samples too in order to get
## dendrograms which may get interesting groups of samples?
not_grey <- initial_modules[["colors"]] != "grey"
not_grey_exprs <- t(exprs(l2input))[, not_grey]
dim(not_grey_exprs)## [1] 42 8148
not_grey_genes <- colnames(not_grey_exprs)
dist_tom <- 1 - WGCNA::TOMsimilarityFromExpr(
not_grey_exprs,
power = chosen_power)## TOM calculation: adjacency..
## ..will use 8 parallel threads.
## Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
colnames(dist_tom) <- not_grey_genes
rownames(dist_tom) <- colnames(dist_tom)
similarity_cluster <- flashClust::flashClust(as.dist(dist_tom), method = "average")
WGCNA::plotDendroAndColors(
similarity_cluster,
colors = initial_modules[["colors"]][not_grey_genes],
"ME",
dendroLabels = FALSE,
hang = 0.03,
addGuide = TRUE,
guideHang = 0.05,
cex.colorLabels = 1.5,
main = ("Cluster Dendrogram (WGCNA)"))## In my initial pass, I did the clustering on the samples instead of genes and got
## two primary groups:
## 26,6,7,12,4,27,3,13,9,30,17,22,18,42,15,31,8,25,5,16,14,20,10,38
## 1,2,23,11,40,34,33,35,37,24,41,19,29,28,36,21,32,39
## I assume that these two groups will have some meaning vis a vis the monocyte samples?
first_group <- c(26,6,7,12,4,27,3,13,9,30,17,22,18,42,15,31,8,25,5,16,14,20,10,38)
second_group <- c(1,2,23,11,40,34,33,35,37,24,41,19,29,28,36,21,32,39)
unique(pData(l2input)[first_group, "tubelabelorigin"])## [1] "su2168" "su2066" "su2071" "su2065" "su2172" "su2068" "su2173" "su2161"
## [9] "su2167" "su2162" "su2201" "su2073" "su2188"
unique(pData(l2input)[second_group, "tubelabelorigin"])## [1] "su2052" "su2167" "su2068" "su2190" "su2183" "su2184" "su2168" "su2162"
## [9] "su2172" "su2173"
## So, they are two distinct groups of donors...
table(pData(l2input)[first_group, "finaloutcome"])##
## cure failure
## 14 10
table(pData(l2input)[second_group, "finaloutcome"])##
## cure failure
## 7 11
table(pData(l2input)[first_group, "visitnumber"])##
## v3 v2 v1
## 5 8 11
table(pData(l2input)[second_group, "visitnumber"])##
## v3 v2 v1
## 8 5 5