1 Changelog

  • 202408: Collapsing the various ML documents back into one in the hopes that it eases the degree of confusion. All of the exploratory analyses are at the top, at the bottom are the blocks which are used to generate the final tables, but they are only runnable manually due to time constraints.
  • 202408: Re-enabled glm, it seems to be running faster again?
  • 20240620: Disabled glm training/testing in the container. For reasons I have not yet figured out it takes ~ 10x longer (~10 days) in the container (note, this is a new thing, it used to take ~ 8 hours in the container without error, so something changed in 202405-202406 caret/glmnet/singularity/etc) to complete; which is a problem because my notebooks are auto-regenerated on a 2x/week basis (if changes occur) and so it doesn’t finish before another round starts…
  • 202309: Adding a version of this with only the Tumaco data. I am a pretty big non-fan of doing this; so I am going to do so in the most cursory fashion possible by just copying this document to another copy with a suffix of ‘tumaco’ and changing the initial datastructure to the Tumaco-only dataset.

2 Introduction

I had some success in classifying the TMRC2 samples by strain via ML and want to try something more difficult. Thus I will use the normalized gene expression data and try classifying it by cure/fail.

All analyses in this document heavily use caret (Kuhn (2008)) and, as per the sister TMRC2 document, follow pretty closely the path from:

https://topepo.github.io/caret/data-splitting.html#simple-splitting-based-on-the-outcome

and

https://github.com/compgenomr/book/blob/master/05-supervisedLearning.Rmd

3 Starter data

In the strain classifier I used normalized variants. I am thinking to use normalized expression here and therefore explicitly limit myself to ~ 20k variables (significantly down from the 1.6M).

In addition, caret expects the data as (rows == samples) and (columns == variables) where each element is one observation. Thus we will need to transpose the expression matrix.

input_data <- subset_expt(tc_clinical, subset="batch!='biopsy'") %>%
  normalize_expt(transform = "log2", convert = "cpm")
## The samples excluded are: TMRC30156, TMRC30269, TMRC30253, TMRC30270, TMRC30016, TMRC30017, TMRC30018, TMRC30019, TMRC30020, TMRC30022, TMRC30026, TMRC30044, TMRC30045, TMRC30152, TMRC30177, TMRC30155, TMRC30154, TMRC30241.
## subset_expt(): There were 184, now there are 166 samples.
## transform_counts: Found 932911 values equal to 0, adding 1 to the matrix.
ref_col <- "finaloutcome"
outcome_factor <- as.factor(as.character(pData(input_data)[[ref_col]]))
comparison_n <- 200

## Setting this up in case Najib and Maria Adelaida request a Tumaco set of exploratory analyses.
t_input_data <- subset_expt(tc_clinical, subset="batch!='biopsy'") %>%
  normalize_expt(transform = "log2", convert = "cpm")
## The samples excluded are: TMRC30156, TMRC30269, TMRC30253, TMRC30270, TMRC30016, TMRC30017, TMRC30018, TMRC30019, TMRC30020, TMRC30022, TMRC30026, TMRC30044, TMRC30045, TMRC30152, TMRC30177, TMRC30155, TMRC30154, TMRC30241.
## subset_expt(): There were 184, now there are 166 samples.
## transform_counts: Found 932911 values equal to 0, adding 1 to the matrix.
t_outcome_factor <- as.factor(as.character(pData(t_input_data)[[ref_col]]))

4 Compare TopN important genes to DE enes

Given that we have separated the various analyses, it will take me a minute to figure out where I saved the relevant differential expression analysis. I do not actually save the various DE results to rda files by default, instead opting to send them to xlsx files to share. Recall if you will, that the data that I think might be used for the paper also does not go into the default excel directory but instead mirrors the box organization scheme.

Thus, I think the most relevant file is: “analyses/3_cali_and_tumaco/DE_Cure_vs_Fail/All_Samples/tc_cf_clinical_table_sva-v202207.xlsx”

and

“analyses/4_tumaco/DE_Cure_vs_Fail/All_Samples/t_cf_clinical_table_sva-v202207.xlsx”

Note, this xlsx file contains three primary worksheets, v1cf, v2cf, v3cf which comprise all samples across cell types for each visit and compares cure and fail. When I choose sheet #2 in the following block I am explicitly asking for only the v1 cure/fail result.

4.1 Opening the relevant DE table

input_xlsx <- glue("analyses/3_cali_and_tumaco/DE_Cure_Fail/Clinical_Samples/tc_clinical_cf_sig_sva-v{ver}.xlsx")
all_de_cf <- openxlsx::readWorkbook(input_xlsx, sheet = 3, startRow = 2)
rownames(all_de_cf) <- all_de_cf[["row.names"]]
all_de_cf[["row.names"]] <- NULL
deseq_de_cf_up <- all_de_cf[, c("deseq_logfc", "deseq_adjp", "deseq_basemean", "deseq_lfcse")]
all_de_cf <- openxlsx::readWorkbook(input_xlsx, sheet = 4, startRow = 2)
rownames(all_de_cf) <- all_de_cf[["row.names"]]
all_de_cf[["row.names"]] <- NULL
deseq_de_cf_down <- all_de_cf[, c("deseq_logfc", "deseq_adjp", "deseq_basemean", "deseq_lfcse")]
deseq_de_cf <- rbind.data.frame(deseq_de_cf_up, deseq_de_cf_down)

tumaco_xlsx <- glue("analyses/4_tumaco/DE_Cure_Fail/t_all_visitcf_sig_sva-v{ver}.xlsx")
t_de_cf <- openxlsx::readWorkbook(tumaco_xlsx, sheet = 3, startRow = 2)
rownames(t_de_cf) <- t_de_cf[["row.names"]]
t_de_cf[["row.names"]] <- NULL
t_de_cf_up <- t_de_cf[, c("deseq_logfc", "deseq_adjp", "deseq_basemean", "deseq_lfcse")]
t_de_cf <- openxlsx::readWorkbook(tumaco_xlsx, sheet = 4, startRow = 2)
rownames(t_de_cf) <- t_de_cf[["row.names"]]
t_de_cf[["row.names"]] <- NULL
t_de_cf_down <- t_de_cf[, c("deseq_logfc", "deseq_adjp", "deseq_basemean", "deseq_lfcse")]
t_de_cf <- rbind.data.frame(t_de_cf_up, t_de_cf_down)
top_bottom_ids <- rownames(t_de_cf)

4.2 A function to help compare methods to DESeq2

The following block contains a few lists which will hold comparisons between DESeq2 and the various methods which follow, some parameters, and a function to compare the methods and DESeq.

Presumably DESeq and the models should be responding to variance in the data, for which I think the logFC values, p-values, mean values, or standard errors are the most likely proxies to which I have easy access. So, let us pull the top/bottom n genes vis a vis each of those categories and see what happens?

Here is a little function to compare a ML result to the above DE table. It should provide the importance plot, a venn of the two methods, and the gene sets (with annotations) for the unique/shared genes between this ML method and the DE table. Note, this is just using the highest/lowest logFC values and ignoring the p-value for now.

comparison_lfc <- list()
comparison_adjp <- list()
comparison_wgcna <- list()
annotations <- fData(t_monocytes)  ## Arbitrarily taken

compare_ml_de <- function(importance, sig_table, annotations, n = comparison_n, plot_n = 15,
                          according_to = "deseq_logfc", column = "Overall") {
  ml_df <- importance[["importance"]] %>%
    arrange(desc(!!sym(column)))
  importance_plot <- plot(importance, top = plot_n)
  top_ml <- head(rownames(ml_df), n = n)
  de_df <- sig_table %>%
    arrange(desc(deseq_logfc))
  top_de <- head(rownames(de_df), n = n)
  bottom_de <- tail(rownames(de_df), n = n)
  top_bottom_ids <- c(top_de, bottom_de)
  comparison <- list("de" = top_bottom_ids, "ml" = top_ml)
  comparison_venn <- Vennerable::Venn(comparison)
  shared_ids <- comparison_venn@IntersectionSets["11"][[1]]
  shared_annot <- annotations[shared_ids, ]
  de_ids <- comparison_venn@IntersectionSets["10"][[1]]
  ml_ids <- comparison_venn@IntersectionSets["01"][[1]]
  only_de <- annotations[de_ids, ]
  only_ml <- annotations[ml_ids, ]
  comparison_plot <- Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")
  retlist <- list(
    "importance_plot" = importance_plot,
    "comparison" = comparison,
    "comparison_venn" = comparison_venn,
    "comparison_plot" = comparison_plot,
    "shared" = shared_annot,
    "only_de" = only_de,
    "only_ml" = only_ml)
  return(retlist)
}

5 Filtering

The ML text I am reading provide some neat examples for how one might filter the data to make it more suitable for model creation.

5.1 Near zero variance, or genefilter’s cv

The first filter I was introduced to is quite familiar from our sequencing data, the removal of features with near-zero-variance. Indeed, I am pretty certain that normalize_expt() could do this equivalently and significantly faster than caret::preProcess().

nrow(exprs(input_data))
## [1] 19952
system.time({
  equivalent <- normalize_expt(input_data, filter = "cv", cv_min = 0.1)
})
## Removing 3619 low-count genes (16333 remaining).
##    user  system elapsed 
##   0.674   0.100   0.773
dim(exprs(equivalent))
## [1] 16333   166
## Given this large amount of data, this step is slow, taking > 10 minutes.
## Yeah seriously, the following three lines get 16,723 genes in 10 minutes while
## the normalize_expt() call above gets 16,749 genes in 2.4 seconds.
#system.time({
#  nzv <- preProcess(texprs, method="nzv", uniqueCut=15)
#  nzv_texprs <- predict(nzv, texprs)
#  dim(nzv_texprs)
#}
nzv_texprs <- t(exprs(equivalent))

5.2 Filtering to the highest standard deviation variables

I think I am a bit confused by this filter, one would think that the nzv filter above, if applied correctly, should give you exactly this.

For the moment, I am excluding the following block in order to see how much time/memory keeping these variables costs. If I recall properly, the model of the top-2k variant positions cost ~ 1-4G of memory. I hope that this scales linearly, but I am thinking it might not.

standard_devs <- apply(nzv_texprs, 2, sd)
top_predictors <- order(standard_devs, decreasing = TRUE)[1:3000]
nzv_texprs <- nzv_texprs[, top_predictors]

5.3 Center the data

I think centering may not be needed for this data, but here is how:

nzv_center <- preProcess(nzv_texprs, method = "center")
nzv_texprs <- predict(nzv_center, nzv_texprs)

5.4 Drop correlated

This is a filter which does not correspond to any of those we use in sequencing data because genes which are highly correlated are likely to be of immediate interest.

In the same fashion, I want to leave this off because later applications of this model will include low coverage samples which may not have every variant represented.

## This step takes a while...
system.time({
  nzv_correlated <- preProcess(nzv_texprs, method = "corr", cutoff = 0.95)
  nzv_uncorr <- predict(nzv_correlated, nzv_texprs)
})
##    user  system elapsed 
##  14.542   0.128  14.671
dim(nzv_uncorr)
## [1] 166 964

6 Merge the appropriate metadata

There are a few metadata factors which might prove of interest for classification. The most obvious are of course outcome, clinic, donor, visit, celltype. I am, for the moment, only likely to focus on outcome. AFAICT I can only include one of these at a time in the data, which is a shame.

interesting_meta <- pData(input_data)[, c("finaloutcome", "donor", "persistence",
                                          "visitnumber", "selectionmethod",
                                          "typeofcells", "time")]

ml_df <- as.data.frame(cbind(outcome_factor, as.data.frame(nzv_uncorr)))
ml_df[["outcome_factor"]] <- as.factor(ml_df[["outcome_factor"]])
dim(ml_df)
## [1] 166 965

7 Split the data into training/testing

caret provides nice functionality for splitting up the data. I suspect there are many more fun knobs I can play with for instances where I need to exclude some levels of a factor and such. In this case I just want to split by outcome.

7.1 Via data splitting

ml_df <- as.data.frame(cbind(outcome_factor, as.data.frame(nzv_uncorr)))

datasets <- create_partitions(nzv_uncorr, interesting_meta,
                              outcome_factor = outcome_factor)

7.2 Via sampling

There are a few likely sampling methods: cross-validation, bootstrapping, and jackknifing. I will try those out later.

8 Try out training and prediction methods

My goals from here on will be to get the beginnings of a sense of the various methods I can use to create the models from the training data and predict the outcome on the test data. I am hoping also to pick up some idea of what the various arguments mean while I am at it.

8.1 Try out KNN

k-nearest neighbors is somewhat similar to a kmeans estimate. Thus the primary argument is ‘k’

8.1.1 Model creation and performance

split <- 1
train_all <- datasets[["trainers"]][[split]]
train_df <- datasets[["trainers_stripped"]][[split]]
train_idx <- datasets[["train_idx"]][[split]]
train_outcomes <- datasets[["trainer_outcomes"]][[split]]
test_df <- datasets[["testers"]][[split]]
test_idx <- datasets[["test_idx"]][[split]]
test_outcomes <- datasets[["tester_outcomes"]][[split]]

knn_fit <- knn3(x = train_df,
                y = train_outcomes,
                k = 3)
knn_predict_trained <- predict(knn_fit, train_df, type = "prob")

knn_train_evaluated <- self_evaluate_model(knn_predict_trained, datasets,
                                           which = split, type = "train")
## Setting levels: control = cure, case = failure
## Setting direction: controls > cases

knn_train_evaluated
## The summary of the (in)correct calls is:
##    Mode   FALSE    TRUE 
## logical       7      60
## The missed samples are:
## [1] "TMRC30224" "TMRC30071" "TMRC30121" "TMRC30183" "TMRC30181" "TMRC30175"
## [7] "TMRC30144"
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      40       4
##    failure    3      20
##                                         
##                Accuracy : 0.896         
##                  95% CI : (0.797, 0.957)
##     No Information Rate : 0.642         
##     P-Value [Acc > NIR] : 2.28e-06      
##                                         
##                   Kappa : 0.771         
##                                         
##  Mcnemar's Test P-Value : 1             
##                                         
##             Sensitivity : 0.930         
##             Specificity : 0.833         
##          Pos Pred Value : 0.909         
##          Neg Pred Value : 0.870         
##               Precision : 0.909         
##                  Recall : 0.930         
##                      F1 : 0.920         
##              Prevalence : 0.642         
##          Detection Rate : 0.597         
##    Detection Prevalence : 0.657         
##       Balanced Accuracy : 0.882         
##                                         
##        'Positive' Class : cure          
## 
## The ROC AUC is: 0.949110671936759.

As the confusion matrix shows, this failed for a few samples. Perhaps let us change k and see if it improves.

Here is a table of fase positives/negatives for a few values of ‘k’, in this context a false positive is calling a known cure as a failure and false negative is calling a known failure as a cure.

|—|—|—| |k |fp |fn | |2 |0 |8 | |3 |5 |5 | |4 |8 |9 | |5 |11 |7 | |6 |15 |8 |

Note: this depends on the luck of rand(), so the above numbers shift moderately from one run to the next. Thus I think I will just use 2 or 3.

knn_fit2 <- knn3(x = train_df,
                y = train_outcomes,
                k = 5)
knn_predict_trained2 <- predict(knn_fit2, train_df, type = "prob")

knn_train_evaluated2 <- self_evaluate_model(knn_predict_trained2, datasets,
                                            which = split, type = "train")
## Setting levels: control = cure, case = failure
## Setting direction: controls > cases

knn_train_evaluated2
## The summary of the (in)correct calls is:
##    Mode   FALSE    TRUE 
## logical      12      55
## The missed samples are:
##  [1] "TMRC30178" "TMRC30223" "TMRC30224" "TMRC30071" "TMRC30080" "TMRC30121"
##  [7] "TMRC30048" "TMRC30183" "TMRC30181" "TMRC30218" "TMRC30220" "TMRC30144"
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      40       4
##    failure    8      15
##                                         
##                Accuracy : 0.821         
##                  95% CI : (0.708, 0.904)
##     No Information Rate : 0.716         
##     P-Value [Acc > NIR] : 0.0348        
##                                         
##                   Kappa : 0.586         
##                                         
##  Mcnemar's Test P-Value : 0.3865        
##                                         
##             Sensitivity : 0.833         
##             Specificity : 0.789         
##          Pos Pred Value : 0.909         
##          Neg Pred Value : 0.652         
##               Precision : 0.909         
##                  Recall : 0.833         
##                      F1 : 0.870         
##              Prevalence : 0.716         
##          Detection Rate : 0.597         
##    Detection Prevalence : 0.657         
##       Balanced Accuracy : 0.811         
##                                         
##        'Positive' Class : cure          
## 
## The ROC AUC is: 0.87302371541502.

8.1.2 Predict the rest of the data with this model.

knn_predict_test <- predict(knn_fit, test_df)

knn_test_evaluated <- self_evaluate_model(knn_predict_test, datasets,
                                     which = split, type = "test")
## Setting levels: control = cure, case = failure
## Setting direction: controls > cases

knn_test_evaluated
## The summary of the (in)correct calls is:
##    Mode   FALSE    TRUE 
## logical      23      76
## The missed samples are:
##  [1] "TMRC30179" "TMRC30222" "TMRC30274" "TMRC30056" "TMRC30105" "TMRC30119"
##  [7] "TMRC30103" "TMRC30169" "TMRC30096" "TMRC30118" "TMRC30030" "TMRC30195"
## [13] "TMRC30070" "TMRC30191" "TMRC30132" "TMRC30133" "TMRC30143" "TMRC30146"
## [19] "TMRC30198" "TMRC30237" "TMRC30238" "TMRC30264" "TMRC30147"
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      54      11
##    failure   12      22
##                                         
##                Accuracy : 0.768         
##                  95% CI : (0.672, 0.847)
##     No Information Rate : 0.667         
##     P-Value [Acc > NIR] : 0.0192        
##                                         
##                   Kappa : 0.481         
##                                         
##  Mcnemar's Test P-Value : 1.0000        
##                                         
##             Sensitivity : 0.818         
##             Specificity : 0.667         
##          Pos Pred Value : 0.831         
##          Neg Pred Value : 0.647         
##               Precision : 0.831         
##                  Recall : 0.818         
##                      F1 : 0.824         
##              Prevalence : 0.667         
##          Detection Rate : 0.545         
##    Detection Prevalence : 0.657         
##       Balanced Accuracy : 0.742         
##                                         
##        'Positive' Class : cure          
## 
## The ROC AUC is: 0.795022624434389.
knn_predict_test2 <- predict(knn_fit2, test_df)
knn_test_evaluated2 <- self_evaluate_model(knn_predict_test2, datasets,
                                           which = split, type = "test")
## Setting levels: control = cure, case = failure
## Setting direction: controls > cases

knn_test_evaluated2
## The summary of the (in)correct calls is:
##    Mode   FALSE    TRUE 
## logical      36      63
## The missed samples are:
##  [1] "TMRC30179" "TMRC30221" "TMRC30222" "TMRC30056" "TMRC30105" "TMRC30094"
##  [7] "TMRC30119" "TMRC30082" "TMRC30103" "TMRC30169" "TMRC30096" "TMRC30083"
## [13] "TMRC30118" "TMRC30014" "TMRC30030" "TMRC30037" "TMRC30054" "TMRC30070"
## [19] "TMRC30041" "TMRC30068" "TMRC30132" "TMRC30123" "TMRC30161" "TMRC30143"
## [25] "TMRC30199" "TMRC30198" "TMRC30201" "TMRC30200" "TMRC30202" "TMRC30204"
## [31] "TMRC30237" "TMRC30238" "TMRC30208" "TMRC30264" "TMRC30147" "TMRC30265"
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      54      11
##    failure   25       9
##                                         
##                Accuracy : 0.636         
##                  95% CI : (0.534, 0.731)
##     No Information Rate : 0.798         
##     P-Value [Acc > NIR] : 0.9999        
##                                         
##                   Kappa : 0.106         
##                                         
##  Mcnemar's Test P-Value : 0.0303        
##                                         
##             Sensitivity : 0.684         
##             Specificity : 0.450         
##          Pos Pred Value : 0.831         
##          Neg Pred Value : 0.265         
##               Precision : 0.831         
##                  Recall : 0.684         
##                      F1 : 0.750         
##              Prevalence : 0.798         
##          Detection Rate : 0.545         
##    Detection Prevalence : 0.657         
##       Balanced Accuracy : 0.567         
##                                         
##        'Positive' Class : cure          
## 
## The ROC AUC is: 0.69683257918552.

8.2 Perform cross-validation to estimate k

The cross validation method of repeated sampling the data is all done within the train() function. With that in mind, here it is operating with the knn method.

8.2.1 CV with knn

When train() is called with the trControl and tuneGrid, we can control how the knn training is repeated, in this case it will iterate over k from 1 to 10.

This currently fails due to a stack overflow…

cv_control <- trainControl(method = "cv", number = 10)

knn_train_fit <- train(outcome_factor ~ ., data = train_df,
                       method = "knn",
                       trControl = cv_control,
                       tuneGrid = data.frame(k = 1:10))
knn_train_fit[["bestTune"]]

plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
     ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
      col = "#CC0000")

8.2.2 Bootstrap with knn

boot_control <- trainControl(method = "boot", number = 20,
                             returnResamp = "all")

knn_train_fit <- train(outcome ~ ., data = train_all,
                       method = "knn",
                       trControl = boot_control,
                       tuneGrid = data.frame(k = 1:10))
knn_train_fit[["bestTune"]]
##   k
## 1 1
plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
     ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
      col = "#CC0000")

8.2.3 Explain the important variables

In this instance we will search for genes which were important for the model’s creation.

The DALEX package provides a function: feature_importance() which seeks to use a series of other methods to extract (in this case, genes) features which do a good job of explaining the result produced by the model. In the case of this dataset, which has thousands of features, this does not appear to end well.

explainer_knn <- DALEX::explain(knn_fit, label = "knn",
                                data = train_df,
                                y = as.numeric(train_outcomes))

## AFAICT the following will take forever unless we drastically reduce the complexity of the model.
## yeah, I let it run for a week.
## features <- feature_importance(explainer_knn, n_sample = 50, type = "difference")

Conversely, we can use the varImp() function…

knn_variable_importance <- varImp(knn_train_fit)
plot(knn_variable_importance, top = 15)

knn_variables <- knn_variable_importance[["importance"]] %>%
  arrange(desc(cure))

Given that, we can ask for the similarity to the DESeq results…

knn_comparison <- compare_ml_de(knn_variable_importance, deseq_de_cf, annotations,
                                n = comparison_n, plot_n = 15, column = "cure")

knn_comparison[["comparison_plot"]]
## Vennmar
comparison_lfc[["knn"]] <- knn_comparison[["shared"]]

8.3 Random Forest

I think R/caret provides a few implementation of the forest family of methods, this is using ranger(Wright and Ziegler (2017)).

The parameter ‘mtry’ is often important, if I read the text correctly it controls how many variables to sample in each split of the tree. Thus higher numbers should presumably make it more specific at the risk of overfitting.

Setting min.node.size sets the minimume node size of terminal nodes in each tree. Each increment up speeds the algorithm.

I am going to use my boot control trainer from above and see how it goes.

rf_train_fit <- train(outcome ~ ., data = train_all,
                method = "ranger", trControl = boot_control,
                importance = "permutation",
                tuneGrid = data.frame(
                    mtry = 200,
                    min.node.size = 1,
                    splitrule = "gini"),
                verbose = TRUE)
rf_train_fit[["finalModel"]][["prediction.error"]]
## [1] 0.2388
rf_variable_importance <- varImp(rf_train_fit)
plot(rf_variable_importance, top = 15)

rf_variables <- rf_variable_importance[["importance"]] %>%
  arrange(desc(Overall))

rf_predict_trained <- predict(rf_train_fit, train_df)
rf_predict_evaluated <- self_evaluate_model(rf_predict_trained, datasets,
                                            which = split, type = "train")
## Setting levels: control = cure, case = failure
## Setting direction: controls < cases

rf_predict_evaluated
## The summary of the (in)correct calls is:
##    Mode    TRUE 
## logical      67
## The missed samples are:
## character(0)
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      44       0
##    failure    0      23
##                                     
##                Accuracy : 1         
##                  95% CI : (0.946, 1)
##     No Information Rate : 0.657     
##     P-Value [Acc > NIR] : 5.81e-13  
##                                     
##                   Kappa : 1         
##                                     
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.000     
##             Specificity : 1.000     
##          Pos Pred Value : 1.000     
##          Neg Pred Value : 1.000     
##               Precision : 1.000     
##                  Recall : 1.000     
##                      F1 : 1.000     
##              Prevalence : 0.657     
##          Detection Rate : 0.657     
##    Detection Prevalence : 0.657     
##       Balanced Accuracy : 1.000     
##                                     
##        'Positive' Class : cure      
## 
## The ROC AUC is: 1.

8.3.1 What would be shared between DESeq2 and the ML classifier?

rf_comparison <- compare_ml_de(rf_variable_importance, deseq_de_cf, annotations,
                               n = comparison_n, plot_n = 15)

rf_comparison[["comparison_venn"]]
## A Venn object on 2 sets named
## de,ml 
##  00  10  01  11 
##   0 249 167  33
comparison_lfc[["rf"]] <- rf_comparison[["shared"]]

8.3.2 Compare to the WGCNA results

A couple months ago I spent a little time attempting to recapitulate Alejandro’s WGCNA results. I think I did so by mostly copy/pasting his work and adding some commentary and tweaking parts of it so that it was easier for me to read/understand. In the process, I generated a series of modules which looked similar/identical to his. Unfortunately, I did not add some sections to record the genes/modules to some output files. I am therefore going back to that now and doing so in the hopes that I can compare those modules to the results produced by the clasifiers.

It seems that the interest in this comparison has waned, so I am just going to disable it.

wgcna_result <- openxlsx::readWorkbook(glue("excel/wgcna_interesting_genes-v{ver}.xlsx"))
rownames(wgcna_result) <- wgcna_result[["row.names"]]
wgcna_result[["row.names"]] <- NULL

top_ml <- rownames(head(rf_variables, n = comparison_n))
comparison <- list("wgcna" = rownames(wgcna_result), "ml" = top_ml)
comparison_venn <- Vennerable::Venn(comparison)
comparison_wgcna[["rf"]] <- annotations[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

8.3.2.1 Digression do the genes provided by varImp mean anything?

Let us take a moment and see if the top-n genes returned by varImp() have some meaning which jumps out. One might assume, given our extant Differential Expression results, that the interleukin response will be a likely candidate.

importance_gp <- simple_gprofiler(rownames(head(rf_variables, n = comparison_n)))
importance_gp
## A set of ontologies produced by gprofiler using 200
## genes against the hsapiens annotations and significance cutoff 0.05.
## There are: 
## 5 MF
## 49 BP
## 2 KEGG
## 4 REAC
## 1 WP
## 9 TF
## 0 MIRNA
## 0 HPA
## 2 CORUM
## 0 HP hits.

8.3.3 Now the random forest testers!

rf_predict_test <- predict(rf_train_fit, test_df)

rf_predict_test_evaluated <- self_evaluate_model(rf_predict_test, datasets,
                                     which = split, type = "test")
## Setting levels: control = cure, case = failure
## Setting direction: controls < cases

rf_predict_test_evaluated
## The summary of the (in)correct calls is:
##    Mode   FALSE    TRUE 
## logical      14      85
## The missed samples are:
##  [1] "TMRC30179" "TMRC30222" "TMRC30105" "TMRC30058" "TMRC30094" "TMRC30119"
##  [7] "TMRC30103" "TMRC30068" "TMRC30198" "TMRC30200" "TMRC30202" "TMRC30204"
## [13] "TMRC30238" "TMRC30265"
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      64       1
##    failure   13      21
##                                        
##                Accuracy : 0.859        
##                  95% CI : (0.774, 0.92)
##     No Information Rate : 0.778        
##     P-Value [Acc > NIR] : 0.03028      
##                                        
##                   Kappa : 0.658        
##                                        
##  Mcnemar's Test P-Value : 0.00328      
##                                        
##             Sensitivity : 0.831        
##             Specificity : 0.955        
##          Pos Pred Value : 0.985        
##          Neg Pred Value : 0.618        
##               Precision : 0.985        
##                  Recall : 0.831        
##                      F1 : 0.901        
##              Prevalence : 0.778        
##          Detection Rate : 0.646        
##    Detection Prevalence : 0.657        
##       Balanced Accuracy : 0.893        
##                                        
##        'Positive' Class : cure         
## 
## The ROC AUC is: 0.801131221719457.

8.4 GLM, or Logistic regression and regularization

Logistic regression is a statistical method for binary responses. However, it is able to work with multiple classes as well. The general idea of this method is to find parameters which increase the likelihood that the observed data is sampled from a statistical distribution of interest. The transformations and linear regression-esque tasks performed are confusing, but once those are performed, the task becomes setting the model’s (fitting) parameters to values which increase the probability that the statistical model looks like the actual dataset given the training data, and that when samples, will return values which are similar. The most likely statistical distributions one will want to fit are the Gaussian, in which case we want to transform/normalize the mean/variance of our variables so they look whatever normal distribution we are using. Conversely, logistic regression uses a binnomial distribution (like our raw sequencing data!) but which is from 0-1.

8.4.1 Using a single gene

Let us take the most important gene observed in one of our previous training sets: ENSG00000248405 PRR5-ARHGAP8

gene_id <- "ENSG00000248405"
single_fit <- train(
    outcome ~ ENSG00000248405, data = train_all,
    method = "glm", family = "binomial", trControl = trainControl("none"))

tt <- data.frame("ENSG00000248405" = seq(min(train_df[[gene_id]]),
                                         max(train_df[[gene_id]]), len = 100))
## predict probabilities for the simulated data
tt$subtype = predict(single_fit, newdata = tt, type="prob")[, 1]
## plot the sigmoid curve and the training data
plot(ifelse(outcome == "cure", 1, 0) ~ ENSG00000248405,
     data = train_all, col = "red4",
     ylab = "CF as 0 or 1", xlab = "favorite gene expression")
lines(subtype ~ ENSG00000248405, tt, col = "green4", lwd = 2)

plot_df <- train_all[, c("outcome", "ENSG00000248405")]
ggbetweenstats(plot_df, "outcome", "ENSG00000248405")

Having tried with 1 gene, let us extend this to all genes. In my first try of this, it took a long time.

glm_train_fit <- train(outcome ~ ., data = train_all,
                 trControl = boot_control,
                 method = "glm", family = "binomial")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

8.5 Compare GLM and WGCNA/DE

The previous block does not take into account our parameter sweep, so it isn’t theoretically very useful for this series of methological tests. As a result, the following block is also not particularly helpful – though I suspect the results are identical to what follows.

glm_variable_importance <- varImp(glm_train_fit)
## Oh, this only produces 100 entries -- so me getting the top 400 is silly.
glm_variables <- glm_variable_importance[["importance"]] %>%
  arrange(desc(Overall))
tt = plot(glm_variable_importance, top = 15)
simple_gprofiler(rownames(head(glm_variables, n = comparison_n)))
## A set of ontologies produced by gprofiler using 66
## genes against the hsapiens annotations and significance cutoff 0.05.
## There are: 
## 2 MF
## 123 BP
## 1 KEGG
## 3 REAC
## 5 WP
## 1 TF
## 0 MIRNA
## 3 HPA
## 0 CORUM
## 1 HP hits.
top_glm <- rownames(glm_variables)
comparison <- list("de" = top_bottom_ids, "ml" = top_glm)
comparison_venn <- Vennerable::Venn(comparison)
comparison_lfc[["glm"]] <- fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

Dropping the adjp and wgcna comparisons, they show little overlap.

comparison <- list("de" = lowest_adjp, "ml" = top_glm)
comparison_venn <- Vennerable::Venn(comparison)
comparison_adjp[["glm"]] <- fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

comparison <- list("de" = highest_exprs, "ml" = top_glm)
comparison_venn <- Vennerable::Venn(comparison)
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

In this block, I repeat the above with the tuneGrid option set so that it may do the training using a sweep of putatively helpful parameters; as a result it should provide a more appropriate (but quite possibly identical) trainer.

##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
## train_method <- trainControl(method = "cv", number = 10)
glm_fit <- train(outcome ~ ., data = train_all, method = "glmnet",
                 trControl = boot_control, importance = "permutation",
                 tuneGrid = data.frame(
                   alpha = 0.5,
                   lambda = seq(0.1, 0.7, 0.05)),
                 verbose = TRUE)
glm_fit
## glmnet 
## 
##  67 samples
## 964 predictors
##   2 classes: 'cure', 'failure' 
## 
## No pre-processing
## Resampling: Bootstrapped (20 reps) 
## Summary of sample sizes: 67, 67, 67, 67, 67, 67, ... 
## Resampling results across tuning parameters:
## 
##   lambda  Accuracy  Kappa   
##   0.10    0.7686    0.477806
##   0.15    0.7264    0.367329
##   0.20    0.6833    0.243122
##   0.25    0.6592    0.152945
##   0.30    0.6479    0.087507
##   0.35    0.6314    0.015878
##   0.40    0.6313    0.002206
##   0.45    0.6313    0.000000
##   0.50    0.6313    0.000000
##   0.55    0.6313    0.000000
##   0.60    0.6313    0.000000
##   0.65    0.6313    0.000000
##   0.70    0.6313    0.000000
## 
## Tuning parameter 'alpha' was held constant at a value of 0.5
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.5 and lambda = 0.1.
glm_predict_trained <- predict(glm_fit, train_df)

glm_train_eval <- self_evaluate_model(glm_predict_trained, datasets,
                                      which = split, type = "train")
## Setting levels: control = cure, case = failure
## Setting direction: controls < cases

glm_train_eval
## The summary of the (in)correct calls is:
##    Mode   FALSE    TRUE 
## logical       1      66
## The missed samples are:
## [1] "TMRC30220"
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      44       0
##    failure    1      22
##                                    
##                Accuracy : 0.985    
##                  95% CI : (0.92, 1)
##     No Information Rate : 0.672    
##     P-Value [Acc > NIR] : 8.84e-11 
##                                    
##                   Kappa : 0.967    
##                                    
##  Mcnemar's Test P-Value : 1        
##                                    
##             Sensitivity : 0.978    
##             Specificity : 1.000    
##          Pos Pred Value : 1.000    
##          Neg Pred Value : 0.957    
##               Precision : 1.000    
##                  Recall : 0.978    
##                      F1 : 0.989    
##              Prevalence : 0.672    
##          Detection Rate : 0.657    
##    Detection Prevalence : 0.657    
##       Balanced Accuracy : 0.989    
##                                    
##        'Positive' Class : cure     
## 
## The ROC AUC is: 0.978260869565217.

8.5.1 Now the GLM testers!

glm_predict_test <- predict(glm_fit, test_df)

glm_fit_eval_test <- self_evaluate_model(glm_predict_test, datasets,
                                         which = split, type = "test")
## Setting levels: control = cure, case = failure
## Setting direction: controls < cases

glm_fit_eval_test
## The summary of the (in)correct calls is:
##    Mode   FALSE    TRUE 
## logical      15      84
## The missed samples are:
##  [1] "TMRC30179" "TMRC30221" "TMRC30222" "TMRC30058" "TMRC30094" "TMRC30143"
##  [7] "TMRC30146" "TMRC30199" "TMRC30198" "TMRC30201" "TMRC30200" "TMRC30202"
## [13] "TMRC30238" "TMRC30264" "TMRC30265"
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      63       2
##    failure   13      21
##                                         
##                Accuracy : 0.848         
##                  95% CI : (0.762, 0.913)
##     No Information Rate : 0.768         
##     P-Value [Acc > NIR] : 0.03273       
##                                         
##                   Kappa : 0.636         
##                                         
##  Mcnemar's Test P-Value : 0.00982       
##                                         
##             Sensitivity : 0.829         
##             Specificity : 0.913         
##          Pos Pred Value : 0.969         
##          Neg Pred Value : 0.618         
##               Precision : 0.969         
##                  Recall : 0.829         
##                      F1 : 0.894         
##              Prevalence : 0.768         
##          Detection Rate : 0.636         
##    Detection Prevalence : 0.657         
##       Balanced Accuracy : 0.871         
##                                         
##        'Positive' Class : cure          
## 
## The ROC AUC is: 0.793438914027149.

8.5.2 Compare again vs the DE table

In the following block I extract the top and bottom DE genes vs. the top set of genes in the glm_variables ordered by the ‘Overall’ column.

I should make this a function, there is no way this will kept correct across multiple iterations of the ml/wgcna/etc comparisons.

I am going to stop trying to compare these results across WGCNA/adjp/expression, those comparisons are all basically null.

glm_variable_importance <- varImp(glm_fit)
glm_comparison <- compare_ml_de(glm_variable_importance, deseq_de_cf, annotations,
                                n = comparison_n, plot_n = 15)

comparison_lfc[["glm"]] <- glm_comparison[["shared"]]

8.6 Gradient Booster

At this point I basically have a feeling for how one may use the tuneGrid to apply a parameter sweep of the data along with the various (in this case cv) sampling methods. Thus, the following is basically identical to the previous blocks except it is using xgbTree (which is the R implementation of extreme boosting: (Chen and Guestrin (2016))).

##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
train_method <- trainControl(method = "cv", number = 10)

gb_fit <- train(outcome ~ ., data = train_all,
                method = "xgbTree", trControl = train_method,
                tuneGrid = data.frame(
                    nrounds = 200,
                    eta = c(0.05, 0.1, 0.3),
                    max_depth = 4,
                    gamma = 0,
                    colsample_bytree = 1,
                    subsample = 0.5,
                    min_child_weight = 1),
                verbose = TRUE)

gb_predict_trained <- predict(gb_fit, train_df)
gb_predict_trained
##  [1] cure    failure failure failure cure    cure    cure    cure    cure   
## [10] cure    cure    cure    cure    cure    cure    cure    cure    cure   
## [19] cure    cure    cure    cure    cure    cure    failure cure    failure
## [28] cure    cure    failure failure cure    cure    cure    cure    cure   
## [37] failure failure failure failure failure failure cure    cure    cure   
## [46] cure    cure    failure cure    failure cure    cure    failure cure   
## [55] cure    cure    cure    failure cure    failure failure failure failure
## [64] failure failure cure    cure   
## Levels: cure failure
gb_train_eval <- self_evaluate_model(gb_predict_trained, datasets,
                                     which = split, type = "train")
## Setting levels: control = cure, case = failure
## Setting direction: controls < cases

gb_train_eval
## The summary of the (in)correct calls is:
##    Mode    TRUE 
## logical      67
## The missed samples are:
## character(0)
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      44       0
##    failure    0      23
##                                     
##                Accuracy : 1         
##                  95% CI : (0.946, 1)
##     No Information Rate : 0.657     
##     P-Value [Acc > NIR] : 5.81e-13  
##                                     
##                   Kappa : 1         
##                                     
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.000     
##             Specificity : 1.000     
##          Pos Pred Value : 1.000     
##          Neg Pred Value : 1.000     
##               Precision : 1.000     
##                  Recall : 1.000     
##                      F1 : 1.000     
##              Prevalence : 0.657     
##          Detection Rate : 0.657     
##    Detection Prevalence : 0.657     
##       Balanced Accuracy : 1.000     
##                                     
##        'Positive' Class : cure      
## 
## The ROC AUC is: 1.

8.6.1 Now the GB testers!

gb_predict_test <- predict(gb_fit, test_df)

gb_predict_test_evaluated <- self_evaluate_model(gb_predict_test, datasets,
                                                 which = split, type = "test")
## Setting levels: control = cure, case = failure
## Setting direction: controls < cases

gb_predict_test_evaluated
## The summary of the (in)correct calls is:
##    Mode   FALSE    TRUE 
## logical      13      86
## The missed samples are:
##  [1] "TMRC30179" "TMRC30222" "TMRC30105" "TMRC30058" "TMRC30094" "TMRC30119"
##  [7] "TMRC30197" "TMRC30201" "TMRC30200" "TMRC30204" "TMRC30238" "TMRC30264"
## [13] "TMRC30265"
## The confusion matrix is:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cure failure
##    cure      65       0
##    failure   13      21
##                                         
##                Accuracy : 0.869         
##                  95% CI : (0.786, 0.928)
##     No Information Rate : 0.788         
##     P-Value [Acc > NIR] : 0.027769      
##                                         
##                   Kappa : 0.68          
##                                         
##  Mcnemar's Test P-Value : 0.000874      
##                                         
##             Sensitivity : 0.833         
##             Specificity : 1.000         
##          Pos Pred Value : 1.000         
##          Neg Pred Value : 0.618         
##               Precision : 1.000         
##                  Recall : 0.833         
##                      F1 : 0.909         
##              Prevalence : 0.788         
##          Detection Rate : 0.657         
##    Detection Prevalence : 0.657         
##       Balanced Accuracy : 0.917         
##                                         
##        'Positive' Class : cure          
## 
## The ROC AUC is: 0.808823529411765.
gb_variable_importance <- varImp(gb_fit)
gb_variables <- glm_variable_importance[["importance"]] %>%
  arrange(desc(Overall))
plot(gb_variable_importance, top = 15)

gb_comparison <- compare_ml_de(gb_variable_importance, deseq_de_cf, annotations, n = comparison_n, plot_n = 15)

comparison_lfc[["gb"]] <- gb_comparison[["shared"]]

9 Shared importance

upset_lfc <- list()
upset_adjp <- list()
upset_wgcna <- list()
for (d in 1:length(comparison_lfc)) {
  name <- names(comparison_lfc)[d]
  upset_lfc[[name]] <- rownames(comparison_lfc[[name]])
  upset_adjp[[name]] <- rownames(comparison_adjp[[name]])
  ##upset_wgcna[[name]] <- rownames(comparison_wgcna[[name]])
}

start_lfc <- UpSetR::fromList(upset_lfc)
lfc_vs_ml <- UpSetR::upset(start_lfc)

9.1 A different way to query shared/unique genes

The various gene sets observed by these methods may also be compared directly by an upset plot using the deseq_de_cf with (gb|glm|rf|knn)_variable_importance…

comp_number <- 300
de_ml_sig_list <- list(
  "DESeq2" = rownames(deseq_de_cf),
  "knn" = rownames(head(knn_variables, n = comp_number)),
  "rf" = rownames(head(rf_variables, n = comp_number)),
  "glm" = rownames(head(glm_variables, n = comp_number)),
  "gb" = rownames(head(gb_variables, n = comp_number)))
important_genes_upset_input <- UpSetR::fromList(de_ml_sig_list)
important_genes_upset <- UpSetR::upset(important_genes_upset_input)
print(important_genes_upset)

important_genes_overlap <- overlap_groups(de_ml_sig_list)
shared_gene_ids <- overlap_geneids(important_genes_overlap, "DESeq2:knn")
shared_annotations <- fData(input_data)[as.character(shared_gene_ids), ]

important_genes_overlap <- overlap_groups(de_ml_sig_list)
shared_gene_ids <- overlap_geneids(important_genes_overlap, "DESeq2:rf")
shared_annotations <- fData(input_data)[as.character(shared_gene_ids), ]

important_genes_overlap <- overlap_groups(de_ml_sig_list)
shared_gene_ids <- overlap_geneids(important_genes_overlap, "DESeq2:gb")
shared_annotations <- fData(input_data)[as.character(shared_gene_ids), ]

important_genes_overlap <- overlap_groups(de_ml_sig_list)
shared_gene_ids <- overlap_geneids(important_genes_overlap, "DESeq2:glm")
shared_annotations <- fData(input_data)[as.character(shared_gene_ids), ]

10 Collate shared genes by method and write them

Over the course of this document I recorded the limited agreement between the DESeq result and the ML classifier in the data structure ‘comparison_lfc’ Lets write that out to an xlsx file…

de_ml_xlsx <- glue("excel/tc_compare_de_ml_top3kvariant-v{ver}.xlsx")
xlsx <- init_xlsx(de_ml_xlsx)
wb <- xlsx[["wb"]]
excel_basename <- xlsx[["basename"]]
written <- write_xlsx(data = comparison_lfc[["knn"]], wb = wb, sheet = "knn_vs_deseq")
written <- write_xlsx(data = comparison_lfc[["rf"]], wb = wb, sheet = "rf_vs_deseq")
written <- write_xlsx(data = comparison_lfc[["glm"]], wb = wb, sheet = "glm_vs_deseq")
written <- write_xlsx(data = comparison_lfc[["gb"]], wb = wb, sheet = "gb_vs_deseq")
excel_ret <- openxlsx::saveWorkbook(wb, de_ml_xlsx, overwrite = TRUE)

I realized something important in this block: at no point in any of my analyses do I really record the classifications, I just record their (dis)concordance with our known annotations. In the following block I iterate over multiple rounds of the various classifiers, let us modify that function to record this information in some useful fashion.

11 Performing multiple train/test rounds in one call

Doing all of the above for each method and collecting the results is super fiddly, so I wrote a little function to try to make that easier.

WARNING: Turning on the following block takes ~ 2 days on our pretty nice server. Your mileage may vary.

11.1 Setting up input data for multiple ML iterations

There are four versions of the data that Maria Adelaida and Najib are interested in:

  • Tumaco+Cali all visits
  • Tumaco+Cali visit 1
  • Tumaco all visits
  • Tumaco visit 1

Ergo, let us set up an appropriate naming convention for them to make these iterative runs more coherent. I will continue using ‘tc’ and ‘t’ for the clinics as a prefix and follow it with ‘vall’ vs. ‘v1’.

tc_vall_input <- subset_expt(tc_clinical, subset = "batch!='biopsy'")
## The samples excluded are: TMRC30156, TMRC30269, TMRC30253, TMRC30270, TMRC30016, TMRC30017, TMRC30018, TMRC30019, TMRC30020, TMRC30022, TMRC30026, TMRC30044, TMRC30045, TMRC30152, TMRC30177, TMRC30155, TMRC30154, TMRC30241.
## subset_expt(): There were 184, now there are 166 samples.
t_vall_input <- subset_expt(tc_vall_input, subset = "clinic=='tumaco'")
## The samples excluded are: TMRC30185, TMRC30186, TMRC30178, TMRC30179, TMRC30221, TMRC30222, TMRC30223, TMRC30224, TMRC30148, TMRC30149, TMRC30150, TMRC30140, TMRC30138, TMRC30176, TMRC30153, TMRC30151, TMRC30234, TMRC30235, TMRC30225, TMRC30226, TMRC30227, TMRC30228, TMRC30229, TMRC30230, TMRC30231, TMRC30232, TMRC30233, TMRC30209, TMRC30210, TMRC30211, TMRC30212, TMRC30213, TMRC30216, TMRC30214, TMRC30215, TMRC30271, TMRC30273, TMRC30275, TMRC30272, TMRC30274, TMRC30276, TMRC30254, TMRC30255, TMRC30256, TMRC30277, TMRC30239, TMRC30240, TMRC30278, TMRC30279, TMRC30280, TMRC30257, TMRC30258, TMRC30281, TMRC30283, TMRC30284, TMRC30282, TMRC30285.
## subset_expt(): There were 166, now there are 109 samples.
tc_v1_input <- subset_expt(tc_vall_input, subset = "visitbipart=='v1'")
## The samples excluded are: TMRC30178, TMRC30179, TMRC30223, TMRC30224, TMRC30150, TMRC30140, TMRC30176, TMRC30153, TMRC30151, TMRC30228, TMRC30229, TMRC30230, TMRC30231, TMRC30232, TMRC30233, TMRC30212, TMRC30213, TMRC30216, TMRC30214, TMRC30215, TMRC30272, TMRC30274, TMRC30276, TMRC30254, TMRC30255, TMRC30256, TMRC30277, TMRC30278, TMRC30279, TMRC30280, TMRC30282, TMRC30285, TMRC30056, TMRC30113, TMRC30105, TMRC30058, TMRC30164, TMRC30094, TMRC30119, TMRC30082, TMRC30122, TMRC30169, TMRC30093, TMRC30170, TMRC30032, TMRC30096, TMRC30028, TMRC30115, TMRC30118, TMRC30121, TMRC30196, TMRC30030, TMRC30037, TMRC30031, TMRC30027, TMRC30194, TMRC30195, TMRC30054, TMRC30070, TMRC30049, TMRC30055, TMRC30053, TMRC30068, TMRC30171, TMRC30139, TMRC30158, TMRC30160, TMRC30157, TMRC30183, TMRC30181, TMRC30072, TMRC30133, TMRC30078, TMRC30076, TMRC30159, TMRC30129, TMRC30088, TMRC30172, TMRC30137, TMRC30161, TMRC30142, TMRC30145, TMRC30143, TMRC30146, TMRC30182, TMRC30199, TMRC30201, TMRC30200, TMRC30202, TMRC30205, TMRC30206, TMRC30136, TMRC30217, TMRC30077, TMRC30219, TMRC30218, TMRC30079, TMRC30220, TMRC30173, TMRC30144, TMRC30147.
## subset_expt(): There were 166, now there are 65 samples.
t_v1_input <- subset_expt(t_vall_input, subset = "visitbipart=='v1'")
## The samples excluded are: TMRC30056, TMRC30113, TMRC30105, TMRC30058, TMRC30164, TMRC30094, TMRC30119, TMRC30082, TMRC30122, TMRC30169, TMRC30093, TMRC30170, TMRC30032, TMRC30096, TMRC30028, TMRC30115, TMRC30118, TMRC30121, TMRC30196, TMRC30030, TMRC30037, TMRC30031, TMRC30027, TMRC30194, TMRC30195, TMRC30054, TMRC30070, TMRC30049, TMRC30055, TMRC30053, TMRC30068, TMRC30171, TMRC30139, TMRC30158, TMRC30160, TMRC30157, TMRC30183, TMRC30181, TMRC30072, TMRC30133, TMRC30078, TMRC30076, TMRC30159, TMRC30129, TMRC30088, TMRC30172, TMRC30137, TMRC30161, TMRC30142, TMRC30145, TMRC30143, TMRC30146, TMRC30182, TMRC30199, TMRC30201, TMRC30200, TMRC30202, TMRC30205, TMRC30206, TMRC30136, TMRC30217, TMRC30077, TMRC30219, TMRC30218, TMRC30079, TMRC30220, TMRC30173, TMRC30144, TMRC30147.
## subset_expt(): There were 109, now there are 40 samples.
tc_vall_input <- normalize_expt(tc_vall_input, transform = "log2", convert = "cpm")
## transform_counts: Found 932911 values equal to 0, adding 1 to the matrix.
t_vall_input <- normalize_expt(t_vall_input, transform = "log2", convert = "cpm")
## transform_counts: Found 587592 values equal to 0, adding 1 to the matrix.
tc_v1_input <- normalize_expt(tc_v1_input, transform = "log2", convert = "cpm")
## transform_counts: Found 369543 values equal to 0, adding 1 to the matrix.
t_v1_input <- normalize_expt(t_v1_input, transform = "log2", convert = "cpm")
## transform_counts: Found 216947 values equal to 0, adding 1 to the matrix.
tc_vall_outcome_factor <- as.factor(as.character(pData(tc_vall_input)[[ref_col]]))
t_vall_outcome_factor <- as.factor(as.character(pData(t_vall_input)[[ref_col]]))
tc_v1_outcome_factor <- as.factor(as.character(pData(tc_v1_input)[[ref_col]]))
t_v1_outcome_factor <- as.factor(as.character(pData(t_v1_input)[[ref_col]]))

11.2 Perform normalization/filtering operations as per the exploratory document

In the following block I will use the same normalizations and filters as above on each of the four inputs.

tc_vall_norm <- normalize_expt(tc_vall_input, filter = "cv", cv_min = 0.1)
## Removing 3619 low-count genes (16333 remaining).
tc_v1_norm <- normalize_expt(tc_v1_input, filter = "cv", cv_min = 0.1)
## Removing 3968 low-count genes (15984 remaining).
t_vall_norm <- normalize_expt(t_vall_input, filter = "cv", cv_min = 0.1)
## Removing 3785 low-count genes (16167 remaining).
t_v1_norm <- normalize_expt(t_v1_input, filter = "cv", cv_min = 0.1)
## Removing 4109 low-count genes (15843 remaining).
tc_vall_texprs <- t(exprs(tc_vall_norm))
tc_v1_texprs <- t(exprs(tc_v1_norm))
t_vall_texprs <- t(exprs(t_vall_norm))
t_v1_texprs <- t(exprs(t_v1_norm))

sd_genes <- 3000
tc_vall_stdev <- apply(tc_vall_texprs, 2, sd)
tc_vall_pred <- order(tc_vall_stdev, decreasing = TRUE)[1:sd_genes]
tc_vall_texprs <- tc_vall_texprs[, tc_vall_pred]
tc_v1_stdev <- apply(tc_v1_texprs, 2, sd)
tc_v1_pred <- order(tc_v1_stdev, decreasing = TRUE)[1:sd_genes]
tc_v1_texprs <- tc_v1_texprs[, tc_v1_pred]
t_vall_stdev <- apply(t_vall_texprs, 2, sd)
t_vall_pred <- order(t_vall_stdev, decreasing = TRUE)[1:sd_genes]
t_vall_texprs <- t_vall_texprs[, t_vall_pred]
t_v1_stdev <- apply(t_v1_texprs, 2, sd)
t_v1_pred <- order(t_v1_stdev, decreasing = TRUE)[1:sd_genes]
t_v1_texprs <- t_v1_texprs[, t_v1_pred]

tc_vall_preproc <- preProcess(tc_vall_texprs, method = "center")
tc_vall_texprs <- predict(tc_vall_preproc, tc_vall_texprs)
tc_v1_preproc <- preProcess(tc_v1_texprs, method = "center")
tc_v1_texprs <- predict(tc_v1_preproc, tc_v1_texprs)
t_vall_preproc <- preProcess(t_vall_texprs, method = "center")
t_vall_texprs <- predict(t_vall_preproc, t_vall_texprs)
t_v1_preproc <- preProcess(t_v1_texprs, method = "center")
t_v1_texprs <- predict(t_v1_preproc, t_v1_texprs)

tc_vall_preproc <- preProcess(tc_vall_texprs, method = "corr", cutoff = 0.95)
tc_vall_texprs <- predict(tc_vall_preproc, tc_vall_texprs)
tc_v1_preproc <- preProcess(tc_v1_texprs, method = "corr", cutoff = 0.95)
tc_v1_texprs <- predict(tc_v1_preproc, tc_v1_texprs)
t_vall_preproc <- preProcess(t_vall_texprs, method = "corr", cutoff = 0.95)
t_vall_texprs <- predict(t_vall_preproc, t_vall_texprs)
t_v1_preproc <- preProcess(t_v1_texprs, method = "corr", cutoff = 0.95)
t_v1_texprs <- predict(t_v1_preproc, t_v1_texprs)

chosen_factors <- c("finaloutcome", "donor", "persistence", "visitnumber",
                    "selectionmethod", "typeofcells", "time")
tc_vall_meta <- pData(tc_vall_norm)[, chosen_factors]
tc_v1_meta <- pData(tc_v1_norm)[, chosen_factors]
t_vall_meta <- pData(t_vall_norm)[, chosen_factors]
t_v1_meta <- pData(t_v1_norm)[, chosen_factors]

ref_col <- "finaloutcome"
tc_vall_outcome <- as.factor(as.character(pData(tc_vall_norm)[[ref_col]]))
tc_v1_outcome <- as.factor(as.character(pData(tc_v1_norm)[[ref_col]]))
t_vall_outcome <- as.factor(as.character(pData(t_vall_norm)[[ref_col]]))
t_v1_outcome <- as.factor(as.character(pData(t_v1_norm)[[ref_col]]))

tc_vall_ml_df <- cbind.data.frame(tc_vall_outcome, as.data.frame(tc_vall_texprs))
tc_vall_ml_df[[ref_col]] <- as.factor(tc_vall_ml_df[["tc_vall_outcome"]])
tc_v1_ml_df <- cbind.data.frame(tc_v1_outcome, as.data.frame(tc_v1_texprs))
tc_v1_ml_df[[ref_col]] <- as.factor(tc_v1_ml_df[["tc_v1_outcome"]])
t_vall_ml_df <- cbind.data.frame(t_vall_outcome, as.data.frame(t_vall_texprs))
t_vall_ml_df[[ref_col]] <- as.factor(t_vall_ml_df[["t_vall_outcome"]])
t_v1_ml_df <- cbind.data.frame(t_v1_outcome, as.data.frame(t_v1_texprs))
t_v1_ml_df[[ref_col]] <- as.factor(t_v1_ml_df[["t_v1_outcome"]])

11.3 Iterate Tumaco+Cali all visits

tc_vall_summary_xlsx <- glue("excel/tc_vall_ml_summary-v{ver}.xlsx")
tc_vall_knn <- classify_n_times(tc_vall_texprs, tc_vall_meta,
                                outcome_column = ref_col,
                                method = "knn", sampler = "cv")
written <- write_classifier_summary(tc_vall_knn, excel = tc_vall_summary_xlsx)
tc_vall_gb <- classify_n_times(tc_vall_texprs, tc_vall_meta,
                               outcome_column = ref_col,
                               method = "xgbTree", sampler = "cv")
written <- write_classifier_summary(tc_vall_gb, excel = written[["wb"]])
tc_vall_glm <- classify_n_times(tc_vall_texprs, tc_vall_meta,
                                outcome_column = ref_col,
                                method = "glmnet", sampler = "cv")
written <- write_classifier_summary(tc_vall_glm, excel = written[["wb"]])
tc_vall_rf <- classify_n_times(tc_vall_texprs, tc_vall_meta,
                               outcome_column = ref_col,
                               method = "ranger", sampler = "cv")
written <- write_classifier_summary(tc_vall_rf, excel = written[["wb"]])
openxlsx::saveWorkbook(written[["wb"]], file = tc_vall_summary_xlsx)

11.4 Iterate Tumaco+Cali Visit 1

tc_v1_summary_xlsx <- glue("excel/tc_v1_ml_summary-v{ver}.xlsx")
tc_v1_knn <- classify_n_times(tc_v1_texprs, tc_v1_meta,
                                outcome_column = ref_col,
                                method = "knn", sampler = "cv")
written <- write_classifier_summary(tc_v1_knn, excel = tc_v1_summary_xlsx)
tc_v1_gb <- classify_n_times(tc_v1_texprs, tc_v1_meta,
                               outcome_column = ref_col,
                               method = "xgbTree", sampler = "cv")
written <- write_classifier_summary(tc_v1_gb, excel = written[["wb"]])
tc_v1_glm <- classify_n_times(tc_v1_texprs, tc_v1_meta,
                                outcome_column = ref_col,
                                method = "glmnet", sampler = "cv")
written <- write_classifier_summary(tc_v1_glm, excel = written[["wb"]])
tc_v1_rf <- classify_n_times(tc_v1_texprs, tc_v1_meta,
                               outcome_column = ref_col,
                               method = "ranger", sampler = "cv")
written <- write_classifier_summary(tc_v1_rf, excel = written[["wb"]])
openxlsx::saveWorkbook(written[["wb"]], file = tc_v1_summary_xlsx)

11.5 Tumaco all visits

t_vall_summary_xlsx <- glue("excel/t_vall_ml_summary-v{ver}.xlsx")
t_vall_knn <- classify_n_times(t_vall_texprs, t_vall_meta,
                                outcome_column = ref_col,
                                method = "knn", sampler = "cv")
written <- write_classifier_summary(t_vall_knn, excel = t_vall_summary_xlsx)
t_vall_gb <- classify_n_times(t_vall_texprs, t_vall_meta,
                               outcome_column = ref_col,
                               method = "xgbTree", sampler = "cv")
written <- write_classifier_summary(t_vall_gb, excel = written[["wb"]])
t_vall_glm <- classify_n_times(t_vall_texprs, t_vall_meta,
                                outcome_column = ref_col,
                                method = "glmnet", sampler = "cv")
written <- write_classifier_summary(t_vall_glm, excel = written[["wb"]])
t_vall_rf <- classify_n_times(t_vall_texprs, t_vall_meta,
                               outcome_column = ref_col,
                               method = "ranger", sampler = "cv")
written <- write_classifier_summary(t_vall_rf, excel = written[["wb"]])
openxlsx::saveWorkbook(written[["wb"]], file = t_vall_summary_xlsx)

11.6 Tumaco Visit 1

t_v1_summary_xlsx <- glue("excel/t_v1_ml_summary-v{ver}.xlsx")
t_v1_knn <- classify_n_times(t_v1_texprs, t_v1_meta,
                                outcome_column = ref_col,
                                method = "knn", sampler = "cv")
written <- write_classifier_summary(t_v1_knn, excel = t_v1_summary_xlsx)
t_v1_gb <- classify_n_times(t_v1_texprs, t_v1_meta,
                            outcome_column = ref_col,
                               method = "xgbTree", sampler = "cv")
written <- write_classifier_summary(t_v1_gb, excel = written[["wb"]])
t_v1_glm <- classify_n_times(t_v1_texprs, t_v1_meta,
                                outcome_column = ref_col,
                                method = "glmnet", sampler = "cv")
written <- write_classifier_summary(t_v1_glm, excel = written[["wb"]])
t_v1_rf <- classify_n_times(t_v1_texprs, t_v1_meta,
                               outcome_column = ref_col,
                               method = "ranger", sampler = "cv")
written <- write_classifier_summary(t_v1_rf, excel = written[["wb"]])
openxlsx::saveWorkbook(written[["wb"]], file = t_v1_summary_xlsx)

Bibliography

Chen, Tianqi, and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–94. KDD ’16. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785.
Kuhn, Max. 2008. “Building Predictive Models in R Using the Caret Package.” Journal of Statistical Software 28 (5): 1–26. https://doi.org/10.18637/jss.v028.i05.
Wright, Marvin N., and Andreas Ziegler. 2017. “Ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77 (March): 1–17. https://doi.org/10.18637/jss.v077.i01.
---
title: "TMRC3 ML Classification of outcome: `r Sys.getenv('VERSION')`"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
bibliography: atb.bib
output:
  html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: zenburn
    keep_md: false
    mode: selfcontained
    number_sections: true
    self_contained: true
    theme: readable
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
---

<style type="text/css">
body .main-container {
  max-width: 1600px;
}
body, td {
  font-size: 16px;
}
code.r{
  font-size: 16px;
}
pre {
  font-size: 16px
}
</style>

```{r, include=FALSE}
library(hpgltools)
library(caret)
library(dplyr)
library(pROC)
library(DALEX)
library(glmnet)
library(glue)
library(kernlab)
library(ranger)
library(xgboost)
library(ggstatsplot)

knitr::opts_knit$set(progress = TRUE, verbose = TRUE, width = 90, echo = TRUE)
knitr::opts_chunk$set(
  error = TRUE, fig.width = 8, fig.height = 8, fig.retina = 2,
  out.width = "100%", dev = "png",
  dev.args = list(png = list(type = "cairo-png")))
old_options <- options(digits = 4, stringsAsFactors = FALSE, knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 12))
ver <- Sys.getenv("VERSION")
rundate <- format(Sys.Date(), format = "%Y%m%d")

rmd_file <- "06classifier_highvar.Rmd"
savefile <- gsub(pattern = "\\.Rmd", replace = "\\.rda\\.xz", x = rmd_file)
loaded <- load(file = glue("rda/tmrc3_data_structures-v{ver}.rda"))
```

# Changelog

* 202408: Collapsing the various ML documents back into one in the
  hopes that it eases the degree of confusion.  All of the exploratory
  analyses are at the top, at the bottom are the blocks which are used
  to generate the final tables, but they are only runnable manually
  due to time constraints.
* 202408: Re-enabled glm, it seems to be running faster again?
* 20240620: Disabled glm training/testing in the container.  For
  reasons I have not yet figured out it takes ~ 10x longer (~10 days)
  in the container (note, this is a new thing, it used to take ~ 8
  hours in the container without error, so something changed in
  202405-202406 caret/glmnet/singularity/etc) to complete; which is a
  problem because my notebooks are auto-regenerated on a 2x/week basis
  (if changes occur) and so it doesn't finish before another round
  starts...
* 202309: Adding a version of this with only the Tumaco data.  I am a
  pretty big non-fan of doing this; so I am going to do so in the most
  cursory fashion possible by just copying this document to another copy with
  a suffix of 'tumaco' and changing the initial datastructure to the Tumaco-only
  dataset.


# Introduction

I had some success in classifying the TMRC2 samples by strain via ML
and want to try something more difficult.  Thus I will use the
normalized gene expression data and try classifying it by cure/fail.

All analyses in this document heavily use caret
(@kuhnBuildingPredictiveModels2008) and, as per the sister TMRC2
document, follow pretty closely the path from:

https://topepo.github.io/caret/data-splitting.html#simple-splitting-based-on-the-outcome

and

https://github.com/compgenomr/book/blob/master/05-supervisedLearning.Rmd

# Starter data

In the strain classifier I used normalized variants.  I am thinking to
use normalized expression here and therefore explicitly limit myself
to ~ 20k variables (significantly down from the 1.6M).

In addition, caret expects the data as (rows == samples) and
(columns == variables) where each element is one observation.
Thus we will need to transpose the expression matrix.

```{r}
input_data <- subset_expt(tc_clinical, subset="batch!='biopsy'") %>%
  normalize_expt(transform = "log2", convert = "cpm")

ref_col <- "finaloutcome"
outcome_factor <- as.factor(as.character(pData(input_data)[[ref_col]]))
comparison_n <- 200

## Setting this up in case Najib and Maria Adelaida request a Tumaco set of exploratory analyses.
t_input_data <- subset_expt(tc_clinical, subset="batch!='biopsy'") %>%
  normalize_expt(transform = "log2", convert = "cpm")
t_outcome_factor <- as.factor(as.character(pData(t_input_data)[[ref_col]]))
```

# Compare TopN important genes to DE enes

Given that we have separated the various analyses, it will take me a
minute to figure out where I saved the relevant differential
expression analysis.  I do not actually save the various DE results to
rda files by default, instead opting to send them to xlsx files to
share.  Recall if you will, that the data that I think might be used
for the paper also does not go into the default excel directory but
instead mirrors the box organization scheme.

Thus, I think the most relevant file is:
"analyses/3_cali_and_tumaco/DE_Cure_vs_Fail/All_Samples/tc_cf_clinical_table_sva-v202207.xlsx"

and

"analyses/4_tumaco/DE_Cure_vs_Fail/All_Samples/t_cf_clinical_table_sva-v202207.xlsx"

Note, this xlsx file contains three primary worksheets, v1cf, v2cf,
v3cf which comprise all samples across cell types for each visit and
compares cure and fail.  When I choose sheet #2 in the following block
I am explicitly asking for only the v1 cure/fail result.

## Opening the relevant DE table

```{r}
input_xlsx <- glue("analyses/3_cali_and_tumaco/DE_Cure_Fail/Clinical_Samples/tc_clinical_cf_sig_sva-v{ver}.xlsx")
all_de_cf <- openxlsx::readWorkbook(input_xlsx, sheet = 3, startRow = 2)
rownames(all_de_cf) <- all_de_cf[["row.names"]]
all_de_cf[["row.names"]] <- NULL
deseq_de_cf_up <- all_de_cf[, c("deseq_logfc", "deseq_adjp", "deseq_basemean", "deseq_lfcse")]
all_de_cf <- openxlsx::readWorkbook(input_xlsx, sheet = 4, startRow = 2)
rownames(all_de_cf) <- all_de_cf[["row.names"]]
all_de_cf[["row.names"]] <- NULL
deseq_de_cf_down <- all_de_cf[, c("deseq_logfc", "deseq_adjp", "deseq_basemean", "deseq_lfcse")]
deseq_de_cf <- rbind.data.frame(deseq_de_cf_up, deseq_de_cf_down)

tumaco_xlsx <- glue("analyses/4_tumaco/DE_Cure_Fail/t_all_visitcf_sig_sva-v{ver}.xlsx")
t_de_cf <- openxlsx::readWorkbook(tumaco_xlsx, sheet = 3, startRow = 2)
rownames(t_de_cf) <- t_de_cf[["row.names"]]
t_de_cf[["row.names"]] <- NULL
t_de_cf_up <- t_de_cf[, c("deseq_logfc", "deseq_adjp", "deseq_basemean", "deseq_lfcse")]
t_de_cf <- openxlsx::readWorkbook(tumaco_xlsx, sheet = 4, startRow = 2)
rownames(t_de_cf) <- t_de_cf[["row.names"]]
t_de_cf[["row.names"]] <- NULL
t_de_cf_down <- t_de_cf[, c("deseq_logfc", "deseq_adjp", "deseq_basemean", "deseq_lfcse")]
t_de_cf <- rbind.data.frame(t_de_cf_up, t_de_cf_down)
top_bottom_ids <- rownames(t_de_cf)
```

## A function to help compare methods to DESeq2

The following block contains a few lists which will hold comparisons
between DESeq2 and the various methods which follow, some parameters,
and a function to compare the methods and DESeq.

Presumably DESeq and the models should be responding to variance in
the data, for which I think the logFC values, p-values, mean values,
or standard errors are the most likely proxies to which I have easy
access.  So, let us pull the top/bottom n genes vis a vis each of
those categories and see what happens?

Here is a little function to compare a ML result to the above DE
table.  It should provide the importance plot, a venn of the two
methods, and the gene sets (with annotations) for the unique/shared
genes between this ML method and the DE table.  Note, this is just
using the highest/lowest logFC values and ignoring the p-value for now.

```{r}
comparison_lfc <- list()
comparison_adjp <- list()
comparison_wgcna <- list()
annotations <- fData(t_monocytes)  ## Arbitrarily taken

compare_ml_de <- function(importance, sig_table, annotations, n = comparison_n, plot_n = 15,
                          according_to = "deseq_logfc", column = "Overall") {
  ml_df <- importance[["importance"]] %>%
    arrange(desc(!!sym(column)))
  importance_plot <- plot(importance, top = plot_n)
  top_ml <- head(rownames(ml_df), n = n)
  de_df <- sig_table %>%
    arrange(desc(deseq_logfc))
  top_de <- head(rownames(de_df), n = n)
  bottom_de <- tail(rownames(de_df), n = n)
  top_bottom_ids <- c(top_de, bottom_de)
  comparison <- list("de" = top_bottom_ids, "ml" = top_ml)
  comparison_venn <- Vennerable::Venn(comparison)
  shared_ids <- comparison_venn@IntersectionSets["11"][[1]]
  shared_annot <- annotations[shared_ids, ]
  de_ids <- comparison_venn@IntersectionSets["10"][[1]]
  ml_ids <- comparison_venn@IntersectionSets["01"][[1]]
  only_de <- annotations[de_ids, ]
  only_ml <- annotations[ml_ids, ]
  comparison_plot <- Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")
  retlist <- list(
    "importance_plot" = importance_plot,
    "comparison" = comparison,
    "comparison_venn" = comparison_venn,
    "comparison_plot" = comparison_plot,
    "shared" = shared_annot,
    "only_de" = only_de,
    "only_ml" = only_ml)
  return(retlist)
}
```

# Filtering

The ML text I am reading provide some neat examples for how one might
filter the data to make it more suitable for model creation.

## Near zero variance, or genefilter's cv

The first filter I was introduced to is quite familiar from our
sequencing data, the removal of features with near-zero-variance.
Indeed, I am pretty certain that normalize_expt() could do this
equivalently and significantly faster than caret::preProcess().

```{r}
nrow(exprs(input_data))
system.time({
  equivalent <- normalize_expt(input_data, filter = "cv", cv_min = 0.1)
})
dim(exprs(equivalent))

## Given this large amount of data, this step is slow, taking > 10 minutes.
## Yeah seriously, the following three lines get 16,723 genes in 10 minutes while
## the normalize_expt() call above gets 16,749 genes in 2.4 seconds.
#system.time({
#  nzv <- preProcess(texprs, method="nzv", uniqueCut=15)
#  nzv_texprs <- predict(nzv, texprs)
#  dim(nzv_texprs)
#}
nzv_texprs <- t(exprs(equivalent))
```

## Filtering to the highest standard deviation variables

I think I am a bit confused by this filter, one would think that the
nzv filter above, if applied correctly, should give you exactly this.

For the moment, I am excluding the following block in order to see how
much time/memory keeping these variables costs.  If I recall properly,
the model of the top-2k variant positions cost ~ 1-4G of memory.  I
hope that this scales linearly, but I am thinking it might not.

```{r}
standard_devs <- apply(nzv_texprs, 2, sd)
top_predictors <- order(standard_devs, decreasing = TRUE)[1:3000]
nzv_texprs <- nzv_texprs[, top_predictors]
```

## Center the data

I think centering may not be needed for this data, but here is how:

```{r}
nzv_center <- preProcess(nzv_texprs, method = "center")
nzv_texprs <- predict(nzv_center, nzv_texprs)
```

## Drop correlated

This is a filter which does not correspond to any of those we use in
sequencing data because genes which are highly correlated are
likely to be of immediate interest.

In the same fashion, I want to leave this off because later
applications of this model will include low coverage samples which may
not have every variant represented.

```{r}
## This step takes a while...
system.time({
  nzv_correlated <- preProcess(nzv_texprs, method = "corr", cutoff = 0.95)
  nzv_uncorr <- predict(nzv_correlated, nzv_texprs)
})
dim(nzv_uncorr)
```

# Merge the appropriate metadata

There are a few metadata factors which might prove of interest for
classification.  The most obvious are of course outcome, clinic,
donor, visit, celltype.  I am, for the moment, only likely to focus on
outcome.  AFAICT I can only include one of these at a time in the
data, which is a shame.

```{r}
interesting_meta <- pData(input_data)[, c("finaloutcome", "donor", "persistence",
                                          "visitnumber", "selectionmethod",
                                          "typeofcells", "time")]

ml_df <- as.data.frame(cbind(outcome_factor, as.data.frame(nzv_uncorr)))
ml_df[["outcome_factor"]] <- as.factor(ml_df[["outcome_factor"]])
dim(ml_df)
```

# Split the data into training/testing

caret provides nice functionality for splitting up the data.  I
suspect there are many more fun knobs I can play with for instances
where I need to exclude some levels of a factor and such.  In this
case I just want to split by outcome.

## Via data splitting

```{r}
ml_df <- as.data.frame(cbind(outcome_factor, as.data.frame(nzv_uncorr)))

datasets <- create_partitions(nzv_uncorr, interesting_meta,
                              outcome_factor = outcome_factor)
```

## Via sampling

There are a few likely sampling methods: cross-validation,
bootstrapping, and jackknifing.  I will try those out later.

# Try out training and prediction methods

My goals from here on will be to get the beginnings of a sense of the
various methods I can use to create the models from the training data
and predict the outcome on the test data.  I am hoping also to pick up
some idea of what the various arguments mean while I am at it.

## Try out KNN

k-nearest neighbors is somewhat similar to a kmeans estimate.  Thus
the primary argument is 'k'

### Model creation and performance

```{r}
split <- 1
train_all <- datasets[["trainers"]][[split]]
train_df <- datasets[["trainers_stripped"]][[split]]
train_idx <- datasets[["train_idx"]][[split]]
train_outcomes <- datasets[["trainer_outcomes"]][[split]]
test_df <- datasets[["testers"]][[split]]
test_idx <- datasets[["test_idx"]][[split]]
test_outcomes <- datasets[["tester_outcomes"]][[split]]

knn_fit <- knn3(x = train_df,
                y = train_outcomes,
                k = 3)
knn_predict_trained <- predict(knn_fit, train_df, type = "prob")

knn_train_evaluated <- self_evaluate_model(knn_predict_trained, datasets,
                                           which = split, type = "train")
knn_train_evaluated
```

As the confusion matrix shows, this failed for a few samples.  Perhaps
let us change k and see if it improves.

Here is a table of fase positives/negatives for a few values of 'k',
in this context a false positive is calling a known cure as a failure
and false negative is calling a known failure as a cure.

|---|---|---|
|k  |fp |fn |
|2  |0  |8  |
|3  |5  |5  |
|4  |8  |9  |
|5  |11 |7  |
|6  |15 |8  |

Note: this depends on the luck of rand(), so the above numbers shift
moderately from one run to the next.  Thus I think I will just use 2
or 3.

```{r}
knn_fit2 <- knn3(x = train_df,
                y = train_outcomes,
                k = 5)
knn_predict_trained2 <- predict(knn_fit2, train_df, type = "prob")

knn_train_evaluated2 <- self_evaluate_model(knn_predict_trained2, datasets,
                                            which = split, type = "train")
knn_train_evaluated2
```

### Predict the rest of the data with this model.

```{r}
knn_predict_test <- predict(knn_fit, test_df)

knn_test_evaluated <- self_evaluate_model(knn_predict_test, datasets,
                                     which = split, type = "test")
knn_test_evaluated

knn_predict_test2 <- predict(knn_fit2, test_df)
knn_test_evaluated2 <- self_evaluate_model(knn_predict_test2, datasets,
                                           which = split, type = "test")
knn_test_evaluated2
```

## Perform cross-validation to estimate k

The cross validation method of repeated sampling the data is all done
within the train() function.  With that in mind, here it is operating
with the knn method.

### CV with knn

When train() is called with the trControl and tuneGrid, we can control
how the knn training is repeated, in this case it will iterate over k
from 1 to 10.

This currently fails due to a stack overflow...

```{r, eval=FALSE}
cv_control <- trainControl(method = "cv", number = 10)

knn_train_fit <- train(outcome_factor ~ ., data = train_df,
                       method = "knn",
                       trControl = cv_control,
                       tuneGrid = data.frame(k = 1:10))
knn_train_fit[["bestTune"]]

plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
     ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
      col = "#CC0000")
```

### Bootstrap with knn

```{r}
boot_control <- trainControl(method = "boot", number = 20,
                             returnResamp = "all")

knn_train_fit <- train(outcome ~ ., data = train_all,
                       method = "knn",
                       trControl = boot_control,
                       tuneGrid = data.frame(k = 1:10))
knn_train_fit[["bestTune"]]

plot(x = 1:10, 1 - knn_train_fit$results[, 2], pch = 19,
     ylab = "prediction error", xlab = "k")
lines(loess.smooth(x = 1:10, 1 - knn_train_fit$results[, 2],degree = 2),
      col = "#CC0000")
```

### Explain the important variables

In this instance we will search for genes which were important for the
model's creation.

The DALEX package provides a function: feature_importance() which
seeks to use a series of other methods to extract (in this case,
genes) features which do a good job of explaining the result produced
by the model.  In the case of this dataset, which has thousands of
features, this does not appear to end well.

```{r, eval=FALSE}
explainer_knn <- DALEX::explain(knn_fit, label = "knn",
                                data = train_df,
                                y = as.numeric(train_outcomes))

## AFAICT the following will take forever unless we drastically reduce the complexity of the model.
## yeah, I let it run for a week.
## features <- feature_importance(explainer_knn, n_sample = 50, type = "difference")
```

Conversely, we can use the varImp() function...

```{r}
knn_variable_importance <- varImp(knn_train_fit)
plot(knn_variable_importance, top = 15)
knn_variables <- knn_variable_importance[["importance"]] %>%
  arrange(desc(cure))
```

Given that, we can ask for the similarity to the DESeq results...

```{r}
knn_comparison <- compare_ml_de(knn_variable_importance, deseq_de_cf, annotations,
                                n = comparison_n, plot_n = 15, column = "cure")
knn_comparison[["comparison_plot"]]
comparison_lfc[["knn"]] <- knn_comparison[["shared"]]
```

## Random Forest

I think R/caret provides a few implementation of the forest family of
methods, this is using ranger(@wrightRangerFastImplementation2017a).

The parameter 'mtry' is often important, if I read the text correctly
it controls how many variables to sample in each split of the tree.
Thus higher numbers should presumably make it more specific at the
risk of overfitting.

Setting min.node.size sets the minimume node size of terminal nodes in
each tree.  Each increment up speeds the algorithm.

I am going to use my boot control trainer from above and see how it goes.

```{r}
rf_train_fit <- train(outcome ~ ., data = train_all,
                method = "ranger", trControl = boot_control,
                importance = "permutation",
                tuneGrid = data.frame(
                    mtry = 200,
                    min.node.size = 1,
                    splitrule = "gini"),
                verbose = TRUE)
rf_train_fit[["finalModel"]][["prediction.error"]]

rf_variable_importance <- varImp(rf_train_fit)
plot(rf_variable_importance, top = 15)
rf_variables <- rf_variable_importance[["importance"]] %>%
  arrange(desc(Overall))

rf_predict_trained <- predict(rf_train_fit, train_df)
rf_predict_evaluated <- self_evaluate_model(rf_predict_trained, datasets,
                                            which = split, type = "train")
rf_predict_evaluated
```

### What would be shared between DESeq2 and the ML classifier?

```{r}
rf_comparison <- compare_ml_de(rf_variable_importance, deseq_de_cf, annotations,
                               n = comparison_n, plot_n = 15)
rf_comparison[["comparison_venn"]]
comparison_lfc[["rf"]] <- rf_comparison[["shared"]]
```

### Compare to the WGCNA results

A couple months ago I spent a little time attempting to recapitulate
Alejandro's WGCNA results.  I think I did so by mostly copy/pasting
his work and adding some commentary and tweaking parts of it so that
it was easier for me to read/understand.  In the process, I generated
a series of modules which looked similar/identical to his.
Unfortunately, I did not add some sections to record the genes/modules
to some output files.  I am therefore going back to that now and doing
so in the hopes that I can compare those modules to the results
produced by the clasifiers.

It seems that the interest in this comparison has waned, so I am just
going to disable it.

```{r, eval=FALSE}
wgcna_result <- openxlsx::readWorkbook(glue("excel/wgcna_interesting_genes-v{ver}.xlsx"))
rownames(wgcna_result) <- wgcna_result[["row.names"]]
wgcna_result[["row.names"]] <- NULL

top_ml <- rownames(head(rf_variables, n = comparison_n))
comparison <- list("wgcna" = rownames(wgcna_result), "ml" = top_ml)
comparison_venn <- Vennerable::Venn(comparison)
comparison_wgcna[["rf"]] <- annotations[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")
```

#### Digression do the genes provided by varImp mean anything?

Let us take a moment and see if the top-n genes returned by varImp()
have some meaning which jumps out.  One might assume, given our extant
Differential Expression results, that the interleukin response will be
a likely candidate.

```{r}
importance_gp <- simple_gprofiler(rownames(head(rf_variables, n = comparison_n)))
importance_gp
```

### Now the random forest testers!

```{r}
rf_predict_test <- predict(rf_train_fit, test_df)

rf_predict_test_evaluated <- self_evaluate_model(rf_predict_test, datasets,
                                     which = split, type = "test")
rf_predict_test_evaluated
```

## GLM, or Logistic regression and regularization

Logistic regression is a statistical method for binary responses.
However, it is able to work with multiple classes as well.  The
general idea of this method is to find parameters which increase the
likelihood that the observed data is sampled from a statistical
distribution of interest.  The transformations and linear
regression-esque tasks performed are confusing, but once those are
performed, the task becomes setting the model's (fitting) parameters
to values which increase the probability that the statistical model
looks like the actual dataset given the training data, and that when
samples, will return values which are similar.  The most likely
statistical distributions one will want to fit are the Gaussian, in
which case we want to transform/normalize the mean/variance of our
variables so they look whatever normal distribution we are using.
Conversely, logistic regression uses a binnomial distribution (like
our raw sequencing data!) but which is from 0-1.

### Using a single gene

Let us take the most important gene observed in one of our previous
training sets: ENSG00000248405 PRR5-ARHGAP8

```{r, eval=FALSE}
gene_id <- "ENSG00000248405"
single_fit <- train(
    outcome ~ ENSG00000248405, data = train_all,
    method = "glm", family = "binomial", trControl = trainControl("none"))

tt <- data.frame("ENSG00000248405" = seq(min(train_df[[gene_id]]),
                                         max(train_df[[gene_id]]), len = 100))
## predict probabilities for the simulated data
tt$subtype = predict(single_fit, newdata = tt, type="prob")[, 1]
## plot the sigmoid curve and the training data
plot(ifelse(outcome == "cure", 1, 0) ~ ENSG00000248405,
     data = train_all, col = "red4",
     ylab = "CF as 0 or 1", xlab = "favorite gene expression")
lines(subtype ~ ENSG00000248405, tt, col = "green4", lwd = 2)

plot_df <- train_all[, c("outcome", "ENSG00000248405")]
ggbetweenstats(plot_df, "outcome", "ENSG00000248405")
```

Having tried with 1 gene, let us extend this to all genes.  In my
first try of this, it took a long time.

```{r}
glm_train_fit <- train(outcome ~ ., data = train_all,
                 trControl = boot_control,
                 method = "glm", family = "binomial")
```

## Compare GLM and WGCNA/DE

The previous block does not take into account our parameter sweep, so
it isn't theoretically very useful for this series of methological
tests.  As a result, the following block is also not particularly
helpful -- though I suspect the results are identical to what follows.

```{r}
glm_variable_importance <- varImp(glm_train_fit)
## Oh, this only produces 100 entries -- so me getting the top 400 is silly.
glm_variables <- glm_variable_importance[["importance"]] %>%
  arrange(desc(Overall))
tt = plot(glm_variable_importance, top = 15)
simple_gprofiler(rownames(head(glm_variables, n = comparison_n)))

top_glm <- rownames(glm_variables)
comparison <- list("de" = top_bottom_ids, "ml" = top_glm)
comparison_venn <- Vennerable::Venn(comparison)
comparison_lfc[["glm"]] <- fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")
```

Dropping the adjp and wgcna comparisons, they show little overlap.

```{r, eval=FALSE}
comparison <- list("de" = lowest_adjp, "ml" = top_glm)
comparison_venn <- Vennerable::Venn(comparison)
comparison_adjp[["glm"]] <- fData(c_monocytes)[comparison_venn@IntersectionSets["11"][[1]], ]
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")

comparison <- list("de" = highest_exprs, "ml" = top_glm)
comparison_venn <- Vennerable::Venn(comparison)
Vennerable::plot(comparison_venn, doWeights = FALSE, type = "circles")
```

In this block, I repeat the above with the tuneGrid option set so that
it may do the training using a sweep of putatively helpful parameters;
as a result it should provide a more appropriate (but quite possibly
identical) trainer.

```{r}
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
## train_method <- trainControl(method = "cv", number = 10)
glm_fit <- train(outcome ~ ., data = train_all, method = "glmnet",
                 trControl = boot_control, importance = "permutation",
                 tuneGrid = data.frame(
                   alpha = 0.5,
                   lambda = seq(0.1, 0.7, 0.05)),
                 verbose = TRUE)
glm_fit

glm_predict_trained <- predict(glm_fit, train_df)

glm_train_eval <- self_evaluate_model(glm_predict_trained, datasets,
                                      which = split, type = "train")
glm_train_eval
```

### Now the GLM testers!

```{r}
glm_predict_test <- predict(glm_fit, test_df)

glm_fit_eval_test <- self_evaluate_model(glm_predict_test, datasets,
                                         which = split, type = "test")
glm_fit_eval_test
```

### Compare again vs the DE table

In the following block I extract the top and bottom DE genes vs. the
top set of genes in the glm_variables ordered by the 'Overall' column.

I should make this a function, there is no way this will kept correct
across multiple iterations of the ml/wgcna/etc comparisons.

I am going to stop trying to compare these results across
WGCNA/adjp/expression, those comparisons are all basically null.

```{r}
glm_variable_importance <- varImp(glm_fit)
glm_comparison <- compare_ml_de(glm_variable_importance, deseq_de_cf, annotations,
                                n = comparison_n, plot_n = 15)
comparison_lfc[["glm"]] <- glm_comparison[["shared"]]
```

## Gradient Booster

At this point I basically have a feeling for how one may use the
tuneGrid to apply a parameter sweep of the data along with the various
(in this case cv) sampling methods.  Thus, the following is basically
identical to the previous blocks except it is using xgbTree (which is
the R implementation of extreme boosting:
(@chenXGBoostScalableTree2016a)).

```{r}
##rf_method <- trainControl(method = "ranger", number = 10, verbose = TRUE)
train_method <- trainControl(method = "cv", number = 10)

gb_fit <- train(outcome ~ ., data = train_all,
                method = "xgbTree", trControl = train_method,
                tuneGrid = data.frame(
                    nrounds = 200,
                    eta = c(0.05, 0.1, 0.3),
                    max_depth = 4,
                    gamma = 0,
                    colsample_bytree = 1,
                    subsample = 0.5,
                    min_child_weight = 1),
                verbose = TRUE)

gb_predict_trained <- predict(gb_fit, train_df)
gb_predict_trained

gb_train_eval <- self_evaluate_model(gb_predict_trained, datasets,
                                     which = split, type = "train")
gb_train_eval
```

### Now the GB testers!

```{r}
gb_predict_test <- predict(gb_fit, test_df)

gb_predict_test_evaluated <- self_evaluate_model(gb_predict_test, datasets,
                                                 which = split, type = "test")
gb_predict_test_evaluated
```

```{r}
gb_variable_importance <- varImp(gb_fit)
gb_variables <- glm_variable_importance[["importance"]] %>%
  arrange(desc(Overall))
plot(gb_variable_importance, top = 15)
gb_comparison <- compare_ml_de(gb_variable_importance, deseq_de_cf, annotations, n = comparison_n, plot_n = 15)
comparison_lfc[["gb"]] <- gb_comparison[["shared"]]
```

# Shared importance

```{r}
upset_lfc <- list()
upset_adjp <- list()
upset_wgcna <- list()
for (d in 1:length(comparison_lfc)) {
  name <- names(comparison_lfc)[d]
  upset_lfc[[name]] <- rownames(comparison_lfc[[name]])
  upset_adjp[[name]] <- rownames(comparison_adjp[[name]])
  ##upset_wgcna[[name]] <- rownames(comparison_wgcna[[name]])
}

start_lfc <- UpSetR::fromList(upset_lfc)
lfc_vs_ml <- UpSetR::upset(start_lfc)
```

## A different way to query shared/unique genes

The various gene sets observed by these methods may also be compared
directly by an upset plot using the deseq_de_cf with
(gb|glm|rf|knn)_variable_importance...

```{r}
comp_number <- 300
de_ml_sig_list <- list(
  "DESeq2" = rownames(deseq_de_cf),
  "knn" = rownames(head(knn_variables, n = comp_number)),
  "rf" = rownames(head(rf_variables, n = comp_number)),
  "glm" = rownames(head(glm_variables, n = comp_number)),
  "gb" = rownames(head(gb_variables, n = comp_number)))
important_genes_upset_input <- UpSetR::fromList(de_ml_sig_list)
important_genes_upset <- UpSetR::upset(important_genes_upset_input)
print(important_genes_upset)

important_genes_overlap <- overlap_groups(de_ml_sig_list)
shared_gene_ids <- overlap_geneids(important_genes_overlap, "DESeq2:knn")
shared_annotations <- fData(input_data)[as.character(shared_gene_ids), ]

important_genes_overlap <- overlap_groups(de_ml_sig_list)
shared_gene_ids <- overlap_geneids(important_genes_overlap, "DESeq2:rf")
shared_annotations <- fData(input_data)[as.character(shared_gene_ids), ]

important_genes_overlap <- overlap_groups(de_ml_sig_list)
shared_gene_ids <- overlap_geneids(important_genes_overlap, "DESeq2:gb")
shared_annotations <- fData(input_data)[as.character(shared_gene_ids), ]

important_genes_overlap <- overlap_groups(de_ml_sig_list)
shared_gene_ids <- overlap_geneids(important_genes_overlap, "DESeq2:glm")
shared_annotations <- fData(input_data)[as.character(shared_gene_ids), ]
```

# Collate shared genes by method and write them

Over the course of this document I recorded the limited agreement
between the DESeq result and the ML classifier in the data structure
'comparison_lfc'  Lets write that out to an xlsx file...

```{r}
de_ml_xlsx <- glue("excel/tc_compare_de_ml_top3kvariant-v{ver}.xlsx")
xlsx <- init_xlsx(de_ml_xlsx)
wb <- xlsx[["wb"]]
excel_basename <- xlsx[["basename"]]
written <- write_xlsx(data = comparison_lfc[["knn"]], wb = wb, sheet = "knn_vs_deseq")
written <- write_xlsx(data = comparison_lfc[["rf"]], wb = wb, sheet = "rf_vs_deseq")
written <- write_xlsx(data = comparison_lfc[["glm"]], wb = wb, sheet = "glm_vs_deseq")
written <- write_xlsx(data = comparison_lfc[["gb"]], wb = wb, sheet = "gb_vs_deseq")
excel_ret <- openxlsx::saveWorkbook(wb, de_ml_xlsx, overwrite = TRUE)
```

I realized something important in this block: at no point in any of my
analyses do I really record the classifications, I just record their
(dis)concordance with our known annotations.  In the following block I
iterate over multiple rounds of the various classifiers, let us modify
that function to record this information in some useful fashion.

# Performing multiple train/test rounds in one call

Doing all of the above for each method and collecting the results is
super fiddly, so I wrote a little function to try to make that easier.

WARNING: Turning on the following block takes ~ 2 days on our pretty
nice server.  Your mileage may vary.

## Setting up input data for multiple ML iterations

There are four versions of the data that Maria Adelaida and Najib are
interested in:

* Tumaco+Cali all visits
* Tumaco+Cali visit 1
* Tumaco all visits
* Tumaco visit 1

Ergo, let us set up an appropriate naming convention for them to make
these iterative runs more coherent.  I will continue using 'tc' and
't' for the clinics as a prefix and follow it with 'vall' vs. 'v1'.

```{r}
tc_vall_input <- subset_expt(tc_clinical, subset = "batch!='biopsy'")
t_vall_input <- subset_expt(tc_vall_input, subset = "clinic=='tumaco'")
tc_v1_input <- subset_expt(tc_vall_input, subset = "visitbipart=='v1'")
t_v1_input <- subset_expt(t_vall_input, subset = "visitbipart=='v1'")

tc_vall_input <- normalize_expt(tc_vall_input, transform = "log2", convert = "cpm")
t_vall_input <- normalize_expt(t_vall_input, transform = "log2", convert = "cpm")
tc_v1_input <- normalize_expt(tc_v1_input, transform = "log2", convert = "cpm")
t_v1_input <- normalize_expt(t_v1_input, transform = "log2", convert = "cpm")

tc_vall_outcome_factor <- as.factor(as.character(pData(tc_vall_input)[[ref_col]]))
t_vall_outcome_factor <- as.factor(as.character(pData(t_vall_input)[[ref_col]]))
tc_v1_outcome_factor <- as.factor(as.character(pData(tc_v1_input)[[ref_col]]))
t_v1_outcome_factor <- as.factor(as.character(pData(t_v1_input)[[ref_col]]))
```

## Perform normalization/filtering operations as per the exploratory document

In the following block I will use the same normalizations and filters
as above on each of the four inputs.

```{r}
tc_vall_norm <- normalize_expt(tc_vall_input, filter = "cv", cv_min = 0.1)
tc_v1_norm <- normalize_expt(tc_v1_input, filter = "cv", cv_min = 0.1)
t_vall_norm <- normalize_expt(t_vall_input, filter = "cv", cv_min = 0.1)
t_v1_norm <- normalize_expt(t_v1_input, filter = "cv", cv_min = 0.1)

tc_vall_texprs <- t(exprs(tc_vall_norm))
tc_v1_texprs <- t(exprs(tc_v1_norm))
t_vall_texprs <- t(exprs(t_vall_norm))
t_v1_texprs <- t(exprs(t_v1_norm))

sd_genes <- 3000
tc_vall_stdev <- apply(tc_vall_texprs, 2, sd)
tc_vall_pred <- order(tc_vall_stdev, decreasing = TRUE)[1:sd_genes]
tc_vall_texprs <- tc_vall_texprs[, tc_vall_pred]
tc_v1_stdev <- apply(tc_v1_texprs, 2, sd)
tc_v1_pred <- order(tc_v1_stdev, decreasing = TRUE)[1:sd_genes]
tc_v1_texprs <- tc_v1_texprs[, tc_v1_pred]
t_vall_stdev <- apply(t_vall_texprs, 2, sd)
t_vall_pred <- order(t_vall_stdev, decreasing = TRUE)[1:sd_genes]
t_vall_texprs <- t_vall_texprs[, t_vall_pred]
t_v1_stdev <- apply(t_v1_texprs, 2, sd)
t_v1_pred <- order(t_v1_stdev, decreasing = TRUE)[1:sd_genes]
t_v1_texprs <- t_v1_texprs[, t_v1_pred]

tc_vall_preproc <- preProcess(tc_vall_texprs, method = "center")
tc_vall_texprs <- predict(tc_vall_preproc, tc_vall_texprs)
tc_v1_preproc <- preProcess(tc_v1_texprs, method = "center")
tc_v1_texprs <- predict(tc_v1_preproc, tc_v1_texprs)
t_vall_preproc <- preProcess(t_vall_texprs, method = "center")
t_vall_texprs <- predict(t_vall_preproc, t_vall_texprs)
t_v1_preproc <- preProcess(t_v1_texprs, method = "center")
t_v1_texprs <- predict(t_v1_preproc, t_v1_texprs)

tc_vall_preproc <- preProcess(tc_vall_texprs, method = "corr", cutoff = 0.95)
tc_vall_texprs <- predict(tc_vall_preproc, tc_vall_texprs)
tc_v1_preproc <- preProcess(tc_v1_texprs, method = "corr", cutoff = 0.95)
tc_v1_texprs <- predict(tc_v1_preproc, tc_v1_texprs)
t_vall_preproc <- preProcess(t_vall_texprs, method = "corr", cutoff = 0.95)
t_vall_texprs <- predict(t_vall_preproc, t_vall_texprs)
t_v1_preproc <- preProcess(t_v1_texprs, method = "corr", cutoff = 0.95)
t_v1_texprs <- predict(t_v1_preproc, t_v1_texprs)

chosen_factors <- c("finaloutcome", "donor", "persistence", "visitnumber",
                    "selectionmethod", "typeofcells", "time")
tc_vall_meta <- pData(tc_vall_norm)[, chosen_factors]
tc_v1_meta <- pData(tc_v1_norm)[, chosen_factors]
t_vall_meta <- pData(t_vall_norm)[, chosen_factors]
t_v1_meta <- pData(t_v1_norm)[, chosen_factors]

ref_col <- "finaloutcome"
tc_vall_outcome <- as.factor(as.character(pData(tc_vall_norm)[[ref_col]]))
tc_v1_outcome <- as.factor(as.character(pData(tc_v1_norm)[[ref_col]]))
t_vall_outcome <- as.factor(as.character(pData(t_vall_norm)[[ref_col]]))
t_v1_outcome <- as.factor(as.character(pData(t_v1_norm)[[ref_col]]))

tc_vall_ml_df <- cbind.data.frame(tc_vall_outcome, as.data.frame(tc_vall_texprs))
tc_vall_ml_df[[ref_col]] <- as.factor(tc_vall_ml_df[["tc_vall_outcome"]])
tc_v1_ml_df <- cbind.data.frame(tc_v1_outcome, as.data.frame(tc_v1_texprs))
tc_v1_ml_df[[ref_col]] <- as.factor(tc_v1_ml_df[["tc_v1_outcome"]])
t_vall_ml_df <- cbind.data.frame(t_vall_outcome, as.data.frame(t_vall_texprs))
t_vall_ml_df[[ref_col]] <- as.factor(t_vall_ml_df[["t_vall_outcome"]])
t_v1_ml_df <- cbind.data.frame(t_v1_outcome, as.data.frame(t_v1_texprs))
t_v1_ml_df[[ref_col]] <- as.factor(t_v1_ml_df[["t_v1_outcome"]])
```

## Iterate Tumaco+Cali all visits

```{r, eval=FALSE}
tc_vall_summary_xlsx <- glue("excel/tc_vall_ml_summary-v{ver}.xlsx")
tc_vall_knn <- classify_n_times(tc_vall_texprs, tc_vall_meta,
                                outcome_column = ref_col,
                                method = "knn", sampler = "cv")
written <- write_classifier_summary(tc_vall_knn, excel = tc_vall_summary_xlsx)
tc_vall_gb <- classify_n_times(tc_vall_texprs, tc_vall_meta,
                               outcome_column = ref_col,
                               method = "xgbTree", sampler = "cv")
written <- write_classifier_summary(tc_vall_gb, excel = written[["wb"]])
tc_vall_glm <- classify_n_times(tc_vall_texprs, tc_vall_meta,
                                outcome_column = ref_col,
                                method = "glmnet", sampler = "cv")
written <- write_classifier_summary(tc_vall_glm, excel = written[["wb"]])
tc_vall_rf <- classify_n_times(tc_vall_texprs, tc_vall_meta,
                               outcome_column = ref_col,
                               method = "ranger", sampler = "cv")
written <- write_classifier_summary(tc_vall_rf, excel = written[["wb"]])
openxlsx::saveWorkbook(written[["wb"]], file = tc_vall_summary_xlsx)
```

## Iterate Tumaco+Cali Visit 1

```{r, eval=FALSE}
tc_v1_summary_xlsx <- glue("excel/tc_v1_ml_summary-v{ver}.xlsx")
tc_v1_knn <- classify_n_times(tc_v1_texprs, tc_v1_meta,
                                outcome_column = ref_col,
                                method = "knn", sampler = "cv")
written <- write_classifier_summary(tc_v1_knn, excel = tc_v1_summary_xlsx)
tc_v1_gb <- classify_n_times(tc_v1_texprs, tc_v1_meta,
                               outcome_column = ref_col,
                               method = "xgbTree", sampler = "cv")
written <- write_classifier_summary(tc_v1_gb, excel = written[["wb"]])
tc_v1_glm <- classify_n_times(tc_v1_texprs, tc_v1_meta,
                                outcome_column = ref_col,
                                method = "glmnet", sampler = "cv")
written <- write_classifier_summary(tc_v1_glm, excel = written[["wb"]])
tc_v1_rf <- classify_n_times(tc_v1_texprs, tc_v1_meta,
                               outcome_column = ref_col,
                               method = "ranger", sampler = "cv")
written <- write_classifier_summary(tc_v1_rf, excel = written[["wb"]])
openxlsx::saveWorkbook(written[["wb"]], file = tc_v1_summary_xlsx)
```

## Tumaco all visits

```{r, eval=FALSE}
t_vall_summary_xlsx <- glue("excel/t_vall_ml_summary-v{ver}.xlsx")
t_vall_knn <- classify_n_times(t_vall_texprs, t_vall_meta,
                                outcome_column = ref_col,
                                method = "knn", sampler = "cv")
written <- write_classifier_summary(t_vall_knn, excel = t_vall_summary_xlsx)
t_vall_gb <- classify_n_times(t_vall_texprs, t_vall_meta,
                               outcome_column = ref_col,
                               method = "xgbTree", sampler = "cv")
written <- write_classifier_summary(t_vall_gb, excel = written[["wb"]])
t_vall_glm <- classify_n_times(t_vall_texprs, t_vall_meta,
                                outcome_column = ref_col,
                                method = "glmnet", sampler = "cv")
written <- write_classifier_summary(t_vall_glm, excel = written[["wb"]])
t_vall_rf <- classify_n_times(t_vall_texprs, t_vall_meta,
                               outcome_column = ref_col,
                               method = "ranger", sampler = "cv")
written <- write_classifier_summary(t_vall_rf, excel = written[["wb"]])
openxlsx::saveWorkbook(written[["wb"]], file = t_vall_summary_xlsx)
```

## Tumaco Visit 1

```{r, eval=FALSE}
t_v1_summary_xlsx <- glue("excel/t_v1_ml_summary-v{ver}.xlsx")
t_v1_knn <- classify_n_times(t_v1_texprs, t_v1_meta,
                                outcome_column = ref_col,
                                method = "knn", sampler = "cv")
written <- write_classifier_summary(t_v1_knn, excel = t_v1_summary_xlsx)
t_v1_gb <- classify_n_times(t_v1_texprs, t_v1_meta,
                            outcome_column = ref_col,
                               method = "xgbTree", sampler = "cv")
written <- write_classifier_summary(t_v1_gb, excel = written[["wb"]])
t_v1_glm <- classify_n_times(t_v1_texprs, t_v1_meta,
                                outcome_column = ref_col,
                                method = "glmnet", sampler = "cv")
written <- write_classifier_summary(t_v1_glm, excel = written[["wb"]])
t_v1_rf <- classify_n_times(t_v1_texprs, t_v1_meta,
                               outcome_column = ref_col,
                               method = "ranger", sampler = "cv")
written <- write_classifier_summary(t_v1_rf, excel = written[["wb"]])
openxlsx::saveWorkbook(written[["wb"]], file = t_v1_summary_xlsx)
```

# Bibliography
