DOI

1 Introduction

The R markdown documents in this directory are intended to provide a complete accounting of the analyses performed in the preparation of

“Innate biosignature of treatment failure in patients with cutaneous leishmaniasis.”

I assume that if anyone ever reads this, s/he is looking for the source of the data, figures, and tables from that paper and to see if the methods employed to generate them are good, bad, indifferent, or garbage. These documents are being generated by a singularity container which provides the input count tables (Once we have accessions, I will make a companion which is able to create them), sample sheets, input documents, and all the software used to process them. The overal configuration of the container is contained in the toplevel .yml file and provides the base OS (Debian stable) and the scripts used to install the software. ‘local/bin/setup_debian.sh’ handles the base container; ‘local/bin/setup_hpgltools.sh’ sets up R, bioconductor, and my package ‘hpgltools’, and ‘local/bin/runscript’ is the script run if one invokes the container without any arguments; when run it uses the various installed R packages to ‘render’ the Rmd files in /data to freshly created html. This README is the first document rendered.

2 Figure/Table locations

As of 20240919, Maria Adelaida kindly sent me the putatively final set of figures/tables. I am going to lay out the location in the html logs and their Rmd parents where these may be found. If you wish to play along, make sure you run the entire 01datastructures.Rmd.

Also, this container is automatically regenerated and rerun when I make changes; so one may just go here to play along (that is where I am going to hunt down the figures/tables):

https://containerbook.umiacs.io/

2.1 Figure 1

These numbers are coming from 01datastructures: when considering the samples from the perspective of how many people fall into each category, that is coming directly from section 4.1 “Metadata Sources” and the demographics xlsx file. This provides the number of people who fall into each category of panel A/B.

2.2 Figure 2

2.2.1 Panel A

TODO: Clarify v1/v2/v3 vs. Pre-Tx, Mid-Tx, End-Tx in the notebook

This is also derived from 01datastructures, but from the perspective of the numbers of samples which survive our various filters. Thus, to arrive at these numbers, one should start at section 6.1 “Create Expressionset.” and wander through the document; however doing so will likely make most people sad because it is a long journey. Instead, you may skip down to section 14 “Summarize: Tabulate sample numbers”. The section labeled ‘Both’ provides these numbers. Note, this table used to be just Tumaco, which follows.

The numbers of samples in each group are restated in the summaries. The only real caveat is that I wrote them as ‘visit1’ or ‘v1’ for Pre-Tx, ‘visit2’ for Mid-Tx, and ‘visit3’ for End-Tx.

Another way to recapitulate these numbers is to check out the sankey plots in section 8 ‘Visualize the sample breakdown’. An example invocation looks like:

clinic_type_outcome_sankey <- plot_meta_sankey(
  tc_valid, factors = c("clinic", "typeofcells", "finaloutcome"),
  drill_down = TRUE, color_choices = color_choices)
clinic_type_outcome_sankey

clinic_ethnicity_outcome_sankey <- plot_meta_sankey(
  tc_valid, factors = c("clinic", "etnia", "finaloutcome"),
  drill_down = TRUE, color_choices = color_choices)
clinic_ethnicity_outcome_sankey

clinic_sex_outcome_sankey <- plot_meta_sankey(
  tc_valid, factors = c("clinic", "sex", "finaloutcome"),
  drill_down = TRUE, color_choices = color_choices)
clinic_sex_outcome_sankey

2.2.2 Panel B

Found in 02visualization, section 8 “Global views of all cell types” Hey, check it, different types of immune cells are different!

One invocation looks like this (there are a few versions):

tc_pca <- plot_pca(tc_norm, plot_labels = FALSE,
                    plot_title = "PCA - Cell type", size_column = "visitnumber")
tc_pca

2.2.3 Panel C

Ibid. Both panels are actually a little further down in section 8.1.

Here is the invocation:

tc_cf_corheat <- plot_corheat(tc_cf_norm, plot_title = "Heirarchical clustering:
         cell types")
tc_cf_corheat

2.3 Figure 3

Maria Adelaida did a tremendous amount of work to move the various ggplot legends around so that this fits in a single panel in a coherent fashion.

2.3.1 Panel A

Section 9.8 of 02visualization: “PCA: Compare clinics”

tc_clinic_type_nb_pca <- plot_pca(tc_clinic_type_nb)
tc_clinic_type_nb_pca

2.3.2 Panel B

Section 9.5 of 02visualization: “Eosinophils by clinic”

tc_eosinophils_pca <- plot_pca(tc_eosinophils_norm, plot_labels = FALSE)
tc_eosinophils_pca

2.3.3 Panel C

Section 9.6 of 02visualization: “Monocytes by clnic”

tc_monocytes_pca <- plot_pca(tc_monocytes_norm, plot_labels = FALSE)
tc_monocytes_pca

2.3.4 Panel D

Section 9.7 of 02visualization: “Neutrophils by clnic”

tc_neutrophils_pca <- plot_pca(tc_neutrophils_norm, plot_labels = FALSE)
tc_neutrophils_pca

2.3.5 Panel E

Section 9.1 of 02visualization: “Biopsies”

tc_biopsies_pca <- plot_pca(tc_biopsies_norm)
tc_biopsies_pca

2.4 Figure 4

2.4.1 Panel A, left

Section 10.2.4 “Figure 4A: Monocytes”

t_monocyte_pca <- plot_pca(t_monocyte_norm,
  plot_labels = FALSE)
t_monocyte_pca

2.4.2 Panel A, middle

Section 10.2.8 “Figure 4A: Eosinophils”

t_eosinophil_pca <- plot_pca(t_eosinophil_norm,
                             plot_labels = FALSE)
t_eosinophil_pca

2.4.3 Panel A, right

Section 10.2.6 “Figure 4A: Neutrophils”

t_neutrophil_pca <- plot_pca(t_neutrophil_norm,
                             plot_labels = FALSE)
t_neutrophil_pca

2.5 Panel B

These move to the document 04differential_expression_tumaco.

2.5.1 Panel B, left, middle, and right

Section 7.0.4 “Figure 4B: Volcano plots” For some reason I did not have separate sections for each cell type.

num_color <- color_choices[["clinic_cf"]][["tumaco_failure"]]
den_color <- color_choices[["clinic_cf"]][["tumaco_cure"]]
wanted_genes <- c("FI44L", "IFI27", "PRR5", "PRR5-ARHGAP8", "RHCE",
                  "FBXO39", "RSAD2", "SMTNL1", "USP18", "AFAP1")

cf_monocyte_table <- t_cf_monocyte_table_sva[["data"]][["outcome"]]
cf_monocyte_volcano <- plot_volcano_condition_de(
  cf_monocyte_table, "outcome", label = wanted_genes,
  fc_col = "deseq_logfc", p_col = "deseq_adjp", line_position = NULL,
  color_high = num_color, color_low = den_color, label_size = 6)
pp(file = glue("images/cf_monocyte_volcano_labeled-v{ver}.svg"))
cf_monocyte_volcano[["plot"]]
dev.off()

2.5.2 Panel C

TODO: Add similar venn ~ section 8

I played with versions of this using AUCC and UpSet plots throughout the 04differential_expression_tumaco document; but did not generate this specific image. (Maybe I should make a version of it?)

2.6 Figure 5

TODO: Clarify section names

I did not perform any of the STRING analyses and therefore am unable to comment on panels A,C,E,G. I should bug Alejandro.

The lower panels are all coming from 05enrichment and are using the treeplot() result from gotermsim() of the gProfiler2 results.

2.6.1 Panel B

05enrichment, section 2.10.1 “gProfiler” (I am going to change these headings to be a bit more informative…)

t_cf_eosinophil_sig_sva_up_gp <- simple_gprofiler(
  t_cf_eosinophil_sig_sva_up,
  excel = glue("{xlsx_prefix}/Gene_Set_Enrichment/t_cf_eosinophil_up_gp-v{ver}.xlsx"))
t_cf_eosinophil_sig_sva_up_gp

2.6.2 Panel D

05enrichment, Section 2.12.1 “gProfiler”.

t_cf_neutrophil_sig_sva_up_gp <- simple_gprofiler(
  t_cf_neutrophil_sig_sva_up,
  excel = glue("{xlsx_prefix}/Gene_Set_Enrichment/t_cf_neutrophil_up_gp-v{ver}.xlsx"))
t_cf_neutrophil_sig_sva_up_gp

2.6.3 Panel F

Ibid, Section 2.11.1 “gProfiler”

t_cf_monocyte_sig_sva_up_gp <- simple_gprofiler(
  t_cf_monocyte_sig_sva_up,
  excel = glue("{xlsx_prefix}/Gene_Set_Enrichment/t_cf_monocyte_up_gp-v{ver}.xlsx"))
t_cf_monocyte_sig_sva_up_gp

2.6.4 Panel H

Ibid, Section 2.4.1 “gProfiler” but it is pretty far down; just before the clusterProfiler analyses.

t_cf_clinicalnb_gp_down <- simple_gprofiler(
  t_clinicalnb_cf_sig_sva_down,
  excel = glue("{xlsx_prefix}/Gene_Set_Overrepresentation/clinicalnb_fail_up_gp-v{ver}.xlsx"))
t_cf_clinicalnb_gp_down

## and

go_termsim <- enrichplot::pairwise_termsim(t_cf_clinicalnb_gp_down[["BP_enrich"]])
t_cf_clinicalnb_gp_go_down_tree <- sm(enrichplot::treeplot(go_termsim))
pp(file = "images/overrepresentation/t_cf_clinicalnb_gp_down_tree.pdf",
   width = treeplot_width, height = treeplot_height)
t_cf_clinicalnb_gp_go_down_tree
dev.off()

2.7 Figure S1

TODO: Clarify sections

Head back to 02visualization; this is where Theresa taught me regression analysis!

2.7.1 Panel A

02visualization, Near the bottom of section 7.2 “Regression analyses vs outcome” (I need to rename these sections too, these are cross correlations, not regression; I used to have them all together in a big pile and split them up, but didn’t properly rename the headings.)

t_regression_queries <- c("Weight", "Sex", "Ethnicity", "Age")
t_cross_df <- t_regression_numeric[, t_regression_queries]
t_regression_cross <- corr_cross(t_cross_df)
pp(file = "figures/tumaco_weight_sex_ethnicity_age_numeric_crosscor.svg")
t_regression_cross
dev.off()

2.7.2 Panel B

02visualization, 7.2.1 “Copy these with only the Tumaco people”

regression_tests <- c("Age", "Clinic", "Ethnicity", "Sex", "Weight")
lm_regression_demographics <- extract_linear_regression(
  regression_numeric, query = "Therapeutic_Outcome_Final", factors = regression_tests,
  excel = glue("excel/numeric_demographics_regression_final_sex_clinic_ethnicity_age-v{ver}.xlsx"))

## and

pp(file = "images/demographics_only_linear_regression.svg")
lm_regression_demographics[["forest"]]
dev.off()

2.7.3 Panel C

Ibid, Section 7.2.2

tc_log_iterative_regression_demographics <- iterate_logistic_regression(
  regression_df, query = "Therapeutic_Outcome_Final", factors = regression_tests,
  excel = glue("excel/tc_simple_logistic_regression-v{ver}.xlsx"))

## and

pp(file = glue("figures/tc_simple_logistic_regression-v{ver}.svg"))
tc_log_iterative_regression_demographics[["forest"]]
dev.off()

2.7.4 Panel D

Ibid.

t_log_regression_demographics <- extract_logistic_regression(
  t_regression_df, query = "Therapeutic_Outcome_Final", factors = t_regression_tests,
  excel = glue("excel/t_multivariable_logistic_regression-v{ver}.xlsx"))

## and

pp(file = glue("figures/t_multivariable_logistic_regression-v{ver}.svg"))
t_log_regression_demographics[["forest"]]
dev.off()

2.7.5 Panels E and F

TODO: readd

Hmm I deleted these from my notebook, I did not think we were going to use them but instead only include C/D. Adding them back momentarily.

2.8 Figure S2

2.8.1 Panel B/C

These are waay down in Section 16 “Parasite distribution” of 02visualization.

lp_cf_norm_pca <- plot_pca(lp_cf_norm)

2.9 Figure S4

TODO: rename section

One must return to 01datastructures for this image. It was used to define the coverage cutoff for filtering samples and is found at section 7.2 “Figure S2: Non-zero genes before sample filtering” (I guess I should rename this too)

all_nz <- plot_nonzero(hs_expt)

2.10 Figure S5

These are all scattered through 02visualization.

2.10.1 Panel A

Section 9.8 “Compare clinics”

tc_clinic_type_pca <- plot_pca(tc_clinic_type_norm)
tc_clinic_type_pca

2.10.2 Panel B

Section 9.5 “Eosinophils by clinic” No fails from Cali for Eosinophils!

tc_eosinophils_pca <- plot_pca(tc_eosinophils_norm, plot_labels = FALSE)
tc_eosinophils_pca

2.10.3 Panel C

Section 9.6 “Monocytes by clinic”

tc_monocytes_pca <- plot_pca(tc_monocytes_norm, plot_labels = FALSE)
tc_monocytes_pca

2.10.4 Panel D

Section 9.7 “Neutrophils by clinic”

tc_neutrophils_pca <- plot_pca(tc_neutrophils_norm, plot_labels = FALSE)
tc_neutrophils_pca

2.10.5 Panel E

Section 9.1 “Biopsies by clinic”

tc_biopsies_pca <- plot_pca(tc_biopsies_norm)
tc_biopsies_pca

2.10.6 Panel F

At the bottom of section 9.2.3 “Eosinophil samples, both clinics”

plot_pca(etnia_eo_nb)

2.10.7 Panel G

At the bottom of section 9.2.4 “Monocyte samples, both clinics”

plot_pca(etnia_mo_nb)

2.10.8 Panel H

At the bottom of section 9.2.5 “Neutrophil samples, both clinics”

plot_pca(etnia_ne_nb)

2.11 Figure S6

TODO: Someone asked me to changes these colors to this vicious set of greens. The original black/grey/white is nicer.

2.11.1 Panel A/B/C

These images are scattered through section 11.4.1 “Comparing 3 visits by cell type” (I changed the colors back just now 20240919, those greens are a horror show, where was April to yell at me!?)

t_visit_monocyte_norm_pca <- plot_pca(t_visit_monocyte_norm, plot_labels = FALSE)
pp(file = "images/t_monocyte_visit_norm_pca.pdf")
t_visit_monocyte_norm_pca$plot
dev.off()

## replace monocyte with the relevant celltypes for the others.

2.11.2 Panel D

These bottom panels are comparing my simplified model to a more thorough visit-in-model analysis. This first, monocyte comparison is in section 6.0.2 of 04differential_expression_tumaco.

Note, hpgltools is now able to handle arbitrarily complex models across all methods (deseq/edger/etc…) though I didn’t finish it until a few days ago and it has not yet been fully tested.

## The original pairwise invocation with sva:
##t_cf_monocyte_de_sva <- all_pairwise(t_monocyte, model_batch = "svaseq",
##                                     filter = TRUE, parallel = FALSE,
##                                     methods = methods)
test_monocytes <- normalize_expt(t_monocytes, filter = "simple")

This continues for a few blocks of relatively dense code invoking sva with a model containing visit number.

2.11.3 Panel E

Ibid, section 7.0.3

This section is nearly identically dense, and starts with the following:

## The original pairwise invocation with sva:
#t_cf_eosinophil_de_sva <- all_pairwise(t_eosinophils, model_batch = "svaseq",
#                                       filter = TRUE, parallel=FALSE, methods = methods)
test_eosinophils <- normalize_expt(t_eosinophils, filter = "simple")

2.11.4 Panel F

Ibid, section 7.0.2

TODO: This portion of the 04differential_expression_tumaco is poorly organized.

## The original pairwise invocation with sva:
## t_cf_neutrophil_de_sva <- all_pairwise(t_neutrophils, model_batch = "svaseq",
##                                        parallel = parallel, filter = TRUE,
##                                        methods = methods)
test_neutrophils <- normalize_expt(t_neutrophils, filter = "simple")

2.12 Figure S7

This images harken back to 02visualization

2.12.1 Panel A

Section 10.2.1 of 02visualization

t_clinical_nobiop_pca <- plot_pca(t_clinical_nobiop_norm, plot_labels = FALSE)
pp(file = "figures/t_clinical_nobiop_figxxa.pdf")
t_clinical_nobiop_pca[["plot"]]
dev.off()

2.12.2 Panel B

The next image in 10.2

TODO: Check that the device is properly creating pdfs in this section; it looks like it might be creating pngs for some of these?

t_clinical_nobiop_nb_pca <- plot_pca(t_clinical_nobiop_nb, plot_labels = FALSE)
pp(file = "figures/t_clinical_nobiop_sva_figxxb.pdf")
t_clinical_nobiop_nb_pca[["plot"]]
dev.off()

2.13 Figure S8, S9, S10

These are not in my notebooks.

2.14 Figure S11

2.14.1 Panel A

Section 10.2.1

tc_clinical_nobiop_pca <- plot_pca(tc_clinical_nobiop_norm, plot_labels = FALSE)
pp(file = "figures/tc_clinical_nobiop_figxxc.pdf")
tc_clinical_nobiop_pca[["plot"]]

2.14.2 Panel B

and the immediately following image

tc_clinical_nobiop_nb_pca <- plot_pca(tc_clinical_nobiop_nb, plot_labels = FALSE)
pp(file = "figures/tc_clinical_nobiop_sva_figxxd.pdf")
tc_clinical_nobiop_nb_pca[["plot"]]
dev.off()

2.14.3 Panel C

I think this is the only image taken from 03differential_expression_both; it comes from section 5.2.1; in my last iteration of the notebook it failed because I forgot to tell it to use the color_choices (fixed)

tc_clinical_cf_de_batch <- all_pairwise(tc_clinical_nobiop, filter = TRUE,
                                        parallel = parallel, model_batch = TRUE,
                                        methods = methods)

2.15 Table S1

This was manually collated by Maria Adelaida; I think all the numbers do correspond to the regression f-values/p-values.

2.16 Table S2

This is a very slightly modified copy of the sample sheet in sample_sheets/tmrc3_samples_pruned.xlsx One noteworthy thing: if one reprocesses the raw data and runs ‘gather_preprocessing_metadata()’, it will suck up all the logs from any tools for which I have written regexes and append the results as columns at the right of this. I am more than a little proud of that.

2.17 Table S3

This is a slightly modified copy of the result of the ‘svpc_fstats()’ invocations found at the end of Section 17 in 02visualization. (my little function adds a funky heatmap to the output)

queries <- c("typeofcells", "visitnumber", "clinic", "donor")
tc_clinical_fpstats <- svpc_fstats(tc_clinical, num_pcs = 5, queries = queries)

2.18 Table S4

This is a slightly modified copy of two results:

2.18.1 Biopsies DE

This is a modified version of the result when one invokes combine_de_tables(), as per section 6.0.2 of 04differential_expression_tumaco; this container should therefore provide those numbers if one looks in ‘analyses/4_tumaco/DE_Cure_Fail/Biopsies/t_biopsy_cf_table_sva-v202409.xlsx’ That xlsx workbook has a worksheet named ‘outcome’ (the second); this worksheet comprises columns Q-V from it. (also yay! the numbers are the same between my container (which changes constantly and gets rebuilt with each change) and the copy that was used for the paper!)

t_cf_biopsy_table_sva <- combine_de_tables(
  t_cf_biopsy_de_sva, keepers = t_cf_contrast,
  excel = glue("{cf_prefix}/Biopsies/t_biopsy_cf_table_sva-v{ver}.xlsx"))
t_cf_biopsy_table_sva

2.18.2 Bipsies GO

Looking more carefully at this, I think this is not from my overrepresentation analysis; but came from STRING?

The invocation for my version of this resides in 05enrichment, section 2.9.1 and the xlsx file in the container should be found in ‘analyses/Gene_Set_Enrichment/t_cf_biopsy_sig_sva_up_gp-v202409.xlsx’ for gProfiler, and with a similar name containing ‘cp’ for clusterProfiler.

t_cf_biopsy_sig_sva_gp_up <- simple_gprofiler(
  t_cf_biopsy_sig_sva_up,
  excel = glue("{xlsx_prefix}/Gene_Set_Enrichment/t_cf_biopsy_sig_sva_up_gp-v{ver}.xlsx"))
t_cf_biopsy_sig_sva_gp_up

## or perhaps if you prefer clusterProfiler

t_cf_biopsy_sig_sva_cp_up <- simple_cprofiler(
  t_cf_biopsy_sig_sva_up, de_table = t_cf_biopsy_table_sva,
  orgdb = "org.Hs.eg.db",
  excel = glue("{xlsx_prefix}/Gene_Set_Enrichment/t_cf_biopsy_sig_sva_up_cp-v{ver}.xlsx"))

2.19 Table S5

This is a redacted combination of a few files:

2.19.1 worksheet Monocytes

Section 6.0.2 provides the invocation of combine_de_tables() which produces the ‘analysis/4_Tumaco/DE_Cure_Fail/Monocytes/t_monocyte_cf_table-sva-v202409.xlsx’ that comprises the logFC etc values in this worksheet. Interestingly, the top 5 genes all shifted down in logFC by ~ 0.002 in my container-derived worksheet vs this table. I think that is an acceptable difference? I did spot check a few genes in the top 5000 and it looks like the order is maintained.

t_cf_monocyte_table_sva <- combine_de_tables(
  t_cf_monocyte_de_sva, keepers = t_cf_contrast,
  excel = glue("{cf_prefix}/Monocytes/t_monocyte_cf_table_sva-v{ver}.xlsx"))
t_cf_monocyte_table_sva

2.19.2 worksheet Neutrophils

Same logic, but Section 7.0.2 with the analagously named xlsx file in the Neutrophils directory; these numbers also appear to have shifted slightly (but this time by ~ 0.001 up to 0.1 logFC) and I do see a couple of genes out of order… I am going to check in with Maria Adelaida.

t_cf_neutrophil_table_sva <- combine_de_tables(
  t_cf_neutrophil_de_sva, keepers = t_cf_contrast,
  excel = glue("{cf_prefix}/Neutrophils/t_neutrophil_cf_table_sva-v{ver}.xlsx"))
t_cf_neutrophil_table_sva

2.19.3 worksheet Eosinophils

Ibid, section 7.0.3. Once again the values look to have shifted down by ~ 0.001 to 0.01 logFC.

t_cf_eosinophil_table_sva <- combine_de_tables(
  t_cf_eosinophil_de_sva, keepers = t_cf_contrast,
  excel = glue("{cf_prefix}/Eosinophils/t_eosinophil_cf_table_sva-v{ver}.xlsx"))
t_cf_eosinophil_table_sva

2.19.4 worksheet All_innate

This is actually way earlier in the differential expression document, section 3.1 when the variable ‘t_cf_clinicalnb_table_sva’ is created to produce the xlsx file ‘analyses/4_tumaco/DE_Cure_Fail/All_Samples/t_clinical_nobiop_cf_table_sva-v202409.xlsx’

These numbers match up pretty much exactly (I set the number of significant digits to 4 I think, but this worksheet is 3?)

t_cf_clinical_table_sva <- combine_de_tables(
  t_cf_clinical_de_sva, keepers = cf_contrast,
  excel = glue("{cf_prefix}/All_Samples/t_clinical_cf_table_sva-v{ver}.xlsx"))
t_cf_clinical_table_sva

2.20 Table S6

I do not think they used my gProfiler/clusterProfiler results for this.

2.21 Table S7

This comes from 06lrt_gsva, section 3.

with the caveat that I think the C2 and C7 results are more interesting. Also, the results produced by the container are sparser with respect to the annotations because I did not want to steal the gmt/xml annotations from broad. If you happen to have the xml/gmt files, I left the original invocations there so you can get the full table with the paper titles etc (also, note that later versions of mSigDB changed the xml format so that it no longer parses properly in R; so those functions are now smrt enough to handle the new gmt/json files).

tc_celltype_gsva_h <- simple_gsva(
    tc_valid,
    signatures = broad_h,
    msig_xml = "reference/msigdb/msigdb_v7.5.1.xml",
    signature_category = "h")
tc_celltype_gsva_h_sig <- get_sig_gsva_categories(
    tc_celltype_gsva_h,
    excel = "analyses/3_cali_and_tumaco/GSVA/tc_valid_gsva_h.xlsx")

2.22 Table S8

I am reasonably certain that Alejandro produced the inputs for this; however the container produces nearly identical versions of the pieces in 07wgcna, section 9.

written_interesting <- write_xlsx(fData(l2input)[interesting_genes, ],
                                  excel = glue("excel/wgcna_interesting_genes-v{ver}.xlsx"))

## Note that we can do similarity matrices on the samples too in order to get
## dendrograms which may get interesting groups of samples?

2.23 Table S9

This is not run automatically by the container because it will run most computers out of memory and/or take a really long time. If your computer has ~ 512G ram, open 08classifier_highvar.Rmd and run section 11.2; the resulting xlsx output files should look like this (except cooler because I add some plots).

tc_vall_summary_xlsx <- glue("excel/tc_vall_ml_summary-v{ver}.xlsx")
tc_vall_knn <- classify_n_times(tc_vall_texprs, tc_vall_meta,
                                outcome_column = ref_col,
                                method = "knn", sampler = "cv")
written <- write_classifier_summary(tc_vall_knn, excel = tc_vall_summary_xlsx)
tc_vall_gb <- classify_n_times(tc_vall_texprs, tc_vall_meta,
                               outcome_column = ref_col,
                               method = "xgbTree", sampler = "cv")
written <- write_classifier_summary(tc_vall_gb, excel = written[["wb"]])
tc_vall_glm <- classify_n_times(tc_vall_texprs, tc_vall_meta,
                                outcome_column = ref_col,
                                method = "glmnet", sampler = "cv")
written <- write_classifier_summary(tc_vall_glm, excel = written[["wb"]])
tc_vall_rf <- classify_n_times(tc_vall_texprs, tc_vall_meta,
                               outcome_column = ref_col,
                               method = "ranger", sampler = "cv")
written <- write_classifier_summary(tc_vall_rf, excel = written[["wb"]])
openxlsx::saveWorkbook(written[["wb"]], file = tc_vall_summary_xlsx)

3 Structure

The Rmd files may be rendered in any order with one very important exception: the 01datastructures.Rmd file must be run first. It reads all of the sample sheets and count tables and writes out a series of data files used by all the other documents.

Note: All the references are in the documents, not here.

3.1 01datastructures

This is responsible for setting the stage for everything that follows. It collects annotations from the 2020 ensembl human database, the experimental metadata from the xlsx files in sample_sheets/, and the counts found in preprocessing/ and combines them into an initial, large expressionSet. It tallies up the samples according to many/most of the likely factors of interest, filters the data, and extracts the various subsets into separate datastructures.

3.2 02visualization

Dominated by a long series of PCA explorations. It is basically a playground for me to poke at the various data subsets in order to try to get a feeling for what is the most appropriate way to think about the data. It was where we eventually decided that we cannot use the Cali data, for example.

In later iterations it included a series of regression analyses and further examinations into the sensitivities of the PCA and surrogate variable analyses. It should be noted that these analyses were entirely Theresa’s idea, I copied them into this document and made some minor changes.

3.3 03differential_expression_both

This document performs some differential expression analyses using both the samples from Cali and Tumaco. Partially this was done to assuage my interest, and partially to reinforce the idea that the Cali samples really do not play well with others.

3.4 04differential_expression_tumaco

Ibid, but only Tumaco. This is the real meat of the analysis and seeks to try out various ways of performing differential expression using the Tumaco data in order to see which provide the most robust ways of examining the potential questions in the data.

3.5 05enrichment

Read the xlsx files from 03 and 04 above and pass the resulting gene sets and/or DE tables to gProfiler2 and clusterProfiler. gProfiler was used to perform over-representation analyses of the significantly increased/decreased genes (or both together) of the various contrasts vs. the databases provided by gProfiler. The same task was performed using clusterProfiler with two exceptions: clusterProfiler uses the orgdb annotations for its gene groups and is therefore (by default, but not exclusively) limited to GO/KEGG/DAVID; conversely, one may trivially pass the full DE table to clusterProfiler and thus perform the full GSEA analysis. (I have been playing with passing reactome (via ReactomePA) and mSigDB to these methods).

3.6 06lrt_gsva

Theresa also suggested we try a series of Likelihood Ratio Tests (LRT) to examine trends in the data; thus making DESeq2/EdgeR sensitive to not only increased/decreased gene expression across two conditions, but shared/divergent trends across multiple factors in the metadata. This document provides a version of this.

At the same time we were doing that, she and I put together a simplified method for querying the mSigDB via GSVA. These queries are in this document too.

3.7 07wgcna

Alejandro performed a set of WGCNA analyses using the normalized expression values. He kindly sent me a copy of his R script and I reformatted it somewhat into this document.

3.8 08classifier_highvar

I wanted to explore machine learning classifiers. I had this dataset (and the parasite transcriptomes) open, so I decided to play with them. It was never intended for publication by me but as a fun learning experience; but here it is…

4 Introduction to the singularity container

Of the various options, I found singularity to be the most attractive. I therefore wrote a default Makefile with a few targets to create and manipulate images because I cannot be bothered to remember all of the various commands.

5 The images

I am hoping to create 2 image templates for the TMRC analyses: preprocessing and analyses. I will therefore have a few locally maintained shell scripts which contain the base image setup tasks, the tasks required to setup the tools used, download the data, and perform the work.

6 Setting up the base image

I intend to use Debian stable. I am copying the required setup files into the current working directory and invoking make to create the container. I have a few targets which are intended to make this easier to remember.

6.1 Creating the base image

The following command runs singularity with options suitable to create the image. This to me is a little unnerving, because a bunch of stuff gets run as root.

make

6.2 Testing stuff out and making changes

The overlay target drops the user into the container with R/W permissions. It creates a directory ‘cure_Fail_analyses_overlay’ which may be modified at will.

make cure_fail_analyses.overlay

6.3 Using the container for arbitrary Rmd/md files

The container’s runscript has some logic built in which can process arbitrary markdown documents into html via knitr and pandoc.

cd someplace_with_Rmd
/locationofcontainers/cure_fail_host_analyses.sif -i markdown01.Rmd:markdown02.Rmd
---
title: "TMRC3 `r Sys.getenv('VERSION')`: README/Introduction"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
   collapsed: false
   smooth_scroll: false
---


[![DOI](https://zenodo.org/badge/712104518.svg)](https://zenodo.org/doi/10.5281/zenodo.13798788)

# Introduction

The R markdown documents in this directory are intended to provide a
complete accounting of the analyses performed in the preparation of

"Innate biosignature of treatment failure in patients with cutaneous
leishmaniasis."

I assume that if anyone ever reads this, s/he is looking for the
source of the data, figures, and tables from that paper and to see if
the methods employed to generate them are good, bad, indifferent, or
garbage.  These documents are being generated by a singularity
container which provides the input count tables (Once we have
accessions, I will make a companion which is able to create them),
sample sheets, input documents, and all the software used to process
them.  The overal configuration of the container is contained in the
toplevel .yml file and provides the base OS (Debian stable) and the
scripts used to install the software.  'local/bin/setup_debian.sh'
handles the base container; 'local/bin/setup_hpgltools.sh' sets up R,
bioconductor, and my package 'hpgltools', and 'local/bin/runscript' is
the script run if one invokes the container without any arguments;
when run it uses the various installed R packages to 'render' the Rmd
files in /data to freshly created html.  This README is the first
document rendered.

# Figure/Table locations

As of 20240919, Maria Adelaida kindly sent me the putatively final set
of figures/tables.  I am going to lay out the location in the html
logs and their Rmd parents where these may be found.  If you wish to
play along, make sure you run the entire 01datastructures.Rmd.

Also, this container is automatically regenerated and rerun when I
make changes; so one may just go here to play along (that is where I
am going to hunt down the figures/tables):

https://containerbook.umiacs.io/

## Figure 1

These numbers are coming from 01datastructures:  when considering the
samples from the perspective of how many people fall into each
category, that is coming directly from section 4.1 "Metadata Sources"
and the demographics xlsx file.  This provides the number of people
who fall into each category of panel A/B.

## Figure 2

### Panel A

TODO: Clarify v1/v2/v3 vs. Pre-Tx, Mid-Tx, End-Tx in the notebook

This is also derived from 01datastructures, but from the perspective
of the numbers of samples which survive our various filters.  Thus, to
arrive at these numbers, one should start at section 6.1 "Create
Expressionset." and wander through the document; however doing so will
likely make most people sad because it is a long journey.  Instead,
you may skip down to section 14 "Summarize: Tabulate sample numbers".
The section labeled 'Both' provides these numbers.  Note, this table
used to be just Tumaco, which follows.

The numbers of samples in each group are restated in the summaries.
The only real caveat is that I wrote them as 'visit1' or 'v1' for
Pre-Tx, 'visit2' for Mid-Tx, and 'visit3' for End-Tx.

Another way to recapitulate these numbers is to check out the sankey
plots in section 8 'Visualize the sample breakdown'.  An example invocation looks like:

```{r, eval=FALSE}
clinic_type_outcome_sankey <- plot_meta_sankey(
  tc_valid, factors = c("clinic", "typeofcells", "finaloutcome"),
  drill_down = TRUE, color_choices = color_choices)
clinic_type_outcome_sankey

clinic_ethnicity_outcome_sankey <- plot_meta_sankey(
  tc_valid, factors = c("clinic", "etnia", "finaloutcome"),
  drill_down = TRUE, color_choices = color_choices)
clinic_ethnicity_outcome_sankey

clinic_sex_outcome_sankey <- plot_meta_sankey(
  tc_valid, factors = c("clinic", "sex", "finaloutcome"),
  drill_down = TRUE, color_choices = color_choices)
clinic_sex_outcome_sankey
```

### Panel B

Found in 02visualization, section 8 "Global views of all cell types"
Hey, check it, different types of immune cells are different!

One invocation looks like this (there are a few versions):

```{r, eval=FALSE}
tc_pca <- plot_pca(tc_norm, plot_labels = FALSE,
                    plot_title = "PCA - Cell type", size_column = "visitnumber")
tc_pca
```

### Panel C

Ibid.  Both panels are actually a little further down in section 8.1.

Here is the invocation:

```{r, eval=FALSE}
tc_cf_corheat <- plot_corheat(tc_cf_norm, plot_title = "Heirarchical clustering:
         cell types")
tc_cf_corheat
```

## Figure 3

Maria Adelaida did a tremendous amount of work to move the various
ggplot legends around so that this fits in a single panel in a
coherent fashion.

### Panel A

Section 9.8 of 02visualization: "PCA: Compare clinics"

```{r, eval=FALSE}
tc_clinic_type_nb_pca <- plot_pca(tc_clinic_type_nb)
tc_clinic_type_nb_pca
```

### Panel B

Section 9.5 of 02visualization: "Eosinophils by clinic"

```{r, eval=FALSE}
tc_eosinophils_pca <- plot_pca(tc_eosinophils_norm, plot_labels = FALSE)
tc_eosinophils_pca
```

### Panel C

Section 9.6 of 02visualization: "Monocytes by clnic"

```{r, eval=FALSE}
tc_monocytes_pca <- plot_pca(tc_monocytes_norm, plot_labels = FALSE)
tc_monocytes_pca
```

### Panel D

Section 9.7 of 02visualization: "Neutrophils by clnic"

```{r, eval=FALSE}
tc_neutrophils_pca <- plot_pca(tc_neutrophils_norm, plot_labels = FALSE)
tc_neutrophils_pca
```

### Panel E

Section 9.1 of 02visualization: "Biopsies"

```{r, eval=FALSE}
tc_biopsies_pca <- plot_pca(tc_biopsies_norm)
tc_biopsies_pca
```

## Figure 4

### Panel A, left

Section 10.2.4 "Figure 4A: Monocytes"

```{r, eval=FALSE}
t_monocyte_pca <- plot_pca(t_monocyte_norm,
  plot_labels = FALSE)
t_monocyte_pca
```

### Panel A, middle

Section 10.2.8 "Figure 4A: Eosinophils"

```{r, eval=FALSE}
t_eosinophil_pca <- plot_pca(t_eosinophil_norm,
                             plot_labels = FALSE)
t_eosinophil_pca
```

### Panel A, right

Section 10.2.6 "Figure 4A: Neutrophils"

```{r, eval=FALSE}
t_neutrophil_pca <- plot_pca(t_neutrophil_norm,
                             plot_labels = FALSE)
t_neutrophil_pca
```

## Panel B

These move to the document 04differential_expression_tumaco.

### Panel B, left, middle, and right

Section 7.0.4 "Figure 4B: Volcano plots"  For some reason I did not
have separate sections for each cell type.

```{r, eval=FALSE}
num_color <- color_choices[["clinic_cf"]][["tumaco_failure"]]
den_color <- color_choices[["clinic_cf"]][["tumaco_cure"]]
wanted_genes <- c("FI44L", "IFI27", "PRR5", "PRR5-ARHGAP8", "RHCE",
                  "FBXO39", "RSAD2", "SMTNL1", "USP18", "AFAP1")

cf_monocyte_table <- t_cf_monocyte_table_sva[["data"]][["outcome"]]
cf_monocyte_volcano <- plot_volcano_condition_de(
  cf_monocyte_table, "outcome", label = wanted_genes,
  fc_col = "deseq_logfc", p_col = "deseq_adjp", line_position = NULL,
  color_high = num_color, color_low = den_color, label_size = 6)
pp(file = glue("images/cf_monocyte_volcano_labeled-v{ver}.svg"))
cf_monocyte_volcano[["plot"]]
dev.off()
```

### Panel C

TODO: Add similar venn ~ section 8

I played with versions of this using AUCC and UpSet plots throughout
the 04differential_expression_tumaco document; but did not generate
this specific image.  (Maybe I should make a version of it?)

## Figure 5

TODO: Clarify section names

I did not perform any of the STRING analyses and therefore am unable
to comment on panels A,C,E,G.  I should bug Alejandro.

The lower panels are all coming from 05enrichment and are using the
treeplot() result from gotermsim() of the gProfiler2 results.

### Panel B

05enrichment, section 2.10.1 "gProfiler" (I am going to change these
headings to be a bit more informative...)

```{r, eval=FALSE}
t_cf_eosinophil_sig_sva_up_gp <- simple_gprofiler(
  t_cf_eosinophil_sig_sva_up,
  excel = glue("{xlsx_prefix}/Gene_Set_Enrichment/t_cf_eosinophil_up_gp-v{ver}.xlsx"))
t_cf_eosinophil_sig_sva_up_gp
```

### Panel D

05enrichment, Section 2.12.1 "gProfiler".

```{r, eval=FALSE}
t_cf_neutrophil_sig_sva_up_gp <- simple_gprofiler(
  t_cf_neutrophil_sig_sva_up,
  excel = glue("{xlsx_prefix}/Gene_Set_Enrichment/t_cf_neutrophil_up_gp-v{ver}.xlsx"))
t_cf_neutrophil_sig_sva_up_gp
```

### Panel F

Ibid, Section 2.11.1 "gProfiler"

```{r, eval=FALSE}
t_cf_monocyte_sig_sva_up_gp <- simple_gprofiler(
  t_cf_monocyte_sig_sva_up,
  excel = glue("{xlsx_prefix}/Gene_Set_Enrichment/t_cf_monocyte_up_gp-v{ver}.xlsx"))
t_cf_monocyte_sig_sva_up_gp
```

### Panel H

Ibid, Section 2.4.1 "gProfiler" but it is pretty far down; just before
the clusterProfiler analyses.

```{r, eval=FALSE}
t_cf_clinicalnb_gp_down <- simple_gprofiler(
  t_clinicalnb_cf_sig_sva_down,
  excel = glue("{xlsx_prefix}/Gene_Set_Overrepresentation/clinicalnb_fail_up_gp-v{ver}.xlsx"))
t_cf_clinicalnb_gp_down

## and

go_termsim <- enrichplot::pairwise_termsim(t_cf_clinicalnb_gp_down[["BP_enrich"]])
t_cf_clinicalnb_gp_go_down_tree <- sm(enrichplot::treeplot(go_termsim))
pp(file = "images/overrepresentation/t_cf_clinicalnb_gp_down_tree.pdf",
   width = treeplot_width, height = treeplot_height)
t_cf_clinicalnb_gp_go_down_tree
dev.off()
```

## Figure S1

TODO: Clarify sections

Head back to 02visualization; this is where Theresa taught me
regression analysis!

### Panel A

02visualization, Near the bottom of section 7.2 "Regression analyses
vs outcome"  (I need to rename these sections too, these are cross
correlations, not regression; I used to have them all together in a
big pile and split them up, but didn't properly rename the headings.)

```{r, eval=FALSE}
t_regression_queries <- c("Weight", "Sex", "Ethnicity", "Age")
t_cross_df <- t_regression_numeric[, t_regression_queries]
t_regression_cross <- corr_cross(t_cross_df)
pp(file = "figures/tumaco_weight_sex_ethnicity_age_numeric_crosscor.svg")
t_regression_cross
dev.off()
```

### Panel B

02visualization, 7.2.1 "Copy these with only the Tumaco people"

```{r, eval=FALSE}
regression_tests <- c("Age", "Clinic", "Ethnicity", "Sex", "Weight")
lm_regression_demographics <- extract_linear_regression(
  regression_numeric, query = "Therapeutic_Outcome_Final", factors = regression_tests,
  excel = glue("excel/numeric_demographics_regression_final_sex_clinic_ethnicity_age-v{ver}.xlsx"))

## and

pp(file = "images/demographics_only_linear_regression.svg")
lm_regression_demographics[["forest"]]
dev.off()
```

### Panel C

Ibid, Section 7.2.2

```{r, eval=FALSE}
tc_log_iterative_regression_demographics <- iterate_logistic_regression(
  regression_df, query = "Therapeutic_Outcome_Final", factors = regression_tests,
  excel = glue("excel/tc_simple_logistic_regression-v{ver}.xlsx"))

## and

pp(file = glue("figures/tc_simple_logistic_regression-v{ver}.svg"))
tc_log_iterative_regression_demographics[["forest"]]
dev.off()
```

### Panel D

Ibid.

```{r, eval=FALSE}
t_log_regression_demographics <- extract_logistic_regression(
  t_regression_df, query = "Therapeutic_Outcome_Final", factors = t_regression_tests,
  excel = glue("excel/t_multivariable_logistic_regression-v{ver}.xlsx"))

## and

pp(file = glue("figures/t_multivariable_logistic_regression-v{ver}.svg"))
t_log_regression_demographics[["forest"]]
dev.off()
```

### Panels E and F

TODO: readd

Hmm I deleted these from my notebook, I did not think we were going to
use them but instead only include C/D.  Adding them back momentarily.

## Figure S2

### Panel B/C

These are waay down in Section 16 "Parasite distribution" of 02visualization.

```{r, eval=FALSE}
lp_cf_norm_pca <- plot_pca(lp_cf_norm)
```

## Figure S4

TODO: rename section

One must return to 01datastructures for this image.  It was used to
define the coverage cutoff for filtering samples and is found at
section 7.2 "Figure S2: Non-zero genes before sample filtering"  (I
guess I should rename this too)

```{r, eval=FALSE}
all_nz <- plot_nonzero(hs_expt)
```

## Figure S5

These are all scattered through 02visualization.

### Panel A

Section 9.8 "Compare clinics"

```{r, eval=FALSE}
tc_clinic_type_pca <- plot_pca(tc_clinic_type_norm)
tc_clinic_type_pca
```

### Panel B

Section 9.5 "Eosinophils by clinic"  No fails from Cali for
Eosinophils!

```{r, eval=FALSE}
tc_eosinophils_pca <- plot_pca(tc_eosinophils_norm, plot_labels = FALSE)
tc_eosinophils_pca
```

### Panel C

Section 9.6 "Monocytes by clinic"

```{r, eval=FALSE}
tc_monocytes_pca <- plot_pca(tc_monocytes_norm, plot_labels = FALSE)
tc_monocytes_pca
```

### Panel D

Section 9.7 "Neutrophils by clinic"

```{r, eval=FALSE}
tc_neutrophils_pca <- plot_pca(tc_neutrophils_norm, plot_labels = FALSE)
tc_neutrophils_pca
```

### Panel E

Section 9.1 "Biopsies by clinic"

```{r, eval=FALSE}
tc_biopsies_pca <- plot_pca(tc_biopsies_norm)
tc_biopsies_pca
```

### Panel F

At the bottom of section 9.2.3 "Eosinophil samples, both clinics"

```{r, eval=FALSE}
plot_pca(etnia_eo_nb)
```

### Panel G

At the bottom of section 9.2.4 "Monocyte samples, both clinics"

```{r, eval=FALSE}
plot_pca(etnia_mo_nb)
```

### Panel H

At the bottom of section 9.2.5 "Neutrophil samples, both clinics"

```{r, eval=FALSE}
plot_pca(etnia_ne_nb)
```

## Figure S6

TODO: Someone asked me to changes these colors to this vicious set of
greens.  The original black/grey/white is nicer.

### Panel A/B/C

These images are scattered through section 11.4.1 "Comparing 3 visits
by cell type" (I changed the colors back just now 20240919, those
greens are a horror show, where was April to yell at me!?)

```{r, eval=FALSE}
t_visit_monocyte_norm_pca <- plot_pca(t_visit_monocyte_norm, plot_labels = FALSE)
pp(file = "images/t_monocyte_visit_norm_pca.pdf")
t_visit_monocyte_norm_pca$plot
dev.off()

## replace monocyte with the relevant celltypes for the others.
```

### Panel D

These bottom panels are comparing my simplified model to a more
thorough visit-in-model analysis.  This first, monocyte comparison is
in section 6.0.2 of 04differential_expression_tumaco.

Note, hpgltools is now able to handle arbitrarily complex models
across all methods (deseq/edger/etc...) though I didn't finish it
until a few days ago and it has not yet been fully tested.

```{r, eval=FALSE}
## The original pairwise invocation with sva:
##t_cf_monocyte_de_sva <- all_pairwise(t_monocyte, model_batch = "svaseq",
##                                     filter = TRUE, parallel = FALSE,
##                                     methods = methods)
test_monocytes <- normalize_expt(t_monocytes, filter = "simple")
```

This continues for a few blocks of relatively dense code invoking sva
with a model containing visit number.

### Panel E

Ibid, section 7.0.3

This section is nearly identically dense, and starts with the following:

```{r, eval=FALSE}
## The original pairwise invocation with sva:
#t_cf_eosinophil_de_sva <- all_pairwise(t_eosinophils, model_batch = "svaseq",
#                                       filter = TRUE, parallel=FALSE, methods = methods)
test_eosinophils <- normalize_expt(t_eosinophils, filter = "simple")
```

### Panel F

Ibid, section 7.0.2

TODO: This portion of the 04differential_expression_tumaco is poorly
organized.

```{r, eval=FALSE}
## The original pairwise invocation with sva:
## t_cf_neutrophil_de_sva <- all_pairwise(t_neutrophils, model_batch = "svaseq",
##                                        parallel = parallel, filter = TRUE,
##                                        methods = methods)
test_neutrophils <- normalize_expt(t_neutrophils, filter = "simple")
```

## Figure S7

This images harken back to 02visualization

### Panel A

Section 10.2.1 of 02visualization

```{r, eval=FALSE}
t_clinical_nobiop_pca <- plot_pca(t_clinical_nobiop_norm, plot_labels = FALSE)
pp(file = "figures/t_clinical_nobiop_figxxa.pdf")
t_clinical_nobiop_pca[["plot"]]
dev.off()
```

### Panel B

The next image in 10.2

TODO: Check that the device is properly creating pdfs in this section;
it looks like it might be creating pngs for some of these?

```{r, eval=FALSE}
t_clinical_nobiop_nb_pca <- plot_pca(t_clinical_nobiop_nb, plot_labels = FALSE)
pp(file = "figures/t_clinical_nobiop_sva_figxxb.pdf")
t_clinical_nobiop_nb_pca[["plot"]]
dev.off()
```

## Figure S8, S9, S10

These are not in my notebooks.

## Figure S11

### Panel A

Section 10.2.1

```{r, eval=FALSE}
tc_clinical_nobiop_pca <- plot_pca(tc_clinical_nobiop_norm, plot_labels = FALSE)
pp(file = "figures/tc_clinical_nobiop_figxxc.pdf")
tc_clinical_nobiop_pca[["plot"]]
```

### Panel B

and the immediately following image

```{r, eval=FALSE}
tc_clinical_nobiop_nb_pca <- plot_pca(tc_clinical_nobiop_nb, plot_labels = FALSE)
pp(file = "figures/tc_clinical_nobiop_sva_figxxd.pdf")
tc_clinical_nobiop_nb_pca[["plot"]]
dev.off()
```

### Panel C

I think this is the only image taken from
03differential_expression_both; it comes from section 5.2.1; in my
last iteration of the notebook it failed because I forgot to tell it
to use the color_choices (fixed)

```{r, eval=FALSE}
tc_clinical_cf_de_batch <- all_pairwise(tc_clinical_nobiop, filter = TRUE,
                                        parallel = parallel, model_batch = TRUE,
                                        methods = methods)
```

## Table S1

This was manually collated by Maria Adelaida; I think all the numbers
do correspond to the regression f-values/p-values.

## Table S2

This is a very slightly modified copy of the sample sheet in
sample_sheets/tmrc3_samples_pruned.xlsx  One noteworthy thing: if one
reprocesses the raw data and runs 'gather_preprocessing_metadata()',
it will suck up all the logs from any tools for which I have written
regexes and append the results as columns at the right of this.  I am
more than a little proud of that.

## Table S3

This is a slightly modified copy of the result of the 'svpc_fstats()'
invocations found at the end of Section 17 in 02visualization.  (my
little function adds a funky heatmap to the output)

```{r, eval=FALSE}
queries <- c("typeofcells", "visitnumber", "clinic", "donor")
tc_clinical_fpstats <- svpc_fstats(tc_clinical, num_pcs = 5, queries = queries)
```

## Table S4

This is a slightly modified copy of two results:

### Biopsies DE

This is a modified version of the result when one invokes
combine_de_tables(), as per section 6.0.2 of
04differential_expression_tumaco; this container should therefore
provide those numbers if one looks in
'analyses/4_tumaco/DE_Cure_Fail/Biopsies/t_biopsy_cf_table_sva-v202409.xlsx'
That xlsx workbook has a worksheet named 'outcome' (the second); this
worksheet comprises columns Q-V from it.  (also yay! the numbers are
the same between my container (which changes constantly and gets
rebuilt with each change) and the copy that was used for the paper!)

```{r, eval=FALSE}
t_cf_biopsy_table_sva <- combine_de_tables(
  t_cf_biopsy_de_sva, keepers = t_cf_contrast,
  excel = glue("{cf_prefix}/Biopsies/t_biopsy_cf_table_sva-v{ver}.xlsx"))
t_cf_biopsy_table_sva
```

### Bipsies GO

Looking more carefully at this, I think this is not from my
overrepresentation analysis; but came from STRING?

The invocation for my version of this resides in 05enrichment, section
2.9.1 and the xlsx file in the container should be found in
'analyses/Gene_Set_Enrichment/t_cf_biopsy_sig_sva_up_gp-v202409.xlsx'
for gProfiler, and with a similar name containing 'cp' for
clusterProfiler.

```{r, eval=FALSE}
t_cf_biopsy_sig_sva_gp_up <- simple_gprofiler(
  t_cf_biopsy_sig_sva_up,
  excel = glue("{xlsx_prefix}/Gene_Set_Enrichment/t_cf_biopsy_sig_sva_up_gp-v{ver}.xlsx"))
t_cf_biopsy_sig_sva_gp_up

## or perhaps if you prefer clusterProfiler

t_cf_biopsy_sig_sva_cp_up <- simple_cprofiler(
  t_cf_biopsy_sig_sva_up, de_table = t_cf_biopsy_table_sva,
  orgdb = "org.Hs.eg.db",
  excel = glue("{xlsx_prefix}/Gene_Set_Enrichment/t_cf_biopsy_sig_sva_up_cp-v{ver}.xlsx"))
```

## Table S5

This is a redacted combination of a few files:

### worksheet Monocytes

Section 6.0.2 provides the invocation of combine_de_tables() which
produces the
'analysis/4_Tumaco/DE_Cure_Fail/Monocytes/t_monocyte_cf_table-sva-v202409.xlsx'
that comprises the logFC etc values in this worksheet.  Interestingly,
the top 5 genes all shifted down in logFC by ~ 0.002 in my
container-derived worksheet vs this table.  I think that is an
acceptable difference?  I did spot check a few genes in the top 5000
and it looks like the order is maintained.

```{r, eval=FALSE}
t_cf_monocyte_table_sva <- combine_de_tables(
  t_cf_monocyte_de_sva, keepers = t_cf_contrast,
  excel = glue("{cf_prefix}/Monocytes/t_monocyte_cf_table_sva-v{ver}.xlsx"))
t_cf_monocyte_table_sva
```

### worksheet Neutrophils

Same logic, but Section 7.0.2 with the analagously named xlsx file in
the Neutrophils directory; these numbers also appear to have
shifted slightly (but this time by ~ 0.001 up to 0.1 logFC) and I do
see a couple of genes out of order...  I am going to check in with
Maria Adelaida.

```{r, eval=FALSE}
t_cf_neutrophil_table_sva <- combine_de_tables(
  t_cf_neutrophil_de_sva, keepers = t_cf_contrast,
  excel = glue("{cf_prefix}/Neutrophils/t_neutrophil_cf_table_sva-v{ver}.xlsx"))
t_cf_neutrophil_table_sva
```

### worksheet Eosinophils

Ibid, section 7.0.3.  Once again the values look to have shifted down
by ~ 0.001 to 0.01 logFC.

```{r, eval=FALSE}
t_cf_eosinophil_table_sva <- combine_de_tables(
  t_cf_eosinophil_de_sva, keepers = t_cf_contrast,
  excel = glue("{cf_prefix}/Eosinophils/t_eosinophil_cf_table_sva-v{ver}.xlsx"))
t_cf_eosinophil_table_sva
```

### worksheet All_innate

This is actually way earlier in the differential expression document,
section 3.1 when the variable 't_cf_clinicalnb_table_sva' is created
to produce the xlsx file
'analyses/4_tumaco/DE_Cure_Fail/All_Samples/t_clinical_nobiop_cf_table_sva-v202409.xlsx'

These numbers match up pretty much exactly (I set the number of
significant digits to 4 I think, but this worksheet is 3?)

```{r, eval=FALSE}
t_cf_clinical_table_sva <- combine_de_tables(
  t_cf_clinical_de_sva, keepers = cf_contrast,
  excel = glue("{cf_prefix}/All_Samples/t_clinical_cf_table_sva-v{ver}.xlsx"))
t_cf_clinical_table_sva
```

## Table S6

I do not think they used my gProfiler/clusterProfiler results for
this.

## Table S7

This comes from 06lrt_gsva, section 3.

with the caveat that I think the C2 and C7 results are more
interesting.  Also, the results produced by the container are sparser
with respect to the annotations because I did not want to steal the
gmt/xml annotations from broad.  If you happen to have the xml/gmt
files, I left the original invocations there so you can get the full
table with the paper titles etc (also, note that later versions of
mSigDB changed the xml format so that it no longer parses properly in
R; so those functions are now smrt enough to handle the new gmt/json
files).

```{r, eval=FALSE}
tc_celltype_gsva_h <- simple_gsva(
    tc_valid,
    signatures = broad_h,
    msig_xml = "reference/msigdb/msigdb_v7.5.1.xml",
    signature_category = "h")
tc_celltype_gsva_h_sig <- get_sig_gsva_categories(
    tc_celltype_gsva_h,
    excel = "analyses/3_cali_and_tumaco/GSVA/tc_valid_gsva_h.xlsx")
```

## Table S8

I am reasonably certain that Alejandro produced the inputs for this;
however the container produces nearly identical versions of the pieces
in 07wgcna, section 9.

```{r, eval=FALSE}
written_interesting <- write_xlsx(fData(l2input)[interesting_genes, ],
                                  excel = glue("excel/wgcna_interesting_genes-v{ver}.xlsx"))

## Note that we can do similarity matrices on the samples too in order to get
## dendrograms which may get interesting groups of samples?
```

## Table S9

This is not run automatically by the container because it will run
most computers out of memory and/or take a really long time.  If your
computer has ~ 512G ram, open 08classifier_highvar.Rmd and run section
11.2; the resulting xlsx output files should look like this (except
cooler because I add some plots).

```{r, eval=FALSE}
tc_vall_summary_xlsx <- glue("excel/tc_vall_ml_summary-v{ver}.xlsx")
tc_vall_knn <- classify_n_times(tc_vall_texprs, tc_vall_meta,
                                outcome_column = ref_col,
                                method = "knn", sampler = "cv")
written <- write_classifier_summary(tc_vall_knn, excel = tc_vall_summary_xlsx)
tc_vall_gb <- classify_n_times(tc_vall_texprs, tc_vall_meta,
                               outcome_column = ref_col,
                               method = "xgbTree", sampler = "cv")
written <- write_classifier_summary(tc_vall_gb, excel = written[["wb"]])
tc_vall_glm <- classify_n_times(tc_vall_texprs, tc_vall_meta,
                                outcome_column = ref_col,
                                method = "glmnet", sampler = "cv")
written <- write_classifier_summary(tc_vall_glm, excel = written[["wb"]])
tc_vall_rf <- classify_n_times(tc_vall_texprs, tc_vall_meta,
                               outcome_column = ref_col,
                               method = "ranger", sampler = "cv")
written <- write_classifier_summary(tc_vall_rf, excel = written[["wb"]])
openxlsx::saveWorkbook(written[["wb"]], file = tc_vall_summary_xlsx)
```

# Structure

The Rmd files may be rendered in any order with one very important
exception: the 01datastructures.Rmd file must be run first.  It reads
all of the sample sheets and count tables and writes out a series of
data files used by all the other documents.

Note: All the references are in the documents, not here.

## 01datastructures

This is responsible for setting the stage for everything that follows.
It collects annotations from the 2020 ensembl human database, the
experimental metadata from the xlsx files in sample_sheets/, and the
counts found in preprocessing/ and combines them into an initial,
large expressionSet.  It tallies up the samples according to many/most
of the likely factors of interest, filters the data, and extracts the
various subsets into separate datastructures.

## 02visualization

Dominated by a long series of PCA explorations.  It is basically a
playground for me to poke at the various data subsets in order to try
to get a feeling for what is the most appropriate way to think about
the data.  It was where we eventually decided that we cannot use the
Cali data, for example.

In later iterations it included a series of regression analyses and
further examinations into the sensitivities of the PCA and surrogate
variable analyses.  It should be noted that these analyses were
entirely Theresa's idea, I copied them into this document and made
some minor changes.

## 03differential_expression_both

This document performs some differential expression analyses using
both the samples from Cali and Tumaco.  Partially this was done to
assuage my interest, and partially to reinforce the idea that the Cali
samples really do not play well with others.

## 04differential_expression_tumaco

Ibid, but only Tumaco.  This is the real meat of the analysis and
seeks to try out various ways of performing differential expression
using the Tumaco data in order to see which provide the most robust
ways of examining the potential questions in the data.

## 05enrichment

Read the xlsx files from 03 and 04 above and pass the resulting gene
sets and/or DE tables to gProfiler2 and clusterProfiler.  gProfiler
was used to perform over-representation analyses of the significantly
increased/decreased genes (or both together) of the various contrasts
vs. the databases provided by gProfiler.  The same task was performed
using clusterProfiler with two exceptions: clusterProfiler uses the
orgdb annotations for its gene groups and is therefore (by default,
but not exclusively) limited to GO/KEGG/DAVID; conversely, one may
trivially pass the full DE table to clusterProfiler and thus perform
the full GSEA analysis. (I have been playing with passing reactome
(via ReactomePA) and mSigDB to these methods).

## 06lrt_gsva

Theresa also suggested we try a series of Likelihood Ratio Tests (LRT)
to examine trends in the data; thus making DESeq2/EdgeR sensitive to
not only increased/decreased gene expression across two conditions,
but shared/divergent trends across multiple factors in the metadata.
This document provides a version of this.

At the same time we were doing that, she and I put together a
simplified method for querying the mSigDB via GSVA.  These queries are
in this document too.

## 07wgcna

Alejandro performed a set of WGCNA analyses using the normalized
expression values.  He kindly sent me a copy of his R script and I
reformatted it somewhat into this document.

## 08classifier_highvar

I wanted to explore machine learning classifiers. I had this dataset
(and the parasite transcriptomes) open, so I decided to play with
them.  It was never intended for publication by me but as a fun
learning experience; but here it is...

# Introduction to the singularity container

Of the various options, I found singularity to be the most attractive.
I therefore wrote a default Makefile with a few targets to create and
manipulate images because I cannot be bothered to remember all of the
various commands.

# The images

I am hoping to create 2 image templates for the TMRC analyses:
preprocessing and analyses. I will therefore have a few locally
maintained shell scripts which contain the base image setup tasks, the
tasks required to setup the tools used, download the data, and perform
the work.

# Setting up the base image

I intend to use Debian stable.  I am copying the required setup files
into the current working directory and invoking make to create
the container.  I have a few targets which are intended to make this
easier to remember.

## Creating the base image

The following command runs singularity with options suitable to create
the image.  This to me is a little unnerving, because a bunch of stuff
gets run as root.

```{bash, eval=FALSE}
make
```

## Testing stuff out and making changes

The overlay target drops the user into the container with R/W
permissions.  It creates a directory 'cure_Fail_analyses_overlay' which
may be modified at will.

```{bash, eval=FALSE}
make cure_fail_analyses.overlay
```

## Using the container for arbitrary Rmd/md files

The container's runscript has some logic built in which can process
arbitrary markdown documents into html via knitr and pandoc.

```{bash, eval=FALSE}
cd someplace_with_Rmd
/locationofcontainers/cure_fail_host_analyses.sif -i markdown01.Rmd:markdown02.Rmd
```
