1 Introduction

This document is intended to create the data structures used to evaluate our TMRC2 samples. In some cases, this includes only those samples starting in 2019; in other instances I am including our previous (2015-2016) samples.

In all cases the processing performed was:

  1. Default trimming was performed.
  2. Hisat2 was used to map the remaining reads against the Leishmania panamensis genome revision 36.
  3. The alignments from hisat2 were used to count reads/gene against the revision 36 annotations with htseq.
  4. These alignments were also passed to the pileup functionality of samtools and the vcf/bcf utilities in order to make a matrix of all observed differences between each sample with respect to the reference.
  5. The freebayes variant estimation tool was used in addition to #4 to search for variant positions in a more robust fashion.
  6. The trimmed reads were passed to kraken2 using a viral database in order to look for samples with potential LRV sequence.
  7. An explicit, grep-based search for spliced leader reads was used against all human-derived samples. The results from this were copy/pasted into the sample sheet.

1.1 Multiple datasets

In a couple of important ways the TMRC2 data is much more complex than the TMRC3:

  1. It comprises multiple, completely separate queries:
    1. Sequencing the parasite samples
    2. Sequencing a set of human macrophage samples which were infected with specific parasite samples.
  2. The parasite transcriptomic samples comprise multiple different types of queries:
    1. Differential expression to look at strain, susceptibility, and clinical outcomes.
    2. Individual variant searches to look for potentially useful SNPs for classification of parasite samples.
  3. The human macrophage samples may be used to query both the host and parasite transcriptomes because (at least when not drug treated) there is a tremendous population of parasite reads in them.

1.2 Sample sheet(s)

Our shared online sample sheet is nearly static at the time of this writing (202209), I expect at this point the only likely updates will be to annotate some strains as more or less susceptible to drug treatment.

sample_sheet <- "sample_sheets/macrophage_samples.xlsx"

2 Annotations

Everything which follows depends on the Existing TriTrypDB annotations revision 46, circa 2019. The following block loads a database of these annotations and turns it into a matrix where the rows are genes and columns are all the annotation types provided by TriTrypDB.

The same database was used to create a matrix of orthologous genes between L.panamensis and all of the other species in the TriTrypDB.

The same database of annotations also provides mappings to the set of annotated GO categories for the L.panamensis genome along with gene lengths.

meta <- download_eupath_metadata(webservice = "tritrypdb", overwrite = FALSE)
panamensis_entry <- get_eupath_entry("MHOM", metadata = meta[["valid"]])
panamensis_db <- make_eupath_orgdb(panamensis_entry)
panamensis_pkg <- panamensis_db[["pkgname"]]
package_name <- panamensis_db[["pkgname"]]
if (is.null(panamensis_pkg)) {
  panamensis_pkg <- panamensis_entry[["OrgdbPkg"]]
  package_name <- panamensis_pkg
}

tt <- library(panamensis_pkg, character.only = TRUE)
panamensis_env <- get0(panamensis_pkg)
all_fields <- columns(panamensis_env)
all_lp_annot <- sm(load_orgdb_annotations(
    panamensis_env,
    keytype = "gid",
    fields = c("annot_gene_entrez_id", "annot_gene_name",
               "annot_strand", "annot_chromosome", "annot_cds_length",
               "annot_gene_product")))$genes
## Testing to see just how big the full database is.
## testing <- load_orgdb_annotations(panamensis_pkg, keytype = "gid", fields = "all")

lp_go <- load_orgdb_go(panamensis_pkg)
lp_go <- lp_go[, c("GID", "GO")]
lp_lengths <- all_lp_annot[, c("gid", "annot_cds_length")]
colnames(lp_lengths)  <- c("ID", "length")
all_lp_annot[["annot_gene_product"]] <- tolower(all_lp_annot[["annot_gene_product"]])
orthos <- sm(extract_eupath_orthologs(db = panamensis_pkg))
data_structures <- c(data_structures, "lp_lengths", "lp_go", "all_lp_annot")
all_installed <- rownames(installed.packages())
candidates <- grepl(pattern = "^org.Lpanamensis.MHOM.*v68.eg.db", x = all_installed)
orgdb_pkg_name <- all_installed[candidates]

tt <- library(orgdb_pkg_name, character.only = TRUE)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: generics
## 
## Attaching package: 'generics'
## The following object is masked from 'package:dplyr':
## 
##     explain
## The following objects are masked from 'package:base':
## 
##     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
##     setequal, union
## 
## Attaching package: 'BiocGenerics'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
##     unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname
## 
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## The following object is masked from 'package:glue':
## 
##     trim
## 
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
panamensis_pkg <- get0(orgdb_pkg_name)
all_fields <- columns(panamensis_pkg)
all_lp_annot <- sm(load_orgdb_annotations(
    panamensis_pkg,
    keytype = "gid",
    fields = c("annot_gene_entrez_id", "annot_gene_name",
               "annot_strand", "annot_chromosome", "annot_cds_length",
               "annot_gene_product")))$genes

lp_go <- load_orgdb_go(panamensis_pkg)
## The chosen keytype was not available.  Using 'GID'.
## This is an orgdb, good.
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
lp_go <- lp_go[, c("GID", "GO")]
lp_lengths <- all_lp_annot[, c("gid", "annot_cds_length")]
colnames(lp_lengths)  <- c("ID", "length")
all_lp_annot[["annot_gene_product"]] <- tolower(all_lp_annot[["annot_gene_product"]])
data_structures <- c(data_structures, "lp_lengths", "lp_go", "all_lp_annot")

3 Load a genome

The following block loads the full genome sequence for panamensis. We may use this later to attempt to estimate PCR primers to discern strains.

I am not sure how to increase the number of open files in a container, as a result this does not work.

testing_panamensis <- make_eupath_bsgenome(entry = panamensis_entry)
library(as.character(testing_panamensis), character.only = TRUE)
lp_genome <- get0(as.character(testing_panamensis))
data_structures <- c(data_structures, "lp_genome")

4 Generate Expressionsets and Sample Estimation

The process of sample estimation takes two primary inputs:

  1. The sample sheet, which contains all the metadata we currently have on hand, including filenames for the outputs of #3 and #4 above.
  2. The gene annotations.

An expressionSet(or summarizedExperiment) is a data structure used in R to examine RNASeq data. It is comprised of annotations, metadata, and expression data. In the case of our processing pipeline, the location of the expression data is provided by the filenames in the metadata.

4.1 Define colors

The following list contains the colors we have chosen to use when plotting the various ways of discerning the data.

color_choices <- list(
  "strain" = list(
    ## "z1.0" = "#333333", ## Changed this to 'braz' to make it easier to find them.
    "z2.0" = "#555555",
    "z3.0" = "#777777",
    "z2.1" = "#874400",
    "z2.2" = "#0000cc",
    "z2.3" = "#cc0000",
    "z2.4" = "#df7000",
    "z3.2" = "#888888",
    "z1.0" = "#cc00cc",
    "z1.5" = "#cc00cc",
    "b2904" = "#cc00cc",
    "unknown" = "#cbcbcb"),
  ## "null" = "#000000"),
  "zymo" = list(
    "none" = "#000000",
    "z22" = "#0000cc",
    "z23" = "#cc0000"),
  "cf" = list(
    "cure" = "#006f00",
    "fail" = "#9dffa0",
    "unknown" = "#cbcbcb",
    "notapplicable" = "#000000"),
  "condition" = list(
    "inf" = "#199c75",
    "inf_sb" = "#d65d00",
    "uninf" = "#6e6ea3",
    "uninf_sb" = "#d83956"),
  "significance" = list(
    "lt0" = "#ffe0e0",
    "lt1" = "#ffa0a0",
    "lt2" = "#f94040",
    "lt4" = "#a00000",
    "gt0" = "#eeccf9",
    "gt1" = "#de8bf9",
    "gt2" = "#ad07e3",
    "gt4" = "#410257"),
  "drug" = list(
    "none" = "#989898",
    "antimony" = "#088b64"),
  "oldnew" = list(
    "previous" = "#2233aa",
    "current" =  "#9c0303"),
  "infectedp" = list(
    "uninfected" = "#676767",
    "infected" = "#ac06e2"),
  "treatment_zymo" = list(
    "inf_sb_z23" = "#E7298A",
    "inf_z23" = "#D95F02",
    "uninf_none" = "#66A61E",
    "uninf_sb_none" = "#E6AB02",
    "inf_z22" = "#1B9E77",
    "inf_sb_z22" = "#7570B3"),
  "donor" = list(
    "d01" = "#8d0000",
    "d02" = "#E6AB02",
    "d09" = "#7570B3",
    "d81" = "#2233aa"),
  "susceptibility" = list(
    "resistant" = "#8563a7",
    "sensitive" = "#8d0000",
    "ambiguous" = "#cbcbcb",
    "unknown" = "#555555"))
data_structures <- c(data_structures, "color_choices")

5 Macrophage data

All of the above focused entire on the parasite samples, now let us pull up the macrophage infected samples. This will comprise two datasets, one of the human and one of the parasite.

5.1 Macrophage host data

The metadata for the macrophage samples contains a couple of columns for mapped human and parasite reads. We will therefore use them separately to create two expressionsets, one for each species.

** Note **: I forgot to commit the addition of plot_metadata factors() in the last run of this. In addition, I need to add an explicit month to load_biomart_annotations() or change the function to search a couple more months before it stops trying to find an archive.

hs_annot <- load_biomart_annotations(year = "2020", month = "04", symbol_columns = "hgnc_symbol")
## Using mart: ENSEMBL_MART_ENSEMBL from host: apr2020.archive.ensembl.org.
## Successfully connected to the hsapiens_gene_ensembl database.
## Finished downloading ensembl gene annotations.
## Finished downloading ensembl structure annotations.
## Including symbols, there are 67159 vs the 249740 gene annotations.
## Not dropping haplotype chromosome annotations, set drop_haplotypes = TRUE if this is bad.
## Saving annotations to hsapiens_biomart_annotations.rda.
## Finished save().
hs_annot <- hs_annot[["annotation"]]
hs_annot[["transcript"]] <- paste0(rownames(hs_annot), ".", hs_annot[["transcript_version"]])
rownames(hs_annot) <- make.names(hs_annot[["ensembl_gene_id"]], unique = TRUE)
rownames(hs_annot) <- paste0("gene:", rownames(hs_annot))
tx_gene_map <- hs_annot[, c("transcript", "ensembl_gene_id")]

sanitize_columns <- c("drug", "macrophagetreatment", "macrophagezymodeme")
macr_annot <- hs_annot
rownames(macr_annot) <- gsub(x = rownames(macr_annot),
                             pattern = "^gene:",
                             replacement = "")

hs_macrophage <- create_se(sample_sheet, gene_info = macr_annot,
                           file_column = "hg38100hisatfile") %>%
  set_conditions(fact = "macrophagetreatment") %>%
  set_batches(fact = "macrophagezymodeme") %>%
  sanitize_expt_pData(columns = sanitize_columns) %>%
  subset_se(nonzero = 12000)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': could not find function "set_batches"
fixed_genenames <- gsub(x = rownames(assay(hs_macrophage)), pattern = "^gene:",
                        replacement = "")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'gsub': error in evaluating the argument 'x' in selecting a method for function 'rownames': error in evaluating the argument 'x' in selecting a method for function 'assay': object 'hs_macrophage' not found
hs_macrophage <- set_se_genenames(hs_macrophage, ids = fixed_genenames)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': error in evaluating the argument 'x' in selecting a method for function 'assay': object 'hs_macrophage' not found
table(colData(hs_macrophage)$condition)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colData': object 'hs_macrophage' not found
## Added to make a simplified PCA plot.
colData(hs_macrophage)[["experiment"]] <- "macrophage"
## Error: object 'hs_macrophage' not found
## The following 3 lines were copy/pasted to datastructures and should be removed soon.
nostrain <- is.na(colData(hs_macrophage)[["strainid"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colData': object 'hs_macrophage' not found
colData(hs_macrophage)[nostrain, "strainid"] <- "none"
## Error: object 'hs_macrophage' not found
colData(hs_macrophage)[["strain_zymo"]] <- paste0("s", colData(hs_macrophage)[["strainid"]],
                                                "_", colData(hs_macrophage)[["macrophagezymodeme"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colData': object 'hs_macrophage' not found
uninfected <- colData(hs_macrophage)[["strain_zymo"]] == "snone_none"
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colData': object 'hs_macrophage' not found
colData(hs_macrophage)[uninfected, "strain_zymo"] <- "uninfected"
## Error: object 'hs_macrophage' not found
colData(hs_macrophage)[["infectedp"]] <- "infected"
## Error: object 'hs_macrophage' not found
colData(hs_macrophage)[uninfected, "infectedp"] <- "uninfected"
## Error: object 'hs_macrophage' not found
data_structures <- c(data_structures, "hs_macrophage")

5.2 Double-check sample IDs against the sample sheet

1 sample has been excluded from the analysis but is in the sample sheet. I am reasonably certain I know which, but will double-check here.

sample_sheet_ids = c("TMRC30051","TMRC30057","TMRC30059","TMRC30060","TMRC30061","TMRC30062",
                     "TMRC30063","TMRC30064","TMRC30065","TMRC30066","TMRC30067","TMRC30069",
                     "TMRC30117","TMRC30162","TMRC30243","TMRC30244","TMRC30245","TMRC30246",
                     "TMRC30247","TMRC30248","TMRC30249","TMRC30250","TMRC30251","TMRC30252",
                     "TMRC30266","TMRC30267","TMRC30268","TMRC30286","TMRC30326","TMRC30316",
                     "TMRC30317","TMRC30322","TMRC30323","TMRC30328","TMRC30318","TMRC30319",
                     "TMRC30324","TMRC30325","TMRC30320","TMRC30321","TMRC30327","TMRC30312",
                     "TMRC30297","TMRC30298","TMRC30299","TMRC30300","TMRC30295","TMRC30296",
                     "TMRC30303","TMRC30304","TMRC30301","TMRC30302","TMRC30314","TMRC30315",
                     "TMRC30313")
found <- sample_sheet_ids %in% colnames(assay(hs_macrophage))
colnames(assay(hs_macrophage))[!found]

5.3 Subset and create different groupings

all_human <- sanitize_expt_pData(hs_macrophage, columns = "drug") %>%
  set_conditions(fact = "drug") %>%
  set_batches(fact = "typeofcells")
## Error in set_batches(., fact = "typeofcells"): could not find function "set_batches"
data_structures <- c(data_structures, "all_human")

## The following 3 lines were copy/pasted to datastructures and should be removed soon.
no_strain_idx <- colData(all_human)[["strainid"]] == "none"
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colData': object 'all_human' not found
##colData(all_human)[["strainid"]] <- paste0("s", colData(all_human)[["strainid"]],
##                                         "_", colData(all_human)[["macrophagezymodeme"]])
colData(all_human)[no_strain_idx, "strainid"] <- "none"
## Error: object 'all_human' not found
table(colData(all_human)[["strainid"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colData': object 'all_human' not found
all_human_types <- set_conditions(all_human, fact = "typeofcells") %>%
  set_batches(fact = "drug")
## Error in set_batches(., fact = "drug"): could not find function "set_batches"
data_structures <- c(data_structures, "all_human_types")

type_zymo_fact <- paste0(colData(all_human_types)[["condition"]], "_",
                         colData(all_human_types)[["macrophagezymodeme"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colData': object 'all_human_types' not found
type_zymo <- set_conditions(all_human_types, fact = type_zymo_fact)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'se' in selecting a method for function 'set_conditions': object 'all_human_types' not found
data_structures <- c(data_structures, "type_zymo")

type_drug_fact <- paste0(colData(all_human_types)[["condition"]], "_",
                         colData(all_human_types)[["drug"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colData': object 'all_human_types' not found
type_drug <- set_conditions(all_human_types, fact = type_drug_fact)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'se' in selecting a method for function 'set_conditions': object 'all_human_types' not found
data_structures <- c(data_structures, "type_drug")

strain_fact <- colData(all_human_types)[["strainid"]]
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colData': object 'all_human_types' not found
table(strain_fact)
## Error: object 'strain_fact' not found
new_conditions <- paste0(colData(hs_macrophage)[["macrophagetreatment"]], "_",
                         colData(hs_macrophage)[["macrophagezymodeme"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colData': object 'hs_macrophage' not found
## Note the sanitize() call is redundant with the addition of sanitize() in the
## datastructures file, but I don't want to wait to rerun that.
hs_macr <- set_conditions(hs_macrophage, fact = new_conditions) %>%
  sanitize_expt_pData(column = "drug") %>%
  set_se_colors(color_choices[["treatment_zymo"]]) %>%
  subset_se(subset = "typeofcells!='U937'")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colData': error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'se' in selecting a method for function 'set_conditions': object 'hs_macrophage' not found
data_structures <- c(data_structures, "hs_macr")

ggstats_parasite <- plot_metadata_factors(hs_macr, column = "parasitemappingrate",
                                         type = "ggstats", scale = "log2")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': error in evaluating the argument 'object' in selecting a method for function 'pData': object 'hs_macr' not found
pp(file = "images/ggstats_parasiterate_all_macrophage_drug_treatment.png")
ggstats_parasite
## Error: object 'ggstats_parasite' not found
dev.off()
## png 
##   2
ggstats_parasite
## Error: object 'ggstats_parasite' not found
hs_macr_drug_expt <- set_conditions(hs_macr, fact = "drug")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'se' in selecting a method for function 'set_conditions': object 'hs_macr' not found
hs_macr_strain_expt <- set_conditions(hs_macr, fact = "macrophagezymodeme") %>%
  subset_se(subset = "macrophagezymodeme != 'none'")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'se' in selecting a method for function 'set_conditions': object 'hs_macr' not found
data_structures <- c(data_structures, "hs_macr_strain_expt")

table(colData(hs_macr)[["strainid"]])
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colData': object 'hs_macr' not found

Let us see if the sankey plot of these samples looks useful…

ggstats_slreads <- plot_metadata_factors(hs_macrophage, column = "hisatlpsinglemapped",
                                         type = "ggstats", scale = "log2")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': error in evaluating the argument 'object' in selecting a method for function 'pData': object 'hs_macrophage' not found
pp(file = "images/ggstats_slreads_all_macrophage.png")
ggstats_slreads
## Error: object 'ggstats_slreads' not found
dev.off()
## png 
##   2
ggstats_slreads
## Error: object 'ggstats_slreads' not found
ggstats_violin <- plot_metadata_factors(hs_macrophage, column = "hisatlpsinglemapped",
                                        scale = "log2")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': error in evaluating the argument 'object' in selecting a method for function 'pData': object 'hs_macrophage' not found
ggstats_violin
## Error: object 'ggstats_violin' not found
macr_sankey <- plot_meta_sankey(hs_macrophage, color_choices = color_choices,
                                factors = c("oldnew", "drug", "infectedp", "macrophagezymodeme"))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'design' in selecting a method for function 'plot_meta_sankey': object 'hs_macrophage' not found
macr_sankey
## Error: object 'macr_sankey' not found

Finally, split off the U937 samples.

hs_u937 <- subset_se(hs_macrophage, subset = "typeofcells!='Macrophages'")
## Error: object 'hs_macrophage' not found
data_structures <- c(data_structures, "hs_u937")

5.4 Macrophage parasite data

In the previous block, we used a new invocation of ensembl-derived annotation data, this time we can just use our existing parasite gene annotations.

lp_macrophage <- create_se(
  sample_sheet, file_column = "lpanamensisv36hisatfile", gene_info = all_lp_annot,
  savefile = glue("rda/lp_macrophage-v{ver}.rda"),
  annotation = "org.Lpanamensis.MHOMCOL81L13.v46.eg.db") %>%
  set_conditions(fact = "macrophagezymodeme") %>%
  set_batches(fact = "macrophagetreatment")
## Error in set_batches(., fact = "macrophagetreatment"): could not find function "set_batches"
unfilt_written <- write_expt(
  lp_macrophage,
  excel = glue("analyses/macrophage_de/{ver}/read_counts/lp_macrophage_reads_unfiltered-v{ver}.xlsx"))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'ncol': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'lp_macrophage' not found
lp_macrophage_filt <- subset_se(lp_macrophage, nonzero = 2500) %>%
  semantic_filter(semantic = c("amastin", "gp63", "leishmanolysin"),
                  semantic_column = "annot_gene_product")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'input' in selecting a method for function 'semantic_filter': object 'lp_macrophage' not found
data_structures <- c(data_structures, "lp_macrophage", "lp_macrophage_filt")

filt_written <- write_expt(lp_macrophage_filt,
  excel = glue("analyses/macrophage_de/{ver}/read_counts/lp_macrophage_reads_filtered-v{ver}.xlsx"))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'ncol': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'lp_macrophage_filt' not found
lp_macrophage <- lp_macrophage_filt
## Error: object 'lp_macrophage_filt' not found
lp_macrophage_nosb <- subset_se(lp_macrophage, subset="batch!='inf_sb'")
## Error: object 'lp_macrophage' not found
lp_nosb_write <- write_expt(
  lp_macrophage_nosb,
  excel = glue("analyses/macrophage_de/{ver}/read_counts/lp_macrophage_nosb_reads-v{ver}.xlsx"))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'ncol': error in evaluating the argument 'object' in selecting a method for function 'exprs': object 'lp_macrophage_nosb' not found
data_structures <- c(data_structures, "lp_macrophage_nosb")

spec <- make_rnaseq_spec()
test <- gather_preprocessing_metadata(sample_sheet, specification = spec)
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## Warning in gather_preprocessing_metadata(sample_sheet, specification = spec):
## Column: trimomatic_input already exists, replacing it.
## Warning in gather_preprocessing_metadata(sample_sheet, specification = spec):
## Column: trimomatic_output already exists, replacing it.
## Warning in gather_preprocessing_metadata(sample_sheet, specification = spec):
## Column: trimomatic_percent already exists, replacing it.
## Warning in gather_preprocessing_metadata(sample_sheet, specification = spec):
## Column: fastqc_pct_gc already exists, replacing it.
## Warning in gather_preprocessing_metadata(sample_sheet, specification = spec):
## Column: hisat_genome_single_concordant already exists, replacing it.
## Warning in gather_preprocessing_metadata(sample_sheet, specification = spec):
## Column: hisat_genome_multi_concordant already exists, replacing it.
## Warning in gather_preprocessing_metadata(sample_sheet, specification = spec):
## Column: hisat_genome_single_all already exists, replacing it.
## Warning in gather_preprocessing_metadata(sample_sheet, specification = spec):
## Column: hisat_genome_multi_all already exists, replacing it.
## Warning in gather_preprocessing_metadata(sample_sheet, specification = spec):
## Column: hisat_count_table already exists, replacing it.
## preprocessing/TMRC30051/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30057/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30059/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30060/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30061/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30062/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30063/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30064/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30065/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30066/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30067/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30069/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30117/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30162/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30243/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30244/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30245/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30246/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30247/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30248/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30249/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30250/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30251/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30252/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30266/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30267/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30268/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30286/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30291/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30292/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30293/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30294/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30295/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30296/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30297/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30298/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30299/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30300/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30301/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30302/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30303/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30304/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30305/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30306/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30307/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30308/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30309/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30310/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30311/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30312/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30313/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30314/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30315/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30316/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30317/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30318/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30319/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30320/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30321/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30322/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30323/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30324/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30325/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30326/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30327/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30328/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30330/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30331/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## preprocessing/TMRC30332/outputs/*hisat*_*/*_*genome*_gene_ID_fcounts.csv.xz
## Writing new metadata to: sample_sheets/macrophage_samples_modified.xlsx

6 Save all data structures into one rda

found_idx <- data_structures %in% ls()
if (sum(!found_idx) > 0) {
  not_found <- data_structures[!found_idx]
  warning("Some datastructures were not generated: ", toString(not_found), ".")
  data_structures <- data_structures[found_idx]
}
## Warning: Some datastructures were not generated: hs_macrophage, all_human,
## all_human_types, type_zymo, type_drug, hs_macr, hs_macr_strain_expt, hs_u937,
## lp_macrophage, lp_macrophage_filt, lp_macrophage_nosb.
save(list = data_structures, file = glue("rda/tmrc2_data_structures-v{ver}.rda"))
pander::pander(sessionInfo())
## Warning: Your system is mis-configured: '/etc/localtime' is not a symlink
## Warning: It is strongly recommended to set envionment variable TZ to
## 'America/New_York' (or equivalent)

R version 4.5.0 (2025-04-11)

Platform: x86_64-pc-linux-gnu

locale: C

attached base packages: stats4, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: org.Lpanamensis.MHOMCOL81L13.v68.eg.db(v.2024.09), AnnotationDbi(v.1.70.0), IRanges(v.2.42.0), S4Vectors(v.0.46.0), Biobase(v.2.68.0), BiocGenerics(v.0.54.0), generics(v.0.1.4), dplyr(v.1.1.4), Heatplus(v.3.16.0), ggplot2(v.3.5.2), glue(v.1.8.0) and hpgltools(v.1.2)

loaded via a namespace (and not attached): RColorBrewer(v.1.1-3), jsonlite(v.2.0.0), magrittr(v.2.0.3), GenomicFeatures(v.1.60.0), farver(v.2.1.2), rmarkdown(v.2.29), fs(v.1.6.6), BiocIO(v.1.18.0), vctrs(v.0.6.5), memoise(v.2.0.1), Rsamtools(v.2.24.0), RCurl(v.1.98-1.17), htmltools(v.0.5.8.1), S4Arrays(v.1.8.1), progress(v.1.2.3), curl(v.7.0.0), broom(v.1.0.9), SparseArray(v.1.8.1), sass(v.0.4.10), bslib(v.0.9.0), htmlwidgets(v.1.6.4), httr2(v.1.2.1), plyr(v.1.8.9), plotly(v.4.11.0), cachem(v.1.1.0), GenomicAlignments(v.1.44.0), mime(v.0.13), lifecycle(v.1.0.4), iterators(v.1.0.14), pkgconfig(v.2.0.3), Matrix(v.1.7-3), R6(v.2.6.1), fastmap(v.1.2.0), GenomeInfoDbData(v.1.2.14), MatrixGenerics(v.1.20.0), shiny(v.1.11.1), digest(v.0.6.37), DESeq2(v.1.48.1), GenomicRanges(v.1.60.0), RSQLite(v.2.4.3), filelock(v.1.0.3), httr(v.1.4.7), abind(v.1.4-8), compiler(v.4.5.0), pander(v.0.6.6), bit64(v.4.6.0-1), withr(v.3.0.2), backports(v.1.5.0), BiocParallel(v.1.42.1), DBI(v.1.2.3), R.utils(v.2.13.0), biomaRt(v.2.64.0), rappdirs(v.0.3.3), DelayedArray(v.0.34.1), rjson(v.0.2.23), tools(v.4.5.0), zip(v.2.3.3), httpuv(v.1.6.16), varhandle(v.2.0.6), R.oo(v.1.27.1), restfulr(v.0.0.16), GOSemSim(v.2.34.0), promises(v.1.3.3), grid(v.4.5.0), reshape2(v.1.4.4), fgsea(v.1.34.2), gtable(v.0.3.6), tzdb(v.0.5.0), R.methodsS3(v.1.8.2), tidyr(v.1.3.1), hms(v.1.1.3), data.table(v.1.17.8), xml2(v.1.4.0), XVector(v.0.48.0), foreach(v.1.5.2), pillar(v.1.11.0), stringr(v.1.5.1), vroom(v.1.6.5), yulab.utils(v.0.2.1), limma(v.3.64.3), later(v.1.4.3), splines(v.4.5.0), BiocFileCache(v.2.16.1), lattice(v.0.22-7), rtracklayer(v.1.68.0), bit(v.4.6.0), annotate(v.1.86.1), tidyselect(v.1.2.1), GO.db(v.3.21.0), locfit(v.1.5-9.12), Biostrings(v.2.76.0), knitr(v.1.50), edgeR(v.4.6.3), SummarizedExperiment(v.1.38.1), xfun(v.0.53), statmod(v.1.5.0), matrixStats(v.1.5.0), stringi(v.1.8.7), UCSC.utils(v.1.4.0), lazyeval(v.0.2.2), yaml(v.2.3.10), evaluate(v.1.0.4), codetools(v.0.2-20), tibble(v.3.3.0), qvalue(v.2.40.0), graph(v.1.86.0), cli(v.3.6.5), xtable(v.1.8-4), jquerylib(v.0.1.4), dichromat(v.2.0-0.1), Rcpp(v.1.1.0), GenomeInfoDb(v.1.44.2), dbplyr(v.2.5.0), png(v.0.1-8), XML(v.3.99-0.19), parallel(v.4.5.0), readr(v.2.1.5), blob(v.1.2.4), prettyunits(v.1.2.0), DOSE(v.4.2.0), bitops(v.1.0-9), viridisLite(v.0.4.2), GSEABase(v.1.70.0), scales(v.1.4.0), openxlsx(v.4.2.8), purrr(v.1.1.0), crayon(v.1.5.3), rlang(v.1.1.6), cowplot(v.1.2.0), fastmatch(v.1.1-6) and KEGGREST(v.1.48.1)

message("This is hpgltools commit: ", get_git_commit())
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 7e330419388a441dce04bfb7cce31bb7c8be3ef4
## This is hpgltools commit: Thu Aug 28 22:20:45 2025 -0400: 7e330419388a441dce04bfb7cce31bb7c8be3ef4
message("Saving to ", savefile)
## Saving to 01datasets.rda.xz
# tmp <- sm(saveme(filename = savefile))
tmp <- loadme(filename = savefile)
---
title: "Macrophage response to different strains `r Sys.getenv('VERSION')`: Data Set Creation"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
output:
  html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 7
    fig_width: 7
    highlight: zenburn
    keep_md: false
    mode: selfcontained
    number_sections: true
    self_contained: true
    theme: readable
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
---

<style type="text/css">
body .main-container {
  max-width: 1600px;
}
body, td {
  font-size: 16px;
}
code.r{
  font-size: 16px;
}
pre {
  font-size: 16px
}
</style>

```{r options, include=FALSE}
library(hpgltools)
library(glue)
library(ggplot2)
library(Heatplus)
library(dplyr)

knitr::opts_knit$set(
  progress = TRUE, verbose = TRUE, width = 90, echo = TRUE)
knitr::opts_chunk$set(
  error = TRUE, fig.width = 8, fig.height = 8, fig.retina = 2,
  out.width = "100%", dev = "png",
  dev.args = list("png" = list(type = "cairo-png")))
old_options <- options(
  digits = 4, stringsAsFactors = FALSE, knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 12))
ver <- Sys.getenv("VERSION")
rundate <- format(Sys.Date(), format = "%Y%m%d")

rmd_file <- "01datasets.Rmd"
savefile <- gsub(pattern = "\\.Rmd", replace = "\\.rda\\.xz", x = rmd_file)
data_structures <- c()
```

# Introduction

This document is intended to create the data structures used to
evaluate our TMRC2 samples.  In some cases, this includes only those
samples starting in 2019; in other instances I am including our
previous (2015-2016) samples.

In all cases the processing performed was:

1.  Default trimming was performed.
2.  Hisat2 was used to map the remaining reads against the Leishmania
    panamensis genome revision 36.
3.  The alignments from hisat2 were used to count reads/gene against the
    revision 36 annotations with htseq.
4.  These alignments were also passed to the pileup functionality of samtools
    and the vcf/bcf utilities in order to make a matrix of all observed
    differences between each sample with respect to the reference.
5.  The freebayes variant estimation tool was used in addition to #4
    to search for variant positions in a more robust fashion.
6.  The trimmed reads were passed to kraken2 using a viral database in
    order to look for samples with potential LRV sequence.
7.  An explicit, grep-based search for spliced leader reads was used
    against all human-derived samples.  The results from this were
    copy/pasted into the sample sheet.

## Multiple datasets

In a couple of important ways the TMRC2 data is much more complex than the
TMRC3:

1.  It comprises multiple, completely separate queries:
    a.  Sequencing the parasite samples
    b.  Sequencing a set of human macrophage samples which were infected
        with specific parasite samples.
2.  The parasite transcriptomic samples comprise multiple different
    types of queries:
    a.  Differential expression to look at strain, susceptibility, and
    clinical outcomes.
    b.  Individual variant searches to look for potentially useful
    SNPs for classification of parasite samples.
3.  The human macrophage samples may be used to query both the host
    and parasite transcriptomes because (at least when not drug
    treated) there is a tremendous population of parasite reads in
    them.

## Sample sheet(s)

Our shared online sample sheet is nearly static at the time of this
writing (202209), I expect at this point the only likely updates will
be to annotate some strains as more or less susceptible to drug
treatment.

```{r}
sample_sheet <- "sample_sheets/macrophage_samples.xlsx"
```

# Annotations

Everything which follows depends on the Existing TriTrypDB annotations revision
46, circa 2019.  The following block loads a database of these annotations and
turns it into a matrix where the rows are genes and columns are all the
annotation types provided by TriTrypDB.

The same database was used to create a matrix of orthologous genes between
L.panamensis and all of the other species in the TriTrypDB.

The same database of annotations also provides mappings to the set of
annotated GO categories for the L.panamensis genome along with gene
lengths.

```{r, eval=FALSE}
meta <- download_eupath_metadata(webservice = "tritrypdb", overwrite = FALSE)
panamensis_entry <- get_eupath_entry("MHOM", metadata = meta[["valid"]])
panamensis_db <- make_eupath_orgdb(panamensis_entry)
panamensis_pkg <- panamensis_db[["pkgname"]]
package_name <- panamensis_db[["pkgname"]]
if (is.null(panamensis_pkg)) {
  panamensis_pkg <- panamensis_entry[["OrgdbPkg"]]
  package_name <- panamensis_pkg
}

tt <- library(panamensis_pkg, character.only = TRUE)
panamensis_env <- get0(panamensis_pkg)
all_fields <- columns(panamensis_env)
all_lp_annot <- sm(load_orgdb_annotations(
    panamensis_env,
    keytype = "gid",
    fields = c("annot_gene_entrez_id", "annot_gene_name",
               "annot_strand", "annot_chromosome", "annot_cds_length",
               "annot_gene_product")))$genes
## Testing to see just how big the full database is.
## testing <- load_orgdb_annotations(panamensis_pkg, keytype = "gid", fields = "all")

lp_go <- load_orgdb_go(panamensis_pkg)
lp_go <- lp_go[, c("GID", "GO")]
lp_lengths <- all_lp_annot[, c("gid", "annot_cds_length")]
colnames(lp_lengths)  <- c("ID", "length")
all_lp_annot[["annot_gene_product"]] <- tolower(all_lp_annot[["annot_gene_product"]])
orthos <- sm(extract_eupath_orthologs(db = panamensis_pkg))
data_structures <- c(data_structures, "lp_lengths", "lp_go", "all_lp_annot")
```

```{r}
all_installed <- rownames(installed.packages())
candidates <- grepl(pattern = "^org.Lpanamensis.MHOM.*v68.eg.db", x = all_installed)
orgdb_pkg_name <- all_installed[candidates]

tt <- library(orgdb_pkg_name, character.only = TRUE)
panamensis_pkg <- get0(orgdb_pkg_name)
all_fields <- columns(panamensis_pkg)
all_lp_annot <- sm(load_orgdb_annotations(
    panamensis_pkg,
    keytype = "gid",
    fields = c("annot_gene_entrez_id", "annot_gene_name",
               "annot_strand", "annot_chromosome", "annot_cds_length",
               "annot_gene_product")))$genes

lp_go <- load_orgdb_go(panamensis_pkg)
lp_go <- lp_go[, c("GID", "GO")]
lp_lengths <- all_lp_annot[, c("gid", "annot_cds_length")]
colnames(lp_lengths)  <- c("ID", "length")
all_lp_annot[["annot_gene_product"]] <- tolower(all_lp_annot[["annot_gene_product"]])
data_structures <- c(data_structures, "lp_lengths", "lp_go", "all_lp_annot")
```

# Load a genome

The following block loads the full genome sequence for panamensis.  We
may use this later to attempt to estimate PCR primers to discern strains.

I am not sure how to increase the number of open files in a container,
as a result this does not work.

```{r genome, eval=FALSE}
testing_panamensis <- make_eupath_bsgenome(entry = panamensis_entry)
library(as.character(testing_panamensis), character.only = TRUE)
lp_genome <- get0(as.character(testing_panamensis))
data_structures <- c(data_structures, "lp_genome")
```

# Generate Expressionsets and Sample Estimation

The process of sample estimation takes two primary inputs:

1.  The sample sheet, which contains all the metadata we currently have on hand,
    including filenames for the outputs of #3 and #4 above.
2.  The gene annotations.

An expressionSet(or summarizedExperiment) is a data structure used in
R to examine RNASeq data.  It is comprised of annotations, metadata,
and expression data.  In the case of our processing pipeline, the
location of the expression data is provided by the filenames in the metadata.

## Define colors

The following list contains the colors we have chosen to use when
plotting the various ways of discerning the data.

```{r}
color_choices <- list(
  "strain" = list(
    ## "z1.0" = "#333333", ## Changed this to 'braz' to make it easier to find them.
    "z2.0" = "#555555",
    "z3.0" = "#777777",
    "z2.1" = "#874400",
    "z2.2" = "#0000cc",
    "z2.3" = "#cc0000",
    "z2.4" = "#df7000",
    "z3.2" = "#888888",
    "z1.0" = "#cc00cc",
    "z1.5" = "#cc00cc",
    "b2904" = "#cc00cc",
    "unknown" = "#cbcbcb"),
  ## "null" = "#000000"),
  "zymo" = list(
    "none" = "#000000",
    "z22" = "#0000cc",
    "z23" = "#cc0000"),
  "cf" = list(
    "cure" = "#006f00",
    "fail" = "#9dffa0",
    "unknown" = "#cbcbcb",
    "notapplicable" = "#000000"),
  "condition" = list(
    "inf" = "#199c75",
    "inf_sb" = "#d65d00",
    "uninf" = "#6e6ea3",
    "uninf_sb" = "#d83956"),
  "significance" = list(
    "lt0" = "#ffe0e0",
    "lt1" = "#ffa0a0",
    "lt2" = "#f94040",
    "lt4" = "#a00000",
    "gt0" = "#eeccf9",
    "gt1" = "#de8bf9",
    "gt2" = "#ad07e3",
    "gt4" = "#410257"),
  "drug" = list(
    "none" = "#989898",
    "antimony" = "#088b64"),
  "oldnew" = list(
    "previous" = "#2233aa",
    "current" =  "#9c0303"),
  "infectedp" = list(
    "uninfected" = "#676767",
    "infected" = "#ac06e2"),
  "treatment_zymo" = list(
    "inf_sb_z23" = "#E7298A",
    "inf_z23" = "#D95F02",
    "uninf_none" = "#66A61E",
    "uninf_sb_none" = "#E6AB02",
    "inf_z22" = "#1B9E77",
    "inf_sb_z22" = "#7570B3"),
  "donor" = list(
    "d01" = "#8d0000",
    "d02" = "#E6AB02",
    "d09" = "#7570B3",
    "d81" = "#2233aa"),
  "susceptibility" = list(
    "resistant" = "#8563a7",
    "sensitive" = "#8d0000",
    "ambiguous" = "#cbcbcb",
    "unknown" = "#555555"))
data_structures <- c(data_structures, "color_choices")
```

# Macrophage data

All of the above focused entire on the parasite samples, now let us
pull up the macrophage infected samples.  This will comprise two
datasets, one of the human and one of the parasite.

## Macrophage host data

The metadata for the macrophage samples contains a couple of columns
for mapped human and parasite reads.  We will therefore use them
separately to create two expressionsets, one for each species.

** Note **: I forgot to commit the addition of plot_metadata factors() in the last run of this.
In addition, I need to add an explicit month to
load_biomart_annotations() _or_ change the function to search a couple
more months before it stops trying to find an archive.

```{r}
hs_annot <- load_biomart_annotations(year = "2020", month = "04", symbol_columns = "hgnc_symbol")
hs_annot <- hs_annot[["annotation"]]
hs_annot[["transcript"]] <- paste0(rownames(hs_annot), ".", hs_annot[["transcript_version"]])
rownames(hs_annot) <- make.names(hs_annot[["ensembl_gene_id"]], unique = TRUE)
rownames(hs_annot) <- paste0("gene:", rownames(hs_annot))
tx_gene_map <- hs_annot[, c("transcript", "ensembl_gene_id")]

sanitize_columns <- c("drug", "macrophagetreatment", "macrophagezymodeme")
macr_annot <- hs_annot
rownames(macr_annot) <- gsub(x = rownames(macr_annot),
                             pattern = "^gene:",
                             replacement = "")

hs_macrophage <- create_se(sample_sheet, gene_info = macr_annot,
                           file_column = "hg38100hisatfile") %>%
  set_conditions(fact = "macrophagetreatment") %>%
  set_batches(fact = "macrophagezymodeme") %>%
  sanitize_expt_pData(columns = sanitize_columns) %>%
  subset_se(nonzero = 12000)
fixed_genenames <- gsub(x = rownames(assay(hs_macrophage)), pattern = "^gene:",
                        replacement = "")
hs_macrophage <- set_se_genenames(hs_macrophage, ids = fixed_genenames)
table(colData(hs_macrophage)$condition)
## Added to make a simplified PCA plot.
colData(hs_macrophage)[["experiment"]] <- "macrophage"

## The following 3 lines were copy/pasted to datastructures and should be removed soon.
nostrain <- is.na(colData(hs_macrophage)[["strainid"]])
colData(hs_macrophage)[nostrain, "strainid"] <- "none"

colData(hs_macrophage)[["strain_zymo"]] <- paste0("s", colData(hs_macrophage)[["strainid"]],
                                                "_", colData(hs_macrophage)[["macrophagezymodeme"]])
uninfected <- colData(hs_macrophage)[["strain_zymo"]] == "snone_none"
colData(hs_macrophage)[uninfected, "strain_zymo"] <- "uninfected"

colData(hs_macrophage)[["infectedp"]] <- "infected"
colData(hs_macrophage)[uninfected, "infectedp"] <- "uninfected"

data_structures <- c(data_structures, "hs_macrophage")
```

## Double-check sample IDs against the sample sheet

1 sample has been excluded from the analysis but is in the sample
sheet.  I am reasonably certain I know which, but will double-check
here.

```{r, eval=FALSE}
sample_sheet_ids = c("TMRC30051","TMRC30057","TMRC30059","TMRC30060","TMRC30061","TMRC30062",
                     "TMRC30063","TMRC30064","TMRC30065","TMRC30066","TMRC30067","TMRC30069",
                     "TMRC30117","TMRC30162","TMRC30243","TMRC30244","TMRC30245","TMRC30246",
                     "TMRC30247","TMRC30248","TMRC30249","TMRC30250","TMRC30251","TMRC30252",
                     "TMRC30266","TMRC30267","TMRC30268","TMRC30286","TMRC30326","TMRC30316",
                     "TMRC30317","TMRC30322","TMRC30323","TMRC30328","TMRC30318","TMRC30319",
                     "TMRC30324","TMRC30325","TMRC30320","TMRC30321","TMRC30327","TMRC30312",
                     "TMRC30297","TMRC30298","TMRC30299","TMRC30300","TMRC30295","TMRC30296",
                     "TMRC30303","TMRC30304","TMRC30301","TMRC30302","TMRC30314","TMRC30315",
                     "TMRC30313")
found <- sample_sheet_ids %in% colnames(assay(hs_macrophage))
colnames(assay(hs_macrophage))[!found]
```

## Subset and create different groupings

```{r}
all_human <- sanitize_expt_pData(hs_macrophage, columns = "drug") %>%
  set_conditions(fact = "drug") %>%
  set_batches(fact = "typeofcells")
data_structures <- c(data_structures, "all_human")

## The following 3 lines were copy/pasted to datastructures and should be removed soon.
no_strain_idx <- colData(all_human)[["strainid"]] == "none"
##colData(all_human)[["strainid"]] <- paste0("s", colData(all_human)[["strainid"]],
##                                         "_", colData(all_human)[["macrophagezymodeme"]])
colData(all_human)[no_strain_idx, "strainid"] <- "none"
table(colData(all_human)[["strainid"]])

all_human_types <- set_conditions(all_human, fact = "typeofcells") %>%
  set_batches(fact = "drug")
data_structures <- c(data_structures, "all_human_types")

type_zymo_fact <- paste0(colData(all_human_types)[["condition"]], "_",
                         colData(all_human_types)[["macrophagezymodeme"]])
type_zymo <- set_conditions(all_human_types, fact = type_zymo_fact)
data_structures <- c(data_structures, "type_zymo")

type_drug_fact <- paste0(colData(all_human_types)[["condition"]], "_",
                         colData(all_human_types)[["drug"]])
type_drug <- set_conditions(all_human_types, fact = type_drug_fact)
data_structures <- c(data_structures, "type_drug")

strain_fact <- colData(all_human_types)[["strainid"]]
table(strain_fact)

new_conditions <- paste0(colData(hs_macrophage)[["macrophagetreatment"]], "_",
                         colData(hs_macrophage)[["macrophagezymodeme"]])
## Note the sanitize() call is redundant with the addition of sanitize() in the
## datastructures file, but I don't want to wait to rerun that.
hs_macr <- set_conditions(hs_macrophage, fact = new_conditions) %>%
  sanitize_expt_pData(column = "drug") %>%
  set_se_colors(color_choices[["treatment_zymo"]]) %>%
  subset_se(subset = "typeofcells!='U937'")
data_structures <- c(data_structures, "hs_macr")

ggstats_parasite <- plot_metadata_factors(hs_macr, column = "parasitemappingrate",
                                         type = "ggstats", scale = "log2")
pp(file = "images/ggstats_parasiterate_all_macrophage_drug_treatment.png")
ggstats_parasite
dev.off()
ggstats_parasite

hs_macr_drug_expt <- set_conditions(hs_macr, fact = "drug")

hs_macr_strain_expt <- set_conditions(hs_macr, fact = "macrophagezymodeme") %>%
  subset_se(subset = "macrophagezymodeme != 'none'")
data_structures <- c(data_structures, "hs_macr_strain_expt")

table(colData(hs_macr)[["strainid"]])
```

Let us see if the sankey plot of these samples looks useful...

```{r}
ggstats_slreads <- plot_metadata_factors(hs_macrophage, column = "hisatlpsinglemapped",
                                         type = "ggstats", scale = "log2")
pp(file = "images/ggstats_slreads_all_macrophage.png")
ggstats_slreads
dev.off()
ggstats_slreads
ggstats_violin <- plot_metadata_factors(hs_macrophage, column = "hisatlpsinglemapped",
                                        scale = "log2")
ggstats_violin

macr_sankey <- plot_meta_sankey(hs_macrophage, color_choices = color_choices,
                                factors = c("oldnew", "drug", "infectedp", "macrophagezymodeme"))
macr_sankey
```

Finally, split off the U937 samples.

```{r}
hs_u937 <- subset_se(hs_macrophage, subset = "typeofcells!='Macrophages'")
data_structures <- c(data_structures, "hs_u937")
```

## Macrophage parasite data

In the previous block, we used a new invocation of ensembl-derived
annotation data, this time we can just use our existing parasite gene
annotations.

```{r}
lp_macrophage <- create_se(
  sample_sheet, file_column = "lpanamensisv36hisatfile", gene_info = all_lp_annot,
  savefile = glue("rda/lp_macrophage-v{ver}.rda"),
  annotation = "org.Lpanamensis.MHOMCOL81L13.v46.eg.db") %>%
  set_conditions(fact = "macrophagezymodeme") %>%
  set_batches(fact = "macrophagetreatment")

unfilt_written <- write_expt(
  lp_macrophage,
  excel = glue("analyses/macrophage_de/{ver}/read_counts/lp_macrophage_reads_unfiltered-v{ver}.xlsx"))

lp_macrophage_filt <- subset_se(lp_macrophage, nonzero = 2500) %>%
  semantic_filter(semantic = c("amastin", "gp63", "leishmanolysin"),
                  semantic_column = "annot_gene_product")
data_structures <- c(data_structures, "lp_macrophage", "lp_macrophage_filt")

filt_written <- write_expt(lp_macrophage_filt,
  excel = glue("analyses/macrophage_de/{ver}/read_counts/lp_macrophage_reads_filtered-v{ver}.xlsx"))
lp_macrophage <- lp_macrophage_filt

lp_macrophage_nosb <- subset_se(lp_macrophage, subset="batch!='inf_sb'")
lp_nosb_write <- write_expt(
  lp_macrophage_nosb,
  excel = glue("analyses/macrophage_de/{ver}/read_counts/lp_macrophage_nosb_reads-v{ver}.xlsx"))
data_structures <- c(data_structures, "lp_macrophage_nosb")

spec <- make_rnaseq_spec()
test <- gather_preprocessing_metadata(sample_sheet, specification = spec)
```

# Save all data structures into one rda

```{r}
found_idx <- data_structures %in% ls()
if (sum(!found_idx) > 0) {
  not_found <- data_structures[!found_idx]
  warning("Some datastructures were not generated: ", toString(not_found), ".")
  data_structures <- data_structures[found_idx]
}
save(list = data_structures, file = glue("rda/tmrc2_data_structures-v{ver}.rda"))
```

```{r}
pander::pander(sessionInfo())
message("This is hpgltools commit: ", get_git_commit())
message("Saving to ", savefile)
# tmp <- sm(saveme(filename = savefile))
```

```{r loadme_after, eval=FALSE}
tmp <- loadme(filename = savefile)
```
