Did the stuff on this morning’s TODO which came out of this morning’s meeting: do a PCA without the oddball strains (already done in the worksheet), highlight reference strains, and add L.major IDs and Descriptions (done by appending a collapsed version of the ortholog data to the all_lp_annot data).
Fixed human IDs for the macrophage data.
Changed input metadata sheets: primarily because I only remembered yesterday to finish the SL search for samples >TMRC20095. They are running now and will be added momentarily (I will have to redownload the sheet).
Setting up to make a hclust/phylogenetic tree of strains, use these are reference: 2168(2.3), 2272(2.2), for other 2.x choose arbitrarily (lower numbers are better).
Added another sanitize columns call for Antimony vs. antimony and None vs. none in the TMRC2 macrophage samples.
This document is intended to create the data structures used to evaluate our TMRC2 samples. In some cases, this includes only those samples starting in 2019; in other instances I am including our previous (2015-2016) samples.
In all cases the processing performed was:
I am thinking that this meeting will bring Maria Adelaida fully back into the analyses of the parasite data, and therefore may focus primarily on the goals rather than the analyses?
In a couple of important ways the TMRC2 data is much more complex than the TMRC3:
Our shared online sample sheet is nearly static at the time of this writing (202209), I expect at this point the only likely updates will be to annotate some strains as more or less susceptible to drug treatment.
sample_sheet <- "sample_sheets/ClinicalStrains_TMRC2.xlsx"
macrophage_sheet <- "sample_sheets/tmrc2_macrophage_samples.xlsx"The following block provides an example invocation of how I automatically extract things like percent reads mapped/trimmed/etc from the logs produced by trimomatic/cutadapt/hisat/salmon/etc. The caveat is that this container only has a small portion of the material available in the main working tree, as a result the new columns added to the sample sheet are relatively sparse compared to what I get on my computer.
In addition, because these samples have gone through ~ 3 different versions of my pipeline, and the code which extracts the numbers explicitly assumes only the most recent version (because it is the best!), it does not get out the data for all the samples.
## Did not find the column: sampleid.
## Setting the ID column to the first column.
## Dropped 11 rows from the sample metadata because the sample ID is blank.
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Writing new metadata to: sample_sheets/ClinicalStrains_TMRC2_modified.xlsx
Everything which follows depends on the Existing TriTrypDB annotations revision 46, circa 2019. The following block loads a database of these annotations and turns it into a matrix where the rows are genes and columns are all the annotation types provided by TriTrypDB.
The same database was used to create a matrix of orthologous genes between L.panamensis and all of the other species in the TriTrypDB.
The same database of annotations also provides mappings to the set of annotated GO categories for the L.panamensis genome along with gene lengths.
## meta <- download_eupath_metadata(webservice = "tritrypdb", eu_version = "v46")
meta <- download_eupath_metadata(webservice = "tritrypdb")## Loading taxonomy and species database to cross reference against the download.
## Working on 1/88: org.Tvivax.Y486.v68.eg.db.
## Working on 2/88: org.Tcongolense.IL3000.v68.eg.db.
## Working on 3/88: org.Laethiopica.L147.v68.eg.db.
## Working on 4/88: org.Ltropica.L590.v68.eg.db.
## Working on 5/88: org.Tcruzi.Tula.cl2.v68.eg.db.
## Working on 6/88: org.Lpanamensis.MHOMCOL81L13.v68.eg.db.
## Working on 7/88: org.Lbraziliensis.MHOMBR75M2903.v68.eg.db.
## Working on 8/88: org.Tcruzi.Dm28c.2014.v68.eg.db.
## Working on 9/88: org.Tbrucei.brucei.TREU927.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species/strain.
## Found more than one taxonomy ID match, returning the first match.
## Working on 10/88: org.Lmajor.Friedlin.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania major.
## Found more than one taxonomy ID match, returning the first match.
## Working on 11/88: org.Tcruzi.CL.Brener.v68.eg.db.
## Working on 12/88: org.Tcruzi.Esmeraldo.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species/strain.
## Found more than one taxonomy ID match, returning the first match.
## Working on 13/88: org.Lbraziliensis.MHOMBR75M2904.v68.eg.db.
## Working on 14/88: org.Trangeli.SC58.v68.eg.db.
## Working on 15/88: org.Linfantum.JPCM5.v68.eg.db.
## Working on 16/88: org.Tbrucei.gambiense.DAL972.v68.eg.db.
## Working on 17/88: org.Lmajor.LV39c5.v68.eg.db.
## Working on 18/88: org.Lmajor.SD.75.1.v68.eg.db.
## Working on 19/88: org.Tcruzi.JR.cl.4.v68.eg.db.
## Working on 20/88: org.Lmexicana.MHOMGT2001U1103.v68.eg.db.
## Working on 21/88: org.Ldonovani.BPK282A1.v68.eg.db.
## Working on 22/88: org.Adeanei.Cavalho.ATCC.PRA.265.v68.eg.db.
## Found 4 candidate genera matching Angomonas
## Found an exact match for the combination genus/species not strain for Angomonas deanei.
## Found more than one taxonomy ID match, returning the first match.
## Working on 23/88: org.Bayalai.B08.376.v68.eg.db.
## Found 20 candidate genera matching Blechomonas
## Found an exact match for the combination genus/species not strain for Blechomonas ayalai.
## Found more than one taxonomy ID match, returning the first match.
## Working on 24/88: org.Bnonstop.P57.v68.eg.db.
## Found 28 candidate genera matching Blastocrithidia
## Found a genus, but not species for Blastocrithidia nonstop P57, not adding taxon ID number.
## Working on 25/88: org.Bsaltans.Lake.Konstanz.v68.eg.db.
## Found 28 candidate genera matching Bodo
## Found an exact match for the combination genus/species not strain for Bodo saltans.
## Working on 26/88: org.Cfasciculata.Cf.Cl.v68.eg.db.
## Found 53 candidate genera matching Crithidia
## Found an exact match for the combination genus/species not strain for Crithidia fasciculata.
## Working on 27/88: org.Emonterogeii.LV88.v68.eg.db.
## Found 13 candidate genera matching Endotrypanum
## Found an exact match for the combination genus/species not strain for Endotrypanum monterogeii.
## Found more than one taxonomy ID match, returning the first match.
## Working on 28/88: org.Lamazonensis.MHOMBR71973M2269.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania amazonensis.
## Found more than one taxonomy ID match, returning the first match.
## Working on 29/88: org.Lamazonensis.PH8.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania amazonensis.
## Found more than one taxonomy ID match, returning the first match.
## Working on 30/88: org.Larabica.LEM1108.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania arabica.
## Working on 31/88: org.Lbraziliensis.MHOMBR75M2904.2019.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania braziliensis.
## Found more than one taxonomy ID match, returning the first match.
## Working on 32/88: org.Ldonovani.BHU.1220.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania donovani.
## Found more than one taxonomy ID match, returning the first match.
## Working on 33/88: org.Ldonovani.CL.SL.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania donovani.
## Found more than one taxonomy ID match, returning the first match.
## Working on 34/88: org.Ldonovani.HU3.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania donovani.
## Found more than one taxonomy ID match, returning the first match.
## Working on 35/88: org.Ldonovani.LV9.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania donovani.
## Found more than one taxonomy ID match, returning the first match.
## Working on 36/88: org.Lenriettii.LEM3045.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania enriettii.
## Found more than one taxonomy ID match, returning the first match.
## Working on 37/88: org.Lenriettii.MCAVBR2001CUR178.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania enriettii.
## Found more than one taxonomy ID match, returning the first match.
## Working on 38/88: org.Lgerbilli.LEM452.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania gerbilli.
## Found more than one taxonomy ID match, returning the first match.
## Working on 39/88: org.Lmajor.Friedlin.2021.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania major.
## Found more than one taxonomy ID match, returning the first match.
## Working on 40/88: org.Lmartiniquensis.LEM2494.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania martiniquensis.
## Found more than one taxonomy ID match, returning the first match.
## Working on 41/88: org.Lmartiniquensis.MHOMTH2012LSCM1.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania martiniquensis.
## Found more than one taxonomy ID match, returning the first match.
## Working on 42/88: org.Lorientalis.MHOMTH2014LSCM4.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania orientalis.
## Found more than one taxonomy ID match, returning the first match.
## Working on 43/88: org.Lpanamensis.MHOMPA94PSC.1.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania panamensis.
## Found more than one taxonomy ID match, returning the first match.
## Working on 44/88: org.Lpyrrhocoris.H10.v68.eg.db.
## Found 73 candidate genera matching Leptomonas
## Found an exact match for the combination genus/species not strain for Leptomonas pyrrhocoris.
## Found more than one taxonomy ID match, returning the first match.
## Working on 45/88: org.Lseymouri.ATCC.30220.v68.eg.db.
## Found 73 candidate genera matching Leptomonas
## Found an exact match for the combination genus/species not strain for Leptomonas seymouri.
## Working on 46/88: org.Lsp.Ghana.MHOMGH2012GH5.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania sp..
## Working on 47/88: org.Lsp.Namibia.MPRONA1975252LV425.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania sp..
## Working on 48/88: org.Ltarentolae.Parrot.TarII.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania tarentolae.
## Found more than one taxonomy ID match, returning the first match.
## Working on 49/88: org.Ltarentolae.Parrot.Tar.II.2019.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania tarentolae.
## Found more than one taxonomy ID match, returning the first match.
## Working on 50/88: org.Lturanica.LEM423.v68.eg.db.
## Found 279 candidate genera matching Leishmania
## Found an exact match for the combination genus/species not strain for Leishmania turanica.
## Working on 51/88: org.Pconfusum.CUL13.v68.eg.db.
## Found 4 candidate genera matching Paratrypanosoma
## Found an exact match for the combination genus/species not strain for Paratrypanosoma confusum.
## Found more than one taxonomy ID match, returning the first match.
## Working on 52/88: org.Phertigi.MCOEPA1965C119.v68.eg.db.
## Found 3 candidate genera matching Porcisia
## Found an exact match for the combination genus/species not strain for Porcisia hertigi.
## Working on 53/88: org.Tbrucei.EATRO1125.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma brucei.
## Found more than one taxonomy ID match, returning the first match.
## Working on 54/88: org.Tbrucei.Lister.427.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma brucei.
## Found more than one taxonomy ID match, returning the first match.
## Working on 55/88: org.Tbrucei.Lister.427.2018.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma brucei.
## Found more than one taxonomy ID match, returning the first match.
## Working on 56/88: org.Tcongolense.IL3000.2019.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma congolense.
## Working on 57/88: org.Tcongolense.Tc1148.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma congolense.
## Working on 58/88: org.Tcruzi.231.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 59/88: org.Tcruzi.Berenice.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 60/88: org.Tcruzi.Brazil.A4.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 61/88: org.Tcruzi.Bug2148.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 62/88: org.Tcruzi.CL.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 63/88: org.Tcruzi.CL.Brener.Esmeraldo.like.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 64/88: org.Tcruzi.CL.Brener.Non.Esmeraldo.like.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 65/88: org.Tcruzi.Dm28c.2017.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 66/88: org.Tcruzi.Dm28c.2018.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 67/88: org.Tcruzi.G.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 68/88: org.Tcruzi.S11.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 69/88: org.Tcruzi.S15.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 70/88: org.Tcruzi.S154a.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 71/88: org.Tcruzi.S162a.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 72/88: org.Tcruzi.S23b.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 73/88: org.Tcruzi.S44a.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 74/88: org.Tcruzi.S92a.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 75/88: org.Tcruzi.Sylvio.X101.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 76/88: org.Tcruzi.Sylvio.X101.2012.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 77/88: org.Tcruzi.TCC.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 78/88: org.Tcruzi.Y.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 79/88: org.Tcruzi.Y.C6.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 80/88: org.Tcruzi.Ycl2.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 81/88: org.Tcruzi.Ycl4.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 82/88: org.Tcruzi.Ycl6.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 83/88: org.Tcruzi.marinkellei.B7.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma cruzi.
## Working on 84/88: org.Tequiperdum.OVI.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma equiperdum.
## Found more than one taxonomy ID match, returning the first match.
## Working on 85/88: org.Tevansi.STIB.805.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma evansi.
## Found more than one taxonomy ID match, returning the first match.
## Working on 86/88: org.Tgrayi.ANR4.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma grayi.
## Working on 87/88: org.Tmelophagium.St.Kilda.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma melophagium.
## Working on 88/88: org.Ttheileri.isolate.Edinburgh.v68.eg.db.
## Found 549 candidate genera matching Trypanosoma
## Found an exact match for the combination genus/species not strain for Trypanosoma theileri.
## Found the following hits: Leishmania panamensis MHOM/COL/81/L13, Leishmania braziliensis MHOM/BR/75/M2903, Leishmania braziliensis MHOM/BR/75/M2904, Leishmania mexicana MHOM/GT/2001/U1103, Leishmania sp. Ghana MHOM/GH/2012/GH5, choosing the first.
## Using: Leishmania panamensis MHOM/COL/81/L13.
## org.Lpanamensis.MHOMCOL81L13.v68.eg.db is already installed and a copy should be found at: /data/renv/library/R-4.3/x86_64-conda-linux-gnu/org.Lpanamensis.MHOMCOL81L13.v68.eg.db/extdata/org.Lpanamensis.MHOMCOL81L13.v68.eg.sqlite.
panamensis_pkg <- panamensis_db[["pkgname"]]
package_name <- panamensis_db[["pkgname"]]
if (is.null(panamensis_pkg)) {
panamensis_pkg <- panamensis_db[["orgdb_name"]]
package_name <- panamensis_pkg
}
tt <- library(panamensis_pkg, character.only = TRUE)## Loading required package: AnnotationDbi
##
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
##
## select
##
panamensis_pkg <- get0(panamensis_pkg)
all_fields <- columns(panamensis_pkg)
all_lp_annot <- sm(load_orgdb_annotations(
panamensis_pkg,
keytype = "gid",
fields = c("annot_gene_entrez_id", "annot_gene_name",
"annot_strand", "annot_chromosome", "annot_cds_length",
"annot_gene_product")))$genes
lp_go <- load_orgdb_go(package_name)
lp_go <- lp_go[, c("GID", "GO")]
lp_lengths <- all_lp_annot[, c("gid", "annot_cds_length")]
colnames(lp_lengths) <- c("ID", "length")
all_lp_annot[["annot_gene_product"]] <- tolower(all_lp_annot[["annot_gene_product"]])
orthos <- sm(extract_eupath_orthologs(db = panamensis_pkg))
data_structures <- c(data_structures, "lp_lengths", "lp_go", "all_lp_annot")Recently there was a request to include the Leishmania major gene IDs and descriptions. Thus I will extract them along with the orthologs and append that to the annotations used.
Having spent the time to run the following code, I realized that the orthologs data structure above actually already has the gene IDs and descriptions.
Thus I will leave my query in place to extract the major annotations, but follow it up with a collapse of the major orthologs and appending of that to the panamensis annotations.
orgdb <- "org.Lmajor.Friedlin.v49.eg.db"
tt <- sm(library(orgdb, character.only = TRUE))
major_db <- org.Lmajor.Friedlin.v49.eg.db
all_fields <- columns(pan_db)
all_lm_annot <- sm(load_orgdb_annotations(
major_db,
keytype = "gid",
fields = c("annot_gene_entrez_id", "annot_gene_name",
"annot_strand", "annot_chromosome", "annot_cds_length",
"annot_gene_product")))$genes
wanted_orthos_idx <- orthos[["ORTHOLOGS_SPECIES"]] == "Leishmania major strain Friedlin"
sum(wanted_orthos_idx)
wanted_orthos <- orthos[wanted_orthos_idx, ]
wanted_orthos <- wanted_orthos[, c("GID", "ORTHOLOGS_ID", "ORTHOLOGS_NAME")]
collapsed_orthos <- wanted_orthos %>%
group_by(GID) %>%
summarise(collapsed_id = stringr::str_c(ORTHOLOGS_ID, collapse = " ; "),
collapsed_name = stringr::str_c(ORTHOLOGS_NAME, collapse = " ; "))
all_lp_annot <- merge(all_lp_annot, collapsed_orthos, by.x = "row.names",
by.y = "GID", all.x = TRUE)
rownames(all_lp_annot) <- all_lp_annot[["Row.names"]]
all_lp_annot[["Row.names"]] <- NULL
data_structures <- c(data_structures, "lp_lengths", "lp_go", "all_lp_annot")The following block loads the full genome sequence for panamensis. We may use this later to attempt to estimate PCR primers to discern strains.
I am not sure how to increase the number of open files in a container, as a result this does not work.
## testing_panamensis <- make_eupath_bsgenome(entry = panamensis_entry, eu_version = "v46")
testing_panamensis <- make_eupath_bsgenome(entry = panamensis_entry)
library(as.character(testing_panamensis), character.only = TRUE)
lp_genome <- get0(as.character(testing_panamensis))
data_structures <- c(data_structures, "lp_genome", "meta")The process of sample estimation takes two primary inputs:
An expressionSet(or summarizedExperiment) is a data structure used in R to examine RNASeq data. It is comprised of annotations, metadata, and expression data. In the case of our processing pipeline, the location of the expression data is provided by the filenames in the metadata.
The following samples are much lower coverage:
There is a set of strains which acquired resistance in vitro. These are included in the dataset, but there are not likely enough of them to query that question explicitly.
The following list contains the colors we have chosen to use when plotting the various ways of discerning the data.
color_choices <- list(
"strain" = list(
## "z1.0" = "#333333", ## Changed this to 'braz' to make it easier to find them.
"z2.0" = "#555555",
"z3.0" = "#777777",
"z2.1" = "#874400",
"z2.2" = "#0000cc",
"z2.3" = "#cc0000",
"z2.4" = "#df7000",
"z3.2" = "#888888",
"z1.0" = "#cc00cc",
"z1.5" = "#cc00cc",
"b2904" = "#cc00cc",
"unknown" = "#cbcbcb"),
## "null" = "#000000"),
"zymo" = list(
"z22" = "#0000cc",
"z23" = "#cc0000"),
"cf" = list(
"cure" = "#006f00",
"fail" = "#9dffa0",
"unknown" = "#cbcbcb",
"notapplicable" = "#000000"),
"susceptibility" = list(
"resistant" = "#8563a7",
"sensitive" = "#8d0000",
"ambiguous" = "#cbcbcb",
"unknown" = "#555555"))
data_structures <- c(data_structures, "color_choices")The data structure ‘lp_expt’ contains the data for all samples which have hisat2 count tables, and which pass a few initial quality tests (e.g. they must have more than 8550 genes with >0 counts and >5e6 reads which mapped to a gene); genes which are annotated with a few key redundant categories (leishmanolysin for example) are also culled.
There are a few metadata columns which we really want to make certain are standardized.
Note: I changed this to print both the number of reads and genes for removed samples.
sanitize_columns <- c("passagenumber", "clinicalresponse", "clinicalcategorical",
"zymodemecategorical", "included")
lp_expt <- create_expt(sample_sheet,
gene_info = all_lp_annot,
annotation_name = package_name,
savefile = glue("rda/tmrc2_lp_expt_all_raw-v{ver}.rda"),
id_column = "hpglidentifier",
annotation = package_name, ## this is redundantredundant
file_column = "lpanamensisv36hisatfile") %>%
set_expt_conditions(fact = "zymodemecategorical", colors = color_choices[["strain"]]) %>%
semantic_expt_filter(semantic = c("amastin", "gp63", "leishmanolysin"),
semantic_column = "annot_gene_product") %>%
sanitize_expt_pData(columns = sanitize_columns) %>%
subset_expt(subset = "included=='yes'") %>%
set_expt_factors(columns = sanitize_columns, class = "factor")## Reading the sample metadata.
## Dropped 11 rows from the sample metadata because the sample ID is blank.
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## The sample definitions comprises: 98 rows(samples) and 73 columns(metadata fields).
## Warning in create_expt(sample_sheet, gene_info = all_lp_annot, annotation_name
## = package_name, : Some samples were removed when cross referencing the samples
## against the count data.
## Matched 8778 annotations and counts.
## Bringing together the count matrix and gene information.
## The final expressionset has 8778 features and 96 samples.
## The numbers of samples by condition are:
##
## z1.5 z2.1 z2.2 z2.3 z2.4 z3.2
## 1 7 44 41 2 1
## Warning in set_expt_colors(new_expt, colors = colors): Colors for the following
## categories are not being used: z2.0z3.0z1.0b2904unknown.
## semantic_expt_filter(): Removed 68 genes.
## The samples excluded are: TMRC20025, TMRC20061, TMRC20106.
## subset_expt(): There were 96, now there are 93 samples.
data_structures <- c(data_structures, "lp_expt")
save(list = "lp_expt", file = glue("rda/tmrc2_lp_expt_all_sanitized-v{ver}.rda"))
table(pData(lp_expt)[["zymodemecategorical"]])##
## z21 z22 z23 z24
## 7 43 41 2
##
## cure failure nd
## 41 34 18
##
## cure fail unknown
## 41 34 18
## [1] 93
## [1] "TMRC20002" "TMRC20004" "TMRC20067" "TMRC20068" "TMRC20041" "TMRC20015"
## [7] "TMRC20009" "TMRC20016" "TMRC20011" "TMRC20017" "TMRC20019" "TMRC20024"
## [13] "TMRC20036" "TMRC20069" "TMRC20033" "TMRC20031" "TMRC20055" "TMRC20078"
## [19] "TMRC20094" "TMRC20042" "TMRC20058" "TMRC20072" "TMRC20059" "TMRC20048"
## [25] "TMRC20057" "TMRC20088" "TMRC20056" "TMRC20043" "TMRC20046" "TMRC20093"
## [31] "TMRC20089" "TMRC20047" "TMRC20090" "TMRC20044" "TMRC20045" "TMRC20108"
## [37] "TMRC20096" "TMRC20101" "TMRC20092" "TMRC20091" "TMRC20095"
## [1] "TMRC20001" "TMRC20065" "TMRC20039" "TMRC20010" "TMRC20012" "TMRC20013"
## [7] "TMRC20014" "TMRC20018" "TMRC20070" "TMRC20020" "TMRC20021" "TMRC20022"
## [13] "TMRC20026" "TMRC20076" "TMRC20073" "TMRC20079" "TMRC20071" "TMRC20060"
## [19] "TMRC20083" "TMRC20085" "TMRC20105" "TMRC20109" "TMRC20098" "TMRC20082"
## [25] "TMRC20102" "TMRC20099" "TMRC20100" "TMRC20084" "TMRC20087" "TMRC20103"
## [31] "TMRC20104" "TMRC20086" "TMRC20107" "TMRC20081"
unknown_ids <- pData(lp_expt)[["clinicalcategorical"]] == "unknown"
rownames(pData(lp_expt))[unknown_ids]## [1] "TMRC20005" "TMRC20066" "TMRC20037" "TMRC20038" "TMRC20077" "TMRC20074"
## [7] "TMRC20063" "TMRC20053" "TMRC20052" "TMRC20064" "TMRC20075" "TMRC20051"
## [13] "TMRC20050" "TMRC20049" "TMRC20062" "TMRC20110" "TMRC20080" "TMRC20054"
All the following data will derive from this starting point.
Here is a table of my current classifier’s interpretation of the strains.
##
## unknown z21 z22 z23 z24
## 2 5 43 41 2
merged_zymo <- lp_expt
pData(merged_zymo)[["zymodeme"]] <- as.character(pData(merged_zymo)[["zymodemecategorical"]])
z21_idx <- pData(merged_zymo)[["zymodeme"]] == "z21"
pData(merged_zymo)[z21_idx, "zymodeme"] <- "z22"
z24_idx <- pData(merged_zymo)[["zymodeme"]] == "z24"
pData(merged_zymo)[z24_idx, "zymodeme"] <- "z23"
keepers <- pData(merged_zymo)[["zymodeme"]] == "z22" |
pData(merged_zymo)[["zymodeme"]] == "z23"
merged_zymo <- merged_zymo[, keepers] %>%
set_expt_conditions(fact = "zymodeme", colors = color_choices[["zymo"]])## Subsetting on samples.
## The samples excluded are: .
## subset_expt(): There were 93, now there are 93 samples.
## The numbers of samples by condition are:
##
## z22 z23
## 50 43
##
## cure fail unknown
## 41 34 18
unknown_ids <- pData(lp_expt)[["clinicalcategorical"]] == "unknown"
rownames(pData(lp_expt))[unknown_ids]## [1] "TMRC20005" "TMRC20066" "TMRC20037" "TMRC20038" "TMRC20077" "TMRC20074"
## [7] "TMRC20063" "TMRC20053" "TMRC20052" "TMRC20064" "TMRC20075" "TMRC20051"
## [13] "TMRC20050" "TMRC20049" "TMRC20062" "TMRC20110" "TMRC20080" "TMRC20054"
failed_ids <- pData(lp_expt)[["clinicalcategorical"]] == "fail"
rownames(pData(lp_expt))[failed_ids]## [1] "TMRC20001" "TMRC20065" "TMRC20039" "TMRC20010" "TMRC20012" "TMRC20013"
## [7] "TMRC20014" "TMRC20018" "TMRC20070" "TMRC20020" "TMRC20021" "TMRC20022"
## [13] "TMRC20026" "TMRC20076" "TMRC20073" "TMRC20079" "TMRC20071" "TMRC20060"
## [19] "TMRC20083" "TMRC20085" "TMRC20105" "TMRC20109" "TMRC20098" "TMRC20082"
## [25] "TMRC20102" "TMRC20099" "TMRC20100" "TMRC20084" "TMRC20087" "TMRC20103"
## [31] "TMRC20104" "TMRC20086" "TMRC20107" "TMRC20081"
## Library sizes of 93 samples,
## ranging from 551,386 to 135,385,347.
pdf(file = "figures/library_size_pre_filter.pdf", width = 24, height = 12)
pre_libsize$plot
dev.off()## png
## 2
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## A non-zero genes plot of 93 samples.
## These samples have an average 28.13 CPM coverage and 8625 genes observed, ranging from 8387 to
## 8681.
## Warning: ggrepel: 79 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 82 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## png
## 2
## The samples (and read coverage) removed when filtering 8550 non-zero genes are:
## TMRC20002 TMRC20004
## 11496812 551386
## by number of genes.
## subset_expt(): There were 93, now there are 91 samples.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## A non-zero genes plot of 91 samples.
## These samples have an average 28.61 CPM coverage and 8629 genes observed, ranging from 8573 to
## 8681.
## Warning: ggrepel: 72 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Column ‘Q’ in the sample sheet, make a categorical version of it with these parameters:
Note that these cutoffs are only valid for the historical data. The newer susceptibility data uses a cutoff of 0.78 for sensitive. I will set ambiguous to 0.5 to 0.78?
max_resist_historical <- 0.35
min_sensitive_historical <- 0.49
## 202305: Removed ambiguous category for the current set.g
max_resist_current <- 0.76
min_sensitive_current <- 0.77The sanitize_percent() function seeks to make the percentage values recorded by excel more reliable. Unfortunately, sometimes excel displays the value ‘49%’ when the information recorded in the worksheet is any one of the following:
Thus, the following block will sanitize these percentage values into a single decimal number and make a categorical variable from it using pre-defined values for resistant/ambiguous/sensitive. This categorical variable will be stored in a new column: ‘sus_category_historical’.
st <- pData(lp_expt)[["susceptibilityinfectionreduction32ugmlsbvhistoricaldata"]]
starting <- sanitize_percent(st)
st## [1] "0.45" "0.14" "0.97" "0" "0.97" "0" "0"
## [8] "0.46" "0.45" "0.97" "0.56" "0.99" "0.46" "0.7"
## [15] "0.99" "0.99" "0.45" "0.98" "0.99" "0.49" "No data"
## [22] "No data" "0.99" "0.66" "0.99" "0.99" "1" "1"
## [29] "0.94" "0.94" "No data" "No data" "No data" "No data" "No data"
## [36] "No data" "No data" "No data" "No data" "No data" "No data" "No data"
## [43] "No data" "No data" "0.99" "0.99" "No data" "0.98" "0.97"
## [50] "0.96" "0.96" "0" "0" "0" "0.06" "0.94"
## [57] "0.94" "0.03" "0.94" "0" "0.25" "0.95" "0.27"
## [64] "No data" "No data" "No data" "No data" "No data" "No data" "No data"
## [71] "No data" "No data" "No data" "No data" "No data" "No data" "No data"
## [78] "No data" "No data" "No data" "No data" "No data" "No data" "No data"
## [85] "No data" "No data" "No data" "No data" "No data" "No data" "No data"
## [1] 0.45 0.14 0.97 0.00 0.97 0.00 0.00 0.46 0.45 0.97 0.56 0.99 0.46 0.70 0.99
## [16] 0.99 0.45 0.98 0.99 0.49 NA NA 0.99 0.66 0.99 0.99 1.00 1.00 0.94 0.94
## [31] NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0.99
## [46] 0.99 NA 0.98 0.97 0.96 0.96 0.00 0.00 0.00 0.06 0.94 0.94 0.03 0.94 0.00
## [61] 0.25 0.95 0.27 NA NA NA NA NA NA NA NA NA NA NA NA
## [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [91] NA
## [1] 45
sus_categorical[na_idx] <- "unknown"
resist_idx <- starting <= max_resist_historical
sus_categorical[resist_idx] <- "resistant"
indeterminant_idx <- starting > max_resist_historical &
starting < min_sensitive_historical
sus_categorical[indeterminant_idx] <- "ambiguous"
susceptible_idx <- starting >= min_sensitive_historical
sus_categorical[susceptible_idx] <- "sensitive"
sus_categorical <- as.factor(sus_categorical)
pData(lp_expt)[["sus_category_historical"]] <- sus_categorical
table(sus_categorical)## sus_categorical
## ambiguous resistant sensitive unknown
## 5 12 29 45
two_sankey <- plot_meta_sankey(
merged_zymo, factors = c("zymodeme", "clinicalcategorical", "susceptibility"),
drill_down = TRUE, color_choices = color_choices)## These columns are not in the metadata: susceptibility
## A sankey plot describing the metadata of 93 samples,
## including 8 out of 0 nodes and traversing metadata factors:
## .
The same process will be repeated for the current iteration of the sensitivity assay and stored in the ‘sus_category_current’ column.
starting_current <- sanitize_percent(pData(lp_expt)[["susceptibilityinfectionreduction32ugmlsbvcurrentdata"]])
sus_categorical_current <- starting_current
na_idx <- is.na(starting_current)
sum(na_idx)## [1] 0
sus_categorical_current[na_idx] <- "unknown"
resist_idx <- starting_current <= max_resist_current
sus_categorical_current[resist_idx] <- "resistant"
indeterminant_idx <- starting_current > max_resist_current &
starting_current < min_sensitive_current
sus_categorical_current[indeterminant_idx] <- "ambiguous"
susceptible_idx <- starting_current >= min_sensitive_current
sus_categorical_current[susceptible_idx] <- "sensitive"
sus_categorical_current <- as.factor(sus_categorical_current)
pData(lp_expt)[["sus_category_current"]] <- sus_categorical_current
pData(lp_expt)[["susceptibility"]] <- sus_categorical_current
table(sus_categorical_current)## sus_categorical_current
## resistant sensitive
## 45 46
lp_sankey <- plot_meta_sankey(
lp_expt, factors = c("zymodemecategorical", "clinicalcategorical", "susceptibility"),
drill_down = TRUE, color_choices = color_choices)## Warning: attributes are not identical across measure variables; they will be
## dropped
## A sankey plot describing the metadata of 91 samples,
## including 24 out of 0 nodes and traversing metadata factors:
## .
In many queries, we will seek to compare only the two primary strains, zymodeme 2.2 and 2.3. The following block will extract only those samples.
Note: IMPORTANT Maria Adelaida prefers not to use lp_two_strains. We should not at this time use the merged 2.1/2.2 and 2.4/2.3 categories.
lp_strain <- lp_expt %>%
set_expt_batches(fact = sus_categorical_current) %>%
set_expt_colors(color_choices[["strain"]])## The number of samples by batch are:
##
## resistant sensitive
## 45 46
## Warning in set_expt_colors(., color_choices[["strain"]]): Colors for the
## following categories are not being used: z2.0z3.0z3.2z1.0z1.5b2904unknown.
##
## z2.1 z2.2 z2.3 z2.4
## 7 41 41 2
Clinical outcome is by far the most problematic comparison in this data, but here is the recategorization of the data using it:
lp_cf <- set_expt_conditions(lp_expt, fact = "clinicalcategorical",
colors = color_choices[["cf"]]) %>%
set_expt_batches(fact = sus_categorical_current)## The numbers of samples by condition are:
##
## cure fail unknown
## 39 34 18
## Warning in set_expt_colors(new_expt, colors = colors): Colors for the following
## categories are not being used: notapplicable.
## The number of samples by batch are:
##
## resistant sensitive
## 45 46
##
## cure fail unknown
## 39 34 18
data_structures <- c(data_structures, "lp_cf")
save(list = "lp_cf",
file = glue("rda/tmrc2_lp_cf-v{ver}.rda"))
lp_cf_known <- subset_expt(lp_cf, subset = "condition!='unknown'")## The samples excluded are: TMRC20005, TMRC20066, TMRC20037, TMRC20038, TMRC20077, TMRC20074, TMRC20063, TMRC20053, TMRC20052, TMRC20064, TMRC20075, TMRC20051, TMRC20050, TMRC20049, TMRC20062, TMRC20110, TMRC20080, TMRC20054.
## subset_expt(): There were 91, now there are 73 samples.
Use the factorized version of susceptibility to categorize the samples by the historical data.
lp_susceptibility_historical <- set_expt_conditions(
lp_expt, fact = "sus_category_historical", colors = color_choices[["susceptibility"]]) %>%
set_expt_batches(fact = "clinicalcategorical")## The numbers of samples by condition are:
##
## ambiguous resistant sensitive unknown
## 5 12 29 45
## The number of samples by batch are:
##
## cure fail unknown
## 39 34 18
Use the factorized version of susceptibility to categorize the samples by the historical data.
This will likely be our canonical susceptibility dataset, so I will remove the suffix and just call it ‘lp_susceptibility’.
lp_susceptibility <- set_expt_conditions(
lp_expt, fact = "sus_category_current", colors = color_choices[["susceptibility"]]) %>%
set_expt_batches(fact = "clinicalcategorical")## The numbers of samples by condition are:
##
## resistant sensitive
## 45 46
## Warning in set_expt_colors(new_expt, colors = colors): Colors for the following
## categories are not being used: ambiguousunknown.
## The number of samples by batch are:
##
## cure fail unknown
## 39 34 18
I think this is redundant with a previous block, but I am leaving it until I am certain that it is not required in a following document.
Note: IMPORTANT This is the set Maria Adeliada prefers to use.
## The samples excluded are: TMRC20057, TMRC20056, TMRC20093, TMRC20047, TMRC20045, TMRC20108, TMRC20091, TMRC20084, TMRC20103.
## subset_expt(): There were 91, now there are 82 samples.
The following section will create some initial data structures of the observed variants in the parasite samples. This will include some of our 2016 samples for some classification queries.
I changed and improved the mapping and variant detection methods from what we used for the 2016 data. So some small changes will be required to merge them.
lp_previous <- create_expt("sample_sheets/tmrc2_samples_20191203.xlsx",
file_column = "tophat2file",
savefile = glue("rda/lp_previous-v{ver}.rda"))
tt <- lp_previous$expressionset
rownames(tt) <- gsub(pattern = "^exon_", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\.1$", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\-1$", replacement = "", x = rownames(tt))
lp_previous$expressionset <- tt
rm(tt)
data_structures <- c(data_structures, "lp_previous")The count_expt_snps() function uses our expressionset data and a metadata column in order to extract the mpileup or freebayes-based variant calls and create matrices of the likelihood that each position-per-sample is in fact a variant.
There is an important caveat here which changed on 202301: I was interpreting using the PAIRED tag, which is only used for, unsurprisingly, paired-end samples. A couple samples are not paired and so were failing silently. The QA tag looks like it is more appropriate and should work across both types. One way to find out, I am setting it here and will look to see if the results make more sense for my test samples (TMRC2001, TMRC2005, TMRC2007).
## The next line drops the samples which are missing the SNP pipeline.
lp_snp <- subset_expt(lp_expt, subset = "!is.na(pData(lp_expt)[['freebayessummary']])")## The samples excluded are: .
## subset_expt(): There were 91, now there are 91 samples.
new_snps <- count_expt_snps(lp_snp, annot_column = "freebayessummary", snp_column = "QA",
reader = "readr")## Using the snp column: QA from the sample annotations.
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## * `DP` -> `DP...3`
## * `RO` -> `RO...8`
## * `AO` -> `AO...9`
## * `QR` -> `QR...12`
## * `QA` -> `QA...13`
## * `DP` -> `DP...42`
## * `RO` -> `RO...43`
## * `QR` -> `QR...44`
## * `AO` -> `AO...45`
## * `QA` -> `QA...46`
## Lets see if we get numbers which make sense.
summary(exprs(new_snps)[["tmrc20001"]]) ## My weirdo sample## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 22.4 0.0 2217.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 102 0 247568
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 1101 0 1708458
## Now that we are reasonably confident that things make more sense, lets save and move on...
data_structures <- c(data_structures, "new_snps", "lp_snp")
tt <- normalize_expt(new_snps, transform = "log2")## transform_counts: Found 80536178 values equal to 0, adding 1 to the matrix.
Now let us pull in the 2016 data.
old_snps <- count_expt_snps(lp_previous, annot_column = "bcftable", snp_column = 2)
data_structures <- c(data_structures, "old_snps")
save(list = "lp_snp",
file = glue("rda/lp_snp-v{ver}.rda"))
data_structures <- c(data_structures, "lp_snp")
save(list = "new_snps",
file = glue("rda/new_snps-v{ver}.rda"))
data_structures <- c(data_structures, "new_snps")
save(list = "old_snps",
file = glue("rda/old_snps-v{ver}.rda"))
data_structures <- c(data_structures, "old_snps")
nonzero_snps <- exprs(new_snps) != 0
colSums(nonzero_snps)As far as I can tell, freebayes and mpileup are reasonably similar in their sensitivity/specificity; so combining the two datasets like this is expected to work with minimal problems. The most likely problem is that my mpileup-based pipeline is unable to handle indels.
I am taking a heatmap from our variant data and manually identifying sample groups.
All of the above focused entire on the parasite samples, now let us pull up the macrophage infected samples. This will comprise two datasets, one of the human and one of the parasite.
The metadata for the macrophage samples contains a couple of columns for mapped human and parasite reads. We will therefore use them separately to create two expressionsets, one for each species.
## Using mart: ENSEMBL_MART_ENSEMBL from host: apr2020.archive.ensembl.org.
## Successfully connected to the hsapiens_gene_ensembl database.
## Finished downloading ensembl gene annotations.
## Finished downloading ensembl structure annotations.
## symbol columns is null, pattern matching 'symbol'.
## Including symbols, there are 68503 vs the 249740 gene annotations.
## Not dropping haplotype chromosome annotations, set drop_haplotypes = TRUE if this is bad.
## Saving annotations to hsapiens_biomart_annotations.rda.
## Finished save().
hs_annot <- hs_annot[["annotation"]]
hs_annot[["transcript"]] <- paste0(rownames(hs_annot), ".", hs_annot[["transcript_version"]])
rownames(hs_annot) <- make.names(hs_annot[["ensembl_gene_id"]], unique = TRUE)
rownames(hs_annot) <- paste0("gene:", rownames(hs_annot))
tx_gene_map <- hs_annot[, c("transcript", "ensembl_gene_id")]
sanitize_columns <- c("drug", "macrophagetreatment", "macrophagezymodeme")
macr_annot <- hs_annot
rownames(macr_annot) <- gsub(x = rownames(macr_annot),
pattern = "^gene:",
replacement = "")
hs_macrophage <- create_expt(
macrophage_sheet,
gene_info = macr_annot,
file_column = "hg38100hisatfile") %>%
set_expt_conditions(fact = "macrophagetreatment") %>%
set_expt_batches(fact = "macrophagezymodeme") %>%
sanitize_expt_pData(columns = sanitize_columns) %>%
subset_expt(nonzero = 12000)## Reading the sample metadata.
## Did not find the column: sampleid.
## Setting the ID column to the first column.
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## The sample definitions comprises: 69 rows(samples) and 80 columns(metadata fields).
## Matched 21481 annotations and counts.
## Bringing together the count matrix and gene information.
## Some annotations were lost in merging, setting them to 'undefined'.
## Saving the expressionset to 'expt.rda'.
## The final expressionset has 21481 features and 69 samples.
## The numbers of samples by condition are:
##
## inf inf_sb uninf uninf_sb
## 30 29 5 5
## The number of samples by batch are:
##
## none z2.2 z2.3
## 10 30 29
## The samples (and read coverage) removed when filtering 12000 non-zero genes are:
## TMRC30162
## 10208
## by number of genes.
## subset_expt(): There were 69, now there are 68 samples.
fixed_genenames <- gsub(x = rownames(exprs(hs_macrophage)), pattern = "^gene:",
replacement = "")
hs_macrophage <- set_expt_genenames(hs_macrophage, ids = fixed_genenames)
table(pData(hs_macrophage)$condition)##
## inf inf_sb uninf uninf_sb
## 29 29 5 5
## The following 3 lines were copy/pasted to datastructures and should be removed soon.
nostrain <- is.na(pData(hs_macrophage)[["strainid"]])
pData(hs_macrophage)[nostrain, "strainid"] <- "none"
pData(hs_macrophage)[["strain_zymo"]] <- paste0("s", pData(hs_macrophage)[["strainid"]],
"_", pData(hs_macrophage)[["macrophagezymodeme"]])
uninfected <- pData(hs_macrophage)[["strain_zymo"]] == "snone_none"
pData(hs_macrophage)[uninfected, "strain_zymo"] <- "uninfected"
data_structures <- c(data_structures, "hs_macrophage")Finally, split off the U937 samples.
## The samples excluded are: TMRC30051, TMRC30057, TMRC30059, TMRC30060, TMRC30061, TMRC30062, TMRC30063, TMRC30064, TMRC30065, TMRC30066, TMRC30067, TMRC30069, TMRC30117, TMRC30243, TMRC30244, TMRC30245, TMRC30246, TMRC30247, TMRC30248, TMRC30249, TMRC30250, TMRC30251, TMRC30252, TMRC30266, TMRC30267, TMRC30268, TMRC30286, TMRC30326, TMRC30316, TMRC30317, TMRC30322, TMRC30323, TMRC30328, TMRC30318, TMRC30319, TMRC30324, TMRC30325, TMRC30320, TMRC30321, TMRC30327, TMRC30312, TMRC30297, TMRC30298, TMRC30299, TMRC30300, TMRC30295, TMRC30296, TMRC30303, TMRC30304, TMRC30301, TMRC30302, TMRC30314, TMRC30315, TMRC30313.
## subset_expt(): There were 68, now there are 14 samples.
In the previous block, we used a new invocation of ensembl-derived annotation data, this time we can just use our existing parasite gene annotations.
lp_macrophage <- create_expt(macrophage_sheet,
file_column = "lpanamensisv36hisatfile",
gene_info = all_lp_annot,
savefile = glue("rda/lp_macrophage-v{ver}.rda"),
annotation = "org.Lpanamensis.MHOMCOL81L13.v46.eg.db") %>%
set_expt_conditions(fact = "macrophagezymodeme") %>%
set_expt_batches(fact = "macrophagetreatment")## Reading the sample metadata.
## Did not find the column: sampleid.
## Setting the ID column to the first column.
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## The sample definitions comprises: 69 rows(samples) and 80 columns(metadata fields).
## Warning in create_expt(macrophage_sheet, file_column =
## "lpanamensisv36hisatfile", : Some samples were removed when cross referencing
## the samples against the count data.
## Matched 8778 annotations and counts.
## Bringing together the count matrix and gene information.
## The final expressionset has 8778 features and 66 samples.
## The numbers of samples by condition are:
##
## none z2.2 z2.3
## 8 29 29
## The number of samples by batch are:
##
## inf inf_sb uninf uninf_sb
## 29 29 4 4
unfilt_written <- write_expt(
lp_macrophage,
excel = glue("analyses/macrophage_de/{ver}/read_counts/lp_macrophage_reads_unfiltered-v{ver}.xlsx"))## Writing the first sheet, containing a legend and some summary data.
## The following samples have less than 5705.7 genes.
## [1] "TMRC30066" "TMRC30117" "TMRC30244" "TMRC30246" "TMRC30249" "TMRC30266"
## [7] "TMRC30268" "TMRC30326" "TMRC30323" "TMRC30319" "TMRC30325" "TMRC30327"
## [13] "TMRC30312" "TMRC30300" "TMRC30304" "TMRC30302" "TMRC30313" "TMRC30309"
## [19] "TMRC30292" "TMRC30331" "TMRC30332" "TMRC30330"
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## 175550 entries are 0. We are on a log scale, adding 1 to the data.
##
## Changed 175550 zero count features.
##
## Naively calculating coefficient of variation/dispersion with respect to condition.
##
## Finished calculating dispersion estimates.
##
## `geom_smooth()` using formula = 'y ~ x'
## This expressionset does not support lmer with condition+batch
## Error in density.default(x, adjust = adj) : 'x' contains missing values
## Error in density.default(x, adjust = adj) : 'x' contains missing values
## `geom_smooth()` using formula = 'y ~ x'
lp_macrophage_filt <- subset_expt(lp_macrophage, nonzero = 2500) %>%
semantic_expt_filter(semantic = c("amastin", "gp63", "leishmanolysin"),
semantic_column = "annot_gene_product")## The samples (and read coverage) removed when filtering 2500 non-zero genes are:
## TMRC30066 TMRC30117 TMRC30244 TMRC30246 TMRC30266 TMRC30268 TMRC30326 TMRC30323
## 3080 1147 1662 2834 822 3444 375 84
## TMRC30319 TMRC30325 TMRC30327 TMRC30312 TMRC30304 TMRC30313 TMRC30309 TMRC30330
## 374 356 129 76 289 96 188 181
## by number of genes.
## subset_expt(): There were 66, now there are 50 samples.
## semantic_expt_filter(): Removed 68 genes.
data_structures <- c(data_structures, "lp_macrophage", "lp_macrophage_filt")
filt_written <- write_expt(lp_macrophage_filt,
excel = glue("analyses/macrophage_de/{ver}/read_counts/lp_macrophage_reads_filtered-v{ver}.xlsx"))## Writing the first sheet, containing a legend and some summary data.
## The following samples have less than 5661.5 genes.
## [1] "TMRC30249" "TMRC30300" "TMRC30302" "TMRC30292" "TMRC30331" "TMRC30332"
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## 44583 entries are 0. We are on a log scale, adding 1 to the data.
##
## Changed 44583 zero count features.
##
## Naively calculating coefficient of variation/dispersion with respect to condition.
##
## Finished calculating dispersion estimates.
##
## `geom_smooth()` using formula = 'y ~ x'
## Error in density.default(x, adjust = adj) : 'x' contains missing values
## Error in density.default(x, adjust = adj) : 'x' contains missing values
## `geom_smooth()` using formula = 'y ~ x'
lp_macrophage <- lp_macrophage_filt
lp_macrophage_nosb <- subset_expt(lp_macrophage, subset = "batch!='inf_sb'")## The samples excluded are: TMRC30051, TMRC30062, TMRC30065, TMRC30069, TMRC30248, TMRC30249, TMRC30251, TMRC30252, TMRC30317, TMRC30321, TMRC30298, TMRC30300, TMRC30296, TMRC30302, TMRC30315, TMRC30294, TMRC30292, TMRC30308, TMRC30331, TMRC30332, TMRC30306.
## subset_expt(): There were 50, now there are 29 samples.
lp_nosb_write <- write_expt(
lp_macrophage_nosb,
excel = glue("analyses/macrophage_de/{ver}/read_counts/lp_macrophage_nosb_reads-v{ver}.xlsx"))## Writing the first sheet, containing a legend and some summary data.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.6396 entries are 0. We are on a log scale, adding 1 to the data.
## Changed 6396 zero count features.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## `geom_smooth()` using formula = 'y ~ x'The expressionset has a minimal or missing set of conditions/batches.
## `geom_smooth()` using formula = 'y ~ x'
lp_meta <- pData(lp_macrophage)
lp_meta[["slvsreads_log"]] <- log10(lp_meta[["slvsreads"]])
inf_values <- is.infinite(lp_meta[["slvsreads_log"]])
lp_meta[inf_values, "slvsreads_log"] <- -10
color_vector <- as.character(color_choices[["strain"]])
names(color_vector) <- names(color_choices[["strain"]])
color_vector <- color_vector[c("z2.2", "z2.3", "unknown")]
names(color_vector) <- c("z2.2", "z2.3", "none")
sl_violin <- ggplot(lp_meta,
aes(x = .data[["condition"]], y = .data[["slvsreads_log"]],
fill = .data[["condition"]])) +
geom_violin() +
geom_point() +
scale_fill_manual(values = color_vector)
sl_violinfound_idx <- data_structures %in% ls()
if (sum(!found_idx) > 0) {
not_found <- data_structures[!found_idx]
warning("Some datastructures were not generated: ", toString(not_found), ".")
data_structures <- data_structures[found_idx]
}
save(list = data_structures, file = glue("rda/tmrc2_data_structures-v{ver}.rda"))R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
locale: C
attached base packages: stats4, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: ruv(v.0.9.7.1), BiocParallel(v.1.36.0), variancePartition(v.1.32.5), org.Lpanamensis.MHOMCOL81L13.v68.eg.db(v.2024.06), AnnotationDbi(v.1.64.1), futile.logger(v.1.4.3), EuPathDB(v.1.6.0), GenomeInfoDbData(v.1.2.11), dplyr(v.1.1.4), Heatplus(v.3.10.0), ggplot2(v.3.5.1), hpgltools(v.1.0), Matrix(v.1.6-5), glue(v.1.7.0), SummarizedExperiment(v.1.32.0), GenomicRanges(v.1.54.1), GenomeInfoDb(v.1.38.8), IRanges(v.2.36.0), S4Vectors(v.0.40.2), MatrixGenerics(v.1.14.0), matrixStats(v.1.3.0), Biobase(v.2.62.0) and BiocGenerics(v.0.48.1)
loaded via a namespace (and not attached): fs(v.1.6.4), bitops(v.1.0-7), insight(v.0.20.0), doParallel(v.1.0.17), HDO.db(v.0.99.1), httr(v.1.4.7), RColorBrewer(v.1.1-3), numDeriv(v.2016.8-1.1), backports(v.1.5.0), tools(v.4.3.3), utf8(v.1.2.4), R6(v.2.5.1), statsExpressions(v.1.5.4), lazyeval(v.0.2.2), mgcv(v.1.9-1), withr(v.3.0.0), gridExtra(v.2.3), prettyunits(v.1.2.0), cli(v.3.6.2), formatR(v.1.14), AnnotationHubData(v.1.35.0), prismatic(v.1.1.2), labeling(v.0.4.3), sass(v.0.4.9), mvtnorm(v.1.2-5), genefilter(v.1.84.0), readr(v.2.1.5), pbapply(v.1.7-2), Rsamtools(v.2.18.0), yulab.utils(v.0.1.4), DOSE(v.3.28.2), stringdist(v.0.9.12), AnnotationForge(v.1.44.0), limma(v.3.58.1), RSQLite(v.2.3.7), generics(v.0.1.3), BiocIO(v.1.12.0), gtools(v.3.9.5), vroom(v.1.6.5), zip(v.2.3.1), GO.db(v.3.18.0), fansi(v.1.0.6), abind(v.1.4-5), lifecycle(v.1.0.4), yaml(v.2.3.8), edgeR(v.4.0.16), gplots(v.3.1.3.1), biocViews(v.1.70.0), qvalue(v.2.34.0), SparseArray(v.1.2.4), BiocFileCache(v.2.10.2), Rtsne(v.0.17), paletteer(v.1.6.0), grid(v.4.3.3), blob(v.1.2.4), promises(v.1.3.0), crayon(v.1.5.2), lattice(v.0.22-6), cowplot(v.1.1.3), GenomicFeatures(v.1.54.4), annotate(v.1.80.0), KEGGREST(v.1.42.0), zeallot(v.0.1.0), pillar(v.1.9.0), knitr(v.1.47), varhandle(v.2.0.6), fgsea(v.1.28.0), rjson(v.0.2.21), boot(v.1.3-30), corpcor(v.1.6.10), codetools(v.0.2-20), fastmatch(v.1.1-4), data.table(v.1.15.4), vctrs(v.0.6.5), png(v.0.1-8), Rdpack(v.2.6), testthat(v.3.2.1.1), gtable(v.0.3.5), rematch2(v.2.1.2), datawizard(v.0.11.0), cachem(v.1.1.0), xfun(v.0.44), openxlsx(v.4.2.5.2), rbibutils(v.2.2.16), S4Arrays(v.1.2.1), mime(v.0.12), correlation(v.0.8.4), coda(v.0.19-4.1), survival(v.3.7-0), iterators(v.1.0.14), statmod(v.1.5.0), directlabels(v.2024.1.21), interactiveDisplayBase(v.1.40.0), nlme(v.3.1-165), pbkrtest(v.0.5.2), bit64(v.4.0.5), EnvStats(v.2.8.1), progress(v.1.2.3), filelock(v.1.0.3), rprojroot(v.2.0.4), bslib(v.0.7.0), KernSmooth(v.2.23-24), colorspace(v.2.1-0), DBI(v.1.2.3), tidyselect(v.1.2.1), bit(v.4.0.5), compiler(v.4.3.3), curl(v.5.2.1), rvest(v.1.0.4), httr2(v.1.0.1), graph(v.1.80.0), BiocCheck(v.1.38.2), xml2(v.1.3.6), desc(v.1.4.3), DelayedArray(v.0.28.0), plotly(v.4.10.4), bayestestR(v.0.13.2), rtracklayer(v.1.62.0), scales(v.1.3.0), caTools(v.1.18.2), remaCor(v.0.0.18), quadprog(v.1.5-8), RBGL(v.1.78.0), rappdirs(v.0.3.3), stringr(v.1.5.1), digest(v.0.6.35), ggsankey(v.0.0.99999), minqa(v.1.2.7), rmarkdown(v.2.27), aod(v.1.3.3), XVector(v.0.42.0), RhpcBLASctl(v.0.23-42), htmltools(v.0.5.8.1), pkgconfig(v.2.0.3), lme4(v.1.1-35.3), highr(v.0.11), dbplyr(v.2.5.0), fastmap(v.1.2.0), rlang(v.1.1.4), htmlwidgets(v.1.6.4), shiny(v.1.8.1.1), farver(v.2.1.2), jquerylib(v.0.1.4), jsonlite(v.1.8.8), GOSemSim(v.2.28.1), RCurl(v.1.98-1.14), magrittr(v.2.0.3), patchwork(v.1.2.0), parameters(v.0.21.7), munsell(v.0.5.1), Rcpp(v.1.0.12), stringi(v.1.8.4), brio(v.1.1.5), zlibbioc(v.1.48.2), MASS(v.7.3-60), AnnotationHub(v.3.10.1), plyr(v.1.8.9), parallel(v.4.3.3), ggrepel(v.0.9.5), Biostrings(v.2.70.3), splines(v.4.3.3), pander(v.0.6.5), hms(v.1.1.3), locfit(v.1.5-9.9), RUnit(v.0.4.33), fastcluster(v.1.2.6), effectsize(v.0.8.8), reshape2(v.1.4.4), biomaRt(v.2.58.2), pkgload(v.1.3.4), futile.options(v.1.0.1), BiocVersion(v.3.18.1), XML(v.3.99-0.16.1), evaluate(v.0.23), lambda.r(v.1.2.4), BiocManager(v.1.30.23), nloptr(v.2.0.3), tzdb(v.0.4.0), foreach(v.1.5.2), httpuv(v.1.6.15), MatrixModels(v.0.5-3), BayesFactor(v.0.9.12-4.7), tidyr(v.1.3.1), purrr(v.1.0.2), BiocBaseUtils(v.1.4.0), broom(v.1.0.6), xtable(v.1.8-4), restfulr(v.0.0.15), fANCOVA(v.0.6-1), later(v.1.3.2), viridisLite(v.0.4.2), OrganismDbi(v.1.44.0), tibble(v.3.2.1), ggstatsplot(v.0.12.3), lmerTest(v.3.1-3), memoise(v.2.0.1), GenomicAlignments(v.1.38.2), sva(v.3.50.0) and GSEABase(v.1.64.0)
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset
## This is hpgltools commit:
## Saving to 01datasets.rda.xz