Did the stuff on this morning’s TODO which came out of this morning’s meeting: do a PCA without the oddball strains (already done in the worksheet), highlight reference strains, and add L.major IDs and Descriptions (done by appending a collapsed version of the ortholog data to the all_lp_annot data).
Fixed human IDs for the macrophage data.
Changed input metadata sheets: primarily because I only remembered yesterday to finish the SL search for samples >TMRC20095. They are running now and will be added momentarily (I will have to redownload the sheet).
Setting up to make a hclust/phylogenetic tree of strains, use these are reference: 2168(2.3), 2272(2.2), for other 2.x choose arbitrarily (lower numbers are better).
Added another sanitize columns call for Antimony vs. antimony and None vs. none in the TMRC2 macrophage samples.
This document is intended to create the data structures used to evaluate our TMRC2 samples. In some cases, this includes only those samples starting in 2019; in other instances I am including our previous (2015-2016) samples.
In all cases the processing performed was:
I am thinking that this meeting will bring Maria Adelaida fully back into the analyses of the parasite data, and therefore may focus primarily on the goals rather than the analyses?
In a couple of important ways the TMRC2 data is much more complex than the TMRC3:
Our shared online sample sheet is nearly static at the time of this writing (202209), I expect at this point the only likely updates will be to annotate some strains as more or less susceptible to drug treatment.
sample_sheet <- "sample_sheets/ClinicalStrains_TMRC2.xlsx"
macrophage_sheet <- "sample_sheets/tmrc2_macrophage_samples.xlsx"The following block provides an example invocation of how I automatically extract things like percent reads mapped/trimmed/etc from the logs produced by trimomatic/cutadapt/hisat/salmon/etc. The caveat is that this container only has a small portion of the material available in the main working tree, as a result the new columns added to the sample sheet are relatively sparse compared to what I get on my computer.
In addition, because these samples have gone through ~ 3 different versions of my pipeline, and the code which extracts the numbers explicitly assumes only the most recent version (because it is the best!), it does not get out the data for all the samples.
## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## Writing new metadata to: sample_sheets/ClinicalStrains_TMRC2_modified.xlsx
Everything which follows depends on the Existing TriTrypDB annotations revision 46, circa 2019. The following block loads a database of these annotations and turns it into a matrix where the rows are genes and columns are all the annotation types provided by TriTrypDB.
The same database was used to create a matrix of orthologous genes between L.panamensis and all of the other species in the TriTrypDB.
The same database of annotations also provides mappings to the set of annotated GO categories for the L.panamensis genome along with gene lengths.
The following block assumes one has access to tritrypdb.org, which is not currently guaranteed. Thus I bundled a pre-generated copy of the genome, Txdb, and annotations into this container image.
Here is the code I used to do so.
## meta <- download_eupath_metadata(webservice = "tritrypdb", eu_version = "v46")
eu_meta <- download_eupath_metadata(webservice = "tritrypdb")
panamensis_entry <- get_eupath_entry("MHOM", metadata = eu_meta[["valid"]])
panamensis_db <- make_eupath_orgdb(panamensis_entry)
panamensis_pkg <- panamensis_db[["pkgname"]]
package_name <- panamensis_db[["pkgname"]]
if (is.null(panamensis_pkg)) {
panamensis_pkg <- panamensis_db[["orgdb_name"]]
package_name <- panamensis_pkg
}
tt <- library(panamensis_pkg, character.only = TRUE)
panamensis_pkg <- get0(panamensis_pkg)
all_fields <- columns(panamensis_pkg)
all_lp_annot <- sm(load_orgdb_annotations(panamensis_pkg, keytype = "gid"))
lp_go <- load_orgdb_go(package_name)
lp_go <- lp_go[, c("GID", "GO")]
lp_lengths <- all_lp_annot[, c("gid", "cds_length")]
colnames(lp_lengths) <- c("ID", "length")
all_lp_annot[["annot_gene_product"]] <- tolower(all_lp_annot[["annot_gene_product"]])
orthos <- sm(extract_eupath_orthologs(db = panamensis_pkg))
data_structures <- c(data_structures, "lp_lengths", "lp_go", "all_lp_annot", "meta")The sqlite data provided by EuPathDB includes tables of GO, orthologs, and a decent swath of comparisons against various annotation databases.
all_installed <- rownames(installed.packages())
candidates <- grepl(pattern = "^org.Lpanamensis.MHOM.*v68.eg.db", x = all_installed)
orgdb_pkg_name <- all_installed[candidates]
tt <- library(orgdb_pkg_name, character.only = TRUE)## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:hpgltools':
##
## notes
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:glue':
##
## trim
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
##
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
##
## select
##
panamensis_pkg <- get0(orgdb_pkg_name)
all_fields <- columns(panamensis_pkg)
annot_fields <- grepl(x = all_fields, pattern = "^ANNOT")
annotation_columns <- all_fields[annot_fields]
all_lp_annot <- sm(load_orgdb_annotations(panamensis_pkg,
fields = annotation_columns, keytype = "gid"))
lp_annotations <- all_lp_annot[["genes"]]
colnames(lp_annotations) <- make.names(gsub(x = colnames(lp_annotations),
pattern = "^annot_", replace = ""), unique = TRUE)
lp_go <- load_orgdb_go(panamensis_pkg)## The chosen keytype was not available. Using 'GID'.
## This is an orgdb, good.
## 'select()' returned 1:many mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
lp_go <- lp_go[, c("GID", "GO")]
lp_lengths <- lp_annotations[, c("gid", "cds_length")]
colnames(lp_lengths) <- c("ID", "length")
lp_annotations[["gene_product"]] <- tolower(lp_annotations[["gene_product"]])
data_structures <- c(data_structures, "lp_lengths", "lp_go", "lp_annotations", "meta")Recently there was a request to include the Leishmania major gene IDs and descriptions. Thus I will extract them along with the orthologs and append that to the annotations used.
Having spent the time to run the following code, I realized that the orthologs data structure above actually already has the gene IDs and descriptions.
Thus I will leave my query in place to extract the major annotations, but follow it up with a collapse of the major orthologs and appending of that to the panamensis annotations.
orgdb <- "org.Lmajor.Friedlin.v49.eg.db"
tt <- sm(library(orgdb, character.only = TRUE))
major_db <- org.Lmajor.Friedlin.v49.eg.db
all_fields <- columns(pan_db)
all_lm_annot <- sm(load_orgdb_annotations(
major_db,
keytype = "gid",
fields = c("gene_entrez_id", "annot_gene_name",
"annot_strand", "annot_chromosome", "annot_cds_length",
"annot_gene_product")))$genes
wanted_orthos_idx <- orthos[["ORTHOLOGS_SPECIES"]] == "Leishmania major strain Friedlin"
sum(wanted_orthos_idx)
wanted_orthos <- orthos[wanted_orthos_idx, ]
wanted_orthos <- wanted_orthos[, c("GID", "ORTHOLOGS_ID", "ORTHOLOGS_NAME")]
collapsed_orthos <- wanted_orthos %>%
group_by(GID) %>%
summarise(collapsed_id = stringr::str_c(ORTHOLOGS_ID, collapse = " ; "),
collapsed_name = stringr::str_c(ORTHOLOGS_NAME, collapse = " ; "))
all_lp_annot <- merge(all_lp_annot, collapsed_orthos, by.x = "row.names",
by.y = "GID", all.x = TRUE)
rownames(all_lp_annot) <- all_lp_annot[["Row.names"]]
all_lp_annot[["Row.names"]] <- NULL
data_structures <- c(data_structures, "lp_lengths", "lp_go", "all_lp_annot")The following block loads the full genome sequence for panamensis. We may use this later to attempt to estimate PCR primers to discern strains.
I am not sure how to increase the number of open files in a container, as a result this does not work.
## testing_panamensis <- make_eupath_bsgenome(entry = panamensis_entry, eu_version = "v46")
pkg_candidates <- grepl(x = all_installed, pattern = "BSGenome\\.Leishmania\\.panamensis.*")
pkg_name <- all_installed[pkg_candidates][1]
pkg_name## [1] "BSGenome.Leishmania.panamensis.MHOMCOL81L13.v68"
## Loading required package: Seqinfo
## Loading required package: BSgenome
## Loading required package: GenomicRanges
## Loading required package: Biostrings
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## Loading required package: BiocIO
## Loading required package: rtracklayer
The process of sample estimation takes two primary inputs:
An expressionSet(or summarizedExperiment) is a data structure used in R to examine RNASeq data. It is comprised of annotations, metadata, and expression data. In the case of our processing pipeline, the location of the expression data is provided by the filenames in the metadata.
The following samples are much lower coverage:
There is a set of strains which acquired resistance in vitro. These are included in the dataset, but there are not likely enough of them to query that question explicitly.
The following list contains the colors we have chosen to use when plotting the various ways of discerning the data.
color_choices <- list(
"strain" = list(
## "z1.0" = "#333333", ## Changed this to 'braz' to make it easier to find them.
"z2.0" = "#555555",
"z3.0" = "#777777",
"z2.1" = "#874400",
"z2.2" = "#0000cc",
"z2.3" = "#cc0000",
"z2.4" = "#df7000",
"z3.2" = "#888888",
"z1.0" = "#cc00cc",
"z1.5" = "#cc00cc",
"b2904" = "#cc00cc",
"unknown" = "#cbcbcb"),
## "null" = "#000000"),
"zymo" = list(
"z22" = "#0000cc",
"z23" = "#cc0000"),
"cf" = list(
"cure" = "#006f00",
"fail" = "#9dffa0",
"unknown" = "#cbcbcb",
"notapplicable" = "#000000"),
"susceptibility" = list(
"resistant" = "#8563a7",
"sensitive" = "#8d0000",
"ambiguous" = "#cbcbcb",
"unknown" = "#555555"))
data_structures <- c(data_structures, "color_choices")The data structure ‘lp_se’ contains the data for all samples which have hisat2 count tables, and which pass a few initial quality tests (e.g. they must have more than 8550 genes with >0 counts and >5e6 reads which mapped to a gene); genes which are annotated with a few key redundant categories (leishmanolysin for example) are also culled.
There are a few metadata columns which we really want to make certain are standardized.
Note: I changed this to print both the number of reads and genes for removed samples.
202510: Commenting out the semantic filter and will either not apply it or move it to somewhere after the visualization of the data.
sanitize_columns <- c("passagenumber", "clinicalresponse", "clinicalcategorical",
"zymodemecategorical", "included")
lp_se <- create_se(sample_sheet,
gene_info = lp_annotations,
annotation_name = orgdb_pkg_name,
savefile = glue("rda/tmrc2_lp_se_all_raw-v{ver}.rda"),
id_column = "hpglidentifier",
file_column = "lpanamensisv36hisatfile") %>%
set_conditions(fact = "zymodemecategorical", colors = color_choices[["strain"]]) %>%
## semantic_filter(semantic = c("amastin", "gp63", "leishmanolysin"),
## semantic_column = "annot_gene_product") %>%
sanitize_metadata(columns = sanitize_columns) %>%
subset_se(subset = "included=='yes'") %>%
set_factors(columns = sanitize_columns, class = "factor")## Reading the sample metadata.
## Did not find the column: hpglidentifier.
## Setting the ID column to the first column.
## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## The sample definitions comprises: 93 rows(samples) and 74 columns(metadata fields).
## Matched 8778 annotations and counts.
## The final summarized experiment has 8778 rows and 74 columns.
## The numbers of samples by condition are:
##
## z2.1 z2.2 z2.3 z2.4
## 7 43 41 2
## Warning in set_se_colors(new_se, colors = colors): Colors for the following
## categories are not being used: z2.0, z3.0, z3.2, z1.0, z1.5, b2904, unknown.
## Recasting the data.frame to DataFrame.
## rownames sampleid tubelabelorigin included
## Length:93 Length:93 Length:93 yes:93
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## sourcelab expperson pathogen pathogenstrain
## Length:93 Length:93 Length:93 Length:93
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## host parasitestage phase passagenumber
## Length:93 Length:93 Length:93 2:89
## Class :character Class :character Class :character 3: 3
## Mode :character Mode :character Mode :character 4: 1
##
##
##
##
## parasitenumber clinicalresponse clinicalcategorical zymodemeanalysis
## Length:93 cure :41 cure :41 Length:93
## Class :character failure:34 fail :34 Class :character
## Mode :character nd :18 unknown:18 Mode :character
##
##
##
##
## zymodemecategorical phenotypiccharacteristics
## z21: 7 Min. :2.10
## z22:43 1st Qu.:2.20
## z23:41 Median :2.20
## z24: 2 Mean :2.24
## 3rd Qu.:2.30
## Max. :2.40
##
## susceptibilityinfectionreduction32ugmlsbvhistoricaldata
## Length:93
## Class :character
## Mode :character
##
##
##
##
## susceptibilityinfectionreduction32ugmlsbvcurrentdata
## Length:93
## Class :character
## Mode :character
##
##
##
##
## qualitativeclassificationofdrugsusceptibility rnapreservation
## Length:93 Length:93
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## rnaextractiondate rnaqctesteddate rnangul rnaqcpassed
## Min. :20181024 Min. :20181024 Length:93 Length:93
## 1st Qu.:20200918 1st Qu.:20200921 Class :character Class :character
## Median :20210211 Median :20210217 Mode :character Mode :character
## Mean :20202867 Mean :20204528
## 3rd Qu.:20210913 3rd Qu.:20210914
## Max. :20211112 Max. :20211116
## NA's :1
## rnangul1 x260280 x260230 rnavolumeul
## Length:93 Length:93 Length:93 Min. :24.1
## Class :character Class :character Class :character 1st Qu.:30.0
## Mode :character Mode :character Mode :character Median :30.0
## Mean :29.8
## 3rd Qu.:30.0
## Max. :30.0
## NA's :8
## rnaavailableul libraryconstdate libraryqcdate
## Min. :17.1 Min. :20181025 Min. :20181029
## 1st Qu.:26.2 1st Qu.:20200922 1st Qu.:20200564
## Median :27.4 Median :20210303 Median :20201223
## Mean :26.6 Mean :20204976 Mean :20203133
## 3rd Qu.:27.7 3rd Qu.:20210923 3rd Qu.:20210317
## Max. :28.5 Max. :20211223 Max. :20211223
## NA's :8 NA's :22
## rnausedtoconstructlibrariesul rnausedtoconstructlibrariesug libqcpassed
## Min. :0.410 Length:93 Length:93
## 1st Qu.:0.797 Class :character Class :character
## Median :1.060 Mode :character Mode :character
## Mean :1.614
## 3rd Qu.:2.318
## Max. :8.010
## NA's :9
## index indexsequence libraryvolumeul
## Min. : 1.0 Length:93 Min. :27.0
## 1st Qu.: 7.0 Class :character 1st Qu.:28.0
## Median :12.0 Mode :character Median :28.0
## Mean :13.3 Mean :27.9
## 3rd Qu.:20.0 3rd Qu.:28.0
## Max. :27.0 Max. :28.0
##
## libraryvolumesenttonajibu00b4slabul shipmentu00a0date descriptonandremarks
## Min. :15.0 Min. :20200217 Length:93
## 1st Qu.:15.0 1st Qu.:20210104 Class :character
## Median :15.0 Median :20210427 Mode :character
## Mean :15.1 Mean :20209109
## 3rd Qu.:15.0 3rd Qu.:20211012
## Max. :28.0 Max. :20220103
## NA's :8
## librarybioanalyzerprofileelsayedlabfilenamewelllane libraryconcnm
## Length:93 Min. : 1.7
## Class :character 1st Qu.:25.4
## Mode :character Median :36.6
## Mean :38.8
## 3rd Qu.:54.1
## Max. :82.9
## NA's :50
## samplefor100ul2or4nmsequencing waterfor100ul2nmsequencing sequencingorderno
## Min. : 2.51 Min. :-1.76 Length:93
## 1st Qu.: 3.56 1st Qu.:89.33 Class :character
## Median : 7.34 Median :92.55 Mode :character
## Mean : 8.93 Mean :84.40
## 3rd Qu.:10.38 3rd Qu.:95.77
## Max. :41.67 Max. :97.49
## NA's :66 NA's :66
## seqorderdate seqcompletedate totalreads trimmedreads
## Min. :20191107 Length:93 Min. :1.76e+06 Min. :1.44e+06
## 1st Qu.:20191107 Class :character 1st Qu.:2.63e+07 1st Qu.:2.13e+07
## Median :20210427 Mode :character Median :3.13e+07 Median :2.64e+07
## Mean :20201840 Mean :4.25e+07 Mean :3.59e+07
## 3rd Qu.:20210427 3rd Qu.:4.59e+07 3rd Qu.:3.59e+07
## Max. :20210427 Max. :1.60e+08 Max. :1.47e+08
## NA's :84 NA's :22 NA's :22
## percentkept lpanamensisv36salmonfile lpanamensisv36hisatfile
## Min. :0.588 Length:93 Length:93
## 1st Qu.:0.796 Class :character Class :character
## Median :0.854 Mode :character Mode :character
## Mean :0.834
## 3rd Qu.:0.902
## Max. :0.932
## NA's :25
## hisatsinglemappedconcordant hisatmultimappedconcordant
## Min. :1.85e+05 Min. : 19166
## 1st Qu.:1.82e+07 1st Qu.:1080740
## Median :2.22e+07 Median :1354729
## Mean :3.02e+07 Mean :1936290
## 3rd Qu.:3.05e+07 3rd Qu.:2092224
## Max. :1.21e+08 Max. :7648225
## NA's :22 NA's :22
## hisatconcordantmappingrate hisatdiscordantsingle hisatdiscordantmulti
## Min. :0.00659 Min. : 319672 Min. : 47820
## 1st Qu.:0.88420 1st Qu.: 558865 1st Qu.: 67872
## Median :0.90733 Median : 788489 Median : 76361
## Mean :0.89400 Mean : 4281951 Mean : 963850
## 3rd Qu.:0.92829 3rd Qu.: 850646 3rd Qu.: 102498
## Max. :0.97614 Max. :22441366 Max. :5406308
## NA's :22 NA's :87 NA's :87
## hisattotalrate bcftable freebayessummary r1slforward
## Min. :0.9 Length:93 Length:93 Min. : 0.0
## 1st Qu.:0.9 Class :character Class :character 1st Qu.: 13.0
## Median :0.9 Mode :character Mode :character Median : 24.0
## Mean :0.9 Mean : 29.2
## 3rd Qu.:0.9 3rd Qu.: 37.0
## Max. :0.9 Max. :111.0
## NA's :92 NA's :32
## r1slrevcomp r2slforward r2slrevcomp zymodemereference
## Min. : 38 Min. : 101616 Min. : 0.0 Length:93
## 1st Qu.: 226 1st Qu.: 237498 1st Qu.: 2.0 Class :character
## Median : 479 Median : 328730 Median : 4.5 Mode :character
## Mean : 4168 Mean : 453401 Mean : 907.5
## 3rd Qu.: 1360 3rd Qu.: 552578 3rd Qu.: 9.0
## Max. :51878 Max. :1450388 Max. :47191.0
## NA's :32 NA's :39 NA's :39
## knnv2classification knnv2notes hclustclade hclustnotes
## Length:93 Length:93 Length:93 Length:93
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## knnhclusttogethercall condition batch
## Length:93 z2.1: 7 undefined:93
## Class :character z2.2:43
## Mode :character z2.3:41
## z2.4: 2
##
##
##
data_structures <- c(data_structures, "lp_se")
save(list = "lp_se", file = glue("rda/tmrc2_lp_se_all_sanitized-v{ver}.rda"))
table(colData(lp_se)[["zymodemecategorical"]])##
## z21 z22 z23 z24
## 7 43 41 2
##
## cure failure nd
## 41 34 18
##
## cure fail unknown
## 41 34 18
## [1] 93
## [1] "TMRC20002" "TMRC20004" "TMRC20067" "TMRC20068" "TMRC20041" "TMRC20015"
## [7] "TMRC20009" "TMRC20016" "TMRC20011" "TMRC20017" "TMRC20019" "TMRC20024"
## [13] "TMRC20036" "TMRC20069" "TMRC20033" "TMRC20031" "TMRC20055" "TMRC20078"
## [19] "TMRC20094" "TMRC20042" "TMRC20058" "TMRC20072" "TMRC20059" "TMRC20048"
## [25] "TMRC20057" "TMRC20088" "TMRC20056" "TMRC20043" "TMRC20046" "TMRC20093"
## [31] "TMRC20089" "TMRC20047" "TMRC20090" "TMRC20044" "TMRC20045" "TMRC20108"
## [37] "TMRC20096" "TMRC20101" "TMRC20092" "TMRC20091" "TMRC20095"
## [1] "TMRC20001" "TMRC20065" "TMRC20039" "TMRC20010" "TMRC20012" "TMRC20013"
## [7] "TMRC20014" "TMRC20018" "TMRC20070" "TMRC20020" "TMRC20021" "TMRC20022"
## [13] "TMRC20026" "TMRC20076" "TMRC20073" "TMRC20079" "TMRC20071" "TMRC20060"
## [19] "TMRC20083" "TMRC20085" "TMRC20105" "TMRC20109" "TMRC20098" "TMRC20082"
## [25] "TMRC20102" "TMRC20099" "TMRC20100" "TMRC20084" "TMRC20087" "TMRC20103"
## [31] "TMRC20104" "TMRC20086" "TMRC20107" "TMRC20081"
unknown_ids <- colData(lp_se)[["clinicalcategorical"]] == "unknown"
rownames(colData(lp_se))[unknown_ids]## [1] "TMRC20005" "TMRC20066" "TMRC20037" "TMRC20038" "TMRC20077" "TMRC20074"
## [7] "TMRC20063" "TMRC20053" "TMRC20052" "TMRC20064" "TMRC20075" "TMRC20051"
## [13] "TMRC20050" "TMRC20049" "TMRC20062" "TMRC20110" "TMRC20080" "TMRC20054"
all_sensitive_ids <- colData(lp_se)[["qualitativeclassificationofdrugsusceptibility"]] == "Sensitive"
sensitive_ids <- rownames(colData(lp_se))[all_sensitive_ids]
sensitive_ids## [1] "TMRC20002" "TMRC20004" "TMRC20005" "TMRC20039" "TMRC20041" "TMRC20009"
## [7] "TMRC20011" "TMRC20012" "TMRC20017" "TMRC20014" "TMRC20019" "TMRC20020"
## [13] "TMRC20022" "TMRC20024" "TMRC20036" "TMRC20069" "TMRC20033" "TMRC20026"
## [19] "TMRC20031" "TMRC20076" "TMRC20055" "TMRC20078" "TMRC20072" "TMRC20057"
## [25] "TMRC20056" "TMRC20060" "TMRC20077" "TMRC20074" "TMRC20063" "TMRC20053"
## [31] "TMRC20050" "TMRC20049" "TMRC20110" "TMRC20083" "TMRC20046" "TMRC20093"
## [37] "TMRC20044" "TMRC20045" "TMRC20109" "TMRC20096" "TMRC20092" "TMRC20091"
## [43] "TMRC20084" "TMRC20087" "TMRC20103" "TMRC20086" "TMRC20081"
all_resistant_ids <- colData(lp_se)[["qualitativeclassificationofdrugsusceptibility"]] == "Resistant"
resistant_ids <- rownames(colData(lp_se))[all_resistant_ids]
resistant_ids## [1] "TMRC20001" "TMRC20065" "TMRC20066" "TMRC20037" "TMRC20038" "TMRC20067"
## [7] "TMRC20068" "TMRC20015" "TMRC20010" "TMRC20016" "TMRC20013" "TMRC20018"
## [13] "TMRC20070" "TMRC20021" "TMRC20073" "TMRC20079" "TMRC20071" "TMRC20094"
## [19] "TMRC20042" "TMRC20058" "TMRC20059" "TMRC20048" "TMRC20088" "TMRC20052"
## [25] "TMRC20064" "TMRC20075" "TMRC20051" "TMRC20062" "TMRC20080" "TMRC20043"
## [31] "TMRC20054" "TMRC20085" "TMRC20089" "TMRC20047" "TMRC20090" "TMRC20105"
## [37] "TMRC20108" "TMRC20098" "TMRC20101" "TMRC20082" "TMRC20102" "TMRC20099"
## [43] "TMRC20100" "TMRC20104" "TMRC20107" "TMRC20095"
all_z23 <- colData(lp_se)[["zymodemecategorical"]] == "z23"
z23_ids <- rownames(colData(lp_se))[all_z23]
z23_ids## [1] "TMRC20001" "TMRC20065" "TMRC20066" "TMRC20037" "TMRC20038" "TMRC20067"
## [7] "TMRC20068" "TMRC20015" "TMRC20010" "TMRC20016" "TMRC20013" "TMRC20018"
## [13] "TMRC20070" "TMRC20021" "TMRC20073" "TMRC20079" "TMRC20071" "TMRC20094"
## [19] "TMRC20058" "TMRC20059" "TMRC20048" "TMRC20052" "TMRC20064" "TMRC20075"
## [25] "TMRC20051" "TMRC20062" "TMRC20080" "TMRC20043" "TMRC20054" "TMRC20085"
## [31] "TMRC20089" "TMRC20090" "TMRC20105" "TMRC20098" "TMRC20082" "TMRC20102"
## [37] "TMRC20099" "TMRC20100" "TMRC20104" "TMRC20107" "TMRC20095"
all_z22 <- colData(lp_se)[["zymodemecategorical"]] == "z22"
z22_ids <- rownames(colData(lp_se))[all_z22]
z22_ids## [1] "TMRC20002" "TMRC20004" "TMRC20005" "TMRC20039" "TMRC20041" "TMRC20009"
## [7] "TMRC20011" "TMRC20012" "TMRC20017" "TMRC20014" "TMRC20019" "TMRC20020"
## [13] "TMRC20022" "TMRC20024" "TMRC20036" "TMRC20069" "TMRC20033" "TMRC20026"
## [19] "TMRC20031" "TMRC20076" "TMRC20055" "TMRC20078" "TMRC20042" "TMRC20072"
## [25] "TMRC20088" "TMRC20060" "TMRC20077" "TMRC20074" "TMRC20063" "TMRC20053"
## [31] "TMRC20050" "TMRC20049" "TMRC20110" "TMRC20083" "TMRC20046" "TMRC20044"
## [37] "TMRC20109" "TMRC20096" "TMRC20101" "TMRC20092" "TMRC20087" "TMRC20086"
## [43] "TMRC20081"
## Mode FALSE TRUE
## logical 40 3
## [1] "TMRC20019" "TMRC20020" "TMRC20078" "TMRC20085" "TMRC20093" "TMRC20102"
## Mode FALSE
## logical 41
All the following data will derive from this starting point.
Here is a table of my current classifier’s interpretation of the strains.
##
## unknown z21 z22 z23 z24
## 2 5 43 41 2
merged_zymo <- lp_se
colData(merged_zymo)[["zymodeme"]] <- as.character(colData(merged_zymo)[["zymodemecategorical"]])
z21_idx <- colData(merged_zymo)[["zymodeme"]] == "z21"
colData(merged_zymo)[z21_idx, "zymodeme"] <- "z22"
z24_idx <- colData(merged_zymo)[["zymodeme"]] == "z24"
colData(merged_zymo)[z24_idx, "zymodeme"] <- "z23"
keepers <- colData(merged_zymo)[["zymodeme"]] == "z22" |
colData(merged_zymo)[["zymodeme"]] == "z23"
merged_zymo <- merged_zymo[, keepers] %>%
set_conditions(fact = "zymodeme", colors = color_choices[["zymo"]])## The numbers of samples by condition are:
##
## z22 z23
## 50 43
##
## cure fail unknown
## 41 34 18
unknown_ids <- colData(lp_se)[["clinicalcategorical"]] == "unknown"
rownames(colData(lp_se))[unknown_ids]## [1] "TMRC20005" "TMRC20066" "TMRC20037" "TMRC20038" "TMRC20077" "TMRC20074"
## [7] "TMRC20063" "TMRC20053" "TMRC20052" "TMRC20064" "TMRC20075" "TMRC20051"
## [13] "TMRC20050" "TMRC20049" "TMRC20062" "TMRC20110" "TMRC20080" "TMRC20054"
failed_ids <- colData(lp_se)[["clinicalcategorical"]] == "fail"
rownames(colData(lp_se))[failed_ids]## [1] "TMRC20001" "TMRC20065" "TMRC20039" "TMRC20010" "TMRC20012" "TMRC20013"
## [7] "TMRC20014" "TMRC20018" "TMRC20070" "TMRC20020" "TMRC20021" "TMRC20022"
## [13] "TMRC20026" "TMRC20076" "TMRC20073" "TMRC20079" "TMRC20071" "TMRC20060"
## [19] "TMRC20083" "TMRC20085" "TMRC20105" "TMRC20109" "TMRC20098" "TMRC20082"
## [25] "TMRC20102" "TMRC20099" "TMRC20100" "TMRC20084" "TMRC20087" "TMRC20103"
## [31] "TMRC20104" "TMRC20086" "TMRC20107" "TMRC20081"
## Library sizes of 93 samples,
## ranging from 564,812 to 1.37e+08.
pdf(file = "figures/library_size_pre_filter.pdf", width = 24, height = 12)
pre_libsize$plot
dev.off()## png
## 2
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## i Please use `linewidth` instead.
## i The deprecated feature was likely used in the hpgltools package.
## Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## A non-zero genes plot of 93 samples.
## These samples have an average 28.6 CPM coverage and 8691 genes observed, ranging from 8452 to
## 8749.
## png
## 2
## The samples (and read coverage) removed when filtering 8550 non-zero genes are:
## TMRC20002
## 11681227
## TMRC20002
## 8452
## Samples removed: 8452
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## A non-zero genes plot of 92 samples.
## These samples have an average 28.78 CPM coverage and 8694 genes observed, ranging from 8554 to
## 8749.
Column ‘Q’ in the sample sheet, make a categorical version of it with these parameters:
Note that these cutoffs are only valid for the historical data. The newer susceptibility data uses a cutoff of 0.78 for sensitive. I will set ambiguous to 0.5 to 0.78?
max_resist_historical <- 0.35
min_sensitive_historical <- 0.49
## 202305: Removed ambiguous category for the current set.
max_resist_current <- 0.77
min_sensitive_current <- 0.77The sanitize_percent() function seeks to make the percentage values recorded by excel more reliable. Unfortunately, sometimes excel displays the value ‘49%’ when the information recorded in the worksheet is any one of the following:
Thus, the following block will sanitize these percentage values into a single decimal number and make a categorical variable from it using pre-defined values for resistant/ambiguous/sensitive. This categorical variable will be stored in a new column: ‘sus_category_historical’.
st <- colData(lp_se)[["susceptibilityinfectionreduction32ugmlsbvhistoricaldata"]]
starting <- sanitize_percent(st)
st## [1] "0.45" "0.14" "0.99" "0.97" "0" "0.97" "0"
## [8] "0" "0.46" "0.45" "0.97" "0.56" "0.99" "0.46"
## [15] "0.7" "0.99" "0.99" "0.45" "0.98" "0.99" "0.49"
## [22] "No data" "No data" "0.99" "0.66" "0.99" "0.99" "1"
## [29] "1" "0.94" "0.94" "No data" "No data" "No data" "No data"
## [36] "No data" "No data" "No data" "No data" "No data" "No data" "No data"
## [43] "No data" "No data" "No data" "0.99" "0.99" "No data" "0.98"
## [50] "0.97" "0.96" "0.96" "0" "0" "0" "0.06"
## [57] "0.94" "0.94" "0.03" "0.94" "0" "0.25" "0.95"
## [64] "0.27" "No data" "No data" "No data" "No data" "No data" "No data"
## [71] "No data" "No data" "No data" "No data" "No data" "No data" "No data"
## [78] "No data" "No data" "No data" "No data" "No data" "No data" "No data"
## [85] "No data" "No data" "No data" "No data" "No data" "No data" "No data"
## [92] "No data"
## [1] 0.45 0.14 0.99 0.97 0.00 0.97 0.00 0.00 0.46 0.45 0.97 0.56 0.99 0.46 0.70
## [16] 0.99 0.99 0.45 0.98 0.99 0.49 NA NA 0.99 0.66 0.99 0.99 1.00 1.00 0.94
## [31] 0.94 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [46] 0.99 0.99 NA 0.98 0.97 0.96 0.96 0.00 0.00 0.00 0.06 0.94 0.94 0.03 0.94
## [61] 0.00 0.25 0.95 0.27 NA NA NA NA NA NA NA NA NA NA NA
## [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [91] NA NA
## [1] 45
sus_categorical[na_idx] <- "unknown"
resist_idx <- starting <= max_resist_historical
sus_categorical[resist_idx] <- "resistant"
indeterminant_idx <- starting > max_resist_historical &
starting < min_sensitive_historical
sus_categorical[indeterminant_idx] <- "ambiguous"
susceptible_idx <- starting >= min_sensitive_historical
sus_categorical[susceptible_idx] <- "sensitive"
sus_categorical <- as.factor(sus_categorical)
colData(lp_se)[["sus_category_historical"]] <- sus_categorical
table(sus_categorical)## sus_categorical
## ambiguous resistant sensitive unknown
## 5 12 30 45
two_sankey <- plot_meta_sankey(
merged_zymo, factors = c("zymodeme", "clinicalcategorical", "susceptibility"),
drill_down = TRUE, color_choices = color_choices)## These columns are not in the metadata: susceptibility
## Warning: attributes are not identical across measure variables; they will be
## dropped
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## i Please use the `linewidth` argument instead.
## i The deprecated feature was likely used in the ggsankey package.
## Please report the issue at <https://github.com/davidsjoberg/ggsankey/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## A sankey plot describing the metadata of 93 samples,
## including 8 out of 0 nodes and traversing metadata factors:
## zymodeme, clinicalcategorical.
The same process will be repeated for the current iteration of the sensitivity assay and stored in the ‘sus_category_current’ column.
starting_current <- sanitize_percent(colData(lp_se)[["susceptibilityinfectionreduction32ugmlsbvcurrentdata"]])
sus_categorical_current <- starting_current
na_idx <- is.na(starting_current)
sum(na_idx)## [1] 0
sus_categorical_current[na_idx] <- "unknown"
## The following is only valid when we had three categories, resistant/ambiguous/sensitive
## The new cutoffs drop ambiguous.
#resist_idx <- starting_current <= max_resist_current
#sus_categorical_current[resist_idx] <- "resistant"
#indeterminant_idx <- starting_current > max_resist_current &
# starting_current < min_sensitive_current
#sus_categorical_current[indeterminant_idx] <- "ambiguous"
#susceptible_idx <- starting_current >= min_sensitive_current
#sus_categorical_current[susceptible_idx] <- "sensitive"
#sus_categorical_current <- as.factor(sus_categorical_current)
resist_idx <- starting_current <= max_resist_current
sensitive_idx <- !resist_idx
sus_categorical_current[resist_idx] <- "resistant"
sus_categorical_current[sensitive_idx] <- "sensitive"
sus_categorical_current <- as.factor(sus_categorical_current)
colData(lp_se)[["sus_category_current"]] <- sus_categorical_current
colData(lp_se)[["susceptibility"]] <- sus_categorical_current
table(sus_categorical_current)## sus_categorical_current
## resistant sensitive
## 46 46
lp_sankey <- plot_meta_sankey(
lp_se, factors = c("zymodemecategorical", "clinicalcategorical", "susceptibility"),
drill_down = TRUE, color_choices = color_choices)## Warning: attributes are not identical across measure variables; they will be
## dropped
## A sankey plot describing the metadata of 92 samples,
## including 23 out of 0 nodes and traversing metadata factors:
## zymodemecategorical, clinicalcategorical, susceptibility.
In many queries, we will seek to compare only the two primary strains, zymodeme 2.2 and 2.3. The following block will extract only those samples.
Note: IMPORTANT Maria Adelaida prefers not to use lp_two_strains. We should not at this time use the merged 2.1/2.2 and 2.4/2.3 categories.
lp_strain <- lp_se %>%
set_batches(fact = sus_categorical_current) %>%
set_colors(color_choices[["strain"]])## The number of samples by batch are:
##
## resistant sensitive
## 46 46
## Warning in set_se_colors(exp, ...): Colors for the following categories are not
## being used: z2.0, z3.0, z3.2, z1.0, z1.5, b2904, unknown.
##
## z2.1 z2.2 z2.3 z2.4
## 7 42 41 2
Clinical outcome is by far the most problematic comparison in this data, but here is the recategorization of the data using it:
lp_cf <- set_conditions(lp_se, fact = "clinicalcategorical",
colors = color_choices[["cf"]]) %>%
set_batches(fact = sus_categorical_current)## The numbers of samples by condition are:
##
## cure fail unknown
## 40 34 18
## Warning in set_se_colors(new_se, colors = colors): Colors for the following
## categories are not being used: notapplicable.
## The number of samples by batch are:
##
## resistant sensitive
## 46 46
##
## cure fail unknown
## 40 34 18
data_structures <- c(data_structures, "lp_cf")
save(list = "lp_cf", file = glue("rda/tmrc2_lp_cf-v{ver}.rda"))
lp_cf_known <- subset_se(lp_cf, subset = "condition!='unknown'")
data_structures <- c(data_structures, "lp_cf_known")
save(list = "lp_cf_known", file = glue("rda/tmrc2_lp_cf_known-v{ver}.rda"))
data_structures <- c(data_structures, "lp_cf_known")
save(list = "lp_cf_known", file = glue("rda/tmrc2_lp_cf_known-v{ver}.rda"))Use the factorized version of susceptibility to categorize the samples by the historical data.
lp_susceptibility_historical <- set_conditions(
lp_se, fact = "sus_category_historical", colors = color_choices[["susceptibility"]]) %>%
set_batches(fact = "clinicalcategorical")## The numbers of samples by condition are:
##
## ambiguous resistant sensitive unknown
## 5 12 30 45
## The number of samples by batch are:
##
## cure fail unknown
## 40 34 18
Use the factorized version of susceptibility to categorize the samples by the historical data.
This will likely be our canonical susceptibility dataset, so I will remove the suffix and just call it ‘lp_susceptibility’.
lp_susceptibility <- set_conditions(
lp_se, fact = "sus_category_current", colors = color_choices[["susceptibility"]]) %>%
set_batches(fact = "clinicalcategorical")## The numbers of samples by condition are:
##
## resistant sensitive
## 46 46
## Warning in set_se_colors(new_se, colors = colors): Colors for the following
## categories are not being used: ambiguous, unknown.
## The number of samples by batch are:
##
## cure fail unknown
## 40 34 18
I think this is redundant with a previous block, but I am leaving it until I am certain that it is not required in a following document.
Note: IMPORTANT This is the set Maria Adeliada prefers to use.
The following section will create some initial data structures of the observed variants in the parasite samples. This will include some of our 2016 samples for some classification queries.
I changed and improved the mapping and variant detection methods from what we used for the 2016 data. So some small changes will be required to merge them.
lp_previous <- create_se("sample_sheets/tmrc2_samples_20191203.xlsx",
file_column = "tophat2file",
savefile = glue("rda/lp_previous-v{ver}.rda"))
tt <- lp_previous$expressionset
rownames(tt) <- gsub(pattern = "^exon_", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\.1$", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\-1$", replacement = "", x = rownames(tt))
lp_previous$expressionset <- tt
rm(tt)
data_structures <- c(data_structures, "lp_previous")The count_se_snps() function uses our expressionset data and a metadata column in order to extract the mpileup or freebayes-based variant calls and create matrices of the likelihood that each position-per-sample is in fact a variant.
There is an important caveat here which changed on 202301: I was interpreting using the PAIRED tag, which is only used for, unsurprisingly, paired-end samples. A couple samples are not paired and so were failing silently. The QA tag looks like it is more appropriate and should work across both types. One way to find out, I am setting it here and will look to see if the results make more sense for my test samples (TMRC2001, TMRC2005, TMRC2007).
## The next line drops the samples which are missing the SNP pipeline.
lp_snp <- subset_se(lp_se, subset = "!is.na(colData(lp_se)[['freebayessummary']])")
lp_snp_sufficient <- subset_se(lp_snp, subset = "rownames!='TMRC20082'")
lp_snp_only22_23_ref <- subset_se(lp_snp, subset = "zymodemereference=='z2.2'|zymodemereference=='z2.3'") %>%
subset_se(subset = "rownames!='TMRC20082'")
lp_snp_22_23_ml <- subset_se(lp_snp, subset = "knnv2classification=='z22'|knnv2classification=='z23'") %>%
subset_se(subset = "rownames!='TMRC20082'")
new_snps_sufficient <- count_snps(lp_snp_sufficient, annot_column = "freebayessummary",
snp_column = "QA", reader = "readr")## Using the snp column: QA from the sample annotations.
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## * `DP` -> `DP...3`
## * `RO` -> `RO...8`
## * `AO` -> `AO...9`
## * `QR` -> `QR...12`
## * `QA` -> `QA...13`
## * `DP` -> `DP...42`
## * `RO` -> `RO...43`
## * `QR` -> `QR...44`
## * `AO` -> `AO...45`
## * `QA` -> `QA...46`
new_snps_only22_23_ref_suf <- count_snps(lp_snp_only22_23_ref, annot_column = "freebayessummary",
snp_column = "QA", reader = "readr")## Using the snp column: QA from the sample annotations.
## New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:
new_snps_22_23_ml_suf <- count_snps(lp_snp_22_23_ml, annot_column = "freebayessummary",
snp_column = "QA", reader = "readr")## Using the snp column: QA from the sample annotations.
## New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:New names:
## Lets see if we get numbers which make sense.
summary(assay(new_snps_sufficient)[["TMRC20001"]]) ## My weirdo sample## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 22.8 0.0 2217.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 104 0 247568
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 1121 0 1708458
## Now that we are reasonably confident that things make more sense, lets save and move on...
data_structures <- c(data_structures, "new_snps_sufficient", "lp_snp")
## Most of the time I just run normalize() and assume it will dispatch to normalize_se()
sufficient_norm <- normalize(new_snps_sufficient, transform = "log2")## transform_counts: Found 79143354 values equal to 0, adding 1 to the matrix.
## TMRC20001 TMRC20065 TMRC20004 TMRC20005 TMRC20066 TMRC20039 TMRC20037
## min 0.0000 0.000 0.00000 0.00000 0.0000 0.0000 0.000
## q1 0.0000 0.000 0.00000 0.00000 0.0000 0.0000 0.000
## median 0.0000 0.000 0.00000 0.00000 0.0000 0.0000 0.000
## mean 0.6225 0.993 0.03003 0.09332 0.9328 0.2109 1.084
## q3 0.0000 0.000 0.00000 0.00000 0.0000 0.0000 0.000
## max 11.1150 19.753 10.26796 11.23482 17.7880 18.1461 19.430
## iqr 0.0000 0.000 0.00000 0.00000 0.0000 0.0000 0.000
## iqr_high 0.0000 0.000 0.00000 0.00000 0.0000 0.0000 0.000
## iqr_low 0.0000 0.000 0.00000 0.00000 0.0000 0.0000 0.000
## sd 2.1264 3.024 0.43947 0.81328 2.8365 1.3249 3.215
## var 4.5216 9.148 0.19314 0.66143 8.0457 1.7555 10.336
## stdvar 7.2637 9.212 6.43213 7.08780 8.6252 8.3255 9.536
## TMRC20038 TMRC20067 TMRC20068 TMRC20041 TMRC20015 TMRC20009 TMRC20010
## min 0.000 0.000 0.000 0.0000 0.000 0.000 0.000
## q1 0.000 0.000 0.000 0.0000 0.000 0.000 0.000
## median 0.000 0.000 0.000 0.0000 0.000 0.000 0.000
## mean 1.094 1.002 1.015 0.5844 1.220 0.161 1.023
## q3 0.000 0.000 0.000 0.0000 0.000 0.000 0.000
## max 20.303 19.730 18.920 19.4191 21.149 18.293 17.511
## iqr 0.000 0.000 0.000 0.0000 0.000 0.000 0.000
## iqr_high 0.000 0.000 0.000 0.0000 0.000 0.000 0.000
## iqr_low 0.000 0.000 0.000 0.0000 0.000 0.000 0.000
## sd 3.260 3.040 3.039 2.4470 3.636 1.262 3.098
## var 10.626 9.241 9.235 5.9880 13.224 1.594 9.597
## stdvar 9.714 9.226 9.103 10.2463 10.840 9.901 9.386
## TMRC20016 TMRC20011 TMRC20012 TMRC20013 TMRC20017 TMRC20014 TMRC20018
## min 0.000 0.0000 0.000000 0.000 0.000 0.0000 0.000
## q1 0.000 0.0000 0.000000 0.000 0.000 0.0000 0.000
## median 0.000 0.0000 0.000000 0.000 0.000 0.0000 0.000
## mean 1.552 0.1189 0.003488 1.072 0.427 0.1545 1.529
## q3 0.000 0.0000 0.000000 0.000 0.000 0.0000 0.000
## max 19.603 16.8590 11.116994 17.648 20.165 18.9166 20.774
## iqr 0.000 0.0000 0.000000 0.000 0.000 0.0000 0.000
## iqr_high 0.000 0.0000 0.000000 0.000 0.000 0.0000 0.000
## iqr_low 0.000 0.0000 0.000000 0.000 0.000 0.0000 0.000
## sd 3.725 1.0066 0.159721 3.228 1.904 1.1808 3.761
## var 13.873 1.0132 0.025511 10.418 3.625 1.3943 14.149
## stdvar 8.938 8.5208 7.313648 9.720 8.489 9.0222 9.251
## TMRC20019 TMRC20070 TMRC20020 TMRC20021 TMRC20022 TMRC20024 TMRC20036
## min 0.0000 0.000 0.0000 0.000 0.0000 0.0000 0.0000
## q1 0.0000 0.000 0.0000 0.000 0.0000 0.0000 0.0000
## median 0.0000 0.000 0.0000 0.000 0.0000 0.0000 0.0000
## mean 0.1597 1.067 0.1713 1.324 0.1656 0.1866 0.5803
## q3 0.0000 0.000 0.0000 0.000 0.0000 0.0000 0.0000
## max 19.8862 19.933 19.8229 20.704 20.0597 20.3021 18.5956
## iqr 0.0000 0.000 0.0000 0.000 0.0000 0.0000 0.0000
## iqr_high 0.0000 0.000 0.0000 0.000 0.0000 0.0000 0.0000
## iqr_low 0.0000 0.000 0.0000 0.000 0.0000 0.0000 0.0000
## sd 1.2994 3.179 1.3632 3.849 1.2195 1.3625 2.2893
## var 1.6884 10.104 1.8582 14.818 1.4871 1.8563 5.2410
## stdvar 10.5751 9.465 10.8482 11.194 8.9784 9.9488 9.0322
## TMRC20069 TMRC20033 TMRC20026 TMRC20031 TMRC20076 TMRC20073 TMRC20055
## min 0.0000 0.0000 0.0000 0.0000 0.000 0.000 0.0000
## q1 0.0000 0.0000 0.0000 0.0000 0.000 0.000 0.0000
## median 0.0000 0.0000 0.0000 0.0000 0.000 0.000 0.0000
## mean 0.1647 0.3053 0.1399 0.1681 0.153 1.052 0.1775
## q3 0.0000 0.0000 0.0000 0.0000 0.000 0.000 0.0000
## max 18.7410 19.6357 18.6644 18.5724 18.516 19.552 17.7426
## iqr 0.0000 0.0000 0.0000 0.0000 0.000 0.000 0.0000
## iqr_high 0.0000 0.0000 0.0000 0.0000 0.000 0.000 0.0000
## iqr_low 0.0000 0.0000 0.0000 0.0000 0.000 0.000 0.0000
## sd 1.1945 1.6495 1.1321 1.2153 1.149 3.155 1.1745
## var 1.4269 2.7209 1.2816 1.4769 1.321 9.951 1.3795
## stdvar 8.6618 8.9123 9.1576 8.7847 8.631 9.455 7.7714
## TMRC20079 TMRC20071 TMRC20078 TMRC20094 TMRC20042 TMRC20058 TMRC20072
## min 0.000 0.0000 0.0000 0.0000 0.0000 0.000 0.0000
## q1 0.000 0.0000 0.0000 0.0000 0.0000 0.000 0.0000
## median 0.000 0.0000 0.0000 0.0000 0.0000 0.000 0.0000
## mean 1.043 0.9965 0.1647 0.8776 0.1623 1.045 0.5082
## q3 0.000 0.0000 0.0000 0.0000 0.0000 0.000 0.0000
## max 19.764 18.4983 18.1893 18.9351 17.5305 19.184 17.9175
## iqr 0.000 0.0000 0.0000 0.0000 0.0000 0.000 0.0000
## iqr_high 0.000 0.0000 0.0000 0.0000 0.0000 0.000 0.0000
## iqr_low 0.000 0.0000 0.0000 0.0000 0.0000 0.000 0.0000
## sd 3.125 3.0180 1.1912 2.7624 1.1427 3.161 2.1854
## var 9.763 9.1086 1.4188 7.6310 1.3058 9.989 4.7761
## stdvar 9.365 9.1410 8.6170 8.6950 8.0435 9.559 9.3984
## TMRC20059 TMRC20048 TMRC20057 TMRC20088 TMRC20056 TMRC20060 TMRC20077
## min 0.000 0.000 0.0000 0.0000 0.0000 0.0000 0.0000
## q1 0.000 0.000 0.0000 0.0000 0.0000 0.0000 0.0000
## median 0.000 0.000 0.0000 0.0000 0.0000 0.0000 0.0000
## mean 1.026 1.041 0.5061 0.1231 0.1745 0.1796 0.1475
## q3 0.000 0.000 0.0000 0.0000 0.0000 0.0000 0.0000
## max 19.184 18.886 18.4954 16.2692 16.9808 18.8403 18.4811
## iqr 0.000 0.000 0.0000 0.0000 0.0000 0.0000 0.0000
## iqr_high 0.000 0.000 0.0000 0.0000 0.0000 0.0000 0.0000
## iqr_low 0.000 0.000 0.0000 0.0000 0.0000 0.0000 0.0000
## sd 3.110 3.108 2.1993 0.9833 1.1391 1.2111 1.0963
## var 9.672 9.659 4.8368 0.9669 1.2975 1.4668 1.2018
## stdvar 9.426 9.275 9.5577 7.8552 7.4336 8.1662 8.1504
## TMRC20074 TMRC20063 TMRC20053 TMRC20052 TMRC20064 TMRC20075 TMRC20051
## min 0.0000 0.0000 0.0000 0.000 0.0000 0.0000 0.000
## q1 0.0000 0.0000 0.0000 0.000 0.0000 0.0000 0.000
## median 0.0000 0.0000 0.0000 0.000 0.0000 0.0000 0.000
## mean 0.1679 0.2165 0.1588 1.085 0.9871 0.9196 1.008
## q3 0.0000 0.0000 0.0000 0.000 0.0000 0.0000 0.000
## max 17.9632 17.1481 18.0063 18.877 18.1387 17.9188 19.635
## iqr 0.0000 0.0000 0.0000 0.000 0.0000 0.0000 0.000
## iqr_high 0.0000 0.0000 0.0000 0.000 0.0000 0.0000 0.000
## iqr_low 0.0000 0.0000 0.0000 0.000 0.0000 0.0000 0.000
## sd 1.1498 1.2579 1.1049 3.178 3.0075 2.7572 3.059
## var 1.3219 1.5822 1.2207 10.102 9.0451 7.6024 9.355
## stdvar 7.8715 7.3097 7.6854 9.309 9.1634 8.2671 9.281
## TMRC20050 TMRC20049 TMRC20062 TMRC20110 TMRC20080 TMRC20043 TMRC20083
## min 0.0000 0.000 0.000 0.000 0.000 0.000 0.000
## q1 0.0000 0.000 0.000 0.000 0.000 0.000 0.000
## median 0.0000 0.000 0.000 0.000 0.000 0.000 0.000
## mean 0.1415 0.142 1.000 0.138 1.003 1.031 0.163
## q3 0.0000 0.000 0.000 0.000 0.000 0.000 0.000
## max 16.8623 18.967 18.828 19.540 17.985 19.257 17.632
## iqr 0.0000 0.000 0.000 0.000 0.000 0.000 0.000
## iqr_high 0.0000 0.000 0.000 0.000 0.000 0.000 0.000
## iqr_low 0.0000 0.000 0.000 0.000 0.000 0.000 0.000
## sd 1.0684 1.114 3.042 1.113 3.019 3.098 1.109
## var 1.1415 1.241 9.254 1.239 9.116 9.600 1.230
## stdvar 8.0687 8.741 9.249 8.979 9.088 9.310 7.544
## TMRC20054 TMRC20085 TMRC20046 TMRC20093 TMRC20089 TMRC20047 TMRC20090
## min 0.000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## q1 0.000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## median 0.000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## mean 1.022 0.9384 0.5004 0.4997 0.9588 0.9891 0.9696
## q3 0.000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## max 19.378 18.7345 19.3740 18.9472 18.7778 18.5694 19.0313
## iqr 0.000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## iqr_high 0.000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## iqr_low 0.000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## sd 3.109 2.9200 2.1872 2.1884 2.9702 3.0217 2.9818
## var 9.664 8.5265 4.7836 4.7892 8.8221 9.1309 8.8911
## stdvar 9.457 9.0867 9.5591 9.5834 9.2010 9.2319 9.1697
## TMRC20044 TMRC20045 TMRC20105 TMRC20108 TMRC20109 TMRC20098 TMRC20096
## min 0.0000 0.0000 0.0000 0.000 0.0000 0.0000 0.0000
## q1 0.0000 0.0000 0.0000 0.000 0.0000 0.0000 0.0000
## median 0.0000 0.0000 0.0000 0.000 0.0000 0.0000 0.0000
## mean 0.1291 0.5069 0.8631 1.093 0.1508 0.9885 0.1495
## q3 0.0000 0.0000 0.0000 0.000 0.0000 0.0000 0.0000
## max 17.7954 18.8505 18.8935 19.077 18.7071 19.2242 17.6576
## iqr 0.0000 0.0000 0.0000 0.000 0.0000 0.0000 0.0000
## iqr_high 0.0000 0.0000 0.0000 0.000 0.0000 0.0000 0.0000
## iqr_low 0.0000 0.0000 0.0000 0.000 0.0000 0.0000 0.0000
## sd 1.0531 2.1702 2.7374 3.265 1.1332 3.0192 1.1285
## var 1.1089 4.7097 7.4931 10.660 1.2842 9.1155 1.2735
## stdvar 8.5904 9.2918 8.6816 9.752 8.5174 9.2217 8.5206
## TMRC20101 TMRC20092 TMRC20102 TMRC20099 TMRC20100 TMRC20091 TMRC20084
## min 0.0000 0.0000 0.0000 0.0000 0.000 0.0000 0.0000
## q1 0.0000 0.0000 0.0000 0.0000 0.000 0.0000 0.0000
## median 0.0000 0.0000 0.0000 0.0000 0.000 0.0000 0.0000
## mean 0.1454 0.1351 0.9779 0.9326 1.022 0.1194 0.4668
## q3 0.0000 0.0000 0.0000 0.0000 0.000 0.0000 0.0000
## max 18.2321 17.5991 18.8276 19.1029 19.249 17.0371 18.1771
## iqr 0.0000 0.0000 0.0000 0.0000 0.000 0.0000 0.0000
## iqr_high 0.0000 0.0000 0.0000 0.0000 0.000 0.0000 0.0000
## iqr_low 0.0000 0.0000 0.0000 0.0000 0.000 0.0000 0.0000
## sd 1.0889 1.0471 2.9959 2.8756 3.094 0.9690 2.0822
## var 1.1857 1.0964 8.9756 8.2693 9.575 0.9390 4.3356
## stdvar 8.1561 8.1146 9.1782 8.8670 9.372 7.8657 9.2883
## TMRC20087 TMRC20103 TMRC20104 TMRC20086 TMRC20107 TMRC20081 TMRC20095
## min 0.0000 0.0000 0.000 0.0000 0.000 0.000 0.0000
## q1 0.0000 0.0000 0.000 0.0000 0.000 0.000 0.0000
## median 0.0000 0.0000 0.000 0.0000 0.000 0.000 0.0000
## mean 0.1226 0.5015 1.019 0.1343 1.045 0.175 0.7542
## q3 0.0000 0.0000 0.000 0.0000 0.000 0.000 0.0000
## max 17.8865 18.6149 20.082 19.1059 19.692 18.886 18.9728
## iqr 0.0000 0.0000 0.000 0.0000 0.000 0.000 0.0000
## iqr_high 0.0000 0.0000 0.000 0.0000 0.000 0.000 0.0000
## iqr_low 0.0000 0.0000 0.000 0.0000 0.000 0.000 0.0000
## sd 1.0014 2.1813 3.098 1.0685 3.149 1.246 2.4973
## var 1.0027 4.7579 9.600 1.1416 9.913 1.552 6.2365
## stdvar 8.1787 9.4879 9.420 8.4980 9.491 8.869 8.2692
Now let us pull in the 2016 data.
old_snps <- count_snps(lp_previous, annot_column = "bcftable", snp_column = 2)
data_structures <- c(data_structures, "old_snps")
save(list = "lp_snp", file = glue("rda/lp_snp-v{ver}.rda"))
data_structures <- c(data_structures, "lp_snp")
save(list = "new_snps", file = glue("rda/new_snps-v{ver}.rda"))
data_structures <- c(data_structures, "new_snps")
save(list = "old_snps", file = glue("rda/old_snps-v{ver}.rda"))
data_structures <- c(data_structures, "old_snps")
nonzero_snps <- assay(new_snps) != 0
colSums(nonzero_snps)As far as I can tell, freebayes and mpileup are reasonably similar in their sensitivity/specificity; so combining the two datasets like this is expected to work with minimal problems. The most likely problem is that my mpileup-based pipeline is unable to handle indels.
I am taking a heatmap from our variant data and manually identifying sample groups.
All of the above focused entire on the parasite samples, now let us pull up the macrophage infected samples. This will comprise two datasets, one of the human and one of the parasite.
The metadata for the macrophage samples contains a couple of columns for mapped human and parasite reads. We will therefore use them separately to create two expressionsets, one for each species.
## Using mart: ENSEMBL_MART_ENSEMBL from host: apr2020.archive.ensembl.org.
## Successfully connected to the hsapiens_gene_ensembl database.
## Finished downloading ensembl gene annotations.
## Finished downloading ensembl structure annotations.
## symbol columns is null, pattern matching 'symbol' and taking the first.
## Found 2 potential symbol columns, using the first:hgnc_symbol.
## Including symbols, there are 67159 vs the 249740 gene annotations.
## Not dropping haplotype chromosome annotations, set drop_haplotypes = TRUE if this is bad.
## Saving annotations to hsapiens_biomart_annotations.rda.
## Finished save().
hs_annot <- hs_annot[["annotation"]]
hs_annot[["transcript"]] <- paste0(rownames(hs_annot), ".", hs_annot[["transcript_version"]])
rownames(hs_annot) <- make.names(hs_annot[["ensembl_gene_id"]], unique = TRUE)
rownames(hs_annot) <- paste0("gene:", rownames(hs_annot))
tx_gene_map <- hs_annot[, c("transcript", "ensembl_gene_id")]
sanitize_columns <- c("drug", "macrophagetreatment", "macrophagezymodeme")
macr_annot <- hs_annot
rownames(macr_annot) <- gsub(x = rownames(macr_annot),
pattern = "^gene:",
replacement = "")
hs_macrophage <- create_se(macrophage_sheet, gene_info = macr_annot,
file_column = "hg38100hisatfile") %>%
set_conditions(fact = "macrophagetreatment") %>%
set_batches(fact = "macrophagezymodeme") %>%
sanitize_metadata(columns = sanitize_columns) %>%
subset_se(nonzero = 12000)## Reading the sample metadata.
## Did not find the column: sampleid.
## Setting the ID column to the first column.
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## The sample definitions comprises: 69 rows(samples) and 80 columns(metadata fields).
## Matched 21481 annotations and counts.
## Some annotations were lost in merging, setting them to 'undefined'.
## The final summarized experiment has 21481 rows and 80 columns.
## The numbers of samples by condition are:
##
## inf inf_sb uninf uninf_sb
## 30 29 5 5
## The number of samples by batch are:
##
## none z2.2 z2.3
## 10 30 29
## Recasting the data.frame to DataFrame.
## rownames tubelabelorigin samplename numberofvials
## Length:69 Length:69 Length:69 Min. :1
## Class :character Class :character Class :character 1st Qu.:1
## Mode :character Mode :character Mode :character Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
##
## sourcelab expperson cellssource samplecollectiondate
## Length:69 Length:69 Length:69 Min. :20190629
## Class :character Class :character Class :character 1st Qu.:20210813
## Mode :character Mode :character Mode :character Median :20220827
## Mean :20212685
## 3rd Qu.:20220827
## Max. :20220916
##
## typeofcells donor isolationmethod
## Length:69 Length:69 Length:69
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## cellspurificationmethod selectionmethod rnapreservation
## Length:69 Length:69 Length:69
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## rnaextractiondate rnavolumeul rnaavailableul rnaqctesteddate
## Length:69 Min. :30 Min. : 5.2 Min. : 42382
## Class :character 1st Qu.:30 1st Qu.:22.2 1st Qu.:20200114
## Mode :character Median :30 Median :23.3 Median :20205521
## Mean :30 Mean :23.1 Mean :19485696
## 3rd Qu.:30 3rd Qu.:25.6 3rd Qu.:20211111
## Max. :30 Max. :26.2 Max. :20211221
## NA's :41 NA's :41 NA's :41
## bioanalyzerrnangul rnaqcpassed rin nanodroprnangul
## Length:69 Length:69 Length:69 Length:69
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## x260280 x260230 rnausedtoconstructlibrariesul
## Length:69 Length:69 Min. : 2.20
## Class :character Class :character 1st Qu.: 2.80
## Mode :character Mode :character Median : 3.41
## Mean : 4.61
## 3rd Qu.: 4.83
## Max. :23.30
## NA's :41
## rnausedtoconstructlibrariesng libraryqctesteddate libqcpassed
## Min. : 0.5 Min. :20200115 Length:69
## 1st Qu.: 0.5 1st Qu.:20200115 Class :character
## Median :250.3 Median :20201221 Mode :character
## Mean :275.3 Mean :20205505
## 3rd Qu.:500.0 3rd Qu.:20211219
## Max. :800.0 Max. :20211223
## NA's :41 NA's :42
## index libraryvolumeul libraryvolumesenttonajibslabul
## Min. : 1.0 Min. :28 Min. :15
## 1st Qu.: 7.0 1st Qu.:28 1st Qu.:15
## Median :14.5 Median :28 Median :15
## Mean :13.9 Mean :28 Mean :15
## 3rd Qu.:20.2 3rd Qu.:28 3rd Qu.:15
## Max. :27.0 Max. :28 Max. :15
## NA's :41 NA's :41 NA's :41
## shipmentdate oldnew countersampleatcideimul drug
## Min. :20200217 Length:69 Min. :13 antimony:34
## 1st Qu.:20200217 Class :character 1st Qu.:13 none :35
## Median :20210558 Mode :character Median :13
## Mean :20210188 Mean :13
## 3rd Qu.:20220103 3rd Qu.:13
## Max. :20220103 Max. :13
## NA's :41 NA's :56
## descriptonandremarks observation
## Length:69 Length:69
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## librarybioanalyzerprofileelsayedlabfilenamewelllane libraryconcnm
## Length:69 Min. : 22.1
## Class :character 1st Qu.: 57.8
## Mode :character Median : 98.7
## Mean : 96.2
## 3rd Qu.:112.8
## Max. :218.0
## NA's :55
## samplefor100ul2or4nmsequencing waterfor100ul2or4nmsequencing
## Min. :0.917 Min. :92.0
## 1st Qu.:1.754 1st Qu.:96.9
## Median :1.942 Median :98.1
## Mean :2.995 Mean :97.0
## 3rd Qu.:3.082 3rd Qu.:98.2
## Max. :8.000 Max. :99.1
## NA's :56 NA's :56
## sequencingorderno seqorderdate seqcompletedate totalreads
## Length:69 Min. :20200901 Min. :20200910 Min. :12254196
## Class :character 1st Qu.:20200901 1st Qu.:20200910 1st Qu.:21496945
## Mode :character Median :20200901 Median :20200910 Median :23391483
## Mean :20202287 Mean :20202296 Mean :29119440
## 3rd Qu.:20200901 3rd Qu.:20200910 3rd Qu.:27612540
## Max. :20210601 Max. :20210610 Max. :86048061
## NA's :55 NA's :55
## trimmedreads percentkept hg38100salmonfile hg38100hisatfile
## Min. :10400801 Min. :0.735 Length:69 Length:69
## 1st Qu.:19375511 1st Qu.:0.887 Class :character Class :character
## Median :21401552 Median :0.902 Mode :character Mode :character
## Mean :26367248 Mean :0.903
## 3rd Qu.:25048428 3rd Qu.:0.940
## Max. :79775170 Max. :0.945
##
## hisatsinglemappedhg38 hisatmultimappedhg38 hisatmappingratehg38
## Min. : 501881 Min. : 328180 Min. :0.0418
## 1st Qu.:17620997 1st Qu.: 605198 1st Qu.:0.9114
## Median :18986346 Median : 682134 Median :0.9606
## Mean :23074608 Mean :1339401 Mean :0.9223
## 3rd Qu.:22472518 3rd Qu.: 830271 3rd Qu.:0.9787
## Max. :70100660 Max. :7030835 Max. :0.9885
##
## lpanamensisv36hisatfile hisatlpsinglemapped hisatlpmultimapped
## Length:69 Min. : 199 Min. : 14
## Class :character 1st Qu.: 11504 1st Qu.: 770
## Mode :character Median : 284271 Median : 17386
## Mean : 909574 Mean : 70729
## 3rd Qu.:1006681 3rd Qu.: 81008
## Max. :7086371 Max. :620799
##
## parasitemappingrate parasitehostratio x68 macrophagetreatment
## Min. :0.000011 Min. :0.000204 Length:69 inf :30
## 1st Qu.:0.000487 1st Qu.:0.003393 Class :character inf_sb :29
## Median :0.010078 Median :0.003393 Mode :character uninf : 5
## Mean :0.036938 Mean :0.010336 uninf_sb: 5
## 3rd Qu.:0.048453 3rd Qu.:0.003393
## Max. :0.292395 Max. :0.155351
##
## macrophagezymodeme strainid slr1fwd slr1rc
## none:10 Length:69 Min. : 0.00 Min. : 0
## z22 :30 Class :character 1st Qu.: 0.00 1st Qu.: 0
## z23 :29 Mode :character Median : 0.00 Median : 6
## Mean : 1.16 Mean : 31
## 3rd Qu.: 1.00 3rd Qu.: 27
## Max. :11.00 Max. :641
##
## slr2fwd slr2rc slsum slvsreads
## Min. : 2 Min. : 0.000 Min. : 2 Min. :1.03e-07
## 1st Qu.: 224 1st Qu.: 0.000 1st Qu.: 225 1st Qu.:7.81e-06
## Median : 4249 Median : 0.000 Median : 4300 Median :1.83e-04
## Mean : 17256 Mean : 0.667 Mean : 17289 Mean :6.68e-04
## 3rd Qu.: 22089 3rd Qu.: 0.000 3rd Qu.: 22188 3rd Qu.:8.17e-04
## Max. :168826 Max. :13.000 Max. :169487 Max. :6.57e-03
##
## slvshuman trimomatic_input trimomatic_output trimomatic_percent
## Min. :1.05e-07 Length:69 Length:69 Length:69
## 1st Qu.:1.06e-05 Class :character Class :character Class :character
## Median :1.87e-04 Mode :character Mode :character Mode :character
## Mean :8.21e-04
## 3rd Qu.:8.79e-04
## Max. :1.02e-02
##
## fastqc_pct_gc hisat_genome_single_concordant
## Length:69 Length:69
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## hisat_genome_multi_concordant hisat_genome_single_all hisat_genome_multi_all
## Length:69 Length:69 Length:69
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## hisat_genome_percent hisat_count_table condition batch
## Length:69 Length:69 inf :30 none:10
## Class :character Class :character inf_sb :29 z2.2:30
## Mode :character Mode :character uninf : 5 z2.3:29
## uninf_sb: 5
##
##
##
## The samples (and read coverage) removed when filtering 12000 non-zero genes are:
## TMRC30162
## 521145
## TMRC30162
## 10208
## Samples removed: 10208
fixed_genenames <- gsub(x = rownames(assay(hs_macrophage)), pattern = "^gene:",
replacement = "")
hs_macrophage <- set_genenames(hs_macrophage, ids = fixed_genenames)
table(colData(hs_macrophage)$condition)##
## inf inf_sb uninf uninf_sb
## 29 29 5 5
## The following 3 lines were copy/pasted to datastructures and should be removed soon.
nostrain <- is.na(colData(hs_macrophage)[["strainid"]])
colData(hs_macrophage)[nostrain, "strainid"] <- "none"
colData(hs_macrophage)[["strain_zymo"]] <- paste0("s", colData(hs_macrophage)[["strainid"]],
"_", colData(hs_macrophage)[["macrophagezymodeme"]])
uninfected <- colData(hs_macrophage)[["strain_zymo"]] == "snone_none"
colData(hs_macrophage)[uninfected, "strain_zymo"] <- "uninfected"
data_structures <- c(data_structures, "hs_macrophage")Finally, split off the U937 samples.
In the previous block, we used a new invocation of ensembl-derived annotation data, this time we can just use our existing parasite gene annotations.
lp_macrophage <- create_se(macrophage_sheet, file_column = "lpanamensisv36hisatfile",
gene_info = lp_annotations,
savefile = glue("rda/lp_macrophage-v{ver}.rda"),
annotation = "org.Lpanamensis.MHOMCOL81L13.v46.eg.db") %>%
set_conditions(fact = "macrophagezymodeme") %>%
set_batches(fact = "macrophagetreatment")## Reading the sample metadata.
## Did not find the column: sampleid.
## Setting the ID column to the first column.
## Did not find the condition column in the sample sheet.
## Filling it in as undefined.
## Did not find the batch column in the sample sheet.
## Filling it in as undefined.
## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## The sample definitions comprises: 69 rows(samples) and 80 columns(metadata fields).
## Warning in create_se(macrophage_sheet, file_column = "lpanamensisv36hisatfile",
## : Some samples were removed when cross referencing the samples against the
## count data.
## Matched 8778 annotations and counts.
## The final summarized experiment has 8778 rows and 80 columns.
## The numbers of samples by condition are:
##
## none z2.2 z2.3
## 8 29 29
## The number of samples by batch are:
##
## inf inf_sb uninf uninf_sb
## 29 29 4 4
##unfilt_written <- write_se(
## lp_macrophage,
## excel = glue("analyses/macrophage_de/{ver}/read_counts/lp_macrophage_reads_unfiltered-v{ver}.xlsx"))
lp_macrophage_filt <- subset_se(lp_macrophage, nonzero = 2500)## The samples (and read coverage) removed when filtering 2500 non-zero genes are:
## TMRC30066 TMRC30117 TMRC30244 TMRC30246 TMRC30266 TMRC30268 TMRC30326 TMRC30323
## 3080 1147 1662 2834 822 3444 375 84
## TMRC30319 TMRC30325 TMRC30327 TMRC30312 TMRC30304 TMRC30313 TMRC30309 TMRC30330
## 374 356 129 76 289 96 188 181
## TMRC30066 TMRC30117 TMRC30244 TMRC30246 TMRC30266 TMRC30268 TMRC30326 TMRC30323
## 1890 888 1135 1796 649 1915 303 74
## TMRC30319 TMRC30325 TMRC30327 TMRC30312 TMRC30304 TMRC30313 TMRC30309 TMRC30330
## 270 279 123 76 207 84 166 135
## Samples removed: 1890, 888, 1135, 1796, 649, 1915, 303, 74, 270, 279, 123, 76, 207, 84, 166, 135
## semantic_filter(semantic = c("amastin", "gp63", "leishmanolysin"),
## semantic_column = "annot_gene_product")
data_structures <- c(data_structures, "lp_macrophage", "lp_macrophage_filt")
##filt_written <- write_se(lp_macrophage_filt,
## excel = glue("analyses/macrophage_de/{ver}/read_counts/lp_macrophage_reads_filtered-v{ver}.xlsx"))
lp_macrophage <- lp_macrophage_filt
lp_macrophage_nosb <- subset_se(lp_macrophage, subset = "batch!='inf_sb'")
##lp_nosb_write <- write_se(
## lp_macrophage_nosb,
## excel = glue("analyses/macrophage_de/{ver}/read_counts/lp_macrophage_nosb_reads-v{ver}.xlsx"))
data_structures <- c(data_structures, "lp_macrophage_nosb")Invoking gather_preprocessing_metadata() somewhat requires a full working tree of results from trimomatic/cutadapt/fastp, hisat/salmon/etc, …
lp_meta <- as.data.frame(colData(lp_macrophage))
lp_meta[["slvsreads_log"]] <- log10(lp_meta[["slvsreads"]])
inf_values <- is.infinite(lp_meta[["slvsreads_log"]])
lp_meta[inf_values, "slvsreads_log"] <- -10
color_vector <- as.character(color_choices[["strain"]])
names(color_vector) <- names(color_choices[["strain"]])
color_vector <- color_vector[c("z2.2", "z2.3", "unknown")]
names(color_vector) <- c("z2.2", "z2.3", "none")
sl_violin <- ggplot(lp_meta,
aes(x = .data[["condition"]], y = .data[["slvsreads_log"]],
fill = .data[["condition"]])) +
geom_violin() +
geom_point() +
scale_fill_manual(values = color_vector)
sl_violinI want to make an estimate of ploidy using transcriptomic data. This is by definition a foold’s errand, but I think it might work.
lp_rpkm <- normalize(lp_se, convert = "rpkm", filter = TRUE,
length_column = "cds_length", na_to_zero = TRUE)## Removing 149 low-count genes (8629 remaining).
## Exclude scaffolds
unwanted <- grepl(pattern = "SCAF", x = rowData(lp_rpkm)[["gene_location_text"]])
## I think my subset logic is bacwards...
lp_wanted <- lp_rpkm[!unwanted, ]
summary_df <- as.data.frame(assay(lp_wanted))
summary_df[["gene_mean"]] <- rowMeans(summary_df, na.rm = TRUE)
summary_df[["chromosome"]] <- rowData(lp_wanted)[["chromosome"]]
summary_df[["chromosome"]] <- as.factor(summary_df[["chromosome"]])
levels(summary_df[["chromosome"]]) <- c(seq_len(19), "20.1", "20.2", 21:35)
summary_df <- summary_df[, c("gene_mean", "chromosome")] %>%
group_by(chromosome) %>%
summarize(chr_mean = mean(gene_mean, na.rm = TRUE))
min_rpkm <- min(summary_df[["chr_mean"]])
summary_df[["chr_mean"]] <- summary_df[["chr_mean"]] / min_rpkm
ggplot(summary_df, aes(y = chromosome, x = chr_mean)) +
geom_col()wanted <- colData(lp_wanted)[["knnv2classification"]] == "z22" |
colData(lp_wanted)[["knnv2classification"]] == "z23"
lp_z <- lp_wanted[, wanted]
z22_samples <- colData(lp_z)[["knnv2classification"]] == "z22"
z23_samples <- colData(lp_z)[["knnv2classification"]] == "z23"
lp_z_assay <- as.data.frame(assay(lp_z))
lp_z_assay[["z22_gene_mean"]] <- rowMeans(lp_z_assay[, z22_samples], na.rm = TRUE)
lp_z_assay[["z23_gene_mean"]] <- rowMeans(lp_z_assay[, z23_samples], na.rm = TRUE)
lp_z_assay[["chromosome"]] <- rowData(lp_z)[["chromosome"]]
lp_z_means <- lp_z_assay[, c("z22_gene_mean", "z23_gene_mean", "chromosome")] %>%
group_by(chromosome) %>%
summarize(z22_mean = mean(z22_gene_mean, na.rm = TRUE),
z23_mean = mean(z23_gene_mean, na.rm = TRUE))
chr_reshaped <- reshape2::melt(lp_z_means, id.vars = "chromosome")
chr_reshaped[["chromosome"]] <- factor(chr_reshaped[["chromosome"]],
levels = c(as.character(1:19), "20.1", "20.2",
as.character(21:35)))
putative_aneuploid <- ggplot(data = chr_reshaped, aes(x = value, y = chromosome, fill = variable)) +
geom_bar(position = "dodge", stat = "identity")
pp(file = "images/putative_aneuploid.svg")
putative_aneuploid
dev.off()## png
## 2
found_idx <- data_structures %in% ls()
if (sum(!found_idx) > 0) {
not_found <- data_structures[!found_idx]
warning("Some datastructures were not generated: ", toString(not_found), ".")
data_structures <- data_structures[found_idx]
}## Warning: Some datastructures were not generated: meta.
## Warning: Your system is mis-configured: '/etc/localtime' is not a symlink
## Warning: It is strongly recommended to set envionment variable TZ to
## 'America/New_York' (or equivalent)
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
locale: C
attached base packages: stats4, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: BSGenome.Leishmania.panamensis.MHOMCOL81L13.v68(v.2026.03), BSgenome(v.1.78.0), rtracklayer(v.1.70.1), BiocIO(v.1.20.0), Biostrings(v.2.78.0), XVector(v.0.50.0), GenomicRanges(v.1.62.1), Seqinfo(v.1.0.0), org.Lpanamensis.MHOMCOL81L13.v68.eg.db(v.2024.09), AnnotationDbi(v.1.72.0), IRanges(v.2.44.0), S4Vectors(v.0.48.0), Biobase(v.2.70.0), hpgltools(v.2026.03), Heatplus(v.3.18.0), ggplot2(v.4.0.2), glue(v.1.8.0), dplyr(v.1.2.0), BiocGenerics(v.0.56.0) and generics(v.0.1.4)
loaded via a namespace (and not attached): ggstatsplot(v.0.13.5), splines(v.4.5.0), later(v.1.4.8), prismatic(v.1.1.2), bitops(v.1.0-9), filelock(v.1.0.3), tibble(v.3.3.1), R.oo(v.1.27.1), datawizard(v.1.3.0), XML(v.3.99-0.23), lifecycle(v.1.0.5), httr2(v.1.2.2), edgeR(v.4.8.2), lattice(v.0.22-7), vroom(v.1.7.1), insight(v.1.4.6), backports(v.1.5.0), magrittr(v.2.0.4), limma(v.3.66.0), openxlsx(v.4.2.8.1), plotly(v.4.12.0), sass(v.0.4.10), rmarkdown(v.2.31), jquerylib(v.0.1.4), yaml(v.2.3.12), httpuv(v.1.6.17), otel(v.0.2.0), zip(v.2.3.3), pbapply(v.1.7-4), cowplot(v.1.2.0), DBI(v.1.3.0), RColorBrewer(v.1.1-3), abind(v.1.4-8), purrr(v.1.2.1), R.utils(v.2.13.0), RCurl(v.1.98-1.18), yulab.utils(v.0.2.4), rappdirs(v.0.3.4), ggrepel(v.0.9.8), correlation(v.0.8.8), MatrixModels(v.0.5-4), codetools(v.0.2-20), DelayedArray(v.0.36.1), DOSE(v.4.4.0), xml2(v.1.5.2), tidyselect(v.1.2.1), farver(v.2.1.2), effectsize(v.1.0.2), matrixStats(v.1.5.0), BiocFileCache(v.3.0.0), GenomicAlignments(v.1.46.0), jsonlite(v.2.0.0), ggsankey(v.0.0.99999), iterators(v.1.0.14), foreach(v.1.5.2), tools(v.4.5.0), progress(v.1.2.3), Rcpp(v.1.1.1), SparseArray(v.1.10.10), xfun(v.0.57), qvalue(v.2.42.0), MatrixGenerics(v.1.22.0), withr(v.3.0.2), fastmap(v.1.2.0), boot(v.1.3-31), digest(v.0.6.39), R6(v.2.6.1), mime(v.0.13), GO.db(v.3.22.0), dichromat(v.2.0-0.1), biomaRt(v.2.66.2), RSQLite(v.2.4.6), cigarillo(v.1.0.0), R.methodsS3(v.1.8.2), tidyr(v.1.3.2), data.table(v.1.18.2.1), prettyunits(v.1.2.0), httr(v.1.4.8), htmlwidgets(v.1.6.4), S4Arrays(v.1.10.1), parameters(v.0.28.3), restez(v.2.1.5), pkgconfig(v.2.0.3), gtable(v.0.3.6), blob(v.1.3.0), statsExpressions(v.1.7.3), S7(v.0.2.1), BayesFactor(v.0.9.12-4.8), htmltools(v.0.5.9), fgsea(v.1.36.2), scales(v.1.4.0), png(v.0.1-9), knitr(v.1.51), tzdb(v.0.5.0), reshape2(v.1.4.5), rjson(v.0.2.23), coda(v.0.19-4.1), curl(v.7.0.0), cachem(v.1.1.0), stringr(v.1.6.0), parallel(v.4.5.0), restfulr(v.0.0.16), pillar(v.1.11.1), grid(v.4.5.0), vctrs(v.0.7.2), promises(v.1.5.0), dbplyr(v.2.5.2), xtable(v.1.8-8), paletteer(v.1.7.0), evaluate(v.1.0.5), readr(v.2.2.0), zeallot(v.0.2.0), GenomicFeatures(v.1.62.0), locfit(v.1.5-9.12), mvtnorm(v.1.3-6), cli(v.3.6.5), compiler(v.4.5.0), Rsamtools(v.2.26.0), rlang(v.1.1.7), crayon(v.1.5.3), rstantools(v.2.6.0), labeling(v.0.4.3), rematch2(v.2.1.2), plyr(v.1.8.9), fs(v.2.0.1), pander(v.0.6.6), stringi(v.1.8.7), viridisLite(v.0.4.3), BiocParallel(v.1.44.0), lazyeval(v.0.2.2), bayestestR(v.0.17.0), GOSemSim(v.2.36.0), Matrix(v.1.7-3), hms(v.1.1.4), patchwork(v.1.3.2), bit64(v.4.6.0-1), statmod(v.1.5.1), KEGGREST(v.1.50.0), varhandle(v.2.0.6), shiny(v.1.13.0), SummarizedExperiment(v.1.40.0), broom(v.1.0.12), memoise(v.2.0.1), RcppParallel(v.5.1.11-2), bslib(v.0.10.0), fastmatch(v.1.1-8) and bit(v.4.6.0)
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 03a4e43defdc53e7038116087cb006b05404d424
## This is hpgltools commit: Tue Apr 7 15:44:04 2026 -0400: 03a4e43defdc53e7038116087cb006b05404d424
## Saving to 01datasets.rda.xz