Before we begin, a couple of parameters which have given me grief.

## Used by the various functions which cross reference grange data
## The SEs used in this document are getting this from the orgdb
## which includes this information in multiple columns with different
## chromosome ID prefixes.  E.g. sometimes it is just 1,2,3, ... other times
## it is LpaL1, LpaL2, LpaL3, ...
exp_chr_col <- "sequence_id"
## The tritrypdb also puts the start/stop/strand information in multiple places
exp_start_col <- "coding_start"
exp_end_col <- "coding_end"

1 Introduction

This document will visualize the TMRC2 samples before completing the various differential expression and variant analyses in the hopes of getting an understanding of how the various samples relate to each other.

1.1 Initial library size

Start off with the library sizes of the original dataset. The main thing to note is that we have quite a large variance in coverage. A few of these samples are highly likely to be removed shortly (looking at you, TMRC20001 and TMRC20095)

libsizes <- plot_libsize(lp_se)
## Warning in fortify(data, ...): Arguments in `...` must be used.
## x Problematic argument:
## * colour = colors
## i Did you misspell an argument name?
libsizes
## Library sizes of 92 samples, 
## ranging from 564,812 to 1.37e+08.

dev <- pp("images/lp_se_libsizes.png", width = 18, height = 9)
libsizes$plot
closed <- dev.off()

Library sizes of the protein coding gene counts observed per sample. The samples were mapped with the EuPathDB revision 36 of the Leishmania (Viannia) panamensis strain MHOM/COL/81L13 genome; the alignments were sorted, indexed, and counted via htseq using the gene features, and non-protein coding features were excluded. The per-sample sums of the remaining matrix were plotted to check that the relative sample coverage is sufficient and not too divergent across samples. Bars are colored according to strain/zymodeme annotation: red: zymodeme 2.3; blue: zymodeme 2.2; Leishmania braziliensis-like strains b2904, z1.0, and z1.5: purple; zymodemes which are most similar to 2.3, comprising z2.4 is light brown; zymodemes most similar to 2.2, comprising z3.0, z2.0, z2.1, and z3.2 are light gray, dark gray, dark brown, and gray respectively.

1.2 Non-zero genes with respect to coverage

This plot is usually our primary arbiter for sample removing based on coverage. We pick a semi-arbitrary cutoff based on both coverage and genes observed. In this instance 8,600 genes seems likely?

The cutoff argument prints out samples with gene coverage < that proportion. I think we already dropped in the sample sheet the most problematic samples, so it may not actually print anything.

## I think samples 7,10 should be removed at minimum, probably also 9,11
nonzero <- plot_nonzero(lp_se, cutoff = 0.7, y_intercept = 0.99)
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## i Please use `linewidth` instead.
## i The deprecated feature was likely used in the hpgltools package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
nonzero
## A non-zero genes plot of 92 samples.
## These samples have an average 28.78 CPM coverage and 8694 genes observed, ranging from 8554 to
## 8749.
## Warning: ggrepel: 76 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

dev <- pp(file = "images/lp_nonzero.png", width = 9, height = 9)
nonzero$plot
## Warning: ggrepel: 76 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
closed <- dev.off()

Differences in relative gene content with respect to sequencing coverage. The per-sample number of observed genes was plotted with respect to the relative CPM coverage in order to check that the samples are sufficiently and similarly diverse. Many samples were observed near or at the putative asymptote of likely gene content; no samples were observed with fewer than 65% of the Leishmania panamensis genes included. Note that the range of genes observed is quite small, 8500 <= x < 8700 genes, however this was plotted after already excluding samples with fewer than 8500 genes observed (of which there were 2) and any samples with fewer than 5 million protein coding mapped reads (there were 2 samples that had more than 8500 genes observed in less than 5 million reads).

lp_box <- plot_boxplot(lp_se)
## 7722 entries are 0.  We are on a log scale, adding 1 to the data.
dev <- pp(file = "images/lp_se_boxplot.png", width = 16, height = 9)
lp_box
closed <- dev.off()
lp_box

The distribution of observed counts / gene for all samples was plotted as a boxplot on the log2 (it looks like it is log10, but I checked) scale. In contrast to host transcriptome distribution, the parasite distribution of reads/gene is log-normal.

filter_plot <- plot_libsize_prepost(lp_se)
## Warning in fortify(data, ...): Arguments in `...` must be used.
## x Problematic argument:
## * colour = colors
## i Did you misspell an argument name?
## Arguments in `...` must be used.
## x Problematic argument:
## * colour = colors
## i Did you misspell an argument name?
filter_plot$lowgene_plot
## Warning: Using alpha for a discrete variable is not advised.

filter_plot$count_plot

The numbers of genes removed by low-count filtering is drastically lower in parasite samples than human. Thus, even though the range of coverage for the parasite samples is from near 0 to ~ 150 CPM, the number of genes removed by the default low-count filter ranges only from 40 to 129, and the number of reads associated with them ranges only from 100 to 3168.

table(colData(lp_se)[["zymodemecategorical"]])
## 
## z21 z22 z23 z24 
##   7  42  41   2
table(colData(lp_se)[["clinicalresponse"]])
## 
##    cure failure      nd 
##      40      34      18

2 Transcriptome visualizations

2.1 Distribution Visualizations

Najib’s favorite plots are of course the PCA/TNSE. These are nice to look at in order to get a sense of the relationships between samples. They also provide a good opportunity to see what happens when one applies different normalizations, surrogate analyses, filters, etc. In addition, one may set different experimental factors as the primary ‘condition’ (usually the color of plots) and surrogate ‘batches’.

2.2 By Susceptilibity

Column ‘Q’ in the sample sheet, make a categorical version of it with these parameters:

  • 0 <= x <= 35 is resistant
  • 36 <= x <= 48 is ambiguous
  • 49 <= x is sensitive
strain_norm <- normalize(lp_strain, norm = "quant", transform = "log2",
                         convert = "cpm", filter = TRUE)
## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
zymo_pca <- plot_pca(strain_norm, plot_title = "PCA of parasite expression values",
                     plot_labels = FALSE)
zymo_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by z2.1, z2.2, z2.3, z2.4
## Shapes are defined by resistant, sensitive.

dev <- pp(file = "figures/promastigote_zymocol_sensshape_z21_to_z24.pdf")
zymo_pca$plot
closed <- dev.off()

lp_strain_known <- subset_se(lp_strain, subset = "clinicalcategorical!='unknown'")
strain_known_norm <- normalize(lp_strain_known, norm = "quant", transform = "log2",
                               convert = "cpm", filter = TRUE)
## Removing 154 low-count genes (8624 remaining).
## transform_counts: Found 32 values equal to 0, adding 1 to the matrix.
zymo_known_pca <- plot_pca(strain_known_norm, plot_title = "PCA of parasite expression values",
                           plot_labels = FALSE)
zymo_known_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by z2.1, z2.2, z2.3, z2.4
## Shapes are defined by resistant, sensitive.

dev <- pp(file = "figures/promastigote_zymocol_sensshape_z21_to_z24_only_known_clinical.pdf")
zymo_known_pca$plot
closed <- dev.off()

2.3 Limit to three strains: 2.1/2.2/2.3

only_three_types <- subset_se(lp_strain,
                              subset = "condition=='z2.1'|condition=='z2.3'|condition=='z2.2'")
only_three_norm <- normalize(only_three_types, norm = "quant", transform = "log2",
                             convert = "cpm", batch = FALSE, filter = TRUE) %>%
  set_batches(fact = "phase")
## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
## The number of samples by batch are:
## 
## Stationary 
##         90
onlythree_pca <- plot_pca(only_three_norm, plot_labels = FALSE,
                          plot_title = "PCA of z2.1, z2.2 and z2.3 parasite expression values")
pp(file = "images/promastigote_threetypes_zymocol_noshape.png")
onlythree_pca$plot
dev.off()
## png 
##   2
onlythree_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by z2.1, z2.2, z2.3
## Shapes are defined by Stationary.

2.4 By my ML knn classifier!

I added the result from my kmer classifier to the sample sheet, let us see how that looks.

lp_strain_knn <- set_conditions(lp_strain, fact = "knnv2classification")
## The numbers of samples by condition are:
## 
## unknown     z21     z22     z23     z24 
##       1       5      43      41       2
strain_norm_knn <- normalize(lp_strain_knn, norm = "quant", transform = "log2",
                             convert = "cpm", filter = TRUE)
## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
zymo_pca_knn <- plot_pca(strain_norm_knn, plot_title = "PCA of parasite expression values",
                         plot_labels = FALSE)
dev <- pp(file = "images/promastigote_zymocol_sensshape_knnv2.png")
zymo_pca_knn$plot
closed <- dev.off()
zymo_pca_knn
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by unknown, z21, z22, z23, z24
## Shapes are defined by resistant, sensitive.

strain_nobatch <- set_batches(strain_norm, fact = "sourcelab")
## The number of samples by batch are:
## 
## MAG 
##  92
zymo_pcav2 <- plot_pca(strain_nobatch, plot_title = "PCA of parasite expression values",
                       plot_labels = FALSE)
dev <- pp(file = "images/promastigote_zymocol_nobatch.png")
zymo_pcav2$plot
closed <- dev.off()
zymo_pcav2
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by z2.1, z2.2, z2.3, z2.4
## Shapes are defined by MAG.

strain_nb <- normalize(lp_strain, convert = "cpm", transform = "log2",
                       filter = TRUE, batch = "svaseq")
## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 541 values less than 0.
## Warning in transform_counts(count_table, method = transform, design = design, :
## NaNs produced
## Setting 4663 entries to zero.
strain_nb_pca <- plot_pca(strain_nb, plot_title = "PCA of parasite expression values",
                          plot_labels = FALSE)
dev <- pp(file = "images/clinical_nb_pca_sus_shape.png")
strain_nb_pca$plot
closed <- dev.off()
strain_nb_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by z2.1, z2.2, z2.3, z2.4
## Shapes are defined by resistant, sensitive.

2.4.1 Silly plotly

plotly_plot <- plotly::ggplotly(zymo_pca_knn$plot)
print(plotly_plot)

Add explicit labels for a few reference strains:

  • TMRC20023: Excluded due to coverage (only 7k reads)
  • TMRC20006: This one has 19,815,673 reads, but a weirdly small number of genes and got excluded.
  • TMRC20029: This has 1,946,986 reads and so was excluded.
  • TMRC20034: Not sequenced

** NOTE ** These samples were all removed from examination in the sample_sheet in 202404 and so will not appear in this plot. Thus I am turning off the following block.

samples_to_label <- c("TMRC20023", "TMRC20006", "TMRC20029", "TMRC20007", "TMRC20034",
                      "TMRC20008", "TMRC20027", "TMRC20028", "TMRC20032", "TMRC20040")

label_entries <- zymo_pca$table[samples_to_label, ]
zymo_pca$plot +
  geom_text(mapping = aes(x = "PC1", y = "PC2", label = "sampleid"),
            data = label_entries)

Some likely text for a figure legend might include something like the following (paraphrased from Najib’s 2016 dual transcriptome profiling paper (10.1128/mBio.00027-16)):

Expression profiles of the promastigote samples across multiple strains. Each glyph represents one sample, colors delineate the various strains and fall into two primary clades. Red samples are zymodeme 2.3, blue samples are zymodeme 2.2. The difference between these two primary groups make up approximately 17% of the variance in the PCA. Purple samples are Leishmania braziliensis or zymodeme 1.0/1.5 samples, orange are z2.4, browns and greys are z2.1, z2.0, z3.0, and z3.2 respectively. This analysis was performed following a low-count filter, cpm conversion, quantile normalization, and a log2 transformation. No batch factor was used, nor was a surrogate variable estimation performed.

Some interpretation for this figure might include:

When PCA was performed on the promastigote samples, the dominant (but still relatively small amount of variance) component observed coincided with the two primary strain groups, zymodeme 2.2 and 2.3. With the exception of some Leishmania braziliensis samples, all promatigote samples assayed fell into one of these two categories.

When surrogate varialbe estimation was performed on the entire set of samples, it increased the apparent strain-dependent variance, but had some potentially problematic effects for a couple of samples (one z2.3 sample now lies with the other z2.2 samples); it is assumed that this is because sva attempted to estimate surrogate values for the less-represented strains with some unintended consequences for sample TMRC20095 (which, along with TMRC20008 are the two least covered samples by a significant margin); this hypothesis may be tested by excluding the braziliensis and non-z2.2/2.3 samples and repeating (when this is performed later in the document, the difference between the two primary clades increases to 49.33% of the variance and there are no odd samples).

zymo_tsne <- plot_tsne(strain_norm, plot_title = "TSNE of parasite expression values")
zymo_tsne
## The result of performing a tsne dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by z2.1, z2.2, z2.3, z2.4
## Shapes are defined by resistant, sensitive.

strain_nb_tsne <- plot_tsne(strain_nb, plot_title = "TSNE of parasite expression values")
strain_nb_tsne
## The result of performing a tsne dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by z2.1, z2.2, z2.3, z2.4
## Shapes are defined by resistant, sensitive.

corheat <- plot_corheat(strain_norm, plot_title = "Correlation heatmap of parasite
                 expression values
")
corheat
## A heatmap of pairwise sample correlations ranging from: 
## 0.642203267746696 to 0.992867624966404.

disheat <- plot_disheat(strain_norm, plot_title = "Distance heatmap of parasite
                 expression values
")
disheat$plot

plot_sm(strain_norm)
## When the standard median metric was plotted, the values observed range
## from 0.642203267746696 to 1 with quartiles at 0.932012660770469 and 0.944521289249787.

Potential start for a figure legend:

Global relationships among the promastigote transcriptional profiles. Pairwise pearson correlations and Euclidean distances were calculated using the normalized expression matrices. Colors along the top row delineate the experimental conditions (same colors as the PCA) Samples were clustered by nearest neighbor clustering and each colored tile describes one correlation value between two samples (red to white delineates pearson correlation values of the 8,710 normalized gene values between two samples ranging from <= 0.7 to >= 1.0) or the euclidean distance between two samples (dark blue to white delineates identical to a normalized euclidean distance of >= 110).

Some interpretation for this figure might include:

When the global relationships among the samples were distilled down to individual euclidean distances or pearson correlation coefficients between pairs of samples, the primary clustering among samples observed was according to strain. The primary significant outlier sample (TMRC20095) is explicitly due to low coverage. The other outlier strains are either braziliensis (purple) or a series of strains which, when viewed in IGV, appear to have genetic variants which bridge the differences between the two primary zymodemes, particularly on the known aneuploid chromosomes.

2.5 Limit to just two strains: 2.2/2.3

lp_two_strains_norm <- normalize(lp_zymo, norm = "quant", transform = "log2",
                                 convert = "cpm", batch = FALSE, filter = TRUE)
## Removing 150 low-count genes (8628 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
onlytwo_pca <- plot_pca(lp_two_strains_norm, plot_title = "PCA of z2.2 and z2.3 parasite expression values",
                        plot_labels = FALSE)
dev <- pp(file = "figures/zymo_z2.2_z2.3_pca_sus_shape.pdf")
onlytwo_pca$plot
closed <- dev.off()
onlytwo_pca$plot

lp_two_strains_known <- subset_se(lp_zymo, subset = "clinicalcategorical!='unknown'")
lp_two_strains_known_norm <- normalize(lp_two_strains_known, norm = "quant", transform = "log2",
                                       convert = "cpm", batch = FALSE, filter = TRUE)
## Removing 155 low-count genes (8623 remaining).
## transform_counts: Found 32 values equal to 0, adding 1 to the matrix.
onlytwo_known_pca <- plot_pca(lp_two_strains_known_norm, plot_labels = FALSE,
                              plot_title = "PCA of z2.2 and z2.3 parasite expression values")
dev <- pp(file = "figures/zymo_z2.2_z2.3_pca_sus_shape_only_known.pdf")
onlytwo_pca$plot
closed <- dev.off()
onlytwo_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by z2.2, z2.3
## Shapes are defined by undefined.

lp_two_strains_nb <- normalize(lp_zymo, transform = "log2", convert = "cpm",
                               batch = "svaseq", filter = TRUE)
## Removing 150 low-count genes (8628 remaining).
## transform_counts: Found 512 values less than 0.
## Warning in transform_counts(count_table, method = transform, design = design, :
## NaNs produced
## Setting 4217 entries to zero.
onlytwo_pca_nb <- plot_pca(lp_two_strains_nb, plot_labels = FALSE,
                           plot_title = "PCA of z2.2 and z2.3 parasite expression values")
dev <- pp(file = "images/zymo_z2.2_z2.3_pca_sus_shape_nb.pdf")
onlytwo_pca_nb$plot
closed <- dev.off()
onlytwo_pca_nb$plot

2.6 By Cure/Fail status

This is by far the most problematic comparison, I think the only interpretation of the following images is that the parasite has little effect on the likelihood that a person will successfully end treatment. There does appear to be some variance associated with cure/fail, but only in a few samples (visible in ~10 fail samples and perhaps ~8 cure samples when sva is applied to the data).

cf_norm <- normalize(lp_cf, convert = "cpm", transform = "log2",
                     norm = "quant", filter = TRUE)
## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
start_cf <- plot_pca(cf_norm, plot_title = "PCA of parasite expression values",
                     plot_labels = FALSE)
dev <- pp(file = "figures/cure_fail_sus_shape_all.pdf")
start_cf$plot
closed <- dev.off()
start_cf
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by cure, fail, unknown
## Shapes are defined by resistant, sensitive.

lp_cf_known <- subset_se(lp_cf, subset = "clinicalcategorical!='unknown'")
cf_known_norm <- normalize(lp_cf_known, convert = "cpm", transform = "log2",
                           norm = "quant", filter = TRUE)
## Removing 154 low-count genes (8624 remaining).
## transform_counts: Found 32 values equal to 0, adding 1 to the matrix.
start_cf_known <- plot_pca(cf_known_norm, plot_title = "PCA of parasite expression values",
                           plot_labels = FALSE)
dev <- pp(file = "figures/cure_fail_sus_shape_known.pdf")
start_cf_known$plot
closed <- dev.off()
start_cf_known
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by cure, fail
## Shapes are defined by resistant, sensitive.

only_two_cf <- set_conditions(lp_zymo, fact = "clinicalcategorical",
                              colors = color_choices[["cf"]]) %>%
  set_batches(fact = "sus_category_current")
## The numbers of samples by condition are:
## 
##    cure    fail unknown 
##      33      32      18
## Warning in set_se_colors(new_se, colors = colors): Colors for the following
## categories are not being used: notapplicable.
## The number of samples by batch are:
## 
## resistant sensitive 
##        44        39
only_two_cf_norm <- normalize(only_two_cf, norm = "quant", transform = "log2",
                              convert = "cpm", batch = FALSE, filter = TRUE)
## Removing 150 low-count genes (8628 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
only_two_cf_pca <- plot_pca(only_two_cf_norm, plot_labels = FALSE,
                            plot_title = "PCA of z2.2 and z2.3 parasite expression values")
dev <- pp(file = "figures/cure_fail_sus_shape_onlyz22_z23.pdf")
only_two_cf_pca$plot
dev.off()
## png 
##   2
only_two_cf_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by cure, fail, unknown
## Shapes are defined by resistant, sensitive.

only_two_cf_known <- subset_se(only_two_cf, subset = "condition!='unknown'")
only_two_cf_known_norm <- normalize(only_two_cf_known, norm = "quant", transform = "log2",
                                    convert = "cpm", batch = FALSE, filter = TRUE)
## Removing 155 low-count genes (8623 remaining).
## transform_counts: Found 32 values equal to 0, adding 1 to the matrix.
only_two_cf_known_pca <- plot_pca(only_two_cf_known_norm, plot_labels = FALSE,
                                  plot_title = "PCA of z2.2 and z2.3 parasite expression values")
dev <- pp(file = "figures/cure_fail_sus_shape_onlyz22_z23_known.pdf")
only_two_cf_known_pca$plot
dev.off()
## png 
##   2
only_two_cf_known_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by cure, fail
## Shapes are defined by resistant, sensitive.

cf_nb <- normalize(lp_cf, convert = "cpm", transform = "log2",
                   filter = TRUE, batch = "svaseq")
## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 292 values less than 0.
## Warning in transform_counts(count_table, method = transform, design = design, :
## NaNs produced
## Setting 4218 entries to zero.
cf_nb_pca <- plot_pca(cf_nb, plot_title = "PCA of parasite expression values",
                      plot_labels = FALSE)
dev <- pp(file = "images/cf_sus_share_nb.png")
cf_nb_pca$plot
closed <- dev.off()
cf_nb_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by cure, fail, unknown
## Shapes are defined by resistant, sensitive.

cf_norm <- normalize(lp_cf, transform = "log2", convert = "cpm",
                     filter = TRUE, norm = "quant")
## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
## Getting an error which really does not make sense, I ran it manually and it worked fine.
test <- pca_information(cf_norm, num_components = 6, plot_pcas = TRUE,
                        factors = c("clinicalcategorical", "zymodemecategorical",
                                    "pathogenstrain", "passagenumber"))
test$anova_p
##                           PC1    PC2    PC3      PC4       PC5     PC6
## clinicalcategorical 9.168e-02 0.4286 0.1710 0.185702 7.118e-01 0.18993
## zymodemecategorical 7.306e-29 0.2921 0.5239 0.373609 7.239e-01 0.86261
## pathogenstrain      2.624e-01 0.1768 0.1649 0.004742 3.411e-01 0.98471
## passagenumber       9.328e-01 0.2103 0.4129 0.001469 2.004e-14 0.02395
test$cor_heatmap

2.7 By Current drug sensitivity assay data

We have two competing metrics of antmonial sensitivity; one historical and one current. In both cases there is a reasonable expectation that resistant strains tend to be zymodeme 2.3 and sensitive strains tend to be zymodeme 2.2. There appear to be more exceptions to this rule of thumb in the current data than the historical.

dim(assay(lp_susceptibility))
## [1] 8778   92
sus_norm <- normalize(lp_susceptibility, transform = "log2", convert = "cpm",
                      norm = "quant", filter = TRUE)
## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
sus_pca <- plot_pca(sus_norm, plot_title = "PCA of parasite expression values",
                    plot_labels = FALSE)
dev <- pp(file = "figures/sus_norm_pca.svg")
sus_pca[["plot"]]
closed <- dev.off()
dev <- pp(file = "figures/sus_norm_pca.pdf")
sus_pca[["plot"]]
closed <- dev.off()
sus_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by resistant, sensitive
## Shapes are defined by cure, fail, unknown.

lp_susceptibility_known <- subset_se(lp_susceptibility, subset = "batch!='unknown'")
sus_known_norm <- normalize(lp_susceptibility_known, transform = "log2", convert = "cpm",
                            norm = "quant", filter = TRUE)
## Removing 154 low-count genes (8624 remaining).
## transform_counts: Found 32 values equal to 0, adding 1 to the matrix.
sus_known_pca <- plot_pca(sus_known_norm, plot_title = "PCA of parasite expression values",
                          plot_labels = FALSE)
dev <- pp(file = "figures/sus_norm_known_pca.pdf")
sus_known_pca[["plot"]]
closed <- dev.off()
sus_known_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by resistant, sensitive
## Shapes are defined by cure, fail.

lp_sus_two <- subset_se(lp_susceptibility, subset = "zymodemecategorical!='z21'") %>%
  subset_se(subset = "zymodemecategorical!='z24'")
sus_two_norm <- normalize(lp_sus_two, transform = "log2", convert = "cpm",
                          norm = "quant", filter = TRUE)
## Removing 150 low-count genes (8628 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
sus_two_pca <- plot_pca(sus_two_norm, plot_title = "PCA of parasite expression values",
                        plot_labels = FALSE)
dev <- pp(file = "figures/sus_norm_two_pca.pdf")
sus_two_pca[["plot"]]
closed <- dev.off()
sus_two_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by resistant, sensitive
## Shapes are defined by cure, fail, unknown.

lp_sus_two_known <- subset_se(lp_sus_two, subset = "clinicalcategorical!='unknown'")
sus_two_known_norm <- normalize(lp_sus_two_known, transform = "log2", convert = "cpm",
                                norm = "quant", filter = TRUE)
## Removing 155 low-count genes (8623 remaining).
## transform_counts: Found 32 values equal to 0, adding 1 to the matrix.
sus_two_known_pca <- plot_pca(sus_two_known_norm, plot_title = "PCA of parasite expression values",
                              plot_labels = FALSE)
dev <- pp(file = "figures/sus_norm_two_known_pca.pdf")
sus_two_known_pca[["plot"]]
closed <- dev.off()
sus_two_known_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by resistant, sensitive
## Shapes are defined by cure, fail.

sus_nb <- normalize(lp_susceptibility, transform = "log2", convert = "cpm",
                    batch = "svaseq", filter = TRUE)
## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 563 values less than 0.
## Warning in transform_counts(count_table, method = transform, design = design, :
## NaNs produced
## Setting 4733 entries to zero.
sus_nb_pca <- plot_pca(sus_nb, plot_title = "PCA of parasite expression values",
                       plot_labels = FALSE)
dev <- pp(file = "images/sus_nb_pca.png")
sus_nb_pca[["plot"]]
closed <- dev.off()
sus_nb_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by resistant, sensitive
## Shapes are defined by cure, fail, unknown.

2.8 By Historical drug sensitivity assay data

sus_hist_norm <- normalize(lp_susceptibility_historical, transform = "log2", convert = "cpm",
                           norm = "quant", filter = TRUE)
## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
sus_hist_pca <- plot_pca(sus_hist_norm, plot_title = "PCA of parasite expression values",
                         plot_labels = FALSE)
dev <- pp(file = "images/sus_hist_norm_pca.png")
sus_hist_pca[["plot"]]
## Warning in MASS::cov.trob(data[, vars], wt = weight * nrow(data)): Probable
## convergence failure
## Warning in MASS::cov.trob(data[, vars], wt = weight * nrow(data)): Probable
## convergence failure
closed <- dev.off()
sus_hist_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by ambiguous, resistant, sensitive, unknown
## Shapes are defined by cure, fail, unknown.
## Warning in MASS::cov.trob(data[, vars], wt = weight * nrow(data)): Probable
## convergence failure
## Warning in MASS::cov.trob(data[, vars], wt = weight * nrow(data)): Probable
## convergence failure

sus_hist_nb <- normalize(lp_susceptibility_historical, transform = "log2", convert = "cpm",
                         batch = "svaseq", filter = TRUE)
## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 298 values less than 0.
## Warning in transform_counts(count_table, method = transform, design = design, :
## NaNs produced
## Setting 4312 entries to zero.
sus_hist_nb_pca <- plot_pca(sus_hist_nb, plot_title = "PCA of parasite expression values",
                            plot_labels = FALSE)
dev <- pp(file = "images/sus_hist_nb_pca.png")
sus_hist_nb_pca[["plot"]]
closed <- dev.off()
sus_hist_nb_pca
## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by ambiguous, resistant, sensitive, unknown
## Shapes are defined by cure, fail, unknown.

2.9 Zymodeme enzyme gene IDs

Najib read me an email listing off the gene names associated with the zymodeme classification. I took those names and cross referenced them against the Leishmania panamensis gene annotations and found the following:

They are:

  1. ALAT: LPAL13_120010900 – alanine aminotransferase
  2. ASAT: LPAL13_340013000 – aspartate aminotransferase
  3. G6PD: LPAL13_000054100 – glucase-6-phosphate 1-dehydrogenase
  4. NH: LPAL13_14006100, LPAL13_180018500 – inosine-guanine nucleoside hydrolase
  5. MPI: LPAL13_320022300 (maybe) – mannose phosphate isomerase (I chose phosphomannose isomerase)

Given these 6 gene IDs (NH has two gene IDs associated with it), I can do some looking for specific differences among the various samples.

2.9.1 Expression levels of zymodeme genes

The following creates a colorspace (red to green) heatmap showing the observed expression of these genes in every sample.

my_genes <- c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
              "LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300",
              "other")
my_names <- c("ALAT", "ASAT", "G6PD", "NHv1", "NHv2", "MPI", "other")

zymo_se <- exclude_genes(strain_norm, ids = my_genes, method = "keep")
## Note, I renamed this to subset_genes().
## subset_genes(), before removal, there were 8629 genes, now there are 6.
## There are 92 samples which kept less than 90 percent counts.
## TMRC20001 TMRC20065 TMRC20004 TMRC20005 TMRC20066 TMRC20039 TMRC20037 TMRC20038 
##   0.08587   0.08454   0.08368   0.08346   0.08132   0.08408   0.08142   0.08294 
## TMRC20067 TMRC20068 TMRC20041 TMRC20015 TMRC20009 TMRC20010 TMRC20016 TMRC20011 
##   0.08342   0.08390   0.08245   0.08428   0.08310   0.08372   0.08304   0.08288 
## TMRC20012 TMRC20013 TMRC20017 TMRC20014 TMRC20018 TMRC20019 TMRC20070 TMRC20020 
##   0.08485   0.08515   0.08275   0.08332   0.08290   0.08304   0.08350   0.08154 
## TMRC20021 TMRC20022 TMRC20024 TMRC20036 TMRC20069 TMRC20033 TMRC20026 TMRC20031 
##   0.08139   0.08476   0.08158   0.08203   0.08201   0.08208   0.08690   0.08142 
## TMRC20076 TMRC20073 TMRC20055 TMRC20079 TMRC20071 TMRC20078 TMRC20094 TMRC20042 
##   0.08260   0.08427   0.08367   0.08462   0.08370   0.08320   0.08349   0.08360 
## TMRC20058 TMRC20072 TMRC20059 TMRC20048 TMRC20057 TMRC20088 TMRC20056 TMRC20060 
##   0.08254   0.08334   0.08301   0.08181   0.08540   0.08423   0.08398   0.08254 
## TMRC20077 TMRC20074 TMRC20063 TMRC20053 TMRC20052 TMRC20064 TMRC20075 TMRC20051 
##   0.08337   0.08304   0.08185   0.08225   0.08206   0.08254   0.08315   0.08381 
## TMRC20050 TMRC20049 TMRC20062 TMRC20110 TMRC20080 TMRC20043 TMRC20083 TMRC20054 
##   0.08196   0.08469   0.08361   0.08451   0.08162   0.08284   0.08379   0.08424 
## TMRC20085 TMRC20046 TMRC20093 TMRC20089 TMRC20047 TMRC20090 TMRC20044 TMRC20045 
##   0.08369   0.08478   0.08396   0.08296   0.08368   0.08111   0.08464   0.08318 
## TMRC20105 TMRC20108 TMRC20109 TMRC20098 TMRC20096 TMRC20101 TMRC20092 TMRC20082 
##   0.08388   0.08252   0.08391   0.08428   0.08292   0.08302   0.08254   0.08219 
## TMRC20102 TMRC20099 TMRC20100 TMRC20091 TMRC20084 TMRC20087 TMRC20103 TMRC20104 
##   0.08278   0.08408   0.08265   0.08430   0.08253   0.08380   0.08376   0.08352 
## TMRC20086 TMRC20107 TMRC20081 TMRC20095 
##   0.08305   0.08097   0.08154   0.07737
zymo_heatmap <- plot_sample_heatmap(zymo_se, row_label = my_names)
zymo_heatmap

A recent suggestion included a query about the relationship of our amastigote TMRC2 samples which were the result of infecting a set of macrophages vs. these promastigote samples.

So far, we have kept these two experiments separate, now let us merge them.

tmrc2_macrophage_norm <- normalize(lp_macrophage, transform = "log2", convert = "cpm",
                                   norm = "quant", filter = TRUE)
## Removing 0 low-count genes (8778 remaining).
## transform_counts: Found 3577 values equal to 0, adding 1 to the matrix.
## Hey you, this annotation call should be made automatic for the container!
annotation(lp_se) <- "org.Lpanamensis.MHOMCOL81L13.v46.eg.db"
annotation(lp_macrophage) <- annotation(lp_se)
all_tmrc2 <- hpgltools:::combine_se(lp_se, lp_macrophage)

Before we can use the combined data, we must reconcile a few of aspects of it, notably we need to specify which samples are amastigotes and which are promastigotes.

all_nosb <- all_tmrc2
colData(all_nosb)[["stage"]] <- "promastigote"
na_idx <- is.na(colData(all_nosb)[["macrophagetreatment"]])
colData(all_nosb)[na_idx, "macrophagetreatment"] <- "undefined"
all_nosb <- subset_se(all_nosb, subset = "macrophagetreatment!='inf_sb'")
ama_idx <- colData(all_nosb)[["macrophagetreatment"]] == "inf"
colData(all_nosb)[ama_idx, "stage" ] <- "amastigote"

## Make sure that the zymodeme does not have the inf_ prefix.
zymodeme_char <- gsub(x = colData(all_nosb)[["condition"]], pattern = "^inf_", replacement = "")
colData(all_nosb)[["condition"]] <- zymodeme_char

colData(all_nosb)[["batch"]] <- colData(all_nosb)[["stage"]]
all_nosb <- subset_se(all_nosb, subset = "condition!='none'")
all_norm <- normalize(all_nosb, convert = "cpm", norm = "quant",
                      transform = "log2", filter = TRUE)
## Removing 94 low-count genes (8684 remaining).
## transform_counts: Found 81 values equal to 0, adding 1 to the matrix.
pro_ama_pca <- plot_pca(all_norm)
## Potentially check over the experimental design, there appear to be missing values.
## Warning in plot_pca(data = mtrx, design = design, state = state, plot_colors =
## plot_colors, : There are NA values in the component data.  This can lead to
## weird plotting errors.
pro_ama_pca[["plot"]]

I think the above picture is sort of the opposite of what we want to compare in a DE analysis for this set of data, e.g. we want to compare promastigotes from amastigotes?

two_nosb <- set_batches(all_nosb, fact = "condition") %>%
  set_conditions(fact = "stage") %>%
  subset_se(subset = "batch=='z2.2'|batch=='z2.3'")
## The number of samples by batch are:
## 
## z2.1 z2.2 z2.3 z2.4 
##    7   56   56    2
## The numbers of samples by condition are:
## 
##   amastigote promastigote 
##           29           92
two_norm <- normalize(two_nosb, convert = "cpm", norm = "quant",
                      transform = "log2", filter = TRUE)
## Removing 94 low-count genes (8684 remaining).
## transform_counts: Found 81 values equal to 0, adding 1 to the matrix.
pro_ama_two_pca <- plot_pca(two_norm)
## Potentially check over the experimental design, there appear to be missing values.
## Warning in plot_pca(data = mtrx, design = design, state = state, plot_colors =
## plot_colors, : There are NA values in the component data.  This can lead to
## weird plotting errors.
pro_ama_two_pca[["plot"]]

zy_stage_factor <- paste0(colData(two_nosb)[["batch"]], "_",
                          colData(two_nosb)[["stage"]])
colData(two_nosb)[["zystage"]] <- zy_stage_factor
zystage <- set_conditions(two_nosb, fact = "zystage")
## The numbers of samples by condition are:
## 
##   z2.2_amastigote z2.2_promastigote   z2.3_amastigote z2.3_promastigote 
##                14                42                15                41
zystage_norm <- normalize(zystage, filter = TRUE, norm = "quant",
                          convert = "cpm", transform = "log2")
## Removing 94 low-count genes (8684 remaining).
## transform_counts: Found 81 values equal to 0, adding 1 to the matrix.
plot_pca(zystage_norm)$plot
## Potentially check over the experimental design, there appear to be missing values.
## Warning in plot_pca(data = mtrx, design = design, state = state, plot_colors =
## plot_colors, : There are NA values in the component data.  This can lead to
## weird plotting errors.

zystage_keepers <- list(
  "z2322_ama" = c("z23_amastigote", "z22_amastigote"),
  "z2322_pro" = c("z23_promastigote", "z22_promastigote"),
  "proama_z23" = c("z23_amastigote", "z23_promastigote"),
  "proama_z22" = c("z22_amastigote", "z22_promastigote"))

zystage_de <- all_pairwise(zystage, filter = TRUE, model_batch = "svaseq",
                           model_fstring = "~ 0 + condition")
##   z2.2_amastigote z2.2_promastigote   z2.3_amastigote z2.3_promastigote 
##                14                42                15                41
## Removing 94 low-count genes (8684 remaining).
## Potentially check over the experimental design, there appear to be missing values.
## Warning in plot_pca(data = mtrx, design = design, state = state, plot_colors =
## plot_colors, : There are NA values in the component data.  This can lead to
## weird plotting errors.
## Potentially check over the experimental design, there appear to be missing values.
## Warning in plot_pca(data = mtrx, design = design, state = state, plot_colors =
## plot_colors, : There are NA values in the component data.  This can lead to
## weird plotting errors.
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 12029 entries to zero.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## Warning in createContrastL(objFlt$formula, objFlt$data, L): Contrasts with only
## a single non-zero term are already evaluated by default.
## conditions
##   z22_amastigote z22_promastigote   z23_amastigote z23_promastigote 
##               14               42               15               41
## conditions
##   z22_amastigote z22_promastigote   z23_amastigote z23_promastigote 
##               14               42               15               41
## conditions
##   z22_amastigote z22_promastigote   z23_amastigote z23_promastigote 
##               14               42               15               41

zystage_tables <- combine_de_tables(
  zystage_de, keepers = zystage_keepers,
  excel = glue("excel/zymodeme_stage_table-v{ver}.xlsx"))
## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.

3 Gene expression with respect to chromosome

I want to make a plot where the x-axis is the number of genes on a chromosome and the y-axis is the mean of the expression of those genes.

assay_by_chr_plot <- plot_assay_by_chromosome(lp_zymo, chromosome_column = "chromosome")
assay_by_chr_plot[["plot"]]

4 SNP profiles

One potentially interesting aspect of the variant data: it may be able to help us define the zymodeme state of previous, untested samples.

In order to test this, I am loading some of the 2016 data alongside the new TMRC2 data to see if they fit together.

This is using an older dataset for which I am not sure we have permissions to include in the container, so I am turning them off for now.

old_se <- create_se("sample_sheets/tmrc2_samples_20191203.xlsx",
                        file_column = "tophat2file")

tt <- old_se$expressionset
rownames(tt) <- gsub(pattern = "^exon_", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\.1$", replacement = "", x = rownames(tt))
old_se$expressionset <- tt
rm(tt)

4.1 Create the SNP expressionset

One other important caveat, we have a group of new samples which have not yet run through the variant search pipeline, so I need to remove them from consideration. Though it looks like they finished overnight…

In the non-containerized version of this document, the following block combines an older dataset with the current data.

both_norm <- normalize(new_snps_sufficient, transform = "log2", norm = "quant") %>%
  set_conditions(fact = "pathogenstrain")
## transform_counts: Found 79143354 values equal to 0, adding 1 to the matrix.
## The numbers of samples by condition are:
## 
##   10070   10750   10772   10977   11006   11024   11026   11028   11031   11045 
##       1       1       1       1       1       1       1       1       1       1 
##   11071   11075   11090   11108   11109 11126-I   11133   11134   11152    1131 
##       1       1       1       1       1       1       1       1       1       1 
##   12116   12166   12169 12218-I   12251   12309   12312   12355   12367   12371 
##       1       1       1       1       1       1       1       1       1       1 
##   12417   12444   12479   12535   12554   12556   12570   12578   12581   12588 
##       1       1       1       1       1       1       1       1       1       1 
##   13464   13473   13474   13582   13589   13595   13597   13625   13631   13703 
##       1       1       1       1       1       1       1       1       1       1 
##   13720   13740   13787   13794   13978   14016   14056   14096   14103   14111 
##       1       1       1       1       1       1       1       1       1       1 
##   14149    2122    2168    2173    2183    2198    2272    2330    2331    2411 
##       1       1       1       1       1       1       1       1       1       1 
##    2414    2423    2429    2439    2472    2482    2496    2500    3117    4700 
##       1       1       1       1       1       1       1       1       1       1 
##    4745    4810    4829    4830    4876    5986    6957    7011    7105    7158 
##       1       1       1       1       1       1       1       1       1       1 
##    8190 
##       1

The data structure ‘both_norm’ now contains our 2016 data along with the newer data collected since 2019.

4.2 Plot of SNP profiles for zymodemes

The following plot shows the SNP profiles of all samples (old and new) where the colors at the top show either the 2.2 strains (orange), 2.3 strains (green), the previous samples (purple), or the various lab strains (pink etc).

new_variant_heatmap <- plot_disheat(new_snps_sufficient)
dev <- pp(file = "images/raw_snp_disheat.png", height = 12, width = 12)
new_variant_heatmap$plot
closed <- dev.off()
new_variant_heatmap$plot

The function get_snp_sets() takes the provided metadata factor (in this case ‘condition’) and looks for variants which are exclusive to each element in it. In this case, this is looking for differences between 2.2 and 2.3, as well as the set shared among them.

snp_sets <- get_snp_sets(new_snps_sufficient, factor = "condition")
## The samples represent the following categories:
## 
## z2.1 z2.2 z2.3 z2.4 
##    7   42   40    2
## Using a proportion of observed variants, converting the data to binary observations.
## The factor z2.1 has 7 rows.
## The factor z2.2 has 42 rows.
## The factor z2.3 has 40 rows.
## The factor z2.4 has 2 rows.
## Finished iterating over the chromosomes.
snp_sets
## A set of variants observed when cross referencing all variants against
## the samples associated with each metadata factor: condition.  4
## categories and 927126 variants were observed with 15
## combinations among them.  725 chromosomes/scaffolds were observed with a
## density of variants ranging from 0.000652315720808871 to 0.114678899082569.
##Biobase::annotation(old_se$expressionset) = Biobase::annotation(lp_se$expressionset)
##both_se <- combine_ses(lp_se, old_se)

snp_genes <- snps_vs_genes(lp_se, snp_sets, chr_column = exp_chr_col,
                           start_column = exp_start_col, end_column = exp_end_col)
## The snp grange data has 927126 elements.
## The first few snp chromosomes are: LPAL13_SCAF000001, LPAL13_SCAF000002, LPAL13_SCAF000003, LPAL13_SCAF000004, LPAL13_SCAF000005, LPAL13_SCAF000007
## The first few exp chromosomes are: LPAL13_SCAF000001, LPAL13_SCAF000003, LPAL13_SCAF000010, LPAL13_SCAF000011, LPAL13_SCAF000017, LPAL13_SCAF000021
## There are 437555 overlapping variants and genes.
## I think we have some metrics here we can plot...
snp_subset <- snp_subset_genes(
  lp_se, new_snps_sufficient, start_column = exp_start_col, end_column = exp_end_col,
  exp_name_column = exp_chr_col,
  genes = c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
            "LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300"))
## subset_genes(), before removal, there were 927126 genes, now there are 85.
## There are 91 samples which kept less than 90 percent counts.
## TMRC20001 TMRC20065 TMRC20004 TMRC20005 TMRC20066 TMRC20039 TMRC20037 TMRC20038 
## 0.0363994 0.0284342 0.0704007 0.0446300 0.0244539 0.0218095 0.0228205 0.0244650 
## TMRC20067 TMRC20068 TMRC20041 TMRC20015 TMRC20009 TMRC20010 TMRC20016 TMRC20011 
## 0.0259861 0.0275633 0.0084708 0.0249880 0.0000000 0.0278667 0.0232143 0.0243409 
## TMRC20012 TMRC20013 TMRC20017 TMRC20014 TMRC20018 TMRC20019 TMRC20070 TMRC20020 
## 0.0778398 0.0294979 0.0102837 0.0191370 0.0239034 0.0282985 0.0274939 0.0235432 
## TMRC20021 TMRC20022 TMRC20024 TMRC20036 TMRC20069 TMRC20033 TMRC20026 TMRC20031 
## 0.0286477 0.0000000 0.0212984 0.0089603 0.0270318 0.0019682 0.0352553 0.0199682 
## TMRC20076 TMRC20073 TMRC20055 TMRC20079 TMRC20071 TMRC20078 TMRC20094 TMRC20042 
## 0.0270505 0.0282364 0.0395199 0.0280768 0.0247556 0.0177539 0.0279169 0.0398656 
## TMRC20058 TMRC20072 TMRC20059 TMRC20048 TMRC20057 TMRC20088 TMRC20056 TMRC20060 
## 0.0256906 0.0158464 0.0251221 0.0238737 0.0062348 0.0349161 0.0003009 0.0294943 
## TMRC20077 TMRC20074 TMRC20063 TMRC20053 TMRC20052 TMRC20064 TMRC20075 TMRC20051 
## 0.0340198 0.0282781 0.0017690 0.0202252 0.0274156 0.0280939 0.0236363 0.0297070 
## TMRC20050 TMRC20049 TMRC20062 TMRC20110 TMRC20080 TMRC20043 TMRC20083 TMRC20054 
## 0.0324800 0.0338522 0.0314505 0.0350400 0.0288995 0.0273341 0.0121428 0.0298779 
## TMRC20085 TMRC20046 TMRC20093 TMRC20089 TMRC20047 TMRC20090 TMRC20044 TMRC20045 
## 0.0251593 0.0054459 0.0065508 0.0269888 0.0299476 0.0273432 0.0316706 0.0052405 
## TMRC20105 TMRC20108 TMRC20109 TMRC20098 TMRC20096 TMRC20101 TMRC20092 TMRC20102 
## 0.0303013 0.0267368 0.0184037 0.0265627 0.0202590 0.0282691 0.0024627 0.0250142 
## TMRC20099 TMRC20100 TMRC20091 TMRC20084 TMRC20087 TMRC20103 TMRC20104 TMRC20086 
## 0.0291960 0.0266043 0.0274472 0.0063494 0.0291851 0.0065616 0.0265140 0.0251956 
## TMRC20107 TMRC20081 TMRC20095 
## 0.0200527 0.0133293 0.0136510
tt <- normalize(snp_subset, transform = "log2", filter = TRUE)
## Removing 0 low-count genes (85 remaining).
## transform_counts: Found 7122 values equal to 0, adding 1 to the matrix.
zymo_heat <- plot_sample_heatmap(tt, row_label = rownames(assay(snp_subset)))
zymo_heat

4.3 Compare variants to DE genes

Najib has asked a few times about the relationship between variants and DE genes. In subsequent conversations I figured out what he really wants to learn is variants in the UTR (most likely 5’) which might affect expression of genes. The following explicitly does not help this question, but is a paralog: is there a relationship between variants in the CDS and differential expression?

4.3.1 Collect DE data

In order to do this comparison, we need to reload some of the DE results.

These blocks need to be moved to post-differential analyses

rda <- glue("rda/zymo_tables_sva-v{ver}.rda")
varname <- gsub(x = basename(rda), pattern = "\\.rda", replacement = "")
loaded <- load(file = rda)
zy_df <- get0(varname)[["data"]][["zymodeme"]]
vars_df <- data.frame(ID = names(snp_genes$summary_by_gene), variants = as.numeric(snp_genes$summary_by_gene))
vars_df[["variants"]] <- log2(vars_df[["variants"]] + 1)
vars_by_de_gene <- merge(zy_df, vars_df, by.x = "row.names", by.y = "ID")
cor.test(vars_by_de_gene$deseq_logfc, vars_by_de_gene$variants)
variants_wrt_logfc <- plot_linear_scatter(vars_by_de_gene[, c("deseq_logfc", "variants")])
variants_wrt_logfc$scatter
## It looks like there might be some genes of interest, even though this is not actually
## the question of interest.

Didn’t I create a set of densities by chromosome? Oh I think they come in from get_snp_sets()

4.4 SNPS associated with clinical response in the TMRC samples

clinical_sets <- get_snp_sets(new_snps_sufficient, factor = "clinicalresponse")
## The samples represent the following categories:
## 
##    cure failure      nd 
##      40      33      18
## Using a proportion of observed variants, converting the data to binary observations.
## The factor cure has 40 rows.
## The factor failure has 33 rows.
## The factor nd has 18 rows.
## Finished iterating over the chromosomes.
clinical_sets
## A set of variants observed when cross referencing all variants against
## the samples associated with each metadata factor: clinicalresponse.  3
## categories and 927126 variants were observed with 7
## combinations among them.  725 chromosomes/scaffolds were observed with a
## density of variants ranging from 0.000652315720808871 to 0.114678899082569.
density_vec <- clinical_sets[["density"]]
chromosome_idx <- grep(pattern = "LpaL", x = names(density_vec))
density_df <- as.data.frame(density_vec[chromosome_idx])
density_df[["chr"]] <- rownames(density_df)
colnames(density_df) <- c("density_vec", "chr")
var_den_chr <- ggplot(density_df, aes(x = chr, y = density_vec)) +
  ggplot2::geom_col() +
  ggplot2::theme(axis.text = ggplot2::element_text(size = 10, colour = "black"),
                 axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5))
var_den_chr

pp(file = "figures/variant_density_by_chromosome.pdf")
var_den_chr
dev.off()
## png 
##   2
## oops, forgot to export write_snps...  fixed.
clinical_written <- write_snps(new_snps_sufficient, output_file = "clinical_variants.aln")

4.4.1 Cross reference these variants by gene

clinical_genes <- snps_vs_genes(lp_se, clinical_sets, chr_column = exp_chr_col,
                                start_column = exp_start_col, end_column = exp_end_col)
## The snp grange data has 927126 elements.
## The first few snp chromosomes are: LPAL13_SCAF000001, LPAL13_SCAF000002, LPAL13_SCAF000003, LPAL13_SCAF000004, LPAL13_SCAF000005, LPAL13_SCAF000007
## The first few exp chromosomes are: LPAL13_SCAF000001, LPAL13_SCAF000003, LPAL13_SCAF000010, LPAL13_SCAF000011, LPAL13_SCAF000017, LPAL13_SCAF000021
## There are 437555 overlapping variants and genes.
snp_density <- merge(as.data.frame(clinical_genes[["summary"]]),
                     as.data.frame(rowData(lp_se)),
                     by = "row.names")
snp_density <- snp_density[, c(1, 2, 4, 15)]
colnames(snp_density) <- c("name", "snps", "product", "length")
snp_density[["product"]] <- tolower(snp_density[["product"]])
snp_density[["length"]] <- as.numeric(snp_density[["length"]])
snp_density[["density"]] <- as.numeric(snp_density[["snps"]]) / snp_density[["length"]]
snp_idx <- order(snp_density[["density"]], decreasing = TRUE)
snp_density <- snp_density[snp_idx, ]

removers <- c("amastin", "gp63", "leishmanolysin")
for (r in removers) {
  drop_idx <- grepl(pattern = r, x = snp_density[["product"]])
  snp_density <- snp_density[!drop_idx, ]
}
## Filter these for [A|a]mastin gp63 Leishmanolysin

Let us grab out the number of variants/gene for the cure/fail samples, merge them into a dataframe, and add that to the gene annotations for the lp_se datastructure.

clinical_snps <- snps_intersections(lp_se, clinical_sets, chr_column = exp_chr_col, start_column = exp_start_col, end_column = exp_end_col)

fail_ref_snps <- as.data.frame(clinical_snps[["inters"]][["failure, reference strain"]])
fail_ref_snps <- rbind(fail_ref_snps,
                       as.data.frame(clinical_snps[["inters"]][["failure"]]))
cure_snps <- as.data.frame(clinical_snps[["inters"]][["cure"]])

head(fail_ref_snps)
##                                                     seqnames  start    end
## chr_LPAL13-SCAF000063_pos_2573_ref_C_alt_A LPAL13-SCAF000063   2573   2574
## chr_LPAL13-SCAF000165_pos_363_ref_G_alt_C  LPAL13-SCAF000165    363    364
## chr_LPAL13-SCAF000627_pos_2267_ref_G_alt_T LPAL13-SCAF000627   2267   2268
## chr_LpaL13-01_pos_26758_ref_T_alt_C                LpaL13-01  26758  26759
## chr_LpaL13-03_pos_164108_ref_A_alt_G               LpaL13-03 164108 164109
## chr_LpaL13-03_pos_236263_ref_T_alt_C               LpaL13-03 236263 236264
##                                            width strand
## chr_LPAL13-SCAF000063_pos_2573_ref_C_alt_A     2      +
## chr_LPAL13-SCAF000165_pos_363_ref_G_alt_C      2      +
## chr_LPAL13-SCAF000627_pos_2267_ref_G_alt_T     2      +
## chr_LpaL13-01_pos_26758_ref_T_alt_C            2      +
## chr_LpaL13-03_pos_164108_ref_A_alt_G           2      +
## chr_LpaL13-03_pos_236263_ref_T_alt_C           2      +
head(cure_snps)
##                                                     seqnames  start    end
## chr_LPAL13-SCAF000397_pos_1312_ref_G_alt_A LPAL13-SCAF000397   1312   1313
## chr_LPAL13-SCAF000627_pos_1869_ref_A_alt_G LPAL13-SCAF000627   1869   1870
## chr_LPAL13-SCAF000791_pos_906_ref_C_alt_T  LPAL13-SCAF000791    906    907
## chr_LpaL13-06_pos_2975_ref_T_alt_C                 LpaL13-06   2975   2976
## chr_LpaL13-09_pos_58219_ref_G_alt_A                LpaL13-09  58219  58220
## chr_LpaL13-10_pos_185578_ref_C_alt_T               LpaL13-10 185578 185579
##                                            width strand
## chr_LPAL13-SCAF000397_pos_1312_ref_G_alt_A     2      +
## chr_LPAL13-SCAF000627_pos_1869_ref_A_alt_G     2      +
## chr_LPAL13-SCAF000791_pos_906_ref_C_alt_T      2      +
## chr_LpaL13-06_pos_2975_ref_T_alt_C             2      +
## chr_LpaL13-09_pos_58219_ref_G_alt_A            2      +
## chr_LpaL13-10_pos_185578_ref_C_alt_T           2      +
write.csv(file = "excel/cure_variants.txt", x = rownames(cure_snps))
write.csv(file = "excel/fail_variants.txt", x = rownames(fail_ref_snps))

annot <- rowData(lp_se)
clinical_interest_cure <- as.data.frame(clinical_snps[["gene_summaries"]][["cure"]])
summary(as.factor(clinical_interest_cure[[1]]))
##    0    1    2    3 
## 8729   41    5    3
clinical_interest_fail <- as.data.frame(clinical_snps[["gene_summaries"]][["failure"]])
summary(as.factor(clinical_interest_fail[[1]]))
##    0    1    2    3 
## 8740   35    1    2
clinical_interest <- merge(clinical_interest_cure,
                           clinical_interest_fail,
                           by = "row.names", all = TRUE)

rownames(clinical_interest) <- clinical_interest[["Row.names"]]
clinical_interest[["Row.names"]] <- NULL
colnames(clinical_interest) <- c("cure_snps", "fail_snps")
clinical_annot <- merge(annot, clinical_interest, by = "row.names")
rownames(annot) <- annot[["Row.names"]]
annot[["Row.names"]] <- NULL
dim(annot)
## [1] 8778  111
dim(rowData(lp_se))
## [1] 8778  111
rowData(lp_se) <- annot

5 Zymodeme for new samples

The heatmap produced here should show the variants only for the zymodeme genes.

5.1 Hunt for snp clusters

I am thinking that if we find clusters of locations which are variant, that might provide some PCR testing possibilities.

## Drop the 2.1, 2.4, unknown, and null
pruned_snps <- subset_se(new_snps_sufficient, subset = "condition=='z2.2'|condition=='z2.3'")
new_sets <- get_snp_sets(pruned_snps, factor = "zymodemecategorical")
## The samples represent the following categories:
## 
## z21 z22 z23 z24 
##   0  42  40   0
## Using a proportion of observed variants, converting the data to binary observations.
## Warning in median_by_factor(snp_exp, fact = factor): The level z21 of the
## factor has no columns.
## The factor z22 has 42 rows.
## The factor z23 has 40 rows.
## Warning in median_by_factor(snp_exp, fact = factor): The level z24 of the
## factor has no columns.
## Finished iterating over the chromosomes.
summary(new_sets)
##               Length Class      Mode     
## factor          1    -none-     character
## values          2    data.frame list     
## observations    3    data.frame list     
## possibilities   2    -none-     character
## intersections   3    -none-     list     
## chr_data      725    -none-     list     
## set_names       4    -none-     list     
## invert_names    4    -none-     list     
## density       725    -none-     numeric
## 1000000: 2.2
## 0100000: 2.3

summary(new_sets[["intersections"]][["10"]])
##    Length     Class      Mode 
##       799 character character
write.csv(file = "excel/variants_22.csv", x = new_sets[["intersections"]][["10"]])
summary(new_sets[["intersections"]][["01"]])
##    Length     Class      Mode 
##     67068 character character
write.csv(file = "excel/variants_23.csv", x = new_sets[["intersections"]][["01"]])

Thus we see that there are 3,553 variants associated with 2.2 and 81,589 associated with 2.3.

5.1.1 A small function for searching for potential PCR primers

The following function uses the positional data to look for sequential mismatches associated with zymodeme in the hopes that there will be some regions which would provide good potential targets for a PCR-based assay.

sequential_variants <- function(snp_sets, conditions = NULL, minimum = 3, maximum_separation = 3) {
  if (is.null(conditions)) {
    conditions <- 1
  }
  intersection_sets <- snp_sets[["intersections"]]
  intersection_names <- snp_sets[["set_names"]]
  chosen_intersection <- 1
  if (is.numeric(conditions)) {
    chosen_intersection <- conditions
  } else {
    intersection_idx <- intersection_names == conditions
    chosen_intersection <- names(intersection_names)[intersection_idx]
  }

  possible_positions <- intersection_sets[[chosen_intersection]]
  position_table <- data.frame(row.names = possible_positions)
  pat <- "^chr_(.+)_pos_(.+)_ref_.*$"
  position_table[["chr"]] <- gsub(pattern = pat, replacement = "\\1", x = rownames(position_table))
  position_table[["pos"]] <- as.numeric(gsub(pattern = pat, replacement = "\\2", x = rownames(position_table)))
  position_idx <- order(position_table[, "chr"], position_table[, "pos"])
  position_table <- position_table[position_idx, ]
  position_table[["dist"]] <- 0

  last_chr <- ""
  for (r in 1:nrow(position_table)) {
    this_chr <- position_table[r, "chr"]
    if (r == 1) {
      position_table[r, "dist"] <- position_table[r, "pos"]
      last_chr <- this_chr
      next
    }
    if (this_chr == last_chr) {
      position_table[r, "dist"] <- position_table[r, "pos"] - position_table[r - 1, "pos"]
    } else {
      position_table[r, "dist"] <- position_table[r, "pos"]
    }
    last_chr <- this_chr
  }

  ## Working interactively here.

  doubles <- position_table[["dist"]] == 1
  doubles <- position_table[doubles, ]
  write.csv(doubles, "doubles.csv")

  one_away <- position_table[["dist"]] == 2
  one_away <- position_table[one_away, ]
  write.csv(one_away, "one_away.csv")

  two_away <- position_table[["dist"]] == 3
  two_away <- position_table[two_away, ]
  write.csv(two_away, "two_away.csv")

  combined <- rbind(doubles, one_away)
  combined <- rbind(combined, two_away)
  position_idx <- order(combined[, "chr"], combined[, "pos"])
  combined <- combined[position_idx, ]

  this_chr <- ""
  for (r in 1:nrow(combined)) {
    this_chr <- combined[r, "chr"]
    if (r == 1) {
      combined[r, "dist_pair"] <- combined[r, "pos"]
      last_chr <- this_chr
      next
    }
    if (this_chr == last_chr) {
      combined[r, "dist_pair"] <- combined[r, "pos"] - combined[r - 1, "pos"]
    } else {
      combined[r, "dist_pair"] <- combined[r, "pos"]
    }
    last_chr <- this_chr
  }

  dist_pair_maximum <- 1000
  dist_pair_minimum <- 200
  dist_pair_idx <- combined[["dist_pair"]] <= dist_pair_maximum &
    combined[["dist_pair"]] >= dist_pair_minimum
  remaining <- combined[dist_pair_idx, ]
  no_weak_idx <- grepl(pattern = "ref_(G|C)", x = rownames(remaining))
  remaining <- remaining[no_weak_idx, ]

  print(head(table(position_table[["dist"]])))
  sequentials <- position_table[["dist"]] <= maximum_separation
  message("There are ", sum(sequentials), " candidate regions.")

  ## The following can tell me how many runs of each length occurred, that is not quite what I want.
  ## Now use run length encoding to find the set of sequential sequentials!
  rle_result <- rle(sequentials)
  rle_values <- rle_result[["values"]]
  ## The following line is equivalent to just leaving values alone:
  ## true_values <- rle_result[["values"]] == TRUE
  rle_lengths <- rle_result[["lengths"]]
  true_sequentials <- rle_lengths[rle_values]
  rle_idx <- cumsum(rle_lengths)[which(rle_values)]

  position_table[["last_sequential"]] <- 0
  count <- 0
  for (r in rle_idx) {
    count <- count + 1
    position_table[r, "last_sequential"] <- true_sequentials[count]
  }
  message("The maximum sequential set is: ", max(position_table[["last_sequential"]]), ".")

  wanted_idx <- position_table[["last_sequential"]] >= minimum
  wanted <- position_table[wanted_idx, c("chr", "pos")]
  return(wanted)
}

zymo22_sequentials <- sequential_variants(new_sets, conditions = "z22",
                                          minimum = 1, maximum_separation = 2)
dim(zymo22_sequentials)
## 7 candidate regions for zymodeme 2.2 -- thus I am betting that the reference strain is a 2.2
zymo23_sequentials <- sequential_variants(new_sets, conditions = "z23",
                                          minimum = 2, maximum_separation = 2)
dim(zymo23_sequentials)
## In contrast, there are lots (587) of interesting regions for 2.3!

5.1.2 Extract a promising region from the genome

The first 4 candidate regions from my set of remaining: * Chr Pos. Distance * LpaL13-15 238433 448 * LpaL13-18 142844 613 * LpaL13-29 830342 252 * LpaL13-33 1331507 843

Lets define a couple of terms: * Third: Each of the 4 above positions. * Second: Third - Distance * End: Third + PrimerLen * Start: Second - Primerlen

In each instance, these are the last positions, so we want to grab three things:

  • The entire region from End -> Start, this way we can have a quick sanity check.
  • Start -> Second.
  • (Third -> End) <- Reverse complemented
## * LpaL13-15 238433 448
first_candidate_chr <- lp_genome[["LpaL13_15"]]
primer_length <- 22
amplicon_length <- 448
first_candidate_third <- 238433
first_candidate_second <- first_candidate_third - amplicon_length
first_candidate_start <- first_candidate_second - primer_length
first_candidate_end <- first_candidate_third + primer_length
first_candidate_region <- subseq(first_candidate_chr, first_candidate_start, first_candidate_end)
first_candidate_region
first_candidate_5p <- subseq(first_candidate_chr, first_candidate_start, first_candidate_second)
as.character(first_candidate_5p)
first_candidate_3p <- spgs::reverseComplement(subseq(first_candidate_chr, first_candidate_third, first_candidate_end))
first_candidate_3p

## * LpaL13-18 142844 613
second_candidate_chr <- lp_genome[["LpaL13_18"]]
primer_length <- 22
amplicon_length <- 613
second_candidate_third <- 142844
second_candidate_second <- second_candidate_third - amplicon_length
second_candidate_start <- second_candidate_second - primer_length
second_candidate_end <- second_candidate_third + primer_length
second_candidate_region <- subseq(second_candidate_chr, second_candidate_start, second_candidate_end)
second_candidate_region
second_candidate_5p <- subseq(second_candidate_chr, second_candidate_start, second_candidate_second)
as.character(second_candidate_5p)
second_candidate_3p <- spgs::reverseComplement(subseq(second_candidate_chr, second_candidate_third, second_candidate_end))
second_candidate_3p


## * LpaL13-29 830342 252
third_candidate_chr <- lp_genome[["LpaL13_29"]]
primer_length <- 22
amplicon_length <- 252
third_candidate_third <- 830342
third_candidate_second <- third_candidate_third - amplicon_length
third_candidate_start <- third_candidate_second - primer_length
third_candidate_end <- third_candidate_third + primer_length
third_candidate_region <- subseq(third_candidate_chr, third_candidate_start, third_candidate_end)
third_candidate_region
third_candidate_5p <- subseq(third_candidate_chr, third_candidate_start, third_candidate_second)
as.character(third_candidate_5p)
third_candidate_3p <- spgs::reverseComplement(subseq(third_candidate_chr, third_candidate_third, third_candidate_end))
third_candidate_3p
## You are a garbage polypyrimidine tract.
## Which is actually interesting if the mutations mess it up.


## * LpaL13-33 1331507 843
fourth_candidate_chr <- lp_genome[["LpaL13_33"]]
primer_length <- 22
amplicon_length <- 843
fourth_candidate_third <- 1331507
fourth_candidate_second <- fourth_candidate_third - amplicon_length
fourth_candidate_start <- fourth_candidate_second - primer_length
fourth_candidate_end <- fourth_candidate_third + primer_length
fourth_candidate_region <- subseq(fourth_candidate_chr, fourth_candidate_start, fourth_candidate_end)
fourth_candidate_region
fourth_candidate_5p <- subseq(fourth_candidate_chr, fourth_candidate_start, fourth_candidate_second)
as.character(fourth_candidate_5p)
fourth_candidate_3p <- spgs::reverseComplement(subseq(fourth_candidate_chr, fourth_candidate_third, fourth_candidate_end))
fourth_candidate_3p

5.2 Go hunting for Sanger sequencing regions

I made a fun little function which should find regions which have lots of variants associated with a given experimental factor.

pheno <- subset_se(lp_se, subset = "condition=='z2.2'|condition=='z2.3'")
pheno <- subset_se(pheno, subset = "!is.na(colData(pheno)[['bcftable']])")
pheno_snps <- count_snps(pheno, annot_column = "freebayessummary", snp_column="PAIRED")
## Using the snp column: PAIRED from the sample annotations.
##pheno_snps <- sm(count_snps(pheno, annot_column = "bcftable"))

5.3 SNP Density Primers

I cannot run the following block in the container unless/until I copy the gff into it…

fun_stuff <- snp_density_primers(
  pheno_snps,
  bsgenome = "BSGenome.Leishmania.panamensis.MHOMCOL81L13.v53",
  gff = "reference/TriTrypDB-53_LpanamensisMHOMCOL81L13.gff")
drop_scaffolds <- grepl(x = rownames(fun_stuff$favorites), pattern = "SCAF")
favorite_primer_regions <- fun_stuff[["favorites"]][!drop_scaffolds, ]
favorite_primer_regions[["bin"]] <- rownames(favorite_primer_regions)

favorite_primer_regions <- favorite_primer_regions %>%
  relocate(bin)

5.4 Combine this table with 2.2/2.3 genes

Here is my note from our meeting:

Cross reference primers to DE genes of 2.2/2.3 and/or resistance/suscpetible, add a column to the primer spreadsheet with the DE genes (in retrospect I am guessing this actually means to put the logFC as a column.

One nice thing, I did a semantic removal on the lp_se, so the set of logFC/pvalues should not have any of the offending types; thus I should be able to automagically get rid of them in the merge.

This block needs to go after differential expression analyses.

logfc <- zy_table_sva[["data"]][["z23_vs_z22"]]
logfc_columns <- logfc[, c("deseq_logfc", "deseq_adjp")]
colnames(logfc_columns) <- c("z23_logfc", "z23_adjp")
new_table <- merge(favorite_primer_regions, logfc_columns,
                   by.x = "closest_gene_before_id", by.y = "row.names")
sus <- sus_table_sva[["data"]][["sensitive_vs_resistant"]]
sus_columns <- sus[, c("deseq_logfc", "deseq_adjp")]
colnames(sus_columns) <- c("sus_logfc", "sus_adjp")
new_table <- merge(new_table, sus_columns,
                   by.x = "closest_gene_before_id", by.y = "row.names") %>%
  relocate(bin)
written <- write_xlsx(data = new_table,
                      excel = "excel/favorite_primers_xref_zy_sus.xlsx")

5.5 Make a heatmap describing the clustering of variants

We can cross reference the variants against the zymodeme status and plot a heatmap of the results and hopefully see how they separate.

snp_genes <- sm(snps_vs_genes(lp_se, new_sets, chr_column = exp_chr_col,
                              start_column = exp_start_col, end_column = exp_end_col))

clinical_colors_v2 <- list(
  "z22" = "#0000cc",
  "z23" = "#cc0000")
new_zymo_norm <- normalize_se(pruned_snps, norm = "quant") %>%
  set_conditions(fact = "zymodemecategorical", colors = clinical_colors_v2)
## The numbers of samples by condition are:
## 
## z21 z22 z23 z24 
##   0  42  40   0
## Warning in set_se_colors(new_se, colors = colors): Some conditions do not have
## a color: z21z24.
## Warning in set_se_colors(new_se, colors = colors): These samples are: .
#  set_se_colors(clinical_colors_v2)

zymo_heat <- plot_disheat(new_zymo_norm)
dev <- pp(file = "images/onlyz22_z23_snp_heatmap.pdf", width = 12, height = 12)
zymo_heat[["plot"]]
closed <- dev.off()
zymo_heat
## A heatmap of pairwise sample distances ranging from: 
## 405123.161006473 to 2192192.81018837.

5.5.1 Annotated heatmap of variants

Now let us try to make a heatmap which includes some of the annotation data.

des <- colData(both_norm)
undef_idx <- is.na(des[["pathogenstrain"]])
des[undef_idx, "pathogenstrain"] <- "unknown"

##hmcols <- colorRampPalette(c("yellow","black","darkblue"))(256)
correlations <- hpgl_cor(assay(both_norm))
na_idx <- is.na(correlations)
correlations[na_idx] <- 0

## Make an initial heatmap via plot_disheat, which may get used as the figure:
initial_snps <- set_conditions(both_norm, fact = "zymodemereference", colors = color_choices[["strain"]])
## The numbers of samples by condition are:
## 
## z2.1 z2.2 z2.3 z2.4 
##    7   42   40    2
## Warning in set_se_colors(new_se, colors = colors): Colors for the following
## categories are not being used: z2.0, z3.0, z3.2, z1.0, z1.5, b2904, unknown.
initial_disheat <- plot_disheat(both_norm)
dev <- pp(file = "figures/initial_snp_heatmap.pdf", width = 20, height = 20)
initial_disheat[["plot"]]
closed <- dev.off()
zymo_heat
## A heatmap of pairwise sample distances ranging from: 
## 405123.161006473 to 2192192.81018837.

zymo_missing_idx <- is.na(des[["zymodemecategorical"]])
des[["zymodemecategorical"]] <- as.character(des[["zymodemecategorical"]])
des[["clinicalcategorical"]] <- as.character(des[["clinicalcategorical"]])
des[zymo_missing_idx, "zymodemecategorical"] <- "unknown"
mydendro <- list(
  "clustfun" = hclust,
  "lwd" = 2.0)
col_data <- as.data.frame(des[, c("zymodemecategorical")])
unknown_clinical <- is.na(des[["clinicalcategorical"]])
colnames(col_data) <- c("zymodeme")

row_data <- as.data.frame(des[, c("sus_category_current", "clinicalcategorical")])
colnames(row_data) <- c("susceptibility", "outcome")
row_data[unknown_clinical, "outcome"] <- "undefined"

myannot <- list(
  "Col" = list("data" = col_data),
  "Row" = list("data" = row_data))
myclust <- list("cuth" = 1.0,
                "col" = BrewerClusterCol)
mylabs <- list(
  "Row" = list("nrow" = 4),
  "Col" = list("nrow" = 4))
hmcols <- colorRampPalette(c("darkblue", "beige"))(240)
zymo_annot_heat <- annHeatmap2(
  correlations,
  dendrogram = mydendro,
  annotation = myannot,
  cluster = myclust,
  labels = mylabs,
  ## The following controls if the picture is symmetric
  scale = "none",
  col = hmcols)
## Warning in breakColors(breaks, col): more colors than classes: ignoring 35 last
## colors
dev <- pp(file = "images/dendro_heatmap.pdf", height = 20, width = 20)
plot(zymo_annot_heat)
closed <- dev.off()
plot(zymo_annot_heat)

Print the larger heatmap so that all the labels appear. Keep in mind that as we get more samples, this image needs to continue getting bigger.

5.5.2 CMplot karyogram of variants

I cannot run the following block until/unless I install cmplot in the container. Oh, I did! Let us run it and see what happens.

xref_prop <- table(colData(pheno_snps)[["condition"]])
xref_prop
## 
## z2.2 z2.3 
##   29   27
idx_tbl <- assay(pheno_snps) > 5
new_tbl <- data.frame(row.names = rownames(assay(pheno_snps)))
for (n in names(xref_prop)) {
  samples <- colData(pheno_snps)[["condition"]] == n
  new_tbl[[n]] <- 0
  prop_col <- rowSums(idx_tbl[, samples]) / xref_prop[n]
  new_tbl[n] <- prop_col
}
keepers <- grepl(x = rownames(new_tbl), pattern = "LpaL13")
new_tbl <- new_tbl[keepers, ]
new_tbl[["strong22"]] <- 1.001 - new_tbl[["z2.2"]]
new_tbl[["strong23"]] <- 1.001 - new_tbl[["z2.3"]]
s22_na <- new_tbl[["strong22"]] > 1
new_tbl[s22_na, "strong22"] <- 1
s23_na <- new_tbl[["strong23"]] > 1
new_tbl[s23_na, "strong23"] <- 1

new_tbl[["SNP"]] <- rownames(new_tbl)
new_tbl[["Chromosome"]] <- gsub(x = new_tbl[["SNP"]], pattern = "chr_(.*)_pos_.*", replacement = "\\1")
new_tbl[["Position"]] <- gsub(x = new_tbl[["SNP"]], pattern = ".*_pos_(\\d+)_.*", replacement = "\\1")
new_tbl <- new_tbl[, c("SNP", "Chromosome", "Position", "strong22", "strong23")]

simplify <- new_tbl
simplify[["strong22"]] <- NULL

CMplot(new_tbl, bin.size = 10000, threshold = c(0.01, 0.05), plot.type = "d",
       file.name = "variant_density_10k")
##  Marker density plotting.
##  Plots are stored in: /lab/singularity/clinical_strains_analyses/202510181845_outputs
CMplot(new_tbl, bin.size = 1000, threshold = c(0.01, 0.05), plot.type = "d",
       file.name = "variant_density_1k")
##  Marker density plotting.
##  Plots are stored in: /lab/singularity/clinical_strains_analyses/202510181845_outputs
CMplot(new_tbl, bin.size = 100000, threshold = c(0.01, 0.05), plot.type = "d",
       file.name = "variant_density_100k")
##  Marker density plotting.
##  Plots are stored in: /lab/singularity/clinical_strains_analyses/202510181845_outputs
CMplot(new_tbl, plot.type = "m", multracks = TRUE, threshold = c(0.01, 0.05),
       threshold.lwd = c(1,1), threshold.col = c("black","grey"),
       amplify = TRUE, bin.size = 1000,
       chr.den.col = c("darkgreen", "yellow", "red"),
       signal.col = c("red", "green", "blue"),
       signal.cex = 1, file = "jpg", dpi = 300, file.output = TRUE, verbose = TRUE)
##  Multi-tracks Manhattan plotting strong22.
##  Multi-tracks Manhattan plotting strong23.
##  Plots are stored in: /lab/singularity/clinical_strains_analyses/202510181845_outputs
SNP Density
SNP Density

6 A different karyogram

I have been a bit frustrated with the clunkyness of cmplot, so I did some reading and found autoplot. It makes use of g/iranges to plot arbitrary data and as such has the potential to be significantly more generally useful than cmplot. I think I will be able to use it to view a lot of interesting different data types. In this instance I want to plot density of variants associated with various conditions in the data (z2.3/z2.2, cure/fail, whatever). In addition, it might be nice to have the ORFs displayed in some fashion (space permitting).

I am pretty sure I made a function which makes this less clunky than what follows.

lp_entry <- EuPathDB::get_eupath_entry(species = "MHOM/COL", metadata = eu_meta)

## These lines cannot run in the container because it cannot write
##txdb_pkgname <- make_eupath_txdb(lp_entry)
##grange_name <- make_eupath_granges(lp_entry)
grange_name <- gsub(x = lp_entry[["GrangesPkg"]], pattern = "\\.rda$", replacement = "")
grange_filename <- file.path("build", lp_entry[["GrangesPkg"]])
if (file.exists(grange_filename)) {
  load(grange_filename)
} else {
  created <- dir.create("build/gff", recursive = TRUE)
  grange_build <- make_eupath_granges(lp_entry)
  grange_filename <- grange_build[["rda"]]
  load(grange_filename)
}
grange_data <- get0(grange_name)

scaffold_idx <- grepl(x = as.character(seqnames(grange_data)), pattern = "SCAF")
no_scaffolds <- grange_data[!scaffold_idx]
scaffold_idx <- grepl(x = as.character(names(seqinfo(grange_data))), pattern = "SCAF")
chr_names <- names(seqinfo(grange_data))[!scaffold_idx]
no_scaffolds <- seqinfo(grange_data)[chr_names]

auto_tbl <- new_tbl
auto_tbl[["position2"]] <- auto_tbl[["Position"]]
auto_tbl[["SNP"]] <- NULL
rownames(auto_tbl) <- NULL

tilesize <- 1000
bins_1k <- GenomicRanges::tileGenome(seqlengths(no_scaffolds), tilewidth = 1000,
                                     cut.last.tile.in.chrom = TRUE)
bins_5k <- GenomicRanges::tileGenome(seqlengths(no_scaffolds), tilewidth = 5000,
                                     cut.last.tile.in.chrom = TRUE)
bins_10k <- GenomicRanges::tileGenome(seqlengths(no_scaffolds), tilewidth = 10000,
                                  cut.last.tile.in.chrom = TRUE)
bins_1nt <- GenomicRanges::tileGenome(seqlengths(no_scaffolds), tilewidth = 1,
                                      cut.last.tile.in.chrom = TRUE)
auto_tbl[["strand"]] <- "+"
## I want to calculate the number of intersecting positions between my auto_tbl and the 1k bins.
start <- auto_tbl[, c("Chromosome", "Position", "position2", "strand", "strong23")]
colnames(start) <- c("chr", "start", "end", "strand", "z23")
start[["chr"]] <- gsub(x = start[["chr"]], pattern = "-", replacement = "_")
var_grange <- makeGRangesFromDataFrame(start, seqinfo = no_scaffolds, keep.extra.columns = TRUE)
vars_per_bin <- findOverlaps(bins_1k, var_grange)
vars_per_bin_numeric <- as.data.frame(bins_1k)
vars_per_bin_numeric[["bin"]] <- rownames(vars_per_bin_numeric)

count_per_bin <- as.data.frame(vars_per_bin) %>%
  group_by(queryHits) %>%
  dplyr::tally()
colnames(count_per_bin) <- c("bin", "num")
vars_per_bin_numeric <- merge(vars_per_bin_numeric, count_per_bin, by = "bin", all.x = TRUE)
missing_idx <- is.na(vars_per_bin_numeric[["num"]])
vars_per_bin_numeric[missing_idx, "num"] <- 0
vars_per_bin <- vars_per_bin_numeric[, c("seqnames", "start", "end", "width", "strand", "num")]
vpb_grange <- makeGRangesFromDataFrame(vars_per_bin, seqinfo = no_scaffolds, keep.extra.columns = TRUE)

kary <- autoplot(vpb_grange, layout = "karyogram", aes(color = num, fill = num)) +
  scale_color_gradient(low = "blue", high = "red") +
  scale_fill_gradient(low = "blue", high = "red")
## theme_bw(base_size = 10) +
pp(file = "karyogram_by_variants.pdf", height = 24, width = 18)
kary
dev.off()

var_kary <- ggbio() +
  layout_karyogram(vpb_grange, aes(color = num, fill = num)) +
  scale_fill_gradient(low = "blue", high = "white") +
  scale_color_gradient(low = "blue", high = "white") +
  theme_bw(base_size = 10)
var_kary

6.1 Try out MatrixEQTL

This tool looks a little opaque, but provides sample data with things that make sense to me and should be pretty easy to recapitulate in our data.

  1. covariates.txt: Columns are samples, rows are things from colData – the most likely ones of interest for our data would be zymodeme, sensitivity
  2. geneloc.txt: columns are ‘geneid’, ‘chr’, ‘left’, ‘right’. I guess I can assume left and right are start/stop; in which case this is trivially acquirable from rowData.
  3. ge.txt: This appears to be a log(rpkm/cpm) table with rows as genes and columns as samples
  4. snpsloc.txt: columns are ‘snpid’, ‘chr’, ‘pos’
  5. snps.txt: columns are samples, rows are the ids from snsploc, values a 0,1,2. I assume 0 is identical and 1..12 are the various A->TGC T->AGC C->AGT G->ACT
## For this, let us use the 'new_snps' data structure.
## Caveat here: these need to be coerced to numbers.
my_covariates <- colData(new_snps)[, c("zymodemecategorical", "clinicalcategorical")]
for (col in colnames(my_covariates)) {
  my_covariates[[col]] <- as.numeric(as.factor(my_covariates[[col]]))
}
my_covariates <- t(my_covariates)

my_geneloc <- rowData(lp_se)[, c("gid", "chromosome", "start", "end")]
colnames(my_geneloc) <- c("geneid", "chr", "left", "right")

my_ge <- assay(normalize_se(lp_se, transform = "log2", filter = TRUE, convert = "cpm"))
used_samples <- tolower(colnames(my_ge)) %in% colnames(assay(new_snps))
my_ge <- my_ge[, used_samples]

my_snpsloc <- data.frame(rownames = rownames(assay(new_snps)))
## Oh, caveat here: Because of the way I stored the data,
## I could have duplicate rows which presumably will make matrixEQTL sad
my_snpsloc[["chr"]] <- gsub(pattern = "^chr_(.+)_pos(.+)_ref_.*$", replacement = "\\1",
                            x = rownames(my_snpsloc))
my_snpsloc[["pos"]] <- gsub(pattern = "^chr_(.+)_pos(.+)_ref_.*$", replacement = "\\2",
                            x = rownames(my_snpsloc))
test <- duplicated(my_snpsloc)
## Each duplicated row would be another variant at that position;
## so in theory we would do a rle to number them I am guessing
## However, I do not have different variants so I think I can ignore this for the moment
## but will need to make my matrix either 0 or 1.
if (sum(test) > 0) {
  message("There are: ", sum(duplicated), " duplicated entries.")
  keep_idx <- ! test
  my_snpsloc <- my_snpsloc[keep_idx, ]
}

my_snps <- assay(new_snps)
one_idx <- my_snps > 0
my_snps[one_idx] <- 1

## Ok, at this point I think I have all the pieces which this method wants...
## Oh, no I guess not; it actually wants the data as a set of filenames...
library(MatrixEQTL)
write.table(my_snps, "eqtl/snps.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(my_snps, "eqtl/snps.tsv", )
write.table(my_snpsloc, "eqtl/snpsloc.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(my_snpsloc, "eqtl/snpsloc.tsv")
write.table(as.data.frame(my_ge), "eqtl/ge.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_ge), "eqtl/ge.tsv")
write.table(as.data.frame(my_geneloc), "eqtl/geneloc.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_geneloc), "eqtl/geneloc.tsv")
write.table(as.data.frame(my_covariates), "eqtl/covariates.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_covariates), "eqtl/covariates.tsv")

useModel = modelLINEAR # modelANOVA, modelLINEAR, or modelLINEAR_CROSS

# Genotype file name
SNP_file_name = "eqtl/snps.tsv"
snps_location_file_name = "eqtl/snpsloc.tsv"
expression_file_name = "eqtl/ge.tsv"
gene_location_file_name = "eqtl/geneloc.tsv"
covariates_file_name = "eqtl/covariates.tsv"
# Output file name
output_file_name_cis = tempfile()
output_file_name_tra = tempfile()
# Only associations significant at this level will be saved
pvOutputThreshold_cis = 0.1
pvOutputThreshold_tra = 0.1
# Error covariance matrix
# Set to numeric() for identity.
errorCovariance = numeric()
# errorCovariance = read.table("Sample_Data/errorCovariance.txt");
# Distance for local gene-SNP pairs
cisDist = 1e6
## Load genotype data
snps = SlicedData$new()
snps$fileDelimiter = "\t"      # the TAB character
snps$fileOmitCharacters = "NA" # denote missing values;
snps$fileSkipRows = 1          # one row of column labels
snps$fileSkipColumns = 1       # one column of row labels
snps$fileSliceSize = 2000      # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name)
## Load gene expression data
gene = SlicedData$new()
gene$fileDelimiter = "\t"      # the TAB character
gene$fileOmitCharacters = "NA" # denote missing values;
gene$fileSkipRows = 1          # one row of column labels
gene$fileSkipColumns = 1       # one column of row labels
gene$fileSliceSize = 2000      # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name)
## Load covariates
cvrt = SlicedData$new()
cvrt$fileDelimiter = "\t"      # the TAB character
cvrt$fileOmitCharacters = "NA" # denote missing values;
cvrt$fileSkipRows = 1          # one row of column labels
cvrt$fileSkipColumns = 1       # one column of row labels
if(length(covariates_file_name) > 0) {
  cvrt$LoadFile(covariates_file_name)
}
## Run the analysis
snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE)
genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE)

me = Matrix_eQTL_main(
  snps = snps,
  gene = gene,
  cvrt = cvrt,
  output_file_name = output_file_name_tra,
  pvOutputThreshold = pvOutputThreshold_tra,
  useModel = useModel,
  errorCovariance = errorCovariance,
  verbose = TRUE,
  output_file_name.cis = output_file_name_cis,
  pvOutputThreshold.cis = pvOutputThreshold_cis,
  snpspos = snpspos,
  genepos = genepos,
  cisDist = cisDist,
  pvalue.hist = "qqplot",
  min.pv.by.genesnp = FALSE,
  noFDRsaveMemory = FALSE);
pander::pander(sessionInfo())
## Warning: Your system is mis-configured: '/etc/localtime' is not a symlink
## Warning: It is strongly recommended to set envionment variable TZ to
## 'America/New_York' (or equivalent)

R version 4.5.0 (2025-04-11)

Platform: x86_64-pc-linux-gnu

locale: C

attached base packages: stats4, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: foreach(v.1.5.2), edgeR(v.4.6.3), ruv(v.0.9.7.1), hpgltools(v.1.2), Heatplus(v.3.16.0), glue(v.1.8.0), ggbio(v.1.56.1), ggplot2(v.4.0.0), GenomicRanges(v.1.60.0), GenomeInfoDb(v.1.44.2), IRanges(v.2.42.0), S4Vectors(v.0.46.0), BiocGenerics(v.0.54.0), generics(v.0.1.4), dplyr(v.1.1.4) and CMplot(v.4.5.1)

loaded via a namespace (and not attached): fs(v.1.6.6), ProtGenerics(v.1.40.0), matrixStats(v.1.5.0), bitops(v.1.0-9), blockmodeling(v.1.1.8), doParallel(v.1.0.17), httr(v.1.4.7), RColorBrewer(v.1.1-3), numDeriv(v.2016.8-1.1), tools(v.4.5.0), backports(v.1.5.0), R6(v.2.6.1), lazyeval(v.0.2.2), mgcv(v.1.9-3), withr(v.3.0.2), prettyunits(v.1.2.0), GGally(v.2.4.0), gridExtra(v.2.3), preprocessCore(v.1.70.0), cli(v.3.6.5), Biobase(v.2.68.0), EBSeq(v.2.6.0), labeling(v.0.4.3), sass(v.0.4.10), robustbase(v.0.99-6), mvtnorm(v.1.3-3), S7(v.0.2.0), genefilter(v.1.90.0), Rsamtools(v.2.24.0), yulab.utils(v.0.2.1), txdbmaker(v.1.4.2), foreign(v.0.8-90), DOSE(v.4.2.0), R.utils(v.2.13.0), dichromat(v.2.0-0.1), BSgenome(v.1.76.0), limma(v.3.64.3), rstudioapi(v.0.17.1), RSQLite(v.2.4.3), BiocIO(v.1.18.0), gtools(v.3.9.5), crosstalk(v.1.2.1), zip(v.2.3.3), GO.db(v.3.21.0), Matrix(v.1.7-3), abind(v.1.4-8), R.methodsS3(v.1.8.2), lifecycle(v.1.0.4), yaml(v.2.3.10), SummarizedExperiment(v.1.38.1), gplots(v.3.2.0), qvalue(v.2.40.0), SparseArray(v.1.8.1), BiocFileCache(v.2.16.1), Rtsne(v.0.17), grid(v.4.5.0), blob(v.1.2.4), promises(v.1.3.3), crayon(v.1.5.3), lattice(v.0.22-7), cowplot(v.1.2.0), GenomicFeatures(v.1.60.0), annotate(v.1.86.1), KEGGREST(v.1.48.1), pillar(v.1.11.0), knitr(v.1.50), varhandle(v.2.0.6), fgsea(v.1.34.2), rjson(v.0.2.23), boot(v.1.3-31), corpcor(v.1.6.10), codetools(v.0.2-20), fastmatch(v.1.1-6), Vennerable(v.3.1.0.9000), data.table(v.1.17.8), vctrs(v.0.6.5), png(v.0.1-8), Rdpack(v.2.6.4), testthat(v.3.2.3), gtable(v.0.3.6), cachem(v.1.1.0), openxlsx(v.4.2.8), xfun(v.0.53), rbibutils(v.2.3), S4Arrays(v.1.8.1), mime(v.0.13), RcppEigen(v.0.3.4.0.2), reformulas(v.0.4.1), survival(v.3.8-3), NOISeq(v.2.52.0), iterators(v.1.0.14), statmod(v.1.5.0), nlme(v.3.1-168), pbkrtest(v.0.5.5), bit64(v.4.6.0-1), EnvStats(v.3.1.0), progress(v.1.2.3), filelock(v.1.0.3), rprojroot(v.2.1.1), bslib(v.0.9.0), KernSmooth(v.2.23-26), rpart(v.4.1.24), colorspace(v.2.1-2), DBI(v.1.2.3), Hmisc(v.5.2-4), nnet(v.7.3-20), DESeq2(v.1.48.1), tidyselect(v.1.2.1), bit(v.4.6.0), compiler(v.4.5.0), curl(v.7.0.0), httr2(v.1.2.1), graph(v.1.86.0), htmlTable(v.2.4.3), xml2(v.1.4.0), desc(v.1.4.3), DelayedArray(v.0.34.1), plotly(v.4.11.0), rtracklayer(v.1.68.0), checkmate(v.2.3.3), scales(v.1.4.0), caTools(v.1.18.3), DEoptimR(v.1.1-4), remaCor(v.0.0.20), RBGL(v.1.84.0), rappdirs(v.0.3.3), stringr(v.1.5.1), digest(v.0.6.37), minqa(v.1.2.8), variancePartition(v.1.38.1), rmarkdown(v.2.29), aod(v.1.3.3), XVector(v.0.48.0), RhpcBLASctl(v.0.23-42), htmltools(v.0.5.8.1), pkgconfig(v.2.0.3), base64enc(v.0.1-3), lme4(v.1.1-37), MatrixGenerics(v.1.20.0), dbplyr(v.2.5.0), fastmap(v.1.2.0), ensembldb(v.2.32.0), rlang(v.1.1.6), htmlwidgets(v.1.6.4), UCSC.utils(v.1.4.0), shiny(v.1.11.1), farver(v.2.1.2), jquerylib(v.0.1.4), jsonlite(v.2.0.0), BiocParallel(v.1.42.1), GOSemSim(v.2.34.0), R.oo(v.1.27.1), VariantAnnotation(v.1.54.1), RCurl(v.1.98-1.17), magrittr(v.2.0.4), Formula(v.1.2-5), GenomeInfoDbData(v.1.2.14), Rcpp(v.1.1.0), stringi(v.1.8.7), brio(v.1.1.5), MASS(v.7.3-65), plyr(v.1.8.9), ggstats(v.0.11.0), parallel(v.4.5.0), ggrepel(v.0.9.6), doSNOW(v.1.0.20), Biostrings(v.2.76.0), splines(v.4.5.0), pander(v.0.6.6), hms(v.1.1.3), locfit(v.1.5-9.12), fastcluster(v.1.3.0), pkgload(v.1.4.1), reshape2(v.1.4.4), biomaRt(v.2.64.0), XML(v.3.99-0.19), evaluate(v.1.0.4), biovizBase(v.1.56.0), BiocManager(v.1.30.26), nloptr(v.2.2.1), httpuv(v.1.6.16), tidyr(v.1.3.1), purrr(v.1.1.0), broom(v.1.0.10), xtable(v.1.8-4), restfulr(v.0.0.16), AnnotationFilter(v.1.32.0), fANCOVA(v.0.6-1), later(v.1.4.3), viridisLite(v.0.4.2), snow(v.0.4-4), OrganismDbi(v.1.50.0), tibble(v.3.3.0), lmerTest(v.3.1-3), memoise(v.2.0.1), AnnotationDbi(v.1.70.0), GenomicAlignments(v.1.44.0), cluster(v.2.1.8.1), sva(v.3.56.0) and GSEABase(v.1.70.0)

message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 015b6b24225f280c620d26d9f5e7ed2caa6139a7
## This is hpgltools commit: Fri Oct 17 09:49:54 2025 -0400: 015b6b24225f280c620d26d9f5e7ed2caa6139a7
##message(paste0("Saving to ", savefile))
## tmp <- sm(saveme(filename = savefile))
tmp <- loame(filename = savefile)
---
title: "TMRC2 `r Sys.getenv('VERSION')`: Visualizing Analyses before DE/variant analyses"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
bibliography: atb.bib
output:
  html_document:
    code_download: true
    code_folding: show
    fig_caption: true
    fig_height: 9
    fig_width: 9
    highlight: zenburn
    keep_md: false
    mode: selfcontained
    number_sections: true
    self_contained: true
    theme: readable
    toc: true
    toc_float:
      collapsed: false
      smooth_scroll: false
---

<style type="text/css">
body .main-container {
  max-width: 1600px;
}
body, td {
  font-size: 16px;
}
code.r{
  font-size: 16px;
}
pre {
  font-size: 16px
}
</style>

```{r, include = FALSE}
library(CMplot)
library(dplyr)
library(GenomicRanges)
library(ggbio)
library(ggplot2)
library(glue)
library(Heatplus)
library(hpgltools)

knitr::opts_knit$set(progress = TRUE, verbose = TRUE, width = 90, echo = TRUE)
knitr::opts_chunk$set(
  error = TRUE, fig.width = 9, fig.height = 9, fig.retina = 2,
  out.width = "100%", dev = "png",
  dev.args = list(png = list(type = "cairo-png")))
old_options <- options(digits = 4, stringsAsFactors = FALSE, knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 12))
ver <- Sys.getenv("VERSION")
previous_file <- ""
rundate <- format(Sys.Date(), format = "%Y%m%d")

## tmp <- try(sm(loadme(filename = gsub(pattern = "\\.Rmd", replace = "\\.rda\\.xz", x = previous_file))))
rmd_file <- "02pre_visualization.Rmd"
savefile <- gsub(pattern = "\\.Rmd", replace = "\\.rda\\.xz", x = rmd_file)
loaded <- load(file = glue("rda/tmrc2_data_structures-v{ver}.rda"))
```

Before we begin, a couple of parameters which have given me grief.

```{r}
## Used by the various functions which cross reference grange data
## The SEs used in this document are getting this from the orgdb
## which includes this information in multiple columns with different
## chromosome ID prefixes.  E.g. sometimes it is just 1,2,3, ... other times
## it is LpaL1, LpaL2, LpaL3, ...
exp_chr_col <- "sequence_id"
## The tritrypdb also puts the start/stop/strand information in multiple places
exp_start_col <- "coding_start"
exp_end_col <- "coding_end"
```

# Introduction

This document will visualize the TMRC2 samples before completing the various differential
expression and variant analyses in the hopes of getting an understanding of how the various
samples relate to each other.

## Initial library size

Start off with the library sizes of the original dataset.  The main
thing to note is that we have quite a large variance in coverage.  A
few of these samples are highly likely to be removed shortly (looking
at you, TMRC20001 and TMRC20095)

```{r}
libsizes <- plot_libsize(lp_se)
libsizes
dev <- pp("images/lp_se_libsizes.png", width = 18, height = 9)
libsizes$plot
closed <- dev.off()
```

Library sizes of the protein coding gene counts observed per sample.
The samples were mapped with the EuPathDB revision 36 of the
Leishmania (Viannia) panamensis strain MHOM/COL/81L13 genome; the
alignments were sorted, indexed, and counted via htseq using the gene
features, and non-protein coding features were excluded.
The per-sample sums of the remaining matrix were plotted to check that
the relative sample coverage is sufficient and not too divergent
across samples.  Bars are colored according to strain/zymodeme
annotation: red: zymodeme 2.3; blue: zymodeme 2.2; Leishmania
braziliensis-like strains b2904, z1.0, and z1.5: purple; zymodemes
which are most similar to 2.3, comprising z2.4 is light brown;
zymodemes most similar to 2.2, comprising z3.0, z2.0, z2.1, and z3.2
are light gray, dark gray, dark brown, and gray respectively.

## Non-zero genes with respect to coverage

This plot is usually our primary arbiter for sample removing based on
coverage.  We pick a semi-arbitrary cutoff based on both coverage and
genes observed.  In this instance 8,600 genes seems likely?

The cutoff argument prints out samples with gene coverage < that
proportion.  I think we already dropped in the sample sheet the most
problematic samples, so it may not actually print anything.

```{r}
## I think samples 7,10 should be removed at minimum, probably also 9,11
nonzero <- plot_nonzero(lp_se, cutoff = 0.7, y_intercept = 0.99)
nonzero
dev <- pp(file = "images/lp_nonzero.png", width = 9, height = 9)
nonzero$plot
closed <- dev.off()
```

Differences in relative gene content with respect to sequencing
coverage.  The per-sample number of observed genes was plotted with
respect to the relative CPM coverage in order to check that the
samples are sufficiently and similarly diverse.  Many samples were
observed near or at the putative asymptote of likely gene content; no
samples were observed with fewer than 65% of the Leishmania panamensis
genes included.  Note that the range of genes observed is quite small,
8500 <= x < 8700 genes, however this was plotted after already
excluding samples with fewer than 8500 genes observed (of which there
were 2) and any samples with fewer than 5 million protein coding
mapped reads (there were 2 samples that had more than 8500 genes
observed in less than 5 million reads).

```{r}
lp_box <- plot_boxplot(lp_se)
dev <- pp(file = "images/lp_se_boxplot.png", width = 16, height = 9)
lp_box
closed <- dev.off()
lp_box
```

The distribution of observed counts / gene for all samples was plotted
as a boxplot on the log2 (it looks like it is log10, but I checked)
scale.  In contrast to host transcriptome distribution, the parasite
distribution of reads/gene is log-normal.

```{r}
filter_plot <- plot_libsize_prepost(lp_se)
filter_plot$lowgene_plot
filter_plot$count_plot
```

The numbers of genes removed by low-count filtering is drastically
lower in parasite samples than human.  Thus, even though the range of
coverage for the parasite samples is from near 0 to ~ 150 CPM, the
number of genes removed by the default low-count filter ranges only
from 40 to 129, and the number of reads associated with them ranges
only from 100 to 3168.

```{r}
table(colData(lp_se)[["zymodemecategorical"]])
table(colData(lp_se)[["clinicalresponse"]])
```

# Transcriptome visualizations

## Distribution Visualizations

Najib's favorite plots are of course the PCA/TNSE.  These are nice to look at in
order to get a sense of the relationships between samples.  They also provide a
good opportunity to see what happens when one applies different normalizations,
surrogate analyses, filters, etc.  In addition, one may set different
experimental factors as the primary 'condition' (usually the color of plots) and
surrogate 'batches'.

## By Susceptilibity

Column 'Q' in the sample sheet, make a categorical version of it with these parameters:

* 0 <= x <= 35 is resistant
* 36 <= x <= 48 is ambiguous
* 49 <= x is sensitive

```{r}
strain_norm <- normalize(lp_strain, norm = "quant", transform = "log2",
                         convert = "cpm", filter = TRUE)
zymo_pca <- plot_pca(strain_norm, plot_title = "PCA of parasite expression values",
                     plot_labels = FALSE)
zymo_pca
dev <- pp(file = "figures/promastigote_zymocol_sensshape_z21_to_z24.pdf")
zymo_pca$plot
closed <- dev.off()

lp_strain_known <- subset_se(lp_strain, subset = "clinicalcategorical!='unknown'")
strain_known_norm <- normalize(lp_strain_known, norm = "quant", transform = "log2",
                               convert = "cpm", filter = TRUE)
zymo_known_pca <- plot_pca(strain_known_norm, plot_title = "PCA of parasite expression values",
                           plot_labels = FALSE)
zymo_known_pca
dev <- pp(file = "figures/promastigote_zymocol_sensshape_z21_to_z24_only_known_clinical.pdf")
zymo_known_pca$plot
closed <- dev.off()
```

## Limit to three strains: 2.1/2.2/2.3

```{r}
only_three_types <- subset_se(lp_strain,
                              subset = "condition=='z2.1'|condition=='z2.3'|condition=='z2.2'")
only_three_norm <- normalize(only_three_types, norm = "quant", transform = "log2",
                             convert = "cpm", batch = FALSE, filter = TRUE) %>%
  set_batches(fact = "phase")
onlythree_pca <- plot_pca(only_three_norm, plot_labels = FALSE,
                          plot_title = "PCA of z2.1, z2.2 and z2.3 parasite expression values")
pp(file = "images/promastigote_threetypes_zymocol_noshape.png")
onlythree_pca$plot
dev.off()
onlythree_pca
```

## By my ML knn classifier!

I added the result from my kmer classifier to the sample sheet, let us
see how that looks.

```{r}
lp_strain_knn <- set_conditions(lp_strain, fact = "knnv2classification")
strain_norm_knn <- normalize(lp_strain_knn, norm = "quant", transform = "log2",
                             convert = "cpm", filter = TRUE)
zymo_pca_knn <- plot_pca(strain_norm_knn, plot_title = "PCA of parasite expression values",
                         plot_labels = FALSE)
dev <- pp(file = "images/promastigote_zymocol_sensshape_knnv2.png")
zymo_pca_knn$plot
closed <- dev.off()
zymo_pca_knn

strain_nobatch <- set_batches(strain_norm, fact = "sourcelab")
zymo_pcav2 <- plot_pca(strain_nobatch, plot_title = "PCA of parasite expression values",
                       plot_labels = FALSE)
dev <- pp(file = "images/promastigote_zymocol_nobatch.png")
zymo_pcav2$plot
closed <- dev.off()
zymo_pcav2

strain_nb <- normalize(lp_strain, convert = "cpm", transform = "log2",
                       filter = TRUE, batch = "svaseq")
strain_nb_pca <- plot_pca(strain_nb, plot_title = "PCA of parasite expression values",
                          plot_labels = FALSE)
dev <- pp(file = "images/clinical_nb_pca_sus_shape.png")
strain_nb_pca$plot
closed <- dev.off()
strain_nb_pca
```

### Silly plotly

```{r}
plotly_plot <- plotly::ggplotly(zymo_pca_knn$plot)
print(plotly_plot)
```

Add explicit labels for a few reference strains:

* TMRC20023: Excluded due to coverage (only 7k reads)
* TMRC20006: This one has 19,815,673 reads, but a weirdly small number of genes and got excluded.
* TMRC20029: This has 1,946,986 reads and so was excluded.
* TMRC20034: Not sequenced

** NOTE ** These samples were all removed from examination in the
_sample_sheet_ in 202404 and so will not appear in this plot.  Thus I
am turning off the following block.

```{r, eval=FALSE}
samples_to_label <- c("TMRC20023", "TMRC20006", "TMRC20029", "TMRC20007", "TMRC20034",
                      "TMRC20008", "TMRC20027", "TMRC20028", "TMRC20032", "TMRC20040")

label_entries <- zymo_pca$table[samples_to_label, ]
zymo_pca$plot +
  geom_text(mapping = aes(x = "PC1", y = "PC2", label = "sampleid"),
            data = label_entries)
```

Some likely text for a figure legend might include something like the
following (paraphrased from Najib's 2016 dual transcriptome profiling
paper (10.1128/mBio.00027-16)):

Expression profiles of the promastigote samples across multiple
strains. Each glyph represents one sample, colors delineate the
various strains and fall into two primary clades.  Red samples are
zymodeme 2.3, blue samples are zymodeme 2.2.  The difference between
these two primary groups make up approximately 17% of the variance in
the PCA.  Purple samples are Leishmania braziliensis or zymodeme
1.0/1.5 samples, orange are z2.4, browns and greys are z2.1, z2.0,
z3.0, and z3.2 respectively.  This analysis was performed following a
low-count filter, cpm conversion, quantile normalization, and a log2
transformation.  No batch factor was used, nor was a surrogate
variable estimation performed.

Some interpretation for this figure might include:

When PCA was performed on the promastigote samples, the dominant (but
still relatively small amount of variance) component observed
coincided with the two primary strain groups, zymodeme 2.2 and 2.3.
With the exception of some Leishmania braziliensis samples, all
promatigote samples assayed fell into one of these two categories.

When surrogate varialbe estimation was performed on the entire set of
samples, it increased the apparent strain-dependent variance, but had
some potentially problematic effects for a couple of samples (one z2.3
sample now lies with the other z2.2 samples); it is assumed that this
is because sva attempted to estimate surrogate values for the
less-represented strains with some unintended consequences for sample
TMRC20095 (which, along with TMRC20008 are the two least covered
samples by a significant margin); this hypothesis may be tested by
excluding the braziliensis and non-z2.2/2.3 samples and repeating
(when this is performed later in the document, the difference between
the two primary clades increases to 49.33% of the variance and there
are no odd samples).

```{r}
zymo_tsne <- plot_tsne(strain_norm, plot_title = "TSNE of parasite expression values")
zymo_tsne

strain_nb_tsne <- plot_tsne(strain_nb, plot_title = "TSNE of parasite expression values")
strain_nb_tsne

corheat <- plot_corheat(strain_norm, plot_title = "Correlation heatmap of parasite
                 expression values
")
corheat

disheat <- plot_disheat(strain_norm, plot_title = "Distance heatmap of parasite
                 expression values
")
disheat$plot

plot_sm(strain_norm)
```

Potential start for a figure legend:

Global relationships among the promastigote transcriptional
profiles.  Pairwise pearson correlations and Euclidean distances were
calculated using the normalized expression matrices.  Colors along the
top row delineate the experimental conditions (same colors as the PCA)
Samples were clustered by nearest neighbor clustering and each colored
tile describes one correlation value between two samples (red to white
delineates pearson correlation values of the 8,710 normalized gene
values between two samples ranging from <= 0.7 to >= 1.0) or
the euclidean distance between two samples (dark blue to white
delineates identical to a normalized euclidean distance of >= 110).

Some interpretation for this figure might include:

When the global relationships among the samples were distilled down to
individual euclidean distances or pearson correlation coefficients
between pairs of samples, the primary clustering among samples
observed was according to strain.  The primary significant outlier
sample (TMRC20095) is explicitly due to low coverage.  The other
outlier strains are either braziliensis (purple) or a series of
strains which, when viewed in IGV, appear to have genetic variants
which bridge the differences between the two primary zymodemes,
particularly on the known aneuploid chromosomes.

## Limit to just two strains: 2.2/2.3

```{r}
lp_two_strains_norm <- normalize(lp_zymo, norm = "quant", transform = "log2",
                                 convert = "cpm", batch = FALSE, filter = TRUE)
onlytwo_pca <- plot_pca(lp_two_strains_norm, plot_title = "PCA of z2.2 and z2.3 parasite expression values",
                        plot_labels = FALSE)
dev <- pp(file = "figures/zymo_z2.2_z2.3_pca_sus_shape.pdf")
onlytwo_pca$plot
closed <- dev.off()
onlytwo_pca$plot

lp_two_strains_known <- subset_se(lp_zymo, subset = "clinicalcategorical!='unknown'")
lp_two_strains_known_norm <- normalize(lp_two_strains_known, norm = "quant", transform = "log2",
                                       convert = "cpm", batch = FALSE, filter = TRUE)
onlytwo_known_pca <- plot_pca(lp_two_strains_known_norm, plot_labels = FALSE,
                              plot_title = "PCA of z2.2 and z2.3 parasite expression values")
dev <- pp(file = "figures/zymo_z2.2_z2.3_pca_sus_shape_only_known.pdf")
onlytwo_pca$plot
closed <- dev.off()
onlytwo_pca

lp_two_strains_nb <- normalize(lp_zymo, transform = "log2", convert = "cpm",
                               batch = "svaseq", filter = TRUE)
onlytwo_pca_nb <- plot_pca(lp_two_strains_nb, plot_labels = FALSE,
                           plot_title = "PCA of z2.2 and z2.3 parasite expression values")
dev <- pp(file = "images/zymo_z2.2_z2.3_pca_sus_shape_nb.pdf")
onlytwo_pca_nb$plot
closed <- dev.off()
onlytwo_pca_nb$plot
```

## By Cure/Fail status

This is by far the most problematic comparison, I think the only
interpretation of the following images is that the parasite has little
effect on the likelihood that a person will successfully end
treatment.  There does appear to be some variance associated with
cure/fail, but only in a few samples (visible in ~10 fail samples and
perhaps ~8 cure samples when sva is applied to the data).

```{r}
cf_norm <- normalize(lp_cf, convert = "cpm", transform = "log2",
                     norm = "quant", filter = TRUE)
start_cf <- plot_pca(cf_norm, plot_title = "PCA of parasite expression values",
                     plot_labels = FALSE)
dev <- pp(file = "figures/cure_fail_sus_shape_all.pdf")
start_cf$plot
closed <- dev.off()
start_cf

lp_cf_known <- subset_se(lp_cf, subset = "clinicalcategorical!='unknown'")
cf_known_norm <- normalize(lp_cf_known, convert = "cpm", transform = "log2",
                           norm = "quant", filter = TRUE)
start_cf_known <- plot_pca(cf_known_norm, plot_title = "PCA of parasite expression values",
                           plot_labels = FALSE)
dev <- pp(file = "figures/cure_fail_sus_shape_known.pdf")
start_cf_known$plot
closed <- dev.off()
start_cf_known

only_two_cf <- set_conditions(lp_zymo, fact = "clinicalcategorical",
                              colors = color_choices[["cf"]]) %>%
  set_batches(fact = "sus_category_current")

only_two_cf_norm <- normalize(only_two_cf, norm = "quant", transform = "log2",
                              convert = "cpm", batch = FALSE, filter = TRUE)
only_two_cf_pca <- plot_pca(only_two_cf_norm, plot_labels = FALSE,
                            plot_title = "PCA of z2.2 and z2.3 parasite expression values")
dev <- pp(file = "figures/cure_fail_sus_shape_onlyz22_z23.pdf")
only_two_cf_pca$plot
dev.off()
only_two_cf_pca

only_two_cf_known <- subset_se(only_two_cf, subset = "condition!='unknown'")
only_two_cf_known_norm <- normalize(only_two_cf_known, norm = "quant", transform = "log2",
                                    convert = "cpm", batch = FALSE, filter = TRUE)
only_two_cf_known_pca <- plot_pca(only_two_cf_known_norm, plot_labels = FALSE,
                                  plot_title = "PCA of z2.2 and z2.3 parasite expression values")
dev <- pp(file = "figures/cure_fail_sus_shape_onlyz22_z23_known.pdf")
only_two_cf_known_pca$plot
dev.off()
only_two_cf_known_pca

cf_nb <- normalize(lp_cf, convert = "cpm", transform = "log2",
                   filter = TRUE, batch = "svaseq")
cf_nb_pca <- plot_pca(cf_nb, plot_title = "PCA of parasite expression values",
                      plot_labels = FALSE)
dev <- pp(file = "images/cf_sus_share_nb.png")
cf_nb_pca$plot
closed <- dev.off()
cf_nb_pca

cf_norm <- normalize(lp_cf, transform = "log2", convert = "cpm",
                     filter = TRUE, norm = "quant")
## Getting an error which really does not make sense, I ran it manually and it worked fine.
test <- pca_information(cf_norm, num_components = 6, plot_pcas = TRUE,
                        factors = c("clinicalcategorical", "zymodemecategorical",
                                    "pathogenstrain", "passagenumber"))
test$anova_p
test$cor_heatmap
```

## By Current drug sensitivity assay data

We have two competing metrics of antmonial sensitivity; one historical
and one current.  In both cases there is a reasonable expectation that
resistant strains tend to be zymodeme 2.3 and sensitive strains tend
to be zymodeme 2.2.  There appear to be more exceptions to this rule
of thumb in the current data than the historical.

```{r}
dim(assay(lp_susceptibility))
sus_norm <- normalize(lp_susceptibility, transform = "log2", convert = "cpm",
                      norm = "quant", filter = TRUE)
sus_pca <- plot_pca(sus_norm, plot_title = "PCA of parasite expression values",
                    plot_labels = FALSE)
dev <- pp(file = "figures/sus_norm_pca.svg")
sus_pca[["plot"]]
closed <- dev.off()
dev <- pp(file = "figures/sus_norm_pca.pdf")
sus_pca[["plot"]]
closed <- dev.off()
sus_pca

lp_susceptibility_known <- subset_se(lp_susceptibility, subset = "batch!='unknown'")
sus_known_norm <- normalize(lp_susceptibility_known, transform = "log2", convert = "cpm",
                            norm = "quant", filter = TRUE)
sus_known_pca <- plot_pca(sus_known_norm, plot_title = "PCA of parasite expression values",
                          plot_labels = FALSE)
dev <- pp(file = "figures/sus_norm_known_pca.pdf")
sus_known_pca[["plot"]]
closed <- dev.off()
sus_known_pca

lp_sus_two <- subset_se(lp_susceptibility, subset = "zymodemecategorical!='z21'") %>%
  subset_se(subset = "zymodemecategorical!='z24'")
sus_two_norm <- normalize(lp_sus_two, transform = "log2", convert = "cpm",
                          norm = "quant", filter = TRUE)
sus_two_pca <- plot_pca(sus_two_norm, plot_title = "PCA of parasite expression values",
                        plot_labels = FALSE)
dev <- pp(file = "figures/sus_norm_two_pca.pdf")
sus_two_pca[["plot"]]
closed <- dev.off()
sus_two_pca

lp_sus_two_known <- subset_se(lp_sus_two, subset = "clinicalcategorical!='unknown'")
sus_two_known_norm <- normalize(lp_sus_two_known, transform = "log2", convert = "cpm",
                                norm = "quant", filter = TRUE)
sus_two_known_pca <- plot_pca(sus_two_known_norm, plot_title = "PCA of parasite expression values",
                              plot_labels = FALSE)
dev <- pp(file = "figures/sus_norm_two_known_pca.pdf")
sus_two_known_pca[["plot"]]
closed <- dev.off()
sus_two_known_pca

sus_nb <- normalize(lp_susceptibility, transform = "log2", convert = "cpm",
                    batch = "svaseq", filter = TRUE)
sus_nb_pca <- plot_pca(sus_nb, plot_title = "PCA of parasite expression values",
                       plot_labels = FALSE)
dev <- pp(file = "images/sus_nb_pca.png")
sus_nb_pca[["plot"]]
closed <- dev.off()
sus_nb_pca
```

## By Historical drug sensitivity assay data

```{r}
sus_hist_norm <- normalize(lp_susceptibility_historical, transform = "log2", convert = "cpm",
                           norm = "quant", filter = TRUE)
sus_hist_pca <- plot_pca(sus_hist_norm, plot_title = "PCA of parasite expression values",
                         plot_labels = FALSE)
dev <- pp(file = "images/sus_hist_norm_pca.png")
sus_hist_pca[["plot"]]
closed <- dev.off()
sus_hist_pca

sus_hist_nb <- normalize(lp_susceptibility_historical, transform = "log2", convert = "cpm",
                         batch = "svaseq", filter = TRUE)
sus_hist_nb_pca <- plot_pca(sus_hist_nb, plot_title = "PCA of parasite expression values",
                            plot_labels = FALSE)
dev <- pp(file = "images/sus_hist_nb_pca.png")
sus_hist_nb_pca[["plot"]]
closed <- dev.off()
sus_hist_nb_pca
```

## Zymodeme enzyme gene IDs

Najib read me an email listing off the gene names associated with the zymodeme
classification.  I took those names and cross referenced them against the
Leishmania panamensis gene annotations and found the following:

They are:

1. ALAT: LPAL13_120010900 -- alanine aminotransferase
2. ASAT: LPAL13_340013000 -- aspartate aminotransferase
3. G6PD: LPAL13_000054100 -- glucase-6-phosphate 1-dehydrogenase
4. NH: LPAL13_14006100, LPAL13_180018500 -- inosine-guanine nucleoside hydrolase
5. MPI: LPAL13_320022300 (maybe) -- mannose phosphate isomerase (I chose phosphomannose isomerase)

Given these 6 gene IDs (NH has two gene IDs associated with it), I can do some
looking for specific differences among the various samples.

### Expression levels of zymodeme genes

The following creates a colorspace (red to green) heatmap showing the observed
expression of these genes in every sample.

```{r}
my_genes <- c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
              "LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300",
              "other")
my_names <- c("ALAT", "ASAT", "G6PD", "NHv1", "NHv2", "MPI", "other")

zymo_se <- exclude_genes(strain_norm, ids = my_genes, method = "keep")
zymo_heatmap <- plot_sample_heatmap(zymo_se, row_label = my_names)
zymo_heatmap
```

A recent suggestion included a query about the relationship of our
amastigote TMRC2 samples which were the result of infecting a set of
macrophages vs. these promastigote samples.

So far, we have kept these two experiments separate, now let us merge them.

```{r}
tmrc2_macrophage_norm <- normalize(lp_macrophage, transform = "log2", convert = "cpm",
                                   norm = "quant", filter = TRUE)

## Hey you, this annotation call should be made automatic for the container!
annotation(lp_se) <- "org.Lpanamensis.MHOMCOL81L13.v46.eg.db"
annotation(lp_macrophage) <- annotation(lp_se)
all_tmrc2 <- hpgltools:::combine_se(lp_se, lp_macrophage)
```

Before we can use the combined data, we must reconcile a few of
aspects of it, notably we need to specify which samples are
amastigotes and which are promastigotes.

```{r}
all_nosb <- all_tmrc2
colData(all_nosb)[["stage"]] <- "promastigote"
na_idx <- is.na(colData(all_nosb)[["macrophagetreatment"]])
colData(all_nosb)[na_idx, "macrophagetreatment"] <- "undefined"
all_nosb <- subset_se(all_nosb, subset = "macrophagetreatment!='inf_sb'")
ama_idx <- colData(all_nosb)[["macrophagetreatment"]] == "inf"
colData(all_nosb)[ama_idx, "stage" ] <- "amastigote"

## Make sure that the zymodeme does not have the inf_ prefix.
zymodeme_char <- gsub(x = colData(all_nosb)[["condition"]], pattern = "^inf_", replacement = "")
colData(all_nosb)[["condition"]] <- zymodeme_char

colData(all_nosb)[["batch"]] <- colData(all_nosb)[["stage"]]
all_nosb <- subset_se(all_nosb, subset = "condition!='none'")
all_norm <- normalize(all_nosb, convert = "cpm", norm = "quant",
                      transform = "log2", filter = TRUE)
pro_ama_pca <- plot_pca(all_norm)
pro_ama_pca[["plot"]]
```

I think the above picture is sort of the opposite of what we want to
compare in a DE analysis for this set of data, e.g. we want to compare
promastigotes from amastigotes?

```{r}
two_nosb <- set_batches(all_nosb, fact = "condition") %>%
  set_conditions(fact = "stage") %>%
  subset_se(subset = "batch=='z2.2'|batch=='z2.3'")

two_norm <- normalize(two_nosb, convert = "cpm", norm = "quant",
                      transform = "log2", filter = TRUE)
pro_ama_two_pca <- plot_pca(two_norm)
pro_ama_two_pca[["plot"]]

zy_stage_factor <- paste0(colData(two_nosb)[["batch"]], "_",
                          colData(two_nosb)[["stage"]])
colData(two_nosb)[["zystage"]] <- zy_stage_factor
zystage <- set_conditions(two_nosb, fact = "zystage")

zystage_norm <- normalize(zystage, filter = TRUE, norm = "quant",
                          convert = "cpm", transform = "log2")
plot_pca(zystage_norm)$plot

zystage_keepers <- list(
  "z2322_ama" = c("z23_amastigote", "z22_amastigote"),
  "z2322_pro" = c("z23_promastigote", "z22_promastigote"),
  "proama_z23" = c("z23_amastigote", "z23_promastigote"),
  "proama_z22" = c("z22_amastigote", "z22_promastigote"))

zystage_de <- all_pairwise(zystage, filter = TRUE, model_batch = "svaseq",
                           model_fstring = "~ 0 + condition")

zystage_tables <- combine_de_tables(
  zystage_de, keepers = zystage_keepers,
  excel = glue("excel/zymodeme_stage_table-v{ver}.xlsx"))
```

# Gene expression with respect to chromosome

I want to make a plot where the x-axis is the number of genes on a chromosome and the
y-axis is the mean of the expression of those genes.

```{r}
assay_by_chr_plot <- plot_assay_by_chromosome(lp_zymo, chromosome_column = "chromosome")
assay_by_chr_plot[["plot"]]
```

# SNP profiles

One potentially interesting aspect of the variant data: it may be able
to help us define the zymodeme state of previous, untested samples.

In order to test this, I am loading some of the 2016 data alongside
the new TMRC2 data to see if they fit together.

This is using an older dataset for which I am not sure we have
permissions to include in the container, so I am turning them off for
now.

```{r, eval=FALSE}
old_se <- create_se("sample_sheets/tmrc2_samples_20191203.xlsx",
                        file_column = "tophat2file")

tt <- old_se$expressionset
rownames(tt) <- gsub(pattern = "^exon_", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\.1$", replacement = "", x = rownames(tt))
old_se$expressionset <- tt
rm(tt)
```

## Create the SNP expressionset

One other important caveat, we have a group of new samples which have
not yet run through the variant search pipeline, so I need to remove
them from consideration.  Though it looks like they finished overnight...

In the non-containerized version of this document, the following block
combines an older dataset with the current data.

```{r}
both_norm <- normalize(new_snps_sufficient, transform = "log2", norm = "quant") %>%
  set_conditions(fact = "pathogenstrain")
```

The data structure 'both_norm' now contains our 2016 data along with
the newer data collected since 2019.

## Plot of SNP profiles for zymodemes

The following plot shows the SNP profiles of all samples (old and new) where the
colors at the top show either the 2.2 strains (orange), 2.3 strains (green), the
previous samples (purple), or the various lab strains (pink etc).

```{r}
new_variant_heatmap <- plot_disheat(new_snps_sufficient)
dev <- pp(file = "images/raw_snp_disheat.png", height = 12, width = 12)
new_variant_heatmap$plot
closed <- dev.off()
new_variant_heatmap$plot
```

The function get_snp_sets() takes the provided metadata factor (in
this case 'condition') and looks for variants which are exclusive to
each element in it.  In this case, this is looking for differences
between 2.2 and 2.3, as well as the set shared among them.

```{r}
snp_sets <- get_snp_sets(new_snps_sufficient, factor = "condition")
snp_sets
##Biobase::annotation(old_se$expressionset) = Biobase::annotation(lp_se$expressionset)
##both_se <- combine_ses(lp_se, old_se)

snp_genes <- snps_vs_genes(lp_se, snp_sets, chr_column = exp_chr_col,
                           start_column = exp_start_col, end_column = exp_end_col)
## I think we have some metrics here we can plot...
snp_subset <- snp_subset_genes(
  lp_se, new_snps_sufficient, start_column = exp_start_col, end_column = exp_end_col,
  exp_name_column = exp_chr_col,
  genes = c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
            "LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300"))
tt <- normalize(snp_subset, transform = "log2", filter = TRUE)
zymo_heat <- plot_sample_heatmap(tt, row_label = rownames(assay(snp_subset)))
zymo_heat
```

## Compare variants to DE genes

Najib has asked a few times about the relationship between variants
and DE genes.  In subsequent conversations I figured out what he
really wants to learn is variants in the UTR (most likely 5') which
might affect expression of genes.  The following explicitly does not
help this question, but is a paralog: is there a relationship between
variants in the CDS and differential expression?

### Collect DE data

In order to do this comparison, we need to reload some of the DE results.

These blocks need to be moved to post-differential analyses

```{r reload_de_results, eval=FALSE}
rda <- glue("rda/zymo_tables_sva-v{ver}.rda")
varname <- gsub(x = basename(rda), pattern = "\\.rda", replacement = "")
loaded <- load(file = rda)
zy_df <- get0(varname)[["data"]][["zymodeme"]]
```

```{r variants_vs_de, eval=FALSE}
vars_df <- data.frame(ID = names(snp_genes$summary_by_gene), variants = as.numeric(snp_genes$summary_by_gene))
vars_df[["variants"]] <- log2(vars_df[["variants"]] + 1)
vars_by_de_gene <- merge(zy_df, vars_df, by.x = "row.names", by.y = "ID")
cor.test(vars_by_de_gene$deseq_logfc, vars_by_de_gene$variants)
variants_wrt_logfc <- plot_linear_scatter(vars_by_de_gene[, c("deseq_logfc", "variants")])
variants_wrt_logfc$scatter
## It looks like there might be some genes of interest, even though this is not actually
## the question of interest.
```

Didn't I create a set of densities by chromosome?
Oh I think they come in from get_snp_sets()

## SNPS associated with clinical response in the TMRC samples

```{r}
clinical_sets <- get_snp_sets(new_snps_sufficient, factor = "clinicalresponse")
clinical_sets

density_vec <- clinical_sets[["density"]]
chromosome_idx <- grep(pattern = "LpaL", x = names(density_vec))
density_df <- as.data.frame(density_vec[chromosome_idx])
density_df[["chr"]] <- rownames(density_df)
colnames(density_df) <- c("density_vec", "chr")
var_den_chr <- ggplot(density_df, aes(x = chr, y = density_vec)) +
  ggplot2::geom_col() +
  ggplot2::theme(axis.text = ggplot2::element_text(size = 10, colour = "black"),
                 axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5))
var_den_chr
pp(file = "figures/variant_density_by_chromosome.pdf")
var_den_chr
dev.off()
## oops, forgot to export write_snps...  fixed.
clinical_written <- write_snps(new_snps_sufficient, output_file = "clinical_variants.aln")
```

### Cross reference these variants by gene

```{r}
clinical_genes <- snps_vs_genes(lp_se, clinical_sets, chr_column = exp_chr_col,
                                start_column = exp_start_col, end_column = exp_end_col)

snp_density <- merge(as.data.frame(clinical_genes[["summary"]]),
                     as.data.frame(rowData(lp_se)),
                     by = "row.names")
snp_density <- snp_density[, c(1, 2, 4, 15)]
colnames(snp_density) <- c("name", "snps", "product", "length")
snp_density[["product"]] <- tolower(snp_density[["product"]])
snp_density[["length"]] <- as.numeric(snp_density[["length"]])
snp_density[["density"]] <- as.numeric(snp_density[["snps"]]) / snp_density[["length"]]
snp_idx <- order(snp_density[["density"]], decreasing = TRUE)
snp_density <- snp_density[snp_idx, ]

removers <- c("amastin", "gp63", "leishmanolysin")
for (r in removers) {
  drop_idx <- grepl(pattern = r, x = snp_density[["product"]])
  snp_density <- snp_density[!drop_idx, ]
}
## Filter these for [A|a]mastin gp63 Leishmanolysin
```

Let us grab out the number of variants/gene for the cure/fail samples,
merge them into a dataframe, and add that to the gene annotations for
the lp_se datastructure.

```{r}
clinical_snps <- snps_intersections(lp_se, clinical_sets, chr_column = exp_chr_col, start_column = exp_start_col, end_column = exp_end_col)

fail_ref_snps <- as.data.frame(clinical_snps[["inters"]][["failure, reference strain"]])
fail_ref_snps <- rbind(fail_ref_snps,
                       as.data.frame(clinical_snps[["inters"]][["failure"]]))
cure_snps <- as.data.frame(clinical_snps[["inters"]][["cure"]])

head(fail_ref_snps)
head(cure_snps)
write.csv(file = "excel/cure_variants.txt", x = rownames(cure_snps))
write.csv(file = "excel/fail_variants.txt", x = rownames(fail_ref_snps))

annot <- rowData(lp_se)
clinical_interest_cure <- as.data.frame(clinical_snps[["gene_summaries"]][["cure"]])
summary(as.factor(clinical_interest_cure[[1]]))
clinical_interest_fail <- as.data.frame(clinical_snps[["gene_summaries"]][["failure"]])
summary(as.factor(clinical_interest_fail[[1]]))

clinical_interest <- merge(clinical_interest_cure,
                           clinical_interest_fail,
                           by = "row.names", all = TRUE)

rownames(clinical_interest) <- clinical_interest[["Row.names"]]
clinical_interest[["Row.names"]] <- NULL
colnames(clinical_interest) <- c("cure_snps", "fail_snps")
clinical_annot <- merge(annot, clinical_interest, by = "row.names")
rownames(annot) <- annot[["Row.names"]]
annot[["Row.names"]] <- NULL
dim(annot)
dim(rowData(lp_se))
rowData(lp_se) <- annot
```

# Zymodeme for new samples

The heatmap produced here should show the variants only for the zymodeme genes.

## Hunt for snp clusters

I am thinking that if we find clusters of locations which are variant, that
might provide some PCR testing possibilities.

```{r}
## Drop the 2.1, 2.4, unknown, and null
pruned_snps <- subset_se(new_snps_sufficient, subset = "condition=='z2.2'|condition=='z2.3'")
new_sets <- get_snp_sets(pruned_snps, factor = "zymodemecategorical")
summary(new_sets)
## 1000000: 2.2
## 0100000: 2.3

summary(new_sets[["intersections"]][["10"]])
write.csv(file = "excel/variants_22.csv", x = new_sets[["intersections"]][["10"]])
summary(new_sets[["intersections"]][["01"]])
write.csv(file = "excel/variants_23.csv", x = new_sets[["intersections"]][["01"]])
```

Thus we see that there are 3,553 variants associated with 2.2 and
81,589 associated with 2.3.

### A small function for searching for potential PCR primers

The following function uses the positional data to look for sequential
mismatches associated with zymodeme in the hopes that there will be
some regions which would provide good potential targets for a
PCR-based assay.

```{r sequential_search, eval=FALSE}
sequential_variants <- function(snp_sets, conditions = NULL, minimum = 3, maximum_separation = 3) {
  if (is.null(conditions)) {
    conditions <- 1
  }
  intersection_sets <- snp_sets[["intersections"]]
  intersection_names <- snp_sets[["set_names"]]
  chosen_intersection <- 1
  if (is.numeric(conditions)) {
    chosen_intersection <- conditions
  } else {
    intersection_idx <- intersection_names == conditions
    chosen_intersection <- names(intersection_names)[intersection_idx]
  }

  possible_positions <- intersection_sets[[chosen_intersection]]
  position_table <- data.frame(row.names = possible_positions)
  pat <- "^chr_(.+)_pos_(.+)_ref_.*$"
  position_table[["chr"]] <- gsub(pattern = pat, replacement = "\\1", x = rownames(position_table))
  position_table[["pos"]] <- as.numeric(gsub(pattern = pat, replacement = "\\2", x = rownames(position_table)))
  position_idx <- order(position_table[, "chr"], position_table[, "pos"])
  position_table <- position_table[position_idx, ]
  position_table[["dist"]] <- 0

  last_chr <- ""
  for (r in 1:nrow(position_table)) {
    this_chr <- position_table[r, "chr"]
    if (r == 1) {
      position_table[r, "dist"] <- position_table[r, "pos"]
      last_chr <- this_chr
      next
    }
    if (this_chr == last_chr) {
      position_table[r, "dist"] <- position_table[r, "pos"] - position_table[r - 1, "pos"]
    } else {
      position_table[r, "dist"] <- position_table[r, "pos"]
    }
    last_chr <- this_chr
  }

  ## Working interactively here.

  doubles <- position_table[["dist"]] == 1
  doubles <- position_table[doubles, ]
  write.csv(doubles, "doubles.csv")

  one_away <- position_table[["dist"]] == 2
  one_away <- position_table[one_away, ]
  write.csv(one_away, "one_away.csv")

  two_away <- position_table[["dist"]] == 3
  two_away <- position_table[two_away, ]
  write.csv(two_away, "two_away.csv")

  combined <- rbind(doubles, one_away)
  combined <- rbind(combined, two_away)
  position_idx <- order(combined[, "chr"], combined[, "pos"])
  combined <- combined[position_idx, ]

  this_chr <- ""
  for (r in 1:nrow(combined)) {
    this_chr <- combined[r, "chr"]
    if (r == 1) {
      combined[r, "dist_pair"] <- combined[r, "pos"]
      last_chr <- this_chr
      next
    }
    if (this_chr == last_chr) {
      combined[r, "dist_pair"] <- combined[r, "pos"] - combined[r - 1, "pos"]
    } else {
      combined[r, "dist_pair"] <- combined[r, "pos"]
    }
    last_chr <- this_chr
  }

  dist_pair_maximum <- 1000
  dist_pair_minimum <- 200
  dist_pair_idx <- combined[["dist_pair"]] <= dist_pair_maximum &
    combined[["dist_pair"]] >= dist_pair_minimum
  remaining <- combined[dist_pair_idx, ]
  no_weak_idx <- grepl(pattern = "ref_(G|C)", x = rownames(remaining))
  remaining <- remaining[no_weak_idx, ]

  print(head(table(position_table[["dist"]])))
  sequentials <- position_table[["dist"]] <= maximum_separation
  message("There are ", sum(sequentials), " candidate regions.")

  ## The following can tell me how many runs of each length occurred, that is not quite what I want.
  ## Now use run length encoding to find the set of sequential sequentials!
  rle_result <- rle(sequentials)
  rle_values <- rle_result[["values"]]
  ## The following line is equivalent to just leaving values alone:
  ## true_values <- rle_result[["values"]] == TRUE
  rle_lengths <- rle_result[["lengths"]]
  true_sequentials <- rle_lengths[rle_values]
  rle_idx <- cumsum(rle_lengths)[which(rle_values)]

  position_table[["last_sequential"]] <- 0
  count <- 0
  for (r in rle_idx) {
    count <- count + 1
    position_table[r, "last_sequential"] <- true_sequentials[count]
  }
  message("The maximum sequential set is: ", max(position_table[["last_sequential"]]), ".")

  wanted_idx <- position_table[["last_sequential"]] >= minimum
  wanted <- position_table[wanted_idx, c("chr", "pos")]
  return(wanted)
}

zymo22_sequentials <- sequential_variants(new_sets, conditions = "z22",
                                          minimum = 1, maximum_separation = 2)
dim(zymo22_sequentials)
## 7 candidate regions for zymodeme 2.2 -- thus I am betting that the reference strain is a 2.2
zymo23_sequentials <- sequential_variants(new_sets, conditions = "z23",
                                          minimum = 2, maximum_separation = 2)
dim(zymo23_sequentials)
## In contrast, there are lots (587) of interesting regions for 2.3!
```

### Extract a promising region from the genome

The first 4 candidate regions from my set of remaining:
* Chr       Pos.   Distance
* LpaL13-15 238433 448
* LpaL13-18 142844 613
* LpaL13-29 830342 252
* LpaL13-33 1331507 843

Lets define a couple of terms:
* Third: Each of the 4 above positions.
* Second: Third - Distance
* End: Third + PrimerLen
* Start: Second - Primerlen

In each instance, these are the last positions, so we want to grab three things:

* The entire region from End -> Start, this way we can have a quick sanity check.
* Start -> Second.
* (Third -> End) <- Reverse complemented

```{r extract_bsgenome, eval=FALSE}
## * LpaL13-15 238433 448
first_candidate_chr <- lp_genome[["LpaL13_15"]]
primer_length <- 22
amplicon_length <- 448
first_candidate_third <- 238433
first_candidate_second <- first_candidate_third - amplicon_length
first_candidate_start <- first_candidate_second - primer_length
first_candidate_end <- first_candidate_third + primer_length
first_candidate_region <- subseq(first_candidate_chr, first_candidate_start, first_candidate_end)
first_candidate_region
first_candidate_5p <- subseq(first_candidate_chr, first_candidate_start, first_candidate_second)
as.character(first_candidate_5p)
first_candidate_3p <- spgs::reverseComplement(subseq(first_candidate_chr, first_candidate_third, first_candidate_end))
first_candidate_3p

## * LpaL13-18 142844 613
second_candidate_chr <- lp_genome[["LpaL13_18"]]
primer_length <- 22
amplicon_length <- 613
second_candidate_third <- 142844
second_candidate_second <- second_candidate_third - amplicon_length
second_candidate_start <- second_candidate_second - primer_length
second_candidate_end <- second_candidate_third + primer_length
second_candidate_region <- subseq(second_candidate_chr, second_candidate_start, second_candidate_end)
second_candidate_region
second_candidate_5p <- subseq(second_candidate_chr, second_candidate_start, second_candidate_second)
as.character(second_candidate_5p)
second_candidate_3p <- spgs::reverseComplement(subseq(second_candidate_chr, second_candidate_third, second_candidate_end))
second_candidate_3p


## * LpaL13-29 830342 252
third_candidate_chr <- lp_genome[["LpaL13_29"]]
primer_length <- 22
amplicon_length <- 252
third_candidate_third <- 830342
third_candidate_second <- third_candidate_third - amplicon_length
third_candidate_start <- third_candidate_second - primer_length
third_candidate_end <- third_candidate_third + primer_length
third_candidate_region <- subseq(third_candidate_chr, third_candidate_start, third_candidate_end)
third_candidate_region
third_candidate_5p <- subseq(third_candidate_chr, third_candidate_start, third_candidate_second)
as.character(third_candidate_5p)
third_candidate_3p <- spgs::reverseComplement(subseq(third_candidate_chr, third_candidate_third, third_candidate_end))
third_candidate_3p
## You are a garbage polypyrimidine tract.
## Which is actually interesting if the mutations mess it up.


## * LpaL13-33 1331507 843
fourth_candidate_chr <- lp_genome[["LpaL13_33"]]
primer_length <- 22
amplicon_length <- 843
fourth_candidate_third <- 1331507
fourth_candidate_second <- fourth_candidate_third - amplicon_length
fourth_candidate_start <- fourth_candidate_second - primer_length
fourth_candidate_end <- fourth_candidate_third + primer_length
fourth_candidate_region <- subseq(fourth_candidate_chr, fourth_candidate_start, fourth_candidate_end)
fourth_candidate_region
fourth_candidate_5p <- subseq(fourth_candidate_chr, fourth_candidate_start, fourth_candidate_second)
as.character(fourth_candidate_5p)
fourth_candidate_3p <- spgs::reverseComplement(subseq(fourth_candidate_chr, fourth_candidate_third, fourth_candidate_end))
fourth_candidate_3p
```

## Go hunting for Sanger sequencing regions

I made a fun little function which should find regions which have lots of variants
associated with a given experimental factor.

```{r}
pheno <- subset_se(lp_se, subset = "condition=='z2.2'|condition=='z2.3'")
pheno <- subset_se(pheno, subset = "!is.na(colData(pheno)[['bcftable']])")
pheno_snps <- count_snps(pheno, annot_column = "freebayessummary", snp_column="PAIRED")
##pheno_snps <- sm(count_snps(pheno, annot_column = "bcftable"))
```

## SNP Density Primers

I cannot run the following block in the container unless/until I copy
the gff into it...

```{r, eval=FALSE}
fun_stuff <- snp_density_primers(
  pheno_snps,
  bsgenome = "BSGenome.Leishmania.panamensis.MHOMCOL81L13.v53",
  gff = "reference/TriTrypDB-53_LpanamensisMHOMCOL81L13.gff")
drop_scaffolds <- grepl(x = rownames(fun_stuff$favorites), pattern = "SCAF")
favorite_primer_regions <- fun_stuff[["favorites"]][!drop_scaffolds, ]
favorite_primer_regions[["bin"]] <- rownames(favorite_primer_regions)

favorite_primer_regions <- favorite_primer_regions %>%
  relocate(bin)
```

## Combine this table with 2.2/2.3 genes

Here is my note from our meeting:

Cross reference primers to DE genes of 2.2/2.3 and/or resistance/suscpetible,
add a column to the primer spreadsheet with the DE genes (in retrospect I am guessing
this actually means to put the logFC as a column.

One nice thing, I did a semantic removal on the lp_se, so the set of logFC/pvalues
should not have any of the offending types; thus I should be able to automagically
get rid of them in the merge.

This block needs to go after differential expression analyses.

```{r, eval=FALSE}
logfc <- zy_table_sva[["data"]][["z23_vs_z22"]]
logfc_columns <- logfc[, c("deseq_logfc", "deseq_adjp")]
colnames(logfc_columns) <- c("z23_logfc", "z23_adjp")
new_table <- merge(favorite_primer_regions, logfc_columns,
                   by.x = "closest_gene_before_id", by.y = "row.names")
sus <- sus_table_sva[["data"]][["sensitive_vs_resistant"]]
sus_columns <- sus[, c("deseq_logfc", "deseq_adjp")]
colnames(sus_columns) <- c("sus_logfc", "sus_adjp")
new_table <- merge(new_table, sus_columns,
                   by.x = "closest_gene_before_id", by.y = "row.names") %>%
  relocate(bin)
written <- write_xlsx(data = new_table,
                      excel = "excel/favorite_primers_xref_zy_sus.xlsx")
```


## Make a heatmap describing the clustering of variants

We can cross reference the variants against the zymodeme status and
plot a heatmap of the results and hopefully see how they separate.

```{r}
snp_genes <- sm(snps_vs_genes(lp_se, new_sets, chr_column = exp_chr_col,
                              start_column = exp_start_col, end_column = exp_end_col))

clinical_colors_v2 <- list(
  "z22" = "#0000cc",
  "z23" = "#cc0000")
new_zymo_norm <- normalize_se(pruned_snps, norm = "quant") %>%
  set_conditions(fact = "zymodemecategorical", colors = clinical_colors_v2)
#  set_se_colors(clinical_colors_v2)

zymo_heat <- plot_disheat(new_zymo_norm)
dev <- pp(file = "images/onlyz22_z23_snp_heatmap.pdf", width = 12, height = 12)
zymo_heat[["plot"]]
closed <- dev.off()
zymo_heat
```

### Annotated heatmap of variants

Now let us try to make a heatmap which includes some of the annotation data.

```{r}
des <- colData(both_norm)
undef_idx <- is.na(des[["pathogenstrain"]])
des[undef_idx, "pathogenstrain"] <- "unknown"

##hmcols <- colorRampPalette(c("yellow","black","darkblue"))(256)
correlations <- hpgl_cor(assay(both_norm))
na_idx <- is.na(correlations)
correlations[na_idx] <- 0

## Make an initial heatmap via plot_disheat, which may get used as the figure:
initial_snps <- set_conditions(both_norm, fact = "zymodemereference", colors = color_choices[["strain"]])
initial_disheat <- plot_disheat(both_norm)
dev <- pp(file = "figures/initial_snp_heatmap.pdf", width = 20, height = 20)
initial_disheat[["plot"]]
closed <- dev.off()
zymo_heat

zymo_missing_idx <- is.na(des[["zymodemecategorical"]])
des[["zymodemecategorical"]] <- as.character(des[["zymodemecategorical"]])
des[["clinicalcategorical"]] <- as.character(des[["clinicalcategorical"]])
des[zymo_missing_idx, "zymodemecategorical"] <- "unknown"
mydendro <- list(
  "clustfun" = hclust,
  "lwd" = 2.0)
col_data <- as.data.frame(des[, c("zymodemecategorical")])
unknown_clinical <- is.na(des[["clinicalcategorical"]])
colnames(col_data) <- c("zymodeme")

row_data <- as.data.frame(des[, c("sus_category_current", "clinicalcategorical")])
colnames(row_data) <- c("susceptibility", "outcome")
row_data[unknown_clinical, "outcome"] <- "undefined"

myannot <- list(
  "Col" = list("data" = col_data),
  "Row" = list("data" = row_data))
myclust <- list("cuth" = 1.0,
                "col" = BrewerClusterCol)
mylabs <- list(
  "Row" = list("nrow" = 4),
  "Col" = list("nrow" = 4))
hmcols <- colorRampPalette(c("darkblue", "beige"))(240)
zymo_annot_heat <- annHeatmap2(
  correlations,
  dendrogram = mydendro,
  annotation = myannot,
  cluster = myclust,
  labels = mylabs,
  ## The following controls if the picture is symmetric
  scale = "none",
  col = hmcols)

dev <- pp(file = "images/dendro_heatmap.pdf", height = 20, width = 20)
plot(zymo_annot_heat)
closed <- dev.off()
plot(zymo_annot_heat)
```

Print the larger heatmap so that all the labels appear.  Keep in mind
that as we get more samples, this image needs to continue getting
bigger.

### CMplot karyogram of variants

I cannot run the following block until/unless I install cmplot in the
container.  Oh, I did!  Let us run it and see what happens.

```{r}
xref_prop <- table(colData(pheno_snps)[["condition"]])
xref_prop
idx_tbl <- assay(pheno_snps) > 5
new_tbl <- data.frame(row.names = rownames(assay(pheno_snps)))
for (n in names(xref_prop)) {
  samples <- colData(pheno_snps)[["condition"]] == n
  new_tbl[[n]] <- 0
  prop_col <- rowSums(idx_tbl[, samples]) / xref_prop[n]
  new_tbl[n] <- prop_col
}
keepers <- grepl(x = rownames(new_tbl), pattern = "LpaL13")
new_tbl <- new_tbl[keepers, ]
new_tbl[["strong22"]] <- 1.001 - new_tbl[["z2.2"]]
new_tbl[["strong23"]] <- 1.001 - new_tbl[["z2.3"]]
s22_na <- new_tbl[["strong22"]] > 1
new_tbl[s22_na, "strong22"] <- 1
s23_na <- new_tbl[["strong23"]] > 1
new_tbl[s23_na, "strong23"] <- 1

new_tbl[["SNP"]] <- rownames(new_tbl)
new_tbl[["Chromosome"]] <- gsub(x = new_tbl[["SNP"]], pattern = "chr_(.*)_pos_.*", replacement = "\\1")
new_tbl[["Position"]] <- gsub(x = new_tbl[["SNP"]], pattern = ".*_pos_(\\d+)_.*", replacement = "\\1")
new_tbl <- new_tbl[, c("SNP", "Chromosome", "Position", "strong22", "strong23")]

simplify <- new_tbl
simplify[["strong22"]] <- NULL

CMplot(new_tbl, bin.size = 10000, threshold = c(0.01, 0.05), plot.type = "d",
       file.name = "variant_density_10k")
CMplot(new_tbl, bin.size = 1000, threshold = c(0.01, 0.05), plot.type = "d",
       file.name = "variant_density_1k")
CMplot(new_tbl, bin.size = 100000, threshold = c(0.01, 0.05), plot.type = "d",
       file.name = "variant_density_100k")

CMplot(new_tbl, plot.type = "m", multracks = TRUE, threshold = c(0.01, 0.05),
       threshold.lwd = c(1,1), threshold.col = c("black","grey"),
       amplify = TRUE, bin.size = 1000,
       chr.den.col = c("darkgreen", "yellow", "red"),
       signal.col = c("red", "green", "blue"),
       signal.cex = 1, file = "jpg", dpi = 300, file.output = TRUE, verbose = TRUE)
```

![SNP Density](Marker_Density.variant_density_10k.jpg)

# A different karyogram

I have been a bit frustrated with the clunkyness of cmplot, so I did
some reading and found autoplot.  It makes use of g/iranges to plot
arbitrary data and as such has the potential to be significantly more
generally useful than cmplot.  I think I will be able to use it to
view a lot of interesting different data types.  In this instance I
want to plot density of variants associated with various conditions in
the data (z2.3/z2.2, cure/fail, whatever).  In addition, it might be
nice to have the ORFs displayed in some fashion (space permitting).

I am pretty sure I made a function which makes this less clunky than what follows.

```{r, eval=FALSE}
lp_entry <- EuPathDB::get_eupath_entry(species = "MHOM/COL", metadata = eu_meta)

## These lines cannot run in the container because it cannot write
##txdb_pkgname <- make_eupath_txdb(lp_entry)
##grange_name <- make_eupath_granges(lp_entry)
grange_name <- gsub(x = lp_entry[["GrangesPkg"]], pattern = "\\.rda$", replacement = "")
grange_filename <- file.path("build", lp_entry[["GrangesPkg"]])
if (file.exists(grange_filename)) {
  load(grange_filename)
} else {
  created <- dir.create("build/gff", recursive = TRUE)
  grange_build <- make_eupath_granges(lp_entry)
  grange_filename <- grange_build[["rda"]]
  load(grange_filename)
}
grange_data <- get0(grange_name)

scaffold_idx <- grepl(x = as.character(seqnames(grange_data)), pattern = "SCAF")
no_scaffolds <- grange_data[!scaffold_idx]
scaffold_idx <- grepl(x = as.character(names(seqinfo(grange_data))), pattern = "SCAF")
chr_names <- names(seqinfo(grange_data))[!scaffold_idx]
no_scaffolds <- seqinfo(grange_data)[chr_names]

auto_tbl <- new_tbl
auto_tbl[["position2"]] <- auto_tbl[["Position"]]
auto_tbl[["SNP"]] <- NULL
rownames(auto_tbl) <- NULL

tilesize <- 1000
bins_1k <- GenomicRanges::tileGenome(seqlengths(no_scaffolds), tilewidth = 1000,
                                     cut.last.tile.in.chrom = TRUE)
bins_5k <- GenomicRanges::tileGenome(seqlengths(no_scaffolds), tilewidth = 5000,
                                     cut.last.tile.in.chrom = TRUE)
bins_10k <- GenomicRanges::tileGenome(seqlengths(no_scaffolds), tilewidth = 10000,
                                  cut.last.tile.in.chrom = TRUE)
bins_1nt <- GenomicRanges::tileGenome(seqlengths(no_scaffolds), tilewidth = 1,
                                      cut.last.tile.in.chrom = TRUE)
auto_tbl[["strand"]] <- "+"
## I want to calculate the number of intersecting positions between my auto_tbl and the 1k bins.
start <- auto_tbl[, c("Chromosome", "Position", "position2", "strand", "strong23")]
colnames(start) <- c("chr", "start", "end", "strand", "z23")
start[["chr"]] <- gsub(x = start[["chr"]], pattern = "-", replacement = "_")
var_grange <- makeGRangesFromDataFrame(start, seqinfo = no_scaffolds, keep.extra.columns = TRUE)
vars_per_bin <- findOverlaps(bins_1k, var_grange)
vars_per_bin_numeric <- as.data.frame(bins_1k)
vars_per_bin_numeric[["bin"]] <- rownames(vars_per_bin_numeric)

count_per_bin <- as.data.frame(vars_per_bin) %>%
  group_by(queryHits) %>%
  dplyr::tally()
colnames(count_per_bin) <- c("bin", "num")
vars_per_bin_numeric <- merge(vars_per_bin_numeric, count_per_bin, by = "bin", all.x = TRUE)
missing_idx <- is.na(vars_per_bin_numeric[["num"]])
vars_per_bin_numeric[missing_idx, "num"] <- 0
vars_per_bin <- vars_per_bin_numeric[, c("seqnames", "start", "end", "width", "strand", "num")]
vpb_grange <- makeGRangesFromDataFrame(vars_per_bin, seqinfo = no_scaffolds, keep.extra.columns = TRUE)

kary <- autoplot(vpb_grange, layout = "karyogram", aes(color = num, fill = num)) +
  scale_color_gradient(low = "blue", high = "red") +
  scale_fill_gradient(low = "blue", high = "red")
## theme_bw(base_size = 10) +
pp(file = "karyogram_by_variants.pdf", height = 24, width = 18)
kary
dev.off()

var_kary <- ggbio() +
  layout_karyogram(vpb_grange, aes(color = num, fill = num)) +
  scale_fill_gradient(low = "blue", high = "white") +
  scale_color_gradient(low = "blue", high = "white") +
  theme_bw(base_size = 10)
var_kary
```

<!---
![SNP Density](SNP-Density.ratio.jpg)
![Circular Manhattan](Circular-Manhattan.ratio.jpg)
![Rectangular Manhattan](Rectangular-Manhattan.ratio.jpg)
![QQ](QQplot.ratio.jpg)
--->

## Try out MatrixEQTL

This tool looks a little opaque, but provides sample data with things
that make sense to me and should be pretty easy to recapitulate in our
data.

1.  covariates.txt: Columns are samples, rows are things from colData -- the
most likely ones of interest for our data would be zymodeme,
sensitivity
2.  geneloc.txt: columns are 'geneid', 'chr', 'left', 'right'.  I
guess I can assume left and right are start/stop; in which case
this is trivially acquirable from rowData.
3.  ge.txt: This appears to be a log(rpkm/cpm) table with rows as genes and
columns as samples
4.  snpsloc.txt: columns are 'snpid', 'chr', 'pos'
5.  snps.txt: columns are samples, rows are the ids from snsploc,
values a 0,1,2.  I assume 0 is identical and 1..12 are the various
A->TGC T->AGC C->AGT G->ACT

```{r matrixeqtl, eval=FALSE}
## For this, let us use the 'new_snps' data structure.
## Caveat here: these need to be coerced to numbers.
my_covariates <- colData(new_snps)[, c("zymodemecategorical", "clinicalcategorical")]
for (col in colnames(my_covariates)) {
  my_covariates[[col]] <- as.numeric(as.factor(my_covariates[[col]]))
}
my_covariates <- t(my_covariates)

my_geneloc <- rowData(lp_se)[, c("gid", "chromosome", "start", "end")]
colnames(my_geneloc) <- c("geneid", "chr", "left", "right")

my_ge <- assay(normalize_se(lp_se, transform = "log2", filter = TRUE, convert = "cpm"))
used_samples <- tolower(colnames(my_ge)) %in% colnames(assay(new_snps))
my_ge <- my_ge[, used_samples]

my_snpsloc <- data.frame(rownames = rownames(assay(new_snps)))
## Oh, caveat here: Because of the way I stored the data,
## I could have duplicate rows which presumably will make matrixEQTL sad
my_snpsloc[["chr"]] <- gsub(pattern = "^chr_(.+)_pos(.+)_ref_.*$", replacement = "\\1",
                            x = rownames(my_snpsloc))
my_snpsloc[["pos"]] <- gsub(pattern = "^chr_(.+)_pos(.+)_ref_.*$", replacement = "\\2",
                            x = rownames(my_snpsloc))
test <- duplicated(my_snpsloc)
## Each duplicated row would be another variant at that position;
## so in theory we would do a rle to number them I am guessing
## However, I do not have different variants so I think I can ignore this for the moment
## but will need to make my matrix either 0 or 1.
if (sum(test) > 0) {
  message("There are: ", sum(duplicated), " duplicated entries.")
  keep_idx <- ! test
  my_snpsloc <- my_snpsloc[keep_idx, ]
}

my_snps <- assay(new_snps)
one_idx <- my_snps > 0
my_snps[one_idx] <- 1

## Ok, at this point I think I have all the pieces which this method wants...
## Oh, no I guess not; it actually wants the data as a set of filenames...
library(MatrixEQTL)
write.table(my_snps, "eqtl/snps.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(my_snps, "eqtl/snps.tsv", )
write.table(my_snpsloc, "eqtl/snpsloc.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(my_snpsloc, "eqtl/snpsloc.tsv")
write.table(as.data.frame(my_ge), "eqtl/ge.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_ge), "eqtl/ge.tsv")
write.table(as.data.frame(my_geneloc), "eqtl/geneloc.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_geneloc), "eqtl/geneloc.tsv")
write.table(as.data.frame(my_covariates), "eqtl/covariates.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_covariates), "eqtl/covariates.tsv")

useModel = modelLINEAR # modelANOVA, modelLINEAR, or modelLINEAR_CROSS

# Genotype file name
SNP_file_name = "eqtl/snps.tsv"
snps_location_file_name = "eqtl/snpsloc.tsv"
expression_file_name = "eqtl/ge.tsv"
gene_location_file_name = "eqtl/geneloc.tsv"
covariates_file_name = "eqtl/covariates.tsv"
# Output file name
output_file_name_cis = tempfile()
output_file_name_tra = tempfile()
# Only associations significant at this level will be saved
pvOutputThreshold_cis = 0.1
pvOutputThreshold_tra = 0.1
# Error covariance matrix
# Set to numeric() for identity.
errorCovariance = numeric()
# errorCovariance = read.table("Sample_Data/errorCovariance.txt");
# Distance for local gene-SNP pairs
cisDist = 1e6
## Load genotype data
snps = SlicedData$new()
snps$fileDelimiter = "\t"      # the TAB character
snps$fileOmitCharacters = "NA" # denote missing values;
snps$fileSkipRows = 1          # one row of column labels
snps$fileSkipColumns = 1       # one column of row labels
snps$fileSliceSize = 2000      # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name)
## Load gene expression data
gene = SlicedData$new()
gene$fileDelimiter = "\t"      # the TAB character
gene$fileOmitCharacters = "NA" # denote missing values;
gene$fileSkipRows = 1          # one row of column labels
gene$fileSkipColumns = 1       # one column of row labels
gene$fileSliceSize = 2000      # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name)
## Load covariates
cvrt = SlicedData$new()
cvrt$fileDelimiter = "\t"      # the TAB character
cvrt$fileOmitCharacters = "NA" # denote missing values;
cvrt$fileSkipRows = 1          # one row of column labels
cvrt$fileSkipColumns = 1       # one column of row labels
if(length(covariates_file_name) > 0) {
  cvrt$LoadFile(covariates_file_name)
}
## Run the analysis
snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE)
genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE)

me = Matrix_eQTL_main(
  snps = snps,
  gene = gene,
  cvrt = cvrt,
  output_file_name = output_file_name_tra,
  pvOutputThreshold = pvOutputThreshold_tra,
  useModel = useModel,
  errorCovariance = errorCovariance,
  verbose = TRUE,
  output_file_name.cis = output_file_name_cis,
  pvOutputThreshold.cis = pvOutputThreshold_cis,
  snpspos = snpspos,
  genepos = genepos,
  cisDist = cisDist,
  pvalue.hist = "qqplot",
  min.pv.by.genesnp = FALSE,
  noFDRsaveMemory = FALSE);
```

```{r saveme}
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))
##message(paste0("Saving to ", savefile))
## tmp <- sm(saveme(filename = savefile))
```

```{r loadme_after, eval = FALSE}
tmp <- loame(filename = savefile)
```
