This is intended to contain analyses which logically follow our transciptomic analyses. It is therefore a bit of a grab bag, it may eventually comprise the variant search; but currently that is intertwined with the DE results. As a result, this and the ‘pre_visualization’ document are currently very redundant.
I have a few methods of creating (phylogenetic) trees describing the relationship of the strains of interest in this data.
The following block was generated via the following methods:
The hisat2-based alignments were passed to Kim et al. (2019), which generated a set of high-confidence transcriptome-based vcf files for each sample of sufficient coverage. These were sorted, compressed, and indexed via bcftools[ref]; and high-confidence variants (more than 80% with coverage higher than 5 reads per position) were extracted into a table variants as well as used to modify the reference genome to represent these filtered variants. These modified genomes were passed to ape[ref], indexed, and used to create a kmer index and distance matrix; these distances were used by ape to create a neighbor joining tree.
I wrote a little function which in theory should make the above a bit simpler and more robust for future analyses. Lets see if it works. It currently takes a directory containing the fasta files of the sequences to compare and an optional root.
The following block requires copying a group of fasta files into the ‘compare_strains/’ directory, which in our container image is not available, so turning this off.
Ok, so I added the various fasta files used to create these trees to the container in compressed form. I probably should have just made a tarball, but I was concerned that I would need to change which files are included, so did not. Thus, if one wishes to play with the following blocks, one will first need to run:
Note that I included two cd commands because one may go there from outside or inside the copied container output directory.
In order to perform this, I will use the same fasta files, but extract the zymodeme genes from them and write out a set of fasta files containing their sequences. I therefore wrote a function which takes in the annotation data and fasta files in order to extract the data of interest.
Sadly, I will need to read in the annotations for braziliensis/panamensis panama and any other sequences. But the sequences which are directly extracted from panamensis colombia I will be able to use the same annotations.
wanted_ids <- c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
"LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300")
reference <- write_cds_entries("compare_strains/lpanamensis_v36.fasta", all_lp_annot,
ids = wanted_ids, output = "compare_strains/lpanamensis_cds.fasta")
modified_12588 <- write_cds_entries("compare_strains/strain_12588_modified_z21.fasta", all_lp_annot,
name_prefix = "z21", ids = wanted_ids,
output = "compare_strains/lpanamensis_z21_cds.fasta")
modified_2272 <- write_cds_entries("compare_strains/strain_2272_modified_z22.fasta", all_lp_annot,
name_prefix = "z22", ids = wanted_ids,
output = "compare_strains/lpanamensis_z22_cds.fasta")
modified_2168 <- write_cds_entries("compare_strains/strain_2168_modified_z23.fasta", all_lp_annot,
name_prefix = "z23", ids = wanted_ids,
output = "compare_strains/lpanamensis_z23_cds.fasta")
modified_12444 <- write_cds_entries("compare_strains/strain_12444_modified_z24.fasta", all_lp_annot,
name_prefix = "z24", ids = wanted_ids,
output = "compare_strains/lpanamensis_z24_cds.fasta")Having written these files, I concatenated the zymodeme CDS sequences into individual sequences/strain and performed a MSA and MP tree using clustalo[ref] and PhyML[ref] via seaview[ref]. Sadly, there were only 12 informative sites in the 6 zymodeme defining genes. Happily, the tree generated looks pretty much exactly like my genome-based tree. Also, I didn’t bother to add the other genomes because with only 12 variant positions it did not feel interesting.
Over the last couple of weeks, I redid all the variant searches with a newer, (I think) more sensitive and more specific variant tool. In addition I changed my script which interprets the results so that it is able to extract any tags from it, instead of just the one or two that my previous script handled. In addition, at least in theory it is now able to provide the set of amino acid substitutions for every gene in species without or with introns (not really relevant for Leishmania panamensis).
both_norm <- set_conditions(both_snps, fact = "knnv2classification")
## strains <- both_norm[["design"]][["strain"]]
both_strain <- set_conditions(both_norm, fact = "strain")The data structure ‘both_norm’ now contains our 2016 data along with the newer data collected since 2019.
The following plot shows the SNP profiles of all samples (old and new) where the colors at the top show either the 2.2 strains (orange), 2.3 strains (green), the previous samples (purple), or the various lab strains (pink etc).
new_snps <- new_snps_sufficient
new_variant_heatmap <- plot_corheat(new_snps)
dev <- pp(file = "images/raw_snp_corheat.png", height = 12, width = 12)
new_variant_heatmap$plot
closed <- dev.off()
new_variant_heatmap$plot## A reminder from 02
exp_chr_col <- "sequence_id"
exp_start_col <- "coding_start"
exp_end_col <- "coding_end"The function get_snp_sets() takes the provided metadata factor (in this case ‘condition’) and looks for variants which are exclusive to each element in it. In this case, this is looking for differences between 2.2 and 2.3, as well as the set shared among them.
snp_sets <- get_snp_sets(both_snps, factor = "knnhclusttogethercall")
snp_sets
Biobase::annotation(lp_previous) <- Biobase::annotation(lp_se)
lp_knn <- set_conditions(lp_se, fact = "knnhclusttogethercall")
both_expt <- combine_expts(lp_knn, lp_previous)
snp_genes <- snps_vs_genes(both_se, snp_sets, chr_column = exp_chr_col,
start_column = exp_start_col, end_column = exp_end_col)
## I think we have some metrics here we can plot...s
snp_subset <- snp_subset_genes(
both_expt, both_snps,
genes = c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
"LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300"))
zymo_heat <- plot_sample_heatmap(
normalize_expt(snp_subset, transform = "log2"),
row_label = rownames(exprs(snp_subset)))
zymo_heat
most_variant <- head(snp_genes$count_by_gene, n = 100)
least_variant <- tail(snp_genes$count_by_gene, n = 100)
test <- simple_goseq(names(most_variant), go_db = lp_go, length_db = lp_lengths)Najib has asked a few times about the relationship between variants and DE genes. In subsequent conversations I figured out what he really wants to learn is variants in the UTR (most likely 5’) which might affect expression of genes. The following explicitly does not help this question, but is a paralog: is there a relationship between variants in the CDS and differential expression?
In order to do this comparison, we need to reload some of the DE results.
rda <- glue("rda/zymo_tables_nobatch-v{ver}.rda")
varname <- gsub(x = basename(rda), pattern = "\\.rda", replacement = "")
loaded <- load(file = rda)
zy_df <- get0(varname)[["data"]][["zymodeme"]]
rda <- glue("rda/sus_tables_nobatch-v{ver}.rda")
varname <- gsub(x = basename(rda), pattern = "\\.rda", replacement = "")
loaded <- load(file = rda)
sus_df <- get0(varname)[["data"]][["resistant_sensitive"]]In the previous section, I did this process using older and newer data (and disabled it because we do not have all of that data available for the container; but I still want to do some of these comparisons and so will do them using only the current/new tree.
## The samples represent the following categories:
##
## z21 z22 z23
## 9 37 41
## Using a proportion of observed variants, converting the data to binary observations.
## The factor unknown has 4 rows.
## The factor z21 has 9 rows.
## The factor z22 has 37 rows.
## The factor z23 has 41 rows.
## Finished iterating over the chromosomes.
## A set of variants observed when cross referencing all variants against
## the samples associated with each metadata factor: knnhclusttogethercall. 4
## categories and 927126 variants were observed with 15
## combinations among them. 725 chromosomes/scaffolds were observed with a
## density of variants ranging from 0.000652315720808871 to 0.114678899082569.
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': object 'lp_expt' not found
## Error in snps_vs_genes(lp_knn, snp_sets, expt_name_col = "chromosome"): unused argument (expt_name_col = "chromosome")
In my conversation today (20240614) with Maria Adelaida, she asked if I can provide a table of genes with respect to number of variants and differential expression. It turns out that I basically already did; the following block is pretty much exactly this query. I think the only thing I really need to consider is if I want to formalize this idea so that I can use it elsewhere, or if it will be a one-off.
vars_df <- data.frame(ID = names(snp_genes[["count_by_gene"]]),
variants = as.numeric(snp_genes[["count_by_gene"]]))## Error: object 'snp_genes' not found
## Error: object 'vars_df' not found
## Error: object 'vars_df' not found
## Error: object 'vars_df' not found
## Error: object 'vars_df' not found
## Error: object 'vars_by_de_gene' not found
## Error: object 'vars_by_de_gene' not found
## Error: object 'vars_by_de_gene' not found
## Error: object 'vars_by_de_gene' not found
## Error: object 'variants_wrt_logfc' not found
##plotly::ggplotly(scatter, tooltip = c("x", "y", "text"))
## It looks like there might be some genes of interest, even though this is not actually
## the question of interest.vars_by_sig_gene_idx <- abs(vars_by_de_gene[["deseq_logfc"]] >= 1.0) &
vars_by_de_gene[["deseq_adjp"]] <= 0.05## Error: object 'vars_by_de_gene' not found
## Error: object 'vars_by_de_gene' not found
## Error in h(simpleError(msg, call)): error in evaluating the argument 'data' in selecting a method for function 'write_xlsx': object 'vars_by_sig_gene' not found
Ok, I think I can do this on a UTR basis.
## Error in snps_vs_genes_padded(lp_expt, snp_sets, expt_name_col = "chromosome"): unused argument (expt_name_col = "chromosome")
fivep_vars_df <- data.frame(ID = names(snp_utrs[["count_fivep_by_gene"]]),
variants = as.numeric(snp_utrs[["count_fivep_by_gene"]]))## Error: object 'snp_utrs' not found
## Error: object 'fivep_vars_df' not found
## Error: object 'fivep_vars_by_de_gene' not found
## Error: object 'fivep_vars_by_de_gene' not found
## Error: object 'fivep_vars_by_de_gene' not found
fivep_variants_wrt_logfc <- plot_linear_scatter(fivep_vars_by_de_gene, xcol = "deseq_logfc",
ycol = "variants", text_col = "annotgeneproduct")## Error: object 'fivep_vars_by_de_gene' not found
## Error: object 'fivep_variants_wrt_logfc' not found
## Error: object 'scatter' not found
threep_vars_df <- data.frame(ID = names(snp_utrs[["count_threep_by_gene"]]),
variants = as.numeric(snp_utrs[["count_threep_by_gene"]]))## Error: object 'snp_utrs' not found
## Error: object 'threep_vars_df' not found
## Error: object 'threep_vars_by_de_gene' not found
## Error: object 'threep_vars_by_de_gene' not found
## Error: object 'threep_vars_by_de_gene' not found
threep_variants_wrt_logfc <- plot_scatter(threep_vars_by_de_gene, xcol = "deseq_logfc",
ycol = "variants")## Error: object 'threep_vars_by_de_gene' not found
## Error: object 'threep_variants_wrt_logfc' not found
fivep_vars_by_sig_gene_idx <- abs(fivep_vars_by_de_gene[["deseq_logfc"]] >= 1.0) &
fivep_vars_by_de_gene[["deseq_adjp"]] <= 0.05## Error: object 'fivep_vars_by_de_gene' not found
## Error: object 'fivep_vars_by_de_gene' not found
written <- write_xlsx(data = fivep_vars_by_sig_gene, excel = "excel/fivep_variants_by_sig_gene.xlsx")## Error in h(simpleError(msg, call)): error in evaluating the argument 'data' in selecting a method for function 'write_xlsx': object 'fivep_vars_by_sig_gene' not found
threep_vars_by_sig_gene_idx <- abs(threep_vars_by_de_gene[["deseq_logfc"]] >= 1.0) &
threep_vars_by_de_gene[["deseq_adjp"]] <= 0.05## Error: object 'threep_vars_by_de_gene' not found
## Error: object 'threep_vars_by_de_gene' not found
written <- write_xlsx(data = threep_vars_by_sig_gene, excel = "excel/threep_variants_by_sig_gene.xlsx")## Error in h(simpleError(msg, call)): error in evaluating the argument 'data' in selecting a method for function 'write_xlsx': object 'threep_vars_by_sig_gene' not found
## The samples represent the following categories:
##
## cure failure nd
## 40 33 18
## Using a proportion of observed variants, converting the data to binary observations.
## The factor cure has 40 rows.
## The factor failure has 33 rows.
## The factor nd has 18 rows.
## Finished iterating over the chromosomes.
## A set of variants observed when cross referencing all variants against
## the samples associated with each metadata factor: clinicalresponse. 3
## categories and 927126 variants were observed with 7
## combinations among them. 725 chromosomes/scaffolds were observed with a
## density of variants ranging from 0.000652315720808871 to 0.114678899082569.
density_vec <- clinical_sets[["density"]]
chromosome_idx <- grep(pattern = "LpaL", x = names(density_vec))
density_df <- as.data.frame(density_vec[chromosome_idx])
density_df[["chr"]] <- rownames(density_df)
colnames(density_df) <- c("density_vec", "chr")
## I am bad with quasiquotation and the various non-standard evaluation techniques of tidyr.
## Let us use the following plot to see if I remember how to use bangbang to replace a NSE
## name with a variable, in this case the creatively named 'test'
## I rather think this is gross and generally prefer to just use .data
test <- sym("chr")
ggplot(density_df, aes(x = !!test, y = density_vec)) +
ggplot2::geom_col() +
ggplot2::theme(axis.text = ggplot2::element_text(size = 10, colour = "black"),
axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5))## clinical_written <- write_snps(new_snps, output_file = "clinical.aln")
## In my sample sheet, I did not record the variants by gene or a few other data which were
## collected
test_spec <- make_dnaseq_spec()
new_meta <- gather_preprocessing_metadata(pData(new_snps), specification = test_spec)## Checking the state of the condition column.
## Checking the state of the batch column.
## Checking the condition factor.
## Error in as.vector(x, mode = "character"): no method for coercing this S4 class to a vector
test <- classify_variants(new_meta[["new_meta"]], coverage_column = NULL,
variants_column = "freebayes_variants_by_gene")## Error: object 'new_meta' not found
column_colors <- pData(new_snps)[["knnv2classification"]]
red_idx <- column_colors == "z22"
column_colors[red_idx] <- "red"
pink_idx <- column_colors == "z21"
column_colors[pink_idx] <- "salmon"
blue_idx <- column_colors == "z23"
column_colors[blue_idx] <- "darkblue"
blue_idx <- column_colors == "z24"
column_colors[blue_idx] <- "cornflowerblue"
gplots::heatmap.2(as.matrix(test$mutations_by_sample_norm), trace = "none",
ColSideColors = column_colors)## Error in test$mutations_by_sample_norm: object of type 'symbol' is not subsettable
## Error in snps_vs_genes(lp_expt, clinical_sets, expt_name_col = "chromosome"): unused argument (expt_name_col = "chromosome")
## Error: object 'clinical_genes' not found
snp_density <- merge(as.data.frame(clinical_genes[["count_by_gene"]]),
as.data.frame(fData(lp_expt)),
by = "row.names")## Error: object 'clinical_genes' not found
## Error: object 'snp_density' not found
## Error: object 'snp_density' not found
## Error: object 'snp_density' not found
## Error: object 'snp_density' not found
## Error: object 'snp_density' not found
## Error: object 'snp_density' not found
## Error: object 'snp_density' not found
removers <- c("amastin", "gp63", "leishmanolysin")
for (r in removers) {
drop_idx <- grepl(pattern = r, x = snp_density[["product"]])
snp_density <- snp_density[!drop_idx, ]
}## Error: object 'snp_density' not found
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rowData': object 'lp_expt' not found
## Error: object 'clinical_snps' not found
## Error: object 'fail_ref_snps' not found
## Error: object 'clinical_snps' not found
## Error: object 'fail_ref_snps' not found
## Error: object 'cure_snps' not found
## Error in eval(expr, p): object 'cure_snps' not found
## Error in eval(expr, p): object 'fail_ref_snps' not found
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'fData': object 'lp_expt' not found
## Error: object 'clinical_snps' not found
clinical_interest <- merge(clinical_interest,
as.data.frame(clinical_snps[["gene_summaries"]][["failure, reference strain"]]),
by = "row.names")## Error: object 'clinical_interest' not found
## Error: object 'clinical_interest' not found
## Error: object 'clinical_interest' not found
## Error: object 'clinical_interest' not found
## Error: object 'annot' not found
## Error: object 'annot' not found
## Error: object 'annot' not found
## Error: object 'annot' not found
Didn’t I create a set of densities by chromosome? Oh I think they come in from get_snp_sets()
The heatmap produced here should show the variants only for the zymodeme genes.
I am thinking that if we find clusters of locations which are variant, that might provide some PCR testing possibilities.
## Drop the 2.1, 2.4, unknown, and null
knn_snps <- set_expt_conditions(new_snps, fact = "knnv2classification") %>%
subset_expt(subset="condition=='z22'|condition=='z23'")
knn_sets <- get_snp_sets(knn_snps, factor = "knnv2classification")
knn_sets
summary(knn_sets)
## 1000000: 2.2
## 0100000: 2.3
pruned_snps <- subset_expt(new_snps, subset="condition=='z2.2'|condition=='z2.3'")
new_sets <- get_snp_sets(pruned_snps, factor = "condition")
new_sets
summary(new_sets[["intersections"]][["10"]])
write.csv(file="csv/variants_22.csv", x=new_sets[["intersections"]][["10"]])
summary(new_sets[["intersections"]][["01"]])
write.csv(file="csv/variants_23.csv", x=new_sets[["intersections"]][["01"]])
summary(knn_sets[["intersections"]][["10"]])
write.csv(file="csv/knn_variants_22.csv", x=new_sets[["intersections"]][["10"]])
summary(knn_sets[["intersections"]][["01"]])
write.csv(file="csv/knn_variants_23.csv", x=new_sets[["intersections"]][["01"]])Thus we see that there are 601 variants associated with 2.2 and 67,171 associated with 2.3.
The sequential_variants() function searches for variants which are clustered close together in the hopes that choosing PCR primers focused on this positions will(not) anneal and may be used as a quick way to identify strains.
The current set of strains
zymo22_sequentials <- sequential_variants(new_sets, conditions = "z2.2",
minimum = 1, maximum_separation = 2)## Error in sequential_variants(new_sets, conditions = "z2.2", minimum = 1, : could not find function "sequential_variants"
## Error: object 'zymo22_sequentials' not found
## 7 candidate regions for zymodeme 2.2 -- thus I am betting that the reference strain is a 2.2
zymo23_sequentials <- sequential_variants(new_sets, conditions = "z2.3",
minimum = 2, maximum_separation = 2)## Error in sequential_variants(new_sets, conditions = "z2.3", minimum = 2, : could not find function "sequential_variants"
## Error: object 'zymo23_sequentials' not found
## In contrast, there are lots (587) of interesting regions for 2.3!
knn_zymo22_sequentials <- sequential_variants(knn_sets, conditions = "z2.2",
minimum = 1, maximum_separation = 2)## Error in sequential_variants(knn_sets, conditions = "z2.2", minimum = 1, : could not find function "sequential_variants"
## Error: object 'knn_zymo22_sequentials' not found
## 7 candidate regions for zymodeme 2.2 -- thus I am betting that the reference strain is a 2.2
knn_zymo23_sequentials <- sequential_variants(knn_new_sets, conditions = "z2.3",
minimum = 2, maximum_separation = 2)## Error in sequential_variants(knn_new_sets, conditions = "z2.3", minimum = 2, : could not find function "sequential_variants"
## Error: object 'knn_zymo23_sequentials' not found
The first 4 candidate regions from my set of remaining: * Chr Pos. Distance * LpaL13-15 238433 448 * LpaL13-18 142844 613 * LpaL13-29 830342 252 * LpaL13-33 1331507 843
Lets define a couple of terms: * Third: Each of the 4 above positions. * Second: Third - Distance * End: Third + PrimerLen * Start: Second - Primerlen
In each instance, these are the last positions, so we want to grab three things:
## * LpaL13-15 238433 448
first_candidate_chr <- genome[["LpaL13_15"]]
primer_length <- 22
amplicon_length <- 448
first_candidate_third <- 238433
first_candidate_second <- first_candidate_third - amplicon_length
first_candidate_start <- first_candidate_second - primer_length
first_candidate_end <- first_candidate_third + primer_length
first_candidate_region <- subseq(first_candidate_chr, first_candidate_start, first_candidate_end)
first_candidate_region
first_candidate_5p <- subseq(first_candidate_chr, first_candidate_start, first_candidate_second)
as.character(first_candidate_5p)
first_candidate_3p <- spgs::reverseComplement(subseq(first_candidate_chr, first_candidate_third, first_candidate_end))
first_candidate_3p
## * LpaL13-18 142844 613
second_candidate_chr <- genome[["LpaL13_18"]]
primer_length <- 22
amplicon_length <- 613
second_candidate_third <- 142844
second_candidate_second <- second_candidate_third - amplicon_length
second_candidate_start <- second_candidate_second - primer_length
second_candidate_end <- second_candidate_third + primer_length
second_candidate_region <- subseq(second_candidate_chr, second_candidate_start, second_candidate_end)
second_candidate_region
second_candidate_5p <- subseq(second_candidate_chr, second_candidate_start, second_candidate_second)
as.character(second_candidate_5p)
second_candidate_3p <- spgs::reverseComplement(subseq(second_candidate_chr, second_candidate_third, second_candidate_end))
second_candidate_3p
## * LpaL13-29 830342 252
third_candidate_chr <- genome[["LpaL13_29"]]
primer_length <- 22
amplicon_length <- 252
third_candidate_third <- 830342
third_candidate_second <- third_candidate_third - amplicon_length
third_candidate_start <- third_candidate_second - primer_length
third_candidate_end <- third_candidate_third + primer_length
third_candidate_region <- subseq(third_candidate_chr, third_candidate_start, third_candidate_end)
third_candidate_region
third_candidate_5p <- subseq(third_candidate_chr, third_candidate_start, third_candidate_second)
as.character(third_candidate_5p)
third_candidate_3p <- spgs::reverseComplement(subseq(third_candidate_chr, third_candidate_third, third_candidate_end))
third_candidate_3p
## You are a garbage polypyrimidine tract.
## Which is actually interesting if the mutations mess it up.
## * LpaL13-33 1331507 843
fourth_candidate_chr <- genome[["LpaL13_33"]]
primer_length <- 22
amplicon_length <- 843
fourth_candidate_third <- 1331507
fourth_candidate_second <- fourth_candidate_third - amplicon_length
fourth_candidate_start <- fourth_candidate_second - primer_length
fourth_candidate_end <- fourth_candidate_third + primer_length
fourth_candidate_region <- subseq(fourth_candidate_chr, fourth_candidate_start, fourth_candidate_end)
fourth_candidate_region
fourth_candidate_5p <- subseq(fourth_candidate_chr, fourth_candidate_start, fourth_candidate_second)
as.character(fourth_candidate_5p)
fourth_candidate_3p <- spgs::reverseComplement(subseq(fourth_candidate_chr, fourth_candidate_third, fourth_candidate_end))
fourth_candidate_3pI made a fun little function which should find regions which have lots of variants associated with a given experimental factor.
pheno <- subset_expt(lp_expt, subset = "condition=='z2.2'|condition=='z2.3'")
pheno <- subset_expt(pheno, subset = "!is.na(pData(pheno)[['bcftable']])")
pheno_snps <- sm(count_expt_snps(pheno, annot_column = "bcftable"))
fun_stuff <- snp_density_primers(
pheno_snps,
bsgenome = "BSGenome.Leishmania.panamensis.MHOMCOL81L13.v53",
gff = "reference/TriTrypDB-53_LpanamensisMHOMCOL81L13.gff")
drop_scaffolds <- grepl(x = rownames(fun_stuff$favorites), pattern = "SCAF")
favorite_primer_regions <- fun_stuff[["favorites"]][!drop_scaffolds, ]
favorite_primer_regions[["bin"]] <- rownames(favorite_primer_regions)
library(dplyr)
favorite_primer_regions <- favorite_primer_regions %>%
relocate(bin)Here is my note from our meeting:
Cross reference primers to DE genes of 2.2/2.3 and/or resistance/suscpetible, add a column to the primer spreadsheet with the DE genes (in retrospect I am guessing this actually means to put the logFC as a column.
One nice thing, I did a semantic removal on the lp_expt, so the set of logFC/pvalues should not have any of the offending types; thus I should be able to automagically get rid of them in the merge.
logfc_columns <- zy_df[, c("deseq_logfc", "deseq_adjp")]
colnames(logfc_columns) <- c("z23_logfc", "z23_adjp")
new_table <- merge(favorite_primer_regions, logfc_columns,
by.x = "closest_gene_before_id", by.y = "row.names")
sus_columns <- sus_df[, c("deseq_logfc", "deseq_adjp")]
colnames(sus_columns) <- c("sus_logfc", "sus_adjp")
new_table <- merge(new_table, sus_columns,
by.x = "closest_gene_before_id", by.y = "row.names") %>%
relocate(bin)
written <- write_xlsx(data=new_table,
excel="excel/favorite_primers_xref_zy_sus.xlsx")We can cross reference the variants against the zymodeme status and plot a heatmap of the results and hopefully see how they separate.
## Error in snps_vs_genes(lp_expt, new_sets, expt_name_col = "chromosome"): unused argument (expt_name_col = "chromosome")
clinical_colors_v2 <- list(
"z22" = "#0000cc",
"z23" = "#cc0000")
new_zymo_norm <- normalize_expt(pruned_snps, norm = "quant") %>%
set_expt_conditions(fact = "zymodemecategorical") %>%
set_expt_colors(clinical_colors_v2)## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'object' in selecting a method for function 'pData': error in evaluating the argument 'input' in selecting a method for function 'state': object 'pruned_snps' not found
## Error in h(simpleError(msg, call)): error in evaluating the argument 'input_data' in selecting a method for function 'plot_heatmap': object 'new_zymo_norm' not found
## Error: object 'zymo_heat' not found
## Error: object 'zymo_heat' not found
Now let us try to make a heatmap which includes some of the annotation data.
des <- pData(both_norm)
zymo_column <- "zymodemecategorical"
undef_idx <- is.na(des[[zymo_column]])
des[undef_idx, "strain"] <- "unknown"
##hmcols <- colorRampPalette(c("yellow","black","darkblue"))(256)
correlations <- hpgl_cor(exprs(both_norm))
na_idx <- is.na(correlations)
correlations[na_idx] <- 0
zymo_missing_idx <- is.na(des[[zymo_column]])
des[[zymo_column]] <- as.character(des[[zymo_column]])
des[["clinicalcategorical"]] <- as.character(des[["clinicalcategorical"]])
des[zymo_missing_idx, zymo_column] <- "unknown"
mydendro <- list(
"clustfun" = hclust,
"lwd" = 2.0)
col_data <- as.data.frame(des[, c(zymo_column, "clinicalcategorical")])
colnames(col_data) <- c("zymodeme", "outcome")
unknown_clinical <- is.na(col_data[["clinicalcategorical"]])
col_data[unknown_clinical, "outcome"] <- "undefined"
row_data <- as.data.frame(des[, c("sus_category_current", "knnv2classification")])
colnames(row_data) <- c("susceptibility", "mlclass")
myannot <- list(
"Col" = list("data" = col_data),
"Row" = list("data" = row_data))
myclust <- list("cuth" = 1.0,
"col" = BrewerClusterCol)
mylabs <- list(
"Row" = list("nrow" = 4),
"Col" = list("nrow" = 4))
hmcols <- colorRampPalette(c("darkblue", "beige"))(380)
zymo_annot_heat <- annHeatmap2(
correlations,
dendrogram = mydendro,
annotation = myannot,
cluster = myclust,
labels = mylabs,
## The following controls if the picture is symmetric
scale = "none",
col = hmcols)
dev <- pp(file = "images/dendro_heatmap.png", height = 20, width = 20)
plot(zymo_annot_heat)
closed <- dev.off()
plot(zymo_annot_heat)Print the larger heatmap so that all the labels appear. Keep in mind that as we get more samples, this image needs to continue getting bigger.
xref_prop <- table(pheno_snps[["conditions"]])
pheno_snps$conditions
idx_tbl <- exprs(pheno_snps) > 5
new_tbl <- data.frame(row.names = rownames(exprs(pheno_snps)))
for (n in names(xref_prop)) {
new_tbl[[n]] <- 0
idx_cols <- which(pheno_snps[["conditions"]] == n)
prop_col <- rowSums(idx_tbl[, idx_cols]) / xref_prop[n]
new_tbl[n] <- prop_col
}
keepers <- grepl(x = rownames(new_tbl), pattern = "LpaL13")
new_tbl <- new_tbl[keepers, ]
new_tbl[["strong22"]] <- 1.001 - new_tbl[["z2.2"]]
new_tbl[["strong23"]] <- 1.001 - new_tbl[["z2.3"]]
s22_na <- new_tbl[["strong22"]] > 1
new_tbl[s22_na, "strong22"] <- 1
s23_na <- new_tbl[["strong23"]] > 1
new_tbl[s23_na, "strong23"] <- 1
new_tbl[["SNP"]] <- rownames(new_tbl)
new_tbl[["Chromosome"]] <- gsub(x = new_tbl[["SNP"]], pattern = "chr_(.*)_pos_.*", replacement = "\\1")
new_tbl[["Position"]] <- gsub(x = new_tbl[["SNP"]], pattern = ".*_pos_(\\d+)_.*", replacement = "\\1")
new_tbl <- new_tbl[, c("SNP", "Chromosome", "Position", "strong22", "strong23")]
library(CMplot)
simplify <- new_tbl
simplify[["strong22"]] <- NULL
CMplot(simplify, bin.size = 100000)
CMplot(new_tbl, plot.type="m", multracks=TRUE, threshold = c(0.01, 0.05),
threshold.lwd=c(1,1), threshold.col=c("black","grey"),
amplify=TRUE, bin.size=10000,
chr.den.col=c("darkgreen", "yellow", "red"),
signal.col=c("red", "green", "blue"),
signal.cex=1, file="jpg", memo="", dpi=300, file.output=TRUE, verbose=TRUE)This tool looks a little opaque, but provides sample data with things that make sense to me and should be pretty easy to recapitulate in our data.
## For this, let us use the 'new_snps' data structure.
## Caveat here: these need to be coerced to numbers.
my_covariates <- pData(new_snps)[, c(zymo_column, "clinicalcategorical")]
for (col in colnames(my_covariates)) {
my_covariates[[col]] <- as.numeric(as.factor(my_covariates[[col]]))
}
my_covariates <- t(my_covariates)
my_geneloc <- fData(lp_expt)[, c("gid", "chromosome", "start", "end")]
colnames(my_geneloc) <- c("geneid", "chr", "left", "right")
my_ge <- exprs(normalize_expt(lp_expt, transform = "log2", filter = TRUE, convert = "cpm"))
used_samples <- tolower(colnames(my_ge)) %in% colnames(exprs(new_snps))
my_ge <- my_ge[, used_samples]
my_snpsloc <- data.frame(rownames = rownames(exprs(new_snps)))
## Oh, caveat here: Because of the way I stored the data,
## I could have duplicate rows which presumably will make matrixEQTL sad
my_snpsloc[["chr"]] <- gsub(pattern = "^chr_(.+)_pos(.+)_ref_.*$", replacement = "\\1",
x = rownames(my_snpsloc))
my_snpsloc[["pos"]] <- gsub(pattern = "^chr_(.+)_pos(.+)_ref_.*$", replacement = "\\2",
x = rownames(my_snpsloc))
test <- duplicated(my_snpsloc)
## Each duplicated row would be another variant at that position;
## so in theory we would do a rle to number them I am guessing
## However, I do not have different variants so I think I can ignore this for the moment
## but will need to make my matrix either 0 or 1.
if (sum(test) > 0) {
message("There are: ", sum(duplicated), " duplicated entries.")
keep_idx <- ! test
my_snpsloc <- my_snpsloc[keep_idx, ]
}
my_snps <- exprs(new_snps)
one_idx <- my_snps > 0
my_snps[one_idx] <- 1
## Ok, at this point I think I have all the pieces which this method wants...
## Oh, no I guess not; it actually wants the data as a set of filenames...
library(MatrixEQTL)
write.table(my_snps, "eqtl/snps.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(my_snps, "eqtl/snps.tsv", )
write.table(my_snpsloc, "eqtl/snpsloc.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(my_snpsloc, "eqtl/snpsloc.tsv")
write.table(as.data.frame(my_ge), "eqtl/ge.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_ge), "eqtl/ge.tsv")
write.table(as.data.frame(my_geneloc), "eqtl/geneloc.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_geneloc), "eqtl/geneloc.tsv")
write.table(as.data.frame(my_covariates), "eqtl/covariates.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_covariates), "eqtl/covariates.tsv")
useModel = modelLINEAR # modelANOVA, modelLINEAR, or modelLINEAR_CROSS
# Genotype file name
SNP_file_name = "eqtl/snps.tsv"
snps_location_file_name = "eqtl/snpsloc.tsv"
expression_file_name = "eqtl/ge.tsv"
gene_location_file_name = "eqtl/geneloc.tsv"
covariates_file_name = "eqtl/covariates.tsv"
# Output file name
output_file_name_cis = tempfile()
output_file_name_tra = tempfile()
# Only associations significant at this level will be saved
pvOutputThreshold_cis = 0.1
pvOutputThreshold_tra = 0.1
# Error covariance matrix
# Set to numeric() for identity.
errorCovariance = numeric()
# errorCovariance = read.table("Sample_Data/errorCovariance.txt");
# Distance for local gene-SNP pairs
cisDist = 1e6
## Load genotype data
snps = SlicedData$new()
snps$fileDelimiter = "\t" # the TAB character
snps$fileOmitCharacters = "NA" # denote missing values;
snps$fileSkipRows = 1 # one row of column labels
snps$fileSkipColumns = 1 # one column of row labels
snps$fileSliceSize = 2000 # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name)
## Load gene expression data
gene = SlicedData$new()
gene$fileDelimiter = "\t" # the TAB character
gene$fileOmitCharacters = "NA" # denote missing values;
gene$fileSkipRows = 1 # one row of column labels
gene$fileSkipColumns = 1 # one column of row labels
gene$fileSliceSize = 2000 # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name)
## Load covariates
cvrt = SlicedData$new()
cvrt$fileDelimiter = "\t" # the TAB character
cvrt$fileOmitCharacters = "NA" # denote missing values;
cvrt$fileSkipRows = 1 # one row of column labels
cvrt$fileSkipColumns = 1 # one column of row labels
if(length(covariates_file_name) > 0) {
cvrt$LoadFile(covariates_file_name)
}
## Run the analysis
snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE)
genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE)
me = Matrix_eQTL_main(
snps = snps,
gene = gene,
cvrt = cvrt,
output_file_name = output_file_name_tra,
pvOutputThreshold = pvOutputThreshold_tra,
useModel = useModel,
errorCovariance = errorCovariance,
verbose = TRUE,
output_file_name.cis = output_file_name_cis,
pvOutputThreshold.cis = pvOutputThreshold_cis,
snpspos = snpspos,
genepos = genepos,
cisDist = cisDist,
pvalue.hist = "qqplot",
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE);if (!isTRUE(get0("skip_load"))) {
pander::pander(sessionInfo())
message("This is hpgltools commit: ", get_git_commit())
message("Saving to ", savefile)
## tmp <- sm(saveme(filename = savefile))
}## Warning: Your system is mis-configured: '/etc/localtime' is not a symlink
## Warning: It is strongly recommended to set envionment variable TZ to
## 'America/New_York' (or equivalent)
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 015b6b24225f280c620d26d9f5e7ed2caa6139a7
## This is hpgltools commit: Fri Oct 17 09:49:54 2025 -0400: 015b6b24225f280c620d26d9f5e7ed2caa6139a7
## Saving to 04post_visualization_202510.rda.xz