202405: Changed excel output directory to match organization scheme in box. Generally this means files go to analyses/transcriptome/{type_of_contrast}/{date}/something_{suffix}.xlsx Where suffix is _table for the full tables and _sig for the significant genes and will include information about whether sva etc was used. 202405: Adding some goseq results.
** Note! ** The new definitions of susceptible/resistant are tighter than ever before, as a result there are no longer any ambiguous samples. Thus I removed the ambiguous contrasts in the following block.
Just a reminder that in data_structures.Rmd I created lp_go and lp_lengths
Najib read me an email listing off the gene names associated with the zymodeme classification. I took those names and cross referenced them against the Leishmania panamensis gene annotations and found the following:
They are:
Given these 6 gene IDs (NH has two gene IDs associated with it), I can do some looking for specific differences among the various samples.
The following creates a colorspace (red to green) heatmap showing the observed expression of these genes in every sample.
my_genes <- c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
"LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300",
"other")
my_names <- c("ALAT", "ASAT", "G6PD", "NHv1", "NHv2", "MPI", "other")
zymo_six_genes <- exclude_genes(lp_two_strains, ids = my_genes, method = "keep")## Note, I renamed this to subset_genes().
## subset_genes(), before removal, there were 8778 genes, now there are 6.
## There are 93 samples which kept less than 90 percent counts.
## TMRC20001 TMRC20002 TMRC20065 TMRC20004 TMRC20005 TMRC20066 TMRC20039 TMRC20037
## 0.11877 0.08774 0.12694 0.11685 0.13185 0.10680 0.13447 0.11226
## TMRC20038 TMRC20067 TMRC20068 TMRC20041 TMRC20015 TMRC20009 TMRC20010 TMRC20016
## 0.11215 0.10714 0.11141 0.12526 0.10922 0.11134 0.10190 0.10331
## TMRC20011 TMRC20012 TMRC20013 TMRC20017 TMRC20014 TMRC20018 TMRC20019 TMRC20070
## 0.10780 0.11506 0.11721 0.10369 0.10397 0.11309 0.11818 0.11052
## TMRC20020 TMRC20021 TMRC20022 TMRC20024 TMRC20036 TMRC20069 TMRC20033 TMRC20026
## 0.10912 0.10639 0.12659 0.11237 0.11854 0.11358 0.10945 0.13413
## TMRC20031 TMRC20076 TMRC20073 TMRC20055 TMRC20079 TMRC20071 TMRC20078 TMRC20094
## 0.09875 0.12134 0.12195 0.13396 0.12432 0.11838 0.13113 0.11830
## TMRC20042 TMRC20058 TMRC20072 TMRC20059 TMRC20048 TMRC20057 TMRC20088 TMRC20056
## 0.13499 0.11714 0.14327 0.10855 0.10534 0.13253 0.12709 0.13211
## TMRC20060 TMRC20077 TMRC20074 TMRC20063 TMRC20053 TMRC20052 TMRC20064 TMRC20075
## 0.10701 0.12627 0.12216 0.12028 0.11890 0.11538 0.11880 0.11144
## TMRC20051 TMRC20050 TMRC20049 TMRC20062 TMRC20110 TMRC20080 TMRC20043 TMRC20083
## 0.12924 0.11354 0.14115 0.13212 0.13576 0.11954 0.11182 0.12063
## TMRC20054 TMRC20085 TMRC20046 TMRC20093 TMRC20089 TMRC20047 TMRC20090 TMRC20044
## 0.12548 0.12165 0.13005 0.13393 0.11746 0.12230 0.11411 0.13446
## TMRC20045 TMRC20105 TMRC20108 TMRC20109 TMRC20098 TMRC20096 TMRC20101 TMRC20092
## 0.12583 0.12061 0.11419 0.11505 0.11619 0.11116 0.11628 0.11304
## TMRC20082 TMRC20102 TMRC20099 TMRC20100 TMRC20091 TMRC20084 TMRC20087 TMRC20103
## 0.10253 0.11246 0.11740 0.10696 0.12652 0.11096 0.12368 0.13507
## TMRC20104 TMRC20086 TMRC20107 TMRC20081 TMRC20095
## 0.11464 0.10615 0.09249 0.10096 0.06536
strain_norm <- normalize(zymo_six_genes, convert = "rpkm", filter = TRUE, transform = "log2",
length_column = "cds_length")## Removing 0 low-count genes (6 remaining).
lp_norm <- normalize(lp_two_strains, filter = TRUE, convert = "cpm",
norm = "quant", transform = "log2", length_column = "cds_length")## Removing 142 low-count genes (8636 remaining).
## transform_counts: Found 86 values equal to 0, adding 1 to the matrix.
I want to compare the above heatmap with one which is comprised of all genes with some ‘significantly high’ expression value and also a not-negligible coefficient of variance.
## Removing 4731 low-count genes (4047 remaining).
high_strain_norm <- normalize(zymo_high_genes, convert = "rpkm",
norm = "quant", transform = "log2", length_column = "cds_length")## transform_counts: Found 10008 values equal to 0, adding 1 to the matrix.
I think this plot suggests that the difference between the two primary strains is not really one of a few specific genes, but instead a global pattern.
## z2.2 z2.3
## 42 41
## Removing 150 low-count genes (8628 remaining).
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 4662 entries to zero.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## Warning: the 'findbars' function has moved to the reformulas package. Please update your imports, or ask an upstream package maintainter to do so.
## This warning is displayed once per session.
## Warning: the 'nobars' function has moved to the reformulas package. Please update your imports, or ask an upstream package maintainter to do so.
## This warning is displayed once per session.
## conditions
## z22 z23
## 42 41
## conditions
## z22 z23
## 42 41
## conditions
## z22 z23
## 42 41
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: Existing surrogate matrix.
## The primary analysis performed 1 comparisons.
## The logFC agreement among the methods follows:
## z23_vs_z22
## basic_vs_deseq 0.8999
## basic_vs_dream 0.9294
## basic_vs_ebseq 0.8635
## basic_vs_edger 0.8933
## basic_vs_limma 0.9431
## basic_vs_noiseq 0.9197
## deseq_vs_dream 0.9860
## deseq_vs_ebseq 0.9905
## deseq_vs_edger 0.9990
## deseq_vs_limma 0.9616
## deseq_vs_noiseq 0.9970
## dream_vs_ebseq 0.9689
## dream_vs_edger 0.9833
## dream_vs_limma 0.9817
## dream_vs_noiseq 0.9885
## ebseq_vs_edger 0.9950
## ebseq_vs_limma 0.9402
## ebseq_vs_noiseq 0.9825
## edger_vs_limma 0.9576
## edger_vs_noiseq 0.9954
## limma_vs_noiseq 0.9647
## Including the plots causes the rda file to balloon to 3.4Gb in the following invocation.
## Removing them results in... holy crap 2.1Mb
zymo_table_nobatch <- combine_de_tables(
zymo_de_nobatch, keepers = zymodeme_keeper, label_column = "gene_product",
rda = glue("rda/zymo_tables_nobatch-v{ver}.rda"),
excel = glue("{excel_out}/DE_Strain/{ver}/zymo_tables_nobatch-v{ver}.xlsx"))## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown limma_sigup
## 1 z23_vs_z22 48 114 48 115 75
## limma_sigdown
## 1 93
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## i Please use tidy evaluation idioms with `aes()`.
## i See also `vignette("ggplot2-in-packages")` for more information.
## i The deprecated feature was likely used in the UpSetR package.
## Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_line()`: Each group consists of only one observation.
## i Do you need to adjust the group aesthetic?
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## i Please use the `linewidth` argument instead.
## i The deprecated feature was likely used in the UpSetR package.
## Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Plot describing unique/shared genes in a differential expression table.
zymo_sig_nobatch <- extract_significant_genes(
zymo_table_nobatch,
according_to = "deseq", current_id = "GID", required_id = "GID",
gmt = glue("excel/zymodeme_nobatch-v{ver}.gmt"),
excel = glue("{excel_out}/DE_Strain/{ver}/zymo_sig_nobatch_deseq-v{ver}.xlsx"))## Number of up IDs in contrast zymodeme: 48.
## Number of down IDs in contrast zymodeme: 114.
## A set of genes deemed significant according to deseq.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## deseq_up deseq_down
## zymodeme 48 114
There are too few genes at our current stringencies for a meaningful result.
increased_z22 <- zymo_sig_nobatch[["deseq"]][["downs"]][["zymodeme"]]
increased_z23 <- zymo_sig_nobatch[["deseq"]][["ups"]][["zymodeme"]]
z22_goseq <- simple_goseq(increased_z22, go_db = lp_go, length_db = lp_lengths, min_xref = 10)## Found 25 go_db genes and 114 length_db genes out of 114.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## Found 16 go_db genes and 48 length_db genes out of 48.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
Log ratio, mean average plot and volcano plot of the comparison of the two primary zymodeme transcriptomes. When the transcriptomes of the two main strains (43 and 41 samples of z2.3 and z2.1) were compared without any attempt at batch/surrogate estimation with DESeq2, 45 and 85 genes were observed as significantly higher in strain z2.3 and z2.2 respectively using a cutoff of 1.0 logFC and 0.05 FDR adjusted p-value. There remain a large number of genes which are likely significantly different between the two strains, but fall below the 2-fold difference required for ‘significance.’ This follows prior observations that the parasite transcriptomes are constituitively expressed.
When the same data was plotted via a volcano plot, the relatively small range of fold changes compared to the large range of adjusted p-values is visible.
zymo_de_sva <- all_pairwise(lp_zymo, filter = TRUE, model_fstring = "~ 0 + condition",
model_svs = "svaseq")## z2.2 z2.3
## 42 41
## Removing 150 low-count genes (8628 remaining).
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 4662 entries to zero.
## This received a matrix of SVs.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## conditions
## z22 z23
## 42 41
## conditions
## z22 z23
## 42 41
## conditions
## z22 z23
## 42 41
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: svaseq.
## The primary analysis performed 1 comparisons.
## The logFC agreement among the methods follows:
## z23_vs_z22
## basic_vs_deseq 0.9033
## basic_vs_dream 0.9309
## basic_vs_ebseq 0.8635
## basic_vs_edger 0.8961
## basic_vs_limma 0.9432
## basic_vs_noiseq 0.9197
## deseq_vs_dream 0.9899
## deseq_vs_ebseq 0.9873
## deseq_vs_edger 0.9993
## deseq_vs_limma 0.9641
## deseq_vs_noiseq 0.9948
## dream_vs_ebseq 0.9689
## dream_vs_edger 0.9872
## dream_vs_limma 0.9829
## dream_vs_noiseq 0.9887
## ebseq_vs_edger 0.9918
## ebseq_vs_limma 0.9408
## ebseq_vs_noiseq 0.9825
## edger_vs_limma 0.9607
## edger_vs_noiseq 0.9928
## limma_vs_noiseq 0.9654
zymo_table_sva <- combine_de_tables(
zymo_de_sva, keepers = zymodeme_keeper, label_column = "gene_product",
rda = glue("rda/zymo_tables_sva-v{ver}.rda"),
excel = glue("{excel_out}/DE_Strain/{ver}/zymo_tables_sva-v{ver}.xlsx"))## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown limma_sigup
## 1 z23_vs_z22 48 115 48 115 74
## limma_sigdown
## 1 94
## `geom_line()`: Each group consists of only one observation.
## i Do you need to adjust the group aesthetic?
## Plot describing unique/shared genes in a differential expression table.
zymo_sig_sva <- extract_significant_genes(
zymo_table_sva,
according_to = "deseq",
current_id = "GID", required_id = "GID",
gmt = glue("excel/zymodeme_sva-v{ver}.gmt"),
excel = glue("{excel_out}/DE_Strain/{ver}/zymo_sig_sva-v{ver}.xlsx"))## Number of up IDs in contrast zymodeme: 48.
## Number of down IDs in contrast zymodeme: 115.
## A set of genes deemed significant according to deseq.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## deseq_up deseq_down
## zymodeme 48 115
There are too few genes at our current stringencies for a meaningful result.
increased_z22 <- zymo_sig_sva[["deseq"]][["downs"]][["zymodeme"]]
increased_z23 <- zymo_sig_sva[["deseq"]][["ups"]][["zymodeme"]]
z22_goseq <- simple_goseq(increased_z22, go_db = lp_go, length_db = lp_lengths)## Found 26 go_db genes and 115 length_db genes out of 115.
## Found 14 go_db genes and 48 length_db genes out of 48.
When estimates from SVA were included in the statistical model used by EdgeR, DESeq2, and limma; a nearly identical view of the data emerged. I think this shows with a high degree of confidence, that sva is not having a significant effect on this dataset.
This susceptibility comparison is using the ‘current’ dataset.
Note again: we no longer have any ambiguous samples, so I commented out a portion of the following block.
## resistant sensitive
## 46 46
## Removing 149 low-count genes (8629 remaining).
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 5262 entries to zero.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## conditions
## resistant sensitive
## 46 46
## conditions
## resistant sensitive
## 46 46
## conditions
## resistant sensitive
## 46 46
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: Existing surrogate matrix.
## The primary analysis performed 1 comparisons.
## The logFC agreement among the methods follows:
## snstv_vs_r
## basic_vs_deseq 0.8906
## basic_vs_dream 0.9306
## basic_vs_ebseq 0.8898
## basic_vs_edger 0.8923
## basic_vs_limma 0.9442
## basic_vs_noiseq 0.9365
## deseq_vs_dream 0.9140
## deseq_vs_ebseq 0.9978
## deseq_vs_edger 0.9991
## deseq_vs_limma 0.8981
## deseq_vs_noiseq 0.9855
## dream_vs_ebseq 0.9116
## dream_vs_edger 0.9141
## dream_vs_limma 0.9818
## dream_vs_noiseq 0.9566
## ebseq_vs_edger 0.9995
## ebseq_vs_limma 0.8947
## ebseq_vs_noiseq 0.9866
## edger_vs_limma 0.8976
## edger_vs_noiseq 0.9872
## limma_vs_noiseq 0.9395
sus_table_nobatch <- combine_de_tables(
sus_de_nobatch, keepers = susceptibility_keepers,
rda = glue("rda/sus_tables_nobatch-v{ver}.rda"),
excel = glue("{excel_out}/DE_Susceptibility/{ver}/sus_tables_nobatch-v{ver}.xlsx"))## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup
## 1 sensitive_vs_resistant-inverted 41 108 42
## edger_sigdown limma_sigup limma_sigdown
## 1 108 66 90
## `geom_line()`: Each group consists of only one observation.
## i Do you need to adjust the group aesthetic?
## Plot describing unique/shared genes in a differential expression table.
sus_sig_nobatch <- extract_significant_genes(
sus_table_nobatch,
excel = glue("{excel_out}/DE_Susceptibility/{ver}/sus_sig_nobatch-v{ver}.xlsx"))
sus_de_sva <- all_pairwise(lp_susceptibility, filter = TRUE, model_fstring = "~ 0 + condition", model_svs = "svaseq")## resistant sensitive
## 46 46
## Removing 149 low-count genes (8629 remaining).
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 5262 entries to zero.
## This received a matrix of SVs.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## conditions
## resistant sensitive
## 46 46
## conditions
## resistant sensitive
## 46 46
## conditions
## resistant sensitive
## 46 46
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: svaseq.
## The primary analysis performed 1 comparisons.
## The logFC agreement among the methods follows:
## snstv_vs_r
## basic_vs_deseq 0.9051
## basic_vs_dream 0.9328
## basic_vs_ebseq 0.8898
## basic_vs_edger 0.9072
## basic_vs_limma 0.9448
## basic_vs_noiseq 0.9365
## deseq_vs_dream 0.9376
## deseq_vs_ebseq 0.9887
## deseq_vs_edger 0.9999
## deseq_vs_limma 0.9195
## deseq_vs_noiseq 0.9870
## dream_vs_ebseq 0.9112
## dream_vs_edger 0.9381
## dream_vs_limma 0.9829
## dream_vs_noiseq 0.9567
## ebseq_vs_edger 0.9898
## ebseq_vs_limma 0.8953
## ebseq_vs_noiseq 0.9866
## edger_vs_limma 0.9205
## edger_vs_noiseq 0.9884
## limma_vs_noiseq 0.9402
sus_table_sva <- combine_de_tables(
sus_de_sva, keepers = susceptibility_keepers,
rda = glue("rda/sus_tables_sva-v{ver}.rda"),
excel = glue("{excel_out}/DE_Susceptibility/{ver}/sus_tables_sva-v{ver}.xlsx"))## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup
## 1 sensitive_vs_resistant-inverted 45 109 45
## edger_sigdown limma_sigup limma_sigdown
## 1 109 65 92
## `geom_line()`: Each group consists of only one observation.
## i Do you need to adjust the group aesthetic?
## Plot describing unique/shared genes in a differential expression table.
sus_sig_sva <- extract_significant_genes(
sus_table_sva, according_to = "deseq",
excel = glue("{excel_out}/DE_Susceptibility/{ver}/sus_sig_sva-v{ver}.xlsx"))
sus_sig_sva## A set of genes deemed significant according to deseq.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## deseq_up deseq_down
## resistant_sensitive 45 109
## To get a more true sense of sensitive vs resistant with sva, we kind of need to get rid of the
## unknown samples and perhaps the ambiguous.
## no_ambiguous <- subset_se(lp_susceptibility, subset = "condition!='ambiguous'") %>%
## subset_se(subset = "condition!='unknown'")
## no_ambiguous_de_sva <- all_pairwise(no_ambiguous, filter = TRUE, model_batch = "svaseq")
## no_ambiguous_de_sva
## Let us see if my keeper code will fail hard or soft with extra contrasts...
## no_ambiguous_table_sva <- combine_de_tables(
## no_ambiguous_de_sva, keepers = susceptibility_keepers,
## excel = glue("excel/no_ambiguous_tables_sva-v{ver}.xlsx"))
## no_ambiguous_table_sva
## no_ambiguous_sig_sva <- extract_significant_genes(
## no_ambiguous_table_sva, according_to = "deseq",
## excel = glue("excel/no_ambiguous_sig_sva-v{ver}.xlsx"))
## no_ambiguous_sig_sva## A set of genes deemed significant according to deseq.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## deseq_up deseq_down
## resistant_sensitive 45 109
increased_resistant <- sus_sig_sva[["deseq"]][["ups"]][["resistant_sensitive"]]
increased_sensitive <- sus_sig_sva[["deseq"]][["downs"]][["resistant_sensitive"]]
resistant_goseq <- simple_goseq(increased_resistant, go_db = lp_go, length_db = lp_lengths)## Found 14 go_db genes and 45 length_db genes out of 45.
## Found 24 go_db genes and 109 length_db genes out of 109.
Given that resistance/sensitivity tends to be correlated with strain, one might expect similar results. One caveat in this context though: there are fewer strains with resistance/sensitivity definitions. This when the analysis was repeated without the ambiguous/unknown samples, a few more genes were observed as significant.
## zymo_table_sva[["plots"]][["zymodeme"]][["deseq_ma_plots"]][["plot"]]
zy_df <- zymo_table_sva[["data"]][["zymodeme"]]
sus_df <- sus_table_sva[["data"]][["resistant_sensitive"]]
both_df <- merge(zy_df, sus_df, by = "row.names")
plot_df <- both_df[, c("deseq_logfc.x", "deseq_logfc.y")]
rownames(plot_df) <- both_df[["Row.names"]]
colnames(plot_df) <- c("z23_vs_z22", "sensitive_vs_resistant")
compare <- plot_linear_scatter(plot_df)
pp(file = "images/compare_sus_zy.png")
compare$scatter
dev.off()## png
## 2
##
## Pearson's product-moment correlation
##
## data: df[[xcol]] and df[[ycol]]
## t = 252, df = 8626, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9358 0.9408
## sample estimates:
## cor
## 0.9383
This susceptibility comparison is using the historical dataset.
sushist_de_nobatch <- all_pairwise(lp_susceptibility_historical, model_fstring = "~ 0 + condition",
filter = TRUE)## ambiguous resistant sensitive unknown
## 5 12 30 45
## Removing 149 low-count genes (8629 remaining).
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 5262 entries to zero.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## conditions
## ambiguous resistant sensitive unknown
## 5 12 30 45
## conditions
## ambiguous resistant sensitive unknown
## 5 12 30 45
## conditions
## ambiguous resistant sensitive unknown
## 5 12 30 45
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: Existing surrogate matrix.
## The primary analysis performed 6 comparisons.
sushist_table_nobatch <- combine_de_tables(
sushist_de_nobatch, keepers = susceptibility_keepers,
excel = glue("{excel_out}/DE_Susceptibility/sushist_tables_nobatch-v{ver}.xlsx"))## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup
## 1 sensitive_vs_resistant-inverted 41 130 43
## edger_sigdown limma_sigup limma_sigdown
## 1 135 70 89
## `geom_line()`: Each group consists of only one observation.
## i Do you need to adjust the group aesthetic?
## Plot describing unique/shared genes in a differential expression table.
sushist_sig_nobatch <- extract_significant_genes(
sushist_table_nobatch,
excel = glue("{excel_out}/DE_Susceptibility/sushist_sig_nobatch-v{ver}.xlsx"))
sushist_sig_nobatch## A set of genes deemed significant according to limma, edger, deseq, ebseq, basic.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## limma_up limma_down edger_up edger_down deseq_up deseq_down
## resistant_sensitive 70 89 43 135 41 130
## ebseq_up ebseq_down basic_up basic_down
## resistant_sensitive 36 61 49 88
sushist_de_sva <- all_pairwise(lp_susceptibility_historical, filter = TRUE,
model_fstring = "~ 0 + condition", model_svs = "svaseq")## ambiguous resistant sensitive unknown
## 5 12 30 45
## Removing 149 low-count genes (8629 remaining).
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 5262 entries to zero.
## This received a matrix of SVs.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## conditions
## ambiguous resistant sensitive unknown
## 5 12 30 45
## conditions
## ambiguous resistant sensitive unknown
## 5 12 30 45
## conditions
## ambiguous resistant sensitive unknown
## 5 12 30 45
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: svaseq.
## The primary analysis performed 6 comparisons.
sushist_table_sva <- combine_de_tables(
sushist_de_sva, keepers = susceptibility_keepers,
excel = glue("{excel_out}/DE_Susceptibility/sushist_tables_sva-v{ver}.xlsx"))## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup
## 1 sensitive_vs_resistant-inverted 43 129 43
## edger_sigdown limma_sigup limma_sigdown
## 1 124 58 87
## `geom_line()`: Each group consists of only one observation.
## i Do you need to adjust the group aesthetic?
## Plot describing unique/shared genes in a differential expression table.
sushist_sig_sva <- extract_significant_genes(
sushist_table_sva, according_to = "deseq",
excel = glue("{excel_out}/DE_Susceptibility/sushist_sig_sva-v{ver}.xlsx"))
sushist_sig_sva## A set of genes deemed significant according to deseq.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## deseq_up deseq_down
## resistant_sensitive 43 129
##cf_nb_input <- subset_se(cf_se, subset="condition!='unknown'")
cf_de_nobatch <- all_pairwise(lp_cf_known, filter = TRUE,
model_fstring = "~ 0 + condition", model_svs = FALSE)## cure fail
## 40 34
## Removing 154 low-count genes (8624 remaining).
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 3817 entries to zero.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## conditions
## cure fail
## 40 34
## conditions
## cure fail
## 40 34
## conditions
## cure fail
## 40 34
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: Existing surrogate matrix.
## The primary analysis performed 1 comparisons.
## The logFC agreement among the methods follows:
## fail_vs_cr
## basic_vs_deseq 0.7784
## basic_vs_dream 0.9502
## basic_vs_ebseq 0.8554
## basic_vs_edger 0.8383
## basic_vs_limma 0.9610
## basic_vs_noiseq 0.7507
## deseq_vs_dream 0.7651
## deseq_vs_ebseq 0.9072
## deseq_vs_edger 0.9652
## deseq_vs_limma 0.7597
## deseq_vs_noiseq 0.6022
## dream_vs_ebseq 0.8357
## dream_vs_edger 0.8206
## dream_vs_limma 0.9825
## dream_vs_noiseq 0.8075
## ebseq_vs_edger 0.9792
## ebseq_vs_limma 0.8192
## ebseq_vs_noiseq 0.6591
## edger_vs_limma 0.8107
## edger_vs_noiseq 0.6464
## limma_vs_noiseq 0.7853
cf_table_nobatch <- combine_de_tables(
cf_de_nobatch,
excel = glue("{excel_out}/DE_Cure_vs_Fail/{ver}/cf_tables_nobatch-v{ver}.xlsx"))## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown limma_sigup
## 1 fail_vs_cure 0 1 1 1 0
## limma_sigdown
## 1 0
## Only fail_vs_cure_down has information, cannot create an UpSet.
## Plot describing unique/shared genes in a differential expression table.
## NULL
cf_sig_nobatch <- extract_significant_genes(
cf_table_nobatch,
excel = glue("{excel_out}/DE_Cure_vs_Fail/{ver}/cf_sig_nobatch-v{ver}.xlsx"))
cf_sig_nobatch## A set of genes deemed significant according to limma, edger, deseq, ebseq, basic.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## limma_up limma_down edger_up edger_down deseq_up deseq_down
## fail_vs_cure 0 0 1 1 0 1
## ebseq_up ebseq_down basic_up basic_down
## fail_vs_cure 0 0 0 0
cf_de <- all_pairwise(lp_cf_known, filter = TRUE,
model_fstring = "~ 0 + condition", model_svs = "svaseq")## cure fail
## 40 34
## Removing 154 low-count genes (8624 remaining).
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 3817 entries to zero.
## This received a matrix of SVs.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## conditions
## cure fail
## 40 34
## conditions
## cure fail
## 40 34
## conditions
## cure fail
## 40 34
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: svaseq.
## The primary analysis performed 1 comparisons.
## The logFC agreement among the methods follows:
## fail_vs_cr
## basic_vs_deseq 0.8729
## basic_vs_dream 0.9285
## basic_vs_ebseq 0.8554
## basic_vs_edger 0.8703
## basic_vs_limma 0.9388
## basic_vs_noiseq 0.7507
## deseq_vs_dream 0.9051
## deseq_vs_ebseq 0.9064
## deseq_vs_edger 0.9970
## deseq_vs_limma 0.8915
## deseq_vs_noiseq 0.8043
## dream_vs_ebseq 0.8651
## dream_vs_edger 0.8973
## dream_vs_limma 0.9870
## dream_vs_noiseq 0.6988
## ebseq_vs_edger 0.8951
## ebseq_vs_limma 0.8493
## ebseq_vs_noiseq 0.6591
## edger_vs_limma 0.8843
## edger_vs_noiseq 0.8272
## limma_vs_noiseq 0.6904
cf_table <- combine_de_tables(
cf_de,
excel = glue("{excel_out}/DE_Cure_vs_Fail/{ver}/cf_tables-v{ver}.xlsx"))## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown limma_sigup
## 1 fail_vs_cure 1 13 2 13 0
## limma_sigdown
## 1 1
## `geom_line()`: Each group consists of only one observation.
## i Do you need to adjust the group aesthetic?
## Plot describing unique/shared genes in a differential expression table.
cf_sig <- extract_significant_genes(
cf_table,
excel = glue("{excel_out}/DE_Cure_vs_Fail/{ver}/cf_sig-v{ver}.xlsx"))
cf_sig## A set of genes deemed significant according to limma, edger, deseq, ebseq, basic.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## limma_up limma_down edger_up edger_down deseq_up deseq_down
## fail_vs_cure 0 1 2 13 1 13
## ebseq_up ebseq_down basic_up basic_down
## fail_vs_cure 0 0 0 0
I am not going to mess with GO searches for this.
It is not surprising that few or no genes are deemed significantly differentially expressed across samples which were taken from cure or fail patients.
dev <- pp(file = "images/cf_ma.png")
cf_table[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]
closed <- dev.off()
cf_table[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]One query we have not yet addressed: what are the similarities and differences among the strains used to infect the macrophage samples and the promastigote samples used in the TMRC2 parasite data?
In my container image, this dataset is not currently loaded, so turning this off.
## I just fixed this in the datasets Rmd, but until that propagates just set it manually
annotation(lp_se) <- annotation(lp_macrophage)## Error: unable to find an inherited method for function 'annotation<-' for signature 'object = "SummarizedExperiment", value = "NULL"'
tmrc2_macrophage_norm <- normalize(lp_macrophage, transform="log2", convert="cpm",
norm="quant", filter=TRUE)## Removing 0 low-count genes (8778 remaining).
## transform_counts: Found 3577 values equal to 0, adding 1 to the matrix.
all_tmrc2 <- hpgltools:::combine_se(lp_se, lp_macrophage)
all_nosb <- all_tmrc2
colData(all_nosb)[["stage"]] <- "promastigote"
na_idx <- is.na(colData(all_nosb)[["macrophagetreatment"]])
colData(all_nosb)[na_idx, "macrophagetreatment"] <- "undefined"
all_nosb <- subset_se(all_nosb, subset = "macrophagetreatment!='inf_sb'")
ama_idx <- colData(all_nosb)[["macrophagetreatment"]] == "inf"
colData(all_nosb)[ama_idx, "stage" ] <- "amastigote"
colData(all_nosb)[["batch"]] <- colData(all_nosb)[["stage"]]I think the above picture is sort of the opposite of what we want to compare in a DE analysis for this set of data, e.g. we want to compare promastigotes from amastigotes?
## The number of samples by batch are:
##
## z2.1 z2.2 z2.3 z2.4
## 7 56 56 2
## The numbers of samples by condition are:
##
## amastigote promastigote
## 29 92
two_zymo <- subset_se(
all_nosb,
subset = "zymodemecategorical=='z22'|zymodemecategorical=='z23'|zymodemecategorical=='unknown'")
pro_ama <- all_pairwise(all_nosb, filter = TRUE,
model_fstring = "~ 0 + condition", model_svs = "svaseq")## amastigote promastigote
## 29 92
## Removing 94 low-count genes (8684 remaining).
## Potentially check over the experimental design, there appear to be missing values.
## Warning in plot_pca(data = mtrx, design = design, state = state, plot_colors =
## plot_colors, : There are NA values in the component data. This can lead to
## weird plotting errors.
## Potentially check over the experimental design, there appear to be missing values.
## Warning in plot_pca(data = mtrx, design = design, state = state, plot_colors =
## plot_colors, : There are NA values in the component data. This can lead to
## weird plotting errors.
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 13046 entries to zero.
## This received a matrix of SVs.
## converting counts to integer mode
## the design formula contains one or more numeric variables with integer values,
## specifying a model with increasing fold change for higher values.
## did you mean for this to be a factor? if so, first convert
## this variable to a factor using the factor() function
## the design formula contains one or more numeric variables with integer values,
## specifying a model with increasing fold change for higher values.
## did you mean for this to be a factor? if so, first convert
## this variable to a factor using the factor() function
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## conditions
## amastigote promastigote
## 29 92
## conditions
## amastigote promastigote
## 29 92
## conditions
## amastigote promastigote
## 29 92
pro_ama_table <- combine_de_tables(
pro_ama,
excel = glue("{excel_out}/DE_promastigote_amastigote/{ver}/pro_vs_ama_table-v{ver}.xlsx"))## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.
pro_ama_sig <- extract_significant_genes(
pro_ama_table,
excel = glue("{excel_out}/DE_promastigote_amastigote/{ver}/pro_vs_ama_sig-v{ver}.xlsx"))increased_promastigote <- pro_ama_sig[["deseq"]][["ups"]][["promastigote_vs_amastigote"]]
increased_amastigote <- pro_ama_sig[["deseq"]][["downs"]][["promastigote_vs_amastigote"]]
promastigote_goseq <- simple_goseq(increased_promastigote, go_db = lp_go, length_db = lp_lengths)## Found 226 go_db genes and 576 length_db genes out of 576.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## Ontologies observed by goseq using 576 genes
## with significance cutoff 0.1.
## There are 33 MF hits, 55, BP hits, and 21 CC hits.
## Category bpp_plot_over is the most populated with 30 hits.
amastigote_goseq <- simple_goseq(increased_amastigote, go_db = lp_go,
length_db = lp_lengths, min_xref = 30)## Found 31 go_db genes and 85 length_db genes out of 85.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## Ontologies observed by goseq using 85 genes
## with significance cutoff 0.1.
## There are 51 MF hits, 74, BP hits, and 22 CC hits.
## Category mfp_plot_over is the most populated with 30 hits.
## silly, topgo wants the gene id column to be 'ID', I should fix this.
colnames(lp_go) <- c("ID", "GO")
promastigote_topgo <- simple_topgo(increased_promastigote, go_db = lp_go)## Warning: NAs introduced by coercion
## Getting enrichResult for ontology: bp.
## Getting enrichResult for ontology: mf.
## Getting enrichResult for ontology: cc.
## Warning in topgo_tables(results, godata, limitby = limitby, limit = limit, :
## NAs introduced by coercion
## Getting enrichResult for ontology: bp.
## Getting enrichResult for ontology: mf.
## Getting enrichResult for ontology: cc.
I am a little surprised by this plot, I somewhat expected there to be few genes which passed the 2-fold difference demarcation line.
## Warning: Your system is mis-configured: '/etc/localtime' is not a symlink
## Warning: It is strongly recommended to set envionment variable TZ to
## 'America/New_York' (or equivalent)
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
locale: C
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: ruv(v.0.9.7.1), edgeR(v.4.6.3), hpgltools(v.1.2), testthat(v.3.2.3), glue(v.1.8.0) and Heatplus(v.3.16.0)
loaded via a namespace (and not attached): fs(v.1.6.6), matrixStats(v.1.5.0), bitops(v.1.0-9), enrichplot(v.1.28.4), blockmodeling(v.1.1.8), devtools(v.2.4.6), doParallel(v.1.0.17), httr(v.1.4.7), RColorBrewer(v.1.1-3), numDeriv(v.2016.8-1.1), tools(v.4.5.0), backports(v.1.5.0), R6(v.2.6.1), lazyeval(v.0.2.2), mgcv(v.1.9-3), withr(v.3.0.2), prettyunits(v.1.2.0), gridExtra(v.2.3), preprocessCore(v.1.70.0), cli(v.3.6.5), Biobase(v.2.68.0), topGO(v.2.60.1), labeling(v.0.4.3), EBSeq(v.2.6.0), sass(v.0.4.10), robustbase(v.0.99-7), mvtnorm(v.1.3-3), S7(v.0.2.1), genefilter(v.1.90.0), goseq(v.1.60.0), Rsamtools(v.2.24.0), systemfonts(v.1.3.1), yulab.utils(v.0.2.4), txdbmaker(v.1.4.2), gson(v.0.1.0), DOSE(v.4.2.0), R.utils(v.2.13.0), dichromat(v.2.0-0.1), sessioninfo(v.1.2.3), limma(v.3.64.3), rstudioapi(v.0.17.1), RSQLite(v.2.4.3), generics(v.0.1.4), gridGraphics(v.0.5-1), BiocIO(v.1.18.0), gtools(v.3.9.5), zip(v.2.3.3), dplyr(v.1.2.0), GO.db(v.3.21.0), Matrix(v.1.7-3), S4Vectors(v.0.46.0), abind(v.1.4-8), R.methodsS3(v.1.8.2), lifecycle(v.1.0.5), yaml(v.2.3.10), SummarizedExperiment(v.1.38.1), BiocFileCache(v.2.16.1), gplots(v.3.3.0), qvalue(v.2.40.0), SparseArray(v.1.8.1), grid(v.4.5.0), blob(v.1.2.4), promises(v.1.3.3), crayon(v.1.5.3), ggtangle(v.0.0.7), lattice(v.0.22-7), cowplot(v.1.2.0), GenomicFeatures(v.1.60.0), annotate(v.1.86.1), KEGGREST(v.1.48.1), pillar(v.1.11.0), knitr(v.1.50), varhandle(v.2.0.6), fgsea(v.1.34.2), GenomicRanges(v.1.60.0), rjson(v.0.2.23), boot(v.1.3-31), corpcor(v.1.6.10), codetools(v.0.2-20), fastmatch(v.1.1-6), ggiraph(v.0.9.5), ggfun(v.0.2.0), fontLiberation(v.0.1.0), Vennerable(v.3.1.0.9000), data.table(v.1.17.8), remotes(v.2.5.0), vctrs(v.0.7.1), png(v.0.1-8), treeio(v.1.32.0), Rdpack(v.2.6.6), gtable(v.0.3.6), cachem(v.1.1.0), openxlsx(v.4.2.8.1), xfun(v.0.53), rbibutils(v.2.4.1), S4Arrays(v.1.8.1), mime(v.0.13), RcppEigen(v.0.3.4.0.2), reformulas(v.0.4.4), survival(v.3.8-3), NOISeq(v.2.52.0), iterators(v.1.0.14), statmod(v.1.5.0), ellipsis(v.0.3.2), nlme(v.3.1-168), pbkrtest(v.0.5.5), ggtree(v.4.1.1.004), usethis(v.3.2.1), bit64(v.4.6.0-1), fontquiver(v.0.2.1), filelock(v.1.0.3), progress(v.1.2.3), EnvStats(v.3.1.0), UpSetR(v.1.4.0), GenomeInfoDb(v.1.44.2), rprojroot(v.2.1.1), bslib(v.0.9.0), KernSmooth(v.2.23-26), BiocGenerics(v.0.54.0), DBI(v.1.2.3), DESeq2(v.1.48.1), tidyselect(v.1.2.1), bit(v.4.6.0), compiler(v.4.5.0), curl(v.7.0.0), httr2(v.1.2.1), graph(v.1.86.0), BiasedUrn(v.2.0.12), SparseM(v.1.84-2), xml2(v.1.4.0), desc(v.1.4.3), fontBitstreamVera(v.0.1.1), DelayedArray(v.0.34.1), plotly(v.4.11.0), rtracklayer(v.1.68.0), scales(v.1.4.0), caTools(v.1.18.3), DEoptimR(v.1.1-4), remaCor(v.0.0.20), RBGL(v.1.84.0), rappdirs(v.0.3.3), stringr(v.1.5.1), digest(v.0.6.37), minqa(v.1.2.8), variancePartition(v.1.38.1), rmarkdown(v.2.29), aod(v.1.3.3), XVector(v.0.48.0), RhpcBLASctl(v.0.23-42), htmltools(v.0.5.8.1), pkgconfig(v.2.0.3), lme4(v.1.1-38), MatrixGenerics(v.1.20.0), dbplyr(v.2.5.0), fastmap(v.1.2.0), rlang(v.1.1.7), htmlwidgets(v.1.6.4), UCSC.utils(v.1.4.0), shiny(v.1.11.1), farver(v.2.1.2), jquerylib(v.0.1.4), jsonlite(v.2.0.0), BiocParallel(v.1.42.1), GOSemSim(v.2.34.0), R.oo(v.1.27.1), RCurl(v.1.98-1.17), magrittr(v.2.0.4), GenomeInfoDbData(v.1.2.14), ggplotify(v.0.1.2), patchwork(v.1.3.2), Rcpp(v.1.1.0), ape(v.5.8-1), gdtools(v.0.5.0), stringi(v.1.8.7), brio(v.1.1.5), MASS(v.7.3-65), plyr(v.1.8.9), pkgbuild(v.1.4.8), parallel(v.4.5.0), ggrepel(v.0.9.6), forcats(v.1.0.0), Biostrings(v.2.76.0), splines(v.4.5.0), pander(v.0.6.6), hms(v.1.1.3), geneLenDataBase(v.1.44.0), locfit(v.1.5-9.12), igraph(v.2.1.4), fastcluster(v.1.3.0), biomaRt(v.2.64.0), reshape2(v.1.4.4), stats4(v.4.5.0), pkgload(v.1.5.0), XML(v.3.99-0.19), evaluate(v.1.0.4), BiocManager(v.1.30.26), nloptr(v.2.2.1), PROPER(v.1.40.0), foreach(v.1.5.2), httpuv(v.1.6.16), tidyr(v.1.3.2), purrr(v.1.2.1), ggplot2(v.4.0.2), broom(v.1.0.12), xtable(v.1.8-4), restfulr(v.0.0.16), fANCOVA(v.0.6-1), tidytree(v.0.4.6), later(v.1.4.3), viridisLite(v.0.4.2), tibble(v.3.3.0), lmerTest(v.3.2-0), clusterProfiler(v.4.16.0), aplot(v.0.2.8), memoise(v.2.0.1), AnnotationDbi(v.1.70.0), GenomicAlignments(v.1.44.0), IRanges(v.2.42.0), sva(v.3.56.0) and GSEABase(v.1.70.0)
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset b5cdc146bf9658d9f49b9563e829002346532804
## This is hpgltools commit: Tue Feb 17 14:35:01 2026 -0500: b5cdc146bf9658d9f49b9563e829002346532804