202405: Changed excel output directory to match organization scheme in box. Generally this means files go to analyses/transcriptome/{type_of_contrast}/{date}/something_{suffix}.xlsx Where suffix is _table for the full tables and _sig for the significant genes and will include information about whether sva etc was used. 202405: Adding some goseq results.
** Note! ** The new definitions of susceptible/resistant are tighter than ever before, as a result there are no longer any ambiguous samples. Thus I removed the ambiguous contrasts in the following block.
Just a reminder that in data_structures.Rmd I created lp_go and lp_lengths
Najib read me an email listing off the gene names associated with the zymodeme classification. I took those names and cross referenced them against the Leishmania panamensis gene annotations and found the following:
They are:
Given these 6 gene IDs (NH has two gene IDs associated with it), I can do some looking for specific differences among the various samples.
The following creates a colorspace (red to green) heatmap showing the observed expression of these genes in every sample.
my_genes <- c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
"LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300",
"other")
my_names <- c("ALAT", "ASAT", "G6PD", "NHv1", "NHv2", "MPI", "other")
zymo_six_genes <- exclude_genes_expt(lp_two_strains, ids = my_genes, method = "keep")## Note, I renamed this to subset_genes().
## remove_genes_expt(), before removal, there were 8710 genes, now there are 6.
## There are 93 samples which kept less than 90 percent counts.
## TMRC20001 TMRC20002 TMRC20065 TMRC20004 TMRC20005 TMRC20066 TMRC20039 TMRC20037
## 0.12101 0.08915 0.12835 0.11970 0.13438 0.10804 0.13756 0.11451
## TMRC20038 TMRC20067 TMRC20068 TMRC20041 TMRC20015 TMRC20009 TMRC20010 TMRC20016
## 0.11446 0.10858 0.11269 0.12840 0.11072 0.11376 0.10292 0.10451
## TMRC20011 TMRC20012 TMRC20013 TMRC20017 TMRC20014 TMRC20018 TMRC20019 TMRC20070
## 0.10967 0.11676 0.11865 0.10594 0.10591 0.11517 0.12027 0.11179
## TMRC20020 TMRC20021 TMRC20022 TMRC20024 TMRC20036 TMRC20069 TMRC20033 TMRC20026
## 0.11100 0.10764 0.12952 0.11481 0.12100 0.11631 0.11188 0.13633
## TMRC20031 TMRC20076 TMRC20073 TMRC20055 TMRC20079 TMRC20071 TMRC20078 TMRC20094
## 0.10013 0.12403 0.12398 0.13783 0.12639 0.12003 0.13420 0.12021
## TMRC20042 TMRC20058 TMRC20072 TMRC20059 TMRC20048 TMRC20057 TMRC20088 TMRC20056
## 0.13766 0.11900 0.14675 0.10984 0.10679 0.13443 0.12936 0.13510
## TMRC20060 TMRC20077 TMRC20074 TMRC20063 TMRC20053 TMRC20052 TMRC20064 TMRC20075
## 0.10882 0.12810 0.12494 0.12239 0.12145 0.11690 0.12051 0.11261
## TMRC20051 TMRC20050 TMRC20049 TMRC20062 TMRC20110 TMRC20080 TMRC20043 TMRC20083
## 0.13149 0.11526 0.14480 0.13429 0.13812 0.12120 0.11306 0.12235
## TMRC20054 TMRC20085 TMRC20046 TMRC20093 TMRC20089 TMRC20047 TMRC20090 TMRC20044
## 0.12723 0.12316 0.13182 0.13588 0.11884 0.12391 0.11566 0.13691
## TMRC20045 TMRC20105 TMRC20108 TMRC20109 TMRC20098 TMRC20096 TMRC20101 TMRC20092
## 0.12812 0.12219 0.11601 0.11684 0.11771 0.11364 0.11784 0.11465
## TMRC20082 TMRC20102 TMRC20099 TMRC20100 TMRC20091 TMRC20084 TMRC20087 TMRC20103
## 0.10358 0.11399 0.11888 0.10820 0.12935 0.11273 0.12564 0.13704
## TMRC20104 TMRC20086 TMRC20107 TMRC20081 TMRC20095
## 0.11723 0.10752 0.09364 0.10335 0.06566
## Removing 0 low-count genes (6 remaining).
lp_norm <- normalize_expt(lp_two_strains, filter = TRUE, convert = "rpkm",
norm = "quant", transform = "log2")## Removing 140 low-count genes (8570 remaining).
## transform_counts: Found 88 values equal to 0, adding 1 to the matrix.
I want to compare the above heatmap with one which is comprised of all genes with some ‘significantly high’ expression value and also a not-negligible coefficient of variance.
## Removing 4712 low-count genes (3998 remaining).
high_strain_norm <- normalize_expt(zymo_high_genes, convert = "rpkm",
norm = "quant", transform = "log2")## transform_counts: Found 476 values equal to 0, adding 1 to the matrix.
I think this plot suggests that the difference between the two primary strains is not really one of a few specific genes, but instead a global pattern.
##
## z2.2 z2.3
## 41 41
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: none.
## The primary analysis performed 15 comparisons.
## The logFC agreement among the methods follows:
## z23_vs_z22
## limma_vs_deseq 0.9648
## limma_vs_edger 0.9608
## limma_vs_ebseq 0.9402
## limma_vs_basic 0.9981
## limma_vs_noiseq -0.8973
## deseq_vs_edger 0.9991
## deseq_vs_ebseq 0.9880
## deseq_vs_basic 0.9656
## deseq_vs_noiseq -0.8748
## edger_vs_ebseq 0.9935
## edger_vs_basic 0.9617
## edger_vs_noiseq -0.8655
## ebseq_vs_basic 0.9416
## ebseq_vs_noiseq -0.8252
## basic_vs_noiseq -0.9024
## Including the plots causes the rda file to balloon to 3.4Gb in the following invocation.
## Removing them results in... holy crap 2.1Mb
zymo_table_nobatch <- combine_de_tables(
zymo_de_nobatch, keepers = zymodeme_keeper,
rda = glue("rda/zymo_tables_nobatch-v{ver}.rda"),
excel = glue("{excel_out}/DE_Strain/{ver}/zymo_tables_nobatch-v{ver}.xlsx"))
zymo_table_nobatch## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown limma_sigup
## 1 z23_vs_z22 45 86 46 86 61
## limma_sigdown
## 1 72
## `geom_line()`: Each group consists of only one observation.
## i Do you need to adjust the group aesthetic?
## Plot describing unique/shared genes in a differential expression table.
zymo_sig_nobatch <- extract_significant_genes(
zymo_table_nobatch,
according_to = "deseq", current_id = "GID", required_id = "GID",
gmt = glue("excel/zymodeme_nobatch-v{ver}.gmt"),
excel = glue("{excel_out}/DE_Strain/{ver}/zymo_sig_nobatch_deseq-v{ver}.xlsx"))## Number of up IDs in contrast zymodeme: 45.
## Number of down IDs in contrast zymodeme: 86.
## A set of genes deemed significant according to deseq.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## deseq_up deseq_down
## zymodeme 45 86
There are too few genes at our current stringencies for a meaningful result.
increased_z22 <- zymo_sig_nobatch[["deseq"]][["downs"]][["zymodeme"]]
increased_z23 <- zymo_sig_nobatch[["deseq"]][["ups"]][["zymodeme"]]
z22_goseq <- simple_goseq(increased_z22, go_db = lp_go, length_db = lp_lengths)## Found 17 go_db genes and 86 length_db genes out of 86.
## Found 15 go_db genes and 45 length_db genes out of 45.
Log ratio, mean average plot and volcano plot of the comparison of the two primary zymodeme transcriptomes. When the transcriptomes of the two main strains (43 and 41 samples of z2.3 and z2.1) were compared without any attempt at batch/surrogate estimation with DESeq2, 45 and 85 genes were observed as significantly higher in strain z2.3 and z2.2 respectively using a cutoff of 1.0 logFC and 0.05 FDR adjusted p-value. There remain a large number of genes which are likely significantly different between the two strains, but fall below the 2-fold difference required for ‘significance.’ This follows prior observations that the parasite transcriptomes are constituitively expressed.
When the same data was plotted via a volcano plot, the relatively small range of fold changes compared to the large range of adjusted p-values is visible.
##
## z2.2 z2.3
## 41 41
## Removing 0 low-count genes (8557 remaining).
## Setting 443 low elements to zero.
## transform_counts: Found 443 values equal to 0, adding 1 to the matrix.
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: svaseq.
## The primary analysis performed 15 comparisons.
## The logFC agreement among the methods follows:
## z23_vs_z22
## limma_vs_deseq 0.9648
## limma_vs_edger 0.9597
## limma_vs_ebseq 0.9401
## limma_vs_basic 0.9977
## limma_vs_noiseq -0.8983
## deseq_vs_edger 0.9990
## deseq_vs_ebseq 0.9857
## deseq_vs_basic 0.9654
## deseq_vs_noiseq -0.8749
## edger_vs_ebseq 0.9916
## edger_vs_basic 0.9605
## edger_vs_noiseq -0.8643
## ebseq_vs_basic 0.9416
## ebseq_vs_noiseq -0.8252
## basic_vs_noiseq -0.9024
zymo_table_sva <- combine_de_tables(
zymo_de_sva, keepers = zymodeme_keeper,
rda = glue("rda/zymo_tables_sva-v{ver}.rda"),
excel = glue("{excel_out}/DE_Strain/{ver}/zymo_tables_sva-v{ver}.xlsx"))
zymo_table_sva## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown limma_sigup
## 1 z23_vs_z22 47 85 47 85 59
## limma_sigdown
## 1 72
## `geom_line()`: Each group consists of only one observation.
## i Do you need to adjust the group aesthetic?
## Plot describing unique/shared genes in a differential expression table.
zymo_sig_sva <- extract_significant_genes(
zymo_table_sva,
according_to = "deseq",
current_id = "GID", required_id = "GID",
gmt = glue("excel/zymodeme_sva-v{ver}.gmt"),
excel = glue("{excel_out}/DE_Strain/{ver}/zymo_sig_sva-v{ver}.xlsx"))## Number of up IDs in contrast zymodeme: 47.
## Number of down IDs in contrast zymodeme: 85.
## A set of genes deemed significant according to deseq.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## deseq_up deseq_down
## zymodeme 47 85
There are too few genes at our current stringencies for a meaningful result.
increased_z22 <- zymo_sig_sva[["deseq"]][["downs"]][["zymodeme"]]
increased_z23 <- zymo_sig_sva[["deseq"]][["ups"]][["zymodeme"]]
z22_goseq <- simple_goseq(increased_z22, go_db = lp_go, length_db = lp_lengths)## Found 17 go_db genes and 85 length_db genes out of 85.
## Found 14 go_db genes and 47 length_db genes out of 47.
When estimates from SVA were included in the statistical model used by EdgeR, DESeq2, and limma; a nearly identical view of the data emerged. I think this shows with a high degree of confidence, that sva is not having a significant effect on this dataset.
This susceptibility comparison is using the ‘current’ dataset.
Note again: we no longer have any ambiguous samples, so I commented out a portion of the following block.
##
## resistant sensitive
## 46 45
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: none.
## The primary analysis performed 15 comparisons.
## The logFC agreement among the methods follows:
## snstv_vs_r
## limma_vs_deseq 0.9182
## limma_vs_edger 0.9191
## limma_vs_ebseq 0.9186
## limma_vs_basic 0.9977
## limma_vs_noiseq -0.8994
## deseq_vs_edger 1.0000
## deseq_vs_ebseq 0.9999
## deseq_vs_basic 0.9204
## deseq_vs_noiseq -0.8940
## edger_vs_ebseq 0.9999
## edger_vs_basic 0.9213
## edger_vs_noiseq -0.8969
## ebseq_vs_basic 0.9211
## ebseq_vs_noiseq -0.8950
## basic_vs_noiseq -0.9081
sus_table_nobatch <- combine_de_tables(
sus_de_nobatch, keepers = susceptibility_keepers,
rda = glue("rda/sus_tables_nobatch-v{ver}.rda"),
excel = glue("{excel_out}/DE_Susceptibility/{ver}/sus_tables_nobatch-v{ver}.xlsx"))## Inverting column: basic_logfc.
## Inverting column: deseq_logfc.
## Inverting column: ebseq_logfc.
## Inverting column: edger_logfc.
## Inverting column: limma_logfc.
## Inverting column: noiseq_logfc.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup
## 1 sensitive_vs_resistant-inverted 39 78 40
## edger_sigdown limma_sigup limma_sigdown
## 1 78 54 70
## `geom_line()`: Each group consists of only one observation.
## i Do you need to adjust the group aesthetic?
## Plot describing unique/shared genes in a differential expression table.
sus_sig_nobatch <- extract_significant_genes(
sus_table_nobatch,
excel = glue("{excel_out}/DE_Susceptibility/{ver}/sus_sig_nobatch-v{ver}.xlsx"))
sus_de_sva <- all_pairwise(lp_susceptibility, filter = TRUE, model_batch = "svaseq")##
## resistant sensitive
## 46 45
## Removing 0 low-count genes (8558 remaining).
## Setting 449 low elements to zero.
## transform_counts: Found 449 values equal to 0, adding 1 to the matrix.
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: svaseq.
## The primary analysis performed 15 comparisons.
## The logFC agreement among the methods follows:
## snstv_vs_r
## limma_vs_deseq 0.9384
## limma_vs_edger 0.9389
## limma_vs_ebseq 0.9184
## limma_vs_basic 0.9960
## limma_vs_noiseq -0.9000
## deseq_vs_edger 0.9999
## deseq_vs_ebseq 0.9884
## deseq_vs_basic 0.9384
## deseq_vs_noiseq -0.8996
## edger_vs_ebseq 0.9879
## edger_vs_basic 0.9387
## edger_vs_noiseq -0.9031
## ebseq_vs_basic 0.9211
## ebseq_vs_noiseq -0.8950
## basic_vs_noiseq -0.9081
sus_table_sva <- combine_de_tables(
sus_de_sva, keepers = susceptibility_keepers,
rda = glue("rda/sus_tables_sva-v{ver}.rda"),
excel = glue("{excel_out}/DE_Susceptibility/{ver}/sus_tables_sva-v{ver}.xlsx"))## Inverting column: basic_logfc.
## Inverting column: deseq_logfc.
## Inverting column: ebseq_logfc.
## Inverting column: edger_logfc.
## Inverting column: limma_logfc.
## Inverting column: noiseq_logfc.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup
## 1 sensitive_vs_resistant-inverted 41 80 40
## edger_sigdown limma_sigup limma_sigdown
## 1 80 51 68
## `geom_line()`: Each group consists of only one observation.
## i Do you need to adjust the group aesthetic?
## Plot describing unique/shared genes in a differential expression table.
sus_sig_sva <- extract_significant_genes(
sus_table_sva, according_to = "deseq",
excel = glue("{excel_out}/DE_Susceptibility/{ver}/sus_sig_sva-v{ver}.xlsx"))
sus_sig_sva## A set of genes deemed significant according to deseq.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## deseq_up deseq_down
## resistant_sensitive 41 80
## To get a more true sense of sensitive vs resistant with sva, we kind of need to get rid of the
## unknown samples and perhaps the ambiguous.
## no_ambiguous <- subset_expt(lp_susceptibility, subset = "condition!='ambiguous'") %>%
## subset_expt(subset = "condition!='unknown'")
## no_ambiguous_de_sva <- all_pairwise(no_ambiguous, filter = TRUE, model_batch = "svaseq")
## no_ambiguous_de_sva
## Let us see if my keeper code will fail hard or soft with extra contrasts...
## no_ambiguous_table_sva <- combine_de_tables(
## no_ambiguous_de_sva, keepers = susceptibility_keepers,
## excel = glue("excel/no_ambiguous_tables_sva-v{ver}.xlsx"))
## no_ambiguous_table_sva
## no_ambiguous_sig_sva <- extract_significant_genes(
## no_ambiguous_table_sva, according_to = "deseq",
## excel = glue("excel/no_ambiguous_sig_sva-v{ver}.xlsx"))
## no_ambiguous_sig_sva## A set of genes deemed significant according to deseq.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## deseq_up deseq_down
## resistant_sensitive 41 80
increased_resistant <- sus_sig_sva[["deseq"]][["ups"]][["resistant_sensitive"]]
increased_sensitive <- sus_sig_sva[["deseq"]][["downs"]][["resistant_sensitive"]]
resistant_goseq <- simple_goseq(increased_resistant, go_db = lp_go, length_db = lp_lengths)## Found 12 go_db genes and 41 length_db genes out of 41.
## Found 16 go_db genes and 80 length_db genes out of 80.
## Warning: ggrepel: 7 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Given that resistance/sensitivity tends to be correlated with strain, one might expect similar results. One caveat in this context though: there are fewer strains with resistance/sensitivity definitions. This when the analysis was repeated without the ambiguous/unknown samples, a few more genes were observed as significant.
## zymo_table_sva[["plots"]][["zymodeme"]][["deseq_ma_plots"]][["plot"]]
zy_df <- zymo_table_sva[["data"]][["zymodeme"]]
sus_df <- sus_table_sva[["data"]][["resistant_sensitive"]]
both_df <- merge(zy_df, sus_df, by = "row.names")
plot_df <- both_df[, c("deseq_logfc.x", "deseq_logfc.y")]
rownames(plot_df) <- both_df[["Row.names"]]
colnames(plot_df) <- c("z23_vs_z22", "sensitive_vs_resistant")
compare <- plot_linear_scatter(plot_df)
pp(file = "images/compare_sus_zy.png")
compare$scatter
dev.off()## png
## 2
##
## Pearson's product-moment correlation
##
## data: df[[xcol]] and df[[ycol]]
## t = 282, df = 8555, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9482 0.9523
## sample estimates:
## cor
## 0.9503
This susceptibility comparison is using the historical dataset.
sushist_de_nobatch <- all_pairwise(lp_susceptibility_historical, filter = TRUE,
model_batch = FALSE)##
## ambiguous resistant sensitive unknown
## 5 12 29 45
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: none.
## The primary analysis performed 15 comparisons.
sushist_table_nobatch <- combine_de_tables(
sushist_de_nobatch, keepers = susceptibility_keepers,
excel = glue("{excel_out}/DE_Susceptibility/sushist_tables_nobatch-v{ver}.xlsx"))## Inverting column: basic_logfc.
## Inverting column: deseq_logfc.
## Inverting column: ebseq_logfc.
## Inverting column: edger_logfc.
## Inverting column: limma_logfc.
## Inverting column: noiseq_logfc.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup
## 1 sensitive_vs_resistant-inverted 38 104 38
## edger_sigdown limma_sigup limma_sigdown
## 1 107 58 70
## `geom_line()`: Each group consists of only one observation.
## i Do you need to adjust the group aesthetic?
## Plot describing unique/shared genes in a differential expression table.
sushist_sig_nobatch <- extract_significant_genes(
sushist_table_nobatch,
excel = glue("{excel_out}/DE_Susceptibility/sushist_sig_nobatch-v{ver}.xlsx"))
sushist_sig_nobatch## A set of genes deemed significant according to limma, edger, deseq, ebseq, basic.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## limma_up limma_down edger_up edger_down deseq_up deseq_down
## resistant_sensitive 58 70 38 107 38 104
## ebseq_up ebseq_down basic_up basic_down
## resistant_sensitive 34 43 53 75
##
## ambiguous resistant sensitive unknown
## 5 12 29 45
## Removing 0 low-count genes (8558 remaining).
## Setting 348 low elements to zero.
## transform_counts: Found 348 values equal to 0, adding 1 to the matrix.
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: svaseq.
## The primary analysis performed 15 comparisons.
sushist_table_sva <- combine_de_tables(
sushist_de_sva, keepers = susceptibility_keepers,
excel = glue("{excel_out}/DE_Susceptibility/sushist_tables_sva-v{ver}.xlsx"))## Inverting column: basic_logfc.
## Inverting column: deseq_logfc.
## Inverting column: ebseq_logfc.
## Inverting column: edger_logfc.
## Inverting column: limma_logfc.
## Inverting column: noiseq_logfc.
## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup
## 1 sensitive_vs_resistant-inverted 40 100 41
## edger_sigdown limma_sigup limma_sigdown
## 1 98 53 75
## `geom_line()`: Each group consists of only one observation.
## i Do you need to adjust the group aesthetic?
## Plot describing unique/shared genes in a differential expression table.
sushist_sig_sva <- extract_significant_genes(
sushist_table_sva, according_to = "deseq",
excel = glue("{excel_out}/DE_Susceptibility/sushist_sig_sva-v{ver}.xlsx"))
sushist_sig_sva## A set of genes deemed significant according to deseq.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## deseq_up deseq_down
## resistant_sensitive 40 100
##cf_nb_input <- subset_expt(cf_expt, subset="condition!='unknown'")
cf_de_nobatch <- all_pairwise(lp_cf_known, filter = TRUE, model_batch = FALSE)##
## cure fail
## 39 34
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: none.
## The primary analysis performed 15 comparisons.
## The logFC agreement among the methods follows:
## fail_vs_cr
## limma_vs_deseq 0.8480
## limma_vs_edger 0.8504
## limma_vs_ebseq 0.8450
## limma_vs_basic 0.9887
## limma_vs_noiseq -0.9361
## deseq_vs_edger 1.0000
## deseq_vs_ebseq 0.9998
## deseq_vs_basic 0.8709
## deseq_vs_noiseq -0.9095
## edger_vs_ebseq 0.9997
## edger_vs_basic 0.8733
## edger_vs_noiseq -0.9116
## ebseq_vs_basic 0.8690
## ebseq_vs_noiseq -0.9087
## basic_vs_noiseq -0.9536
cf_table_nobatch <- combine_de_tables(
cf_de_nobatch,
excel = glue("{excel_out}/DE_Cure_vs_Fail/{ver}/cf_tables_nobatch-v{ver}.xlsx"))
cf_table_nobatch## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown limma_sigup
## 1 fail_vs_cure 0 1 1 1 0
## limma_sigdown
## 1 0
## Only fail_vs_cure_down has information, cannot create an UpSet.
## Plot describing unique/shared genes in a differential expression table.
## NULL
cf_sig_nobatch <- extract_significant_genes(
cf_table_nobatch,
excel = glue("{excel_out}/DE_Cure_vs_Fail/{ver}/cf_sig_nobatch-v{ver}.xlsx"))
cf_sig_nobatch## A set of genes deemed significant according to limma, edger, deseq, ebseq, basic.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## limma_up limma_down edger_up edger_down deseq_up deseq_down
## fail_vs_cure 0 0 1 1 0 1
## ebseq_up ebseq_down basic_up basic_down
## fail_vs_cure 0 0 0 1
##
## cure fail
## 39 34
## Removing 0 low-count genes (8546 remaining).
## Setting 110 low elements to zero.
## transform_counts: Found 110 values equal to 0, adding 1 to the matrix.
## A pairwise differential expression with results from: basic, deseq, ebseq, edger, limma, noiseq.
## This used a surrogate/batch estimate from: svaseq.
## The primary analysis performed 15 comparisons.
## The logFC agreement among the methods follows:
## fail_vs_cr
## limma_vs_deseq 0.9025
## limma_vs_edger 0.8922
## limma_vs_ebseq 0.8679
## limma_vs_basic 0.9405
## limma_vs_noiseq -0.9419
## deseq_vs_edger 0.9974
## deseq_vs_ebseq 0.9371
## deseq_vs_basic 0.9184
## deseq_vs_noiseq -0.9276
## edger_vs_ebseq 0.9228
## edger_vs_basic 0.9170
## edger_vs_noiseq -0.9209
## ebseq_vs_basic 0.8690
## ebseq_vs_noiseq -0.9087
## basic_vs_noiseq -0.9536
cf_table <- combine_de_tables(
cf_de,
excel = glue("{excel_out}/DE_Cure_vs_Fail/{ver}/cf_tables-v{ver}.xlsx"))
cf_table## A set of combined differential expression results.
## table deseq_sigup deseq_sigdown edger_sigup edger_sigdown limma_sigup
## 1 fail_vs_cure 2 7 1 7 0
## limma_sigdown
## 1 0
## `geom_line()`: Each group consists of only one observation.
## i Do you need to adjust the group aesthetic?
## Plot describing unique/shared genes in a differential expression table.
cf_sig <- extract_significant_genes(
cf_table,
excel = glue("{excel_out}/DE_Cure_vs_Fail/{ver}/cf_sig-v{ver}.xlsx"))
cf_sig## A set of genes deemed significant according to limma, edger, deseq, ebseq, basic.
## The parameters defining significant were:
## LFC cutoff: 1 adj P cutoff: 0.05
## limma_up limma_down edger_up edger_down deseq_up deseq_down
## fail_vs_cure 0 0 1 7 2 7
## ebseq_up ebseq_down basic_up basic_down
## fail_vs_cure 0 0 0 1
I am not going to mess with GO searches for this.
It is not surprising that few or no genes are deemed significantly differentially expressed across samples which were taken from cure or fail patients.
dev <- pp(file = "images/cf_ma.png")
cf_table[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]
closed <- dev.off()
cf_table[["plots"]][["fail_vs_cure"]][["deseq_ma_plots"]]One query we have not yet addressed: what are the similarities and differences among the strains used to infect the macrophage samples and the promastigote samples used in the TMRC2 parasite data?
In my container image, this dataset is not currently loaded, so turning this off.
## I just fixed this in the datasets Rmd, but until that propagates just set it manually
annotation(lp_expt) <- annotation(lp_macrophage)
tmrc2_macrophage_norm <- normalize_expt(lp_macrophage, transform="log2", convert="cpm",
norm="quant", filter=TRUE)## Removing 0 low-count genes (8710 remaining).
## transform_counts: Found 3735 values equal to 0, adding 1 to the matrix.
## The numbers of samples by condition are:
##
## z2.1 z2.2 z2.3 z2.4
## 7 62 70 2
all_nosb <- all_tmrc2
pData(all_nosb)[["stage"]] <- "promastigote"
na_idx <- is.na(pData(all_nosb)[["macrophagetreatment"]])
pData(all_nosb)[na_idx, "macrophagetreatment"] <- "undefined"
all_nosb <- subset_expt(all_nosb, subset = "macrophagetreatment!='inf_sb'")## The samples excluded are: TMRC30051, TMRC30062, TMRC30065, TMRC30069, TMRC30248, TMRC30249, TMRC30251, TMRC30252, TMRC30317, TMRC30321, TMRC30298, TMRC30300, TMRC30296, TMRC30302, TMRC30315, TMRC30294, TMRC30292, TMRC30308, TMRC30331, TMRC30332, TMRC30306.
## subset_expt(): There were 141, now there are 120 samples.
ama_idx <- pData(all_nosb)[["macrophagetreatment"]] == "inf"
pData(all_nosb)[ama_idx, "stage" ] <- "amastigote"
pData(all_nosb)[["batch"]] <- pData(all_nosb)[["stage"]]I think the above picture is sort of the opposite of what we want to compare in a DE analysis for this set of data, e.g. we want to compare promastigotes from amastigotes?
## The number of samples by batch are:
##
## z2.1 z2.2 z2.3 z2.4
## 7 55 56 2
## The numbers of samples by condition are:
##
## amastigote promastigote
## 29 91
two_zymo <- subset_expt(
all_nosb,
subset = "zymodemecategorical=='z22'|zymodemecategorical=='z23'|zymodemecategorical=='unknown'")## The samples excluded are: TMRC20057, TMRC20056, TMRC20093, TMRC20047, TMRC20045, TMRC20108, TMRC20091, TMRC20084, TMRC20103, TMRC30057, TMRC30061, TMRC30063, TMRC30064, TMRC30067, TMRC30162, TMRC30245, TMRC30247, TMRC30250, TMRC30267, TMRC30286, TMRC30316, TMRC30322, TMRC30328, TMRC30318, TMRC30324, TMRC30320, TMRC30297, TMRC30299, TMRC30295, TMRC30303, TMRC30301, TMRC30314, TMRC30293, TMRC30291, TMRC30307, TMRC30310, TMRC30311, TMRC30305.
## subset_expt(): There were 120, now there are 82 samples.
##
## amastigote promastigote
## 29 91
## Removing 0 low-count genes (8613 remaining).
## Setting 1443 low elements to zero.
## transform_counts: Found 1443 values equal to 0, adding 1 to the matrix.
pro_ama_table <- combine_de_tables(
pro_ama,
excel = glue("{excel_out}/DE_promastigote_amastigote/{ver}/pro_vs_ama_table-v{ver}.xlsx"))
pro_ama_sig <- extract_significant_genes(
pro_ama_table,
excel = glue("{excel_out}/DE_promastigote_amastigote/{ver}/pro_vs_ama_sig-v{ver}.xlsx"))increased_promastigote <- pro_ama_sig[["deseq"]][["ups"]][["promastigote_vs_amastigote"]]
increased_amastigote <- pro_ama_sig[["deseq"]][["downs"]][["promastigote_vs_amastigote"]]
promastigote_goseq <- simple_goseq(increased_promastigote, go_db = lp_go, length_db = lp_lengths)## Found 226 go_db genes and 567 length_db genes out of 567.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## Ontologies observed by goseq using 567 genes
## with significance cutoff 0.1.
## There are 20 MF hits, 20, BP hits, and 8 CC hits.
## Category mfp_plot_over is the most populated with 13 hits.
## Found 32 go_db genes and 71 length_db genes out of 71.
## NULL
## silly, topgo wants the gene id column to be 'ID', I should fix this.
colnames(lp_go) <- c("ID", "GO")
promastigote_topgo <- simple_topgo(increased_promastigote, go_db = lp_go)## Attempting to generate a id2go file in the format expected by topGO.
## simple_topgo(): Set densities = TRUE for ontology density plots.
## Getting enrichResult for ontology: bp.
## Getting enrichResult for ontology: mf.
## Getting enrichResult for ontology: cc.
## simple_topgo(): Set densities = TRUE for ontology density plots.
## Getting enrichResult for ontology: bp.
## Getting enrichResult for ontology: mf.
## Getting enrichResult for ontology: cc.
I am a little surprised by this plot, I somewhat expected there to be few genes which passed the 2-fold difference demarcation line.
R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
locale: C
attached base packages: stats4, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: ruv(v.0.9.7.1), hpgltools(v.1.0), testthat(v.3.2.1.1), Matrix(v.1.6-5), glue(v.1.7.0), SummarizedExperiment(v.1.32.0), GenomicRanges(v.1.54.1), GenomeInfoDb(v.1.38.8), IRanges(v.2.36.0), S4Vectors(v.0.40.2), MatrixGenerics(v.1.14.0), matrixStats(v.1.3.0), Biobase(v.2.62.0), BiocGenerics(v.0.48.1) and Heatplus(v.3.10.0)
loaded via a namespace (and not attached): fs(v.1.6.4), bitops(v.1.0-7), enrichplot(v.1.22.0), devtools(v.2.4.5), doParallel(v.1.0.17), HDO.db(v.0.99.1), httr(v.1.4.7), RColorBrewer(v.1.1-3), numDeriv(v.2016.8-1.1), profvis(v.0.3.8), tools(v.4.3.3), backports(v.1.5.0), utf8(v.1.2.4), R6(v.2.5.1), lazyeval(v.0.2.2), mgcv(v.1.9-1), urlchecker(v.1.0.1), withr(v.3.0.0), prettyunits(v.1.2.0), gridExtra(v.2.3), preprocessCore(v.1.64.0), cli(v.3.6.2), scatterpie(v.0.2.3), topGO(v.2.54.0), labeling(v.0.4.3), sass(v.0.4.9), robustbase(v.0.99-2), mvtnorm(v.1.2-5), genefilter(v.1.84.0), goseq(v.1.54.0), Rsamtools(v.2.18.0), yulab.utils(v.0.1.4), gson(v.0.1.0), R.utils(v.2.12.3), DOSE(v.3.28.2), sessioninfo(v.1.2.2), limma(v.3.58.1), rstudioapi(v.0.16.0), RSQLite(v.2.3.7), generics(v.0.1.3), gridGraphics(v.0.5-1), BiocIO(v.1.12.0), gtools(v.3.9.5), zip(v.2.3.1), dplyr(v.1.1.4), GO.db(v.3.18.0), fansi(v.1.0.6), abind(v.1.4-5), R.methodsS3(v.1.8.2), lifecycle(v.1.0.4), yaml(v.2.3.8), edgeR(v.4.0.16), gplots(v.3.1.3.1), qvalue(v.2.34.0), SparseArray(v.1.2.4), BiocFileCache(v.2.10.2), grid(v.4.3.3), blob(v.1.2.4), promises(v.1.3.0), crayon(v.1.5.3), miniUI(v.0.1.1.1), lattice(v.0.22-6), cowplot(v.1.1.3), GenomicFeatures(v.1.54.4), annotate(v.1.80.0), KEGGREST(v.1.42.0), pillar(v.1.9.0), knitr(v.1.47), varhandle(v.2.0.6), fgsea(v.1.28.0), rjson(v.0.2.21), boot(v.1.3-30), corpcor(v.1.6.10), codetools(v.0.2-20), fastmatch(v.1.1-4), ggfun(v.0.1.5), Vennerable(v.3.1.0.9000), data.table(v.1.15.4), remotes(v.2.5.0), treeio(v.1.26.0), vctrs(v.0.6.5), png(v.0.1-8), Rdpack(v.2.6), gtable(v.0.3.5), cachem(v.1.1.0), openxlsx(v.4.2.5.2), xfun(v.0.45), rbibutils(v.2.2.16), S4Arrays(v.1.2.1), mime(v.0.12), tidygraph(v.1.3.1), survival(v.3.7-0), iterators(v.1.0.14), statmod(v.1.5.0), ellipsis(v.0.3.2), nlme(v.3.1-165), pbkrtest(v.0.5.2), ggtree(v.3.13.0.001), usethis(v.2.2.3), bit64(v.4.0.5), progress(v.1.2.3), EnvStats(v.2.8.1), filelock(v.1.0.3), UpSetR(v.1.4.0), rprojroot(v.2.0.4), bslib(v.0.7.0), KernSmooth(v.2.23-24), colorspace(v.2.1-0), DBI(v.1.2.3), DESeq2(v.1.42.1), tidyselect(v.1.2.1), bit(v.4.0.5), compiler(v.4.3.3), curl(v.5.2.1), graph(v.1.80.0), BiasedUrn(v.2.0.12), SparseM(v.1.83), xml2(v.1.3.6), desc(v.1.4.3), DelayedArray(v.0.28.0), plotly(v.4.10.4), shadowtext(v.0.1.3), rtracklayer(v.1.62.0), scales(v.1.3.0), caTools(v.1.18.2), DEoptimR(v.1.1-3), remaCor(v.0.0.18), RBGL(v.1.78.0), rappdirs(v.0.3.3), stringr(v.1.5.1), digest(v.0.6.35), minqa(v.1.2.7), variancePartition(v.1.32.5), rmarkdown(v.2.27), aod(v.1.3.3), XVector(v.0.42.0), RhpcBLASctl(v.0.23-42), htmltools(v.0.5.8.1), pkgconfig(v.2.0.3), lme4(v.1.1-35.4), highr(v.0.11), dbplyr(v.2.5.0), fastmap(v.1.2.0), rlang(v.1.1.4), htmlwidgets(v.1.6.4), shiny(v.1.8.1.1), farver(v.2.1.2), jquerylib(v.0.1.4), jsonlite(v.1.8.8), BiocParallel(v.1.36.0), R.oo(v.1.26.0), GOSemSim(v.2.28.1), RCurl(v.1.98-1.14), magrittr(v.2.0.3), GenomeInfoDbData(v.1.2.11), ggplotify(v.0.1.2), patchwork(v.1.2.0), munsell(v.0.5.1), Rcpp(v.1.0.12), ape(v.5.8), viridis(v.0.6.5), stringi(v.1.8.4), ggraph(v.2.2.1), brio(v.1.1.5), zlibbioc(v.1.48.2), MASS(v.7.3-60), plyr(v.1.8.9), pkgbuild(v.1.4.4), parallel(v.4.3.3), ggrepel(v.0.9.5), forcats(v.1.0.0), Biostrings(v.2.70.3), graphlayouts(v.1.1.1), splines(v.4.3.3), pander(v.0.6.5), geneLenDataBase(v.1.38.0), hms(v.1.1.3), locfit(v.1.5-9.9), fastcluster(v.1.2.6), igraph(v.2.0.3), reshape2(v.1.4.4), biomaRt(v.2.58.2), pkgload(v.1.3.4), XML(v.3.99-0.16.1), evaluate(v.0.24.0), BiocManager(v.1.30.23), nloptr(v.2.1.0), PROPER(v.1.34.0), foreach(v.1.5.2), tweenr(v.2.0.3), httpuv(v.1.6.15), tidyr(v.1.3.1), purrr(v.1.0.2), polyclip(v.1.10-6), ggplot2(v.3.5.1), ggforce(v.0.4.2), broom(v.1.0.6), xtable(v.1.8-4), restfulr(v.0.0.15), tidytree(v.0.4.6), fANCOVA(v.0.6-1), later(v.1.3.2), viridisLite(v.0.4.2), tibble(v.3.2.1), lmerTest(v.3.1-3), clusterProfiler(v.4.10.1), aplot(v.0.2.3), memoise(v.2.0.1), AnnotationDbi(v.1.64.1), GenomicAlignments(v.1.38.2), sva(v.3.50.0) and GSEABase(v.1.64.0)
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset
## This is hpgltools commit: