Before we begin, a couple of parameters which have given me grief.
## Used by the various functions which cross reference grange data
## The SEs used in this document are getting this from the orgdb
## which includes this information in multiple columns with different
## chromosome ID prefixes. E.g. sometimes it is just 1,2,3, ... other times
## it is LpaL1, LpaL2, LpaL3, ...
exp_chr_col <- "sequence_id"
## The tritrypdb also puts the start/stop/strand information in multiple places
exp_start_col <- "coding_start"
exp_end_col <- "coding_end"This document will visualize the TMRC2 samples before completing the various differential expression and variant analyses in the hopes of getting an understanding of how the various samples relate to each other.
Start off with the library sizes of the original dataset. The main thing to note is that we have quite a large variance in coverage. A few of these samples are highly likely to be removed shortly (looking at you, TMRC20001 and TMRC20095)
## Library sizes of 92 samples,
## ranging from 564,812 to 1.37e+08.
png_libsize <- pp(file = "images/lp_se_libsizes.png", image = libsizes[["plot"]],
width = 18, height = 9)## Warning in pp(file = "images/lp_se_libsizes.png", image = libsizes[["plot"]], :
## There is no device to shut down.
svg_libsize <- pp(file = "images/lp_se_libsizes.svg", image = libsizes[["plot"]],
width = 18, height = 9)
pdf_libsize <- pp(file = "images/lp_se_libsizes.pdf", image = libsizes[["plot"]],
width = 18, height = 9)Library sizes of the protein coding gene counts observed per sample. The samples were mapped with the EuPathDB revision 36 of the Leishmania (Viannia) panamensis strain MHOM/COL/81L13 genome; the alignments were sorted, indexed, and counted via htseq using the gene features, and non-protein coding features were excluded. The per-sample sums of the remaining matrix were plotted to check that the relative sample coverage is sufficient and not too divergent across samples. Bars are colored according to strain/zymodeme annotation: red: zymodeme 2.3; blue: zymodeme 2.2; Leishmania braziliensis-like strains b2904, z1.0, and z1.5: purple; zymodemes which are most similar to 2.3, comprising z2.4 is light brown; zymodemes most similar to 2.2, comprising z3.0, z2.0, z2.1, and z3.2 are light gray, dark gray, dark brown, and gray respectively.
This plot is usually our primary arbiter for sample removing based on coverage. We pick a semi-arbitrary cutoff based on both coverage and genes observed. In this instance 8,600 genes seems likely?
The cutoff argument prints out samples with gene coverage < that proportion. I think we already dropped in the sample sheet the most problematic samples, so it may not actually print anything.
## I think samples 7,10 should be removed at minimum, probably also 9,11
nonzero <- plot_nonzero(lp_se, cutoff = 0.7, y_intercept = 0.99)## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## i Please use `linewidth` instead.
## i The deprecated feature was likely used in the hpgltools package.
## Please report the issue to the authors.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## A non-zero genes plot of 92 samples.
## These samples have an average 28.78 CPM coverage and 8694 genes observed, ranging from 8554 to
## 8749.
svg_nz <- pp(file = "images/lp_nonzero.svg", image = nonzero[["plot"]], width = 9, height = 9)
pdf_nz <- pp(file = "images/lp_nonzero.pdf", image = nonzero[["plot"]], width = 9, height = 9)Differences in relative gene content with respect to sequencing coverage. The per-sample number of observed genes was plotted with respect to the relative CPM coverage in order to check that the samples are sufficiently and similarly diverse. Many samples were observed near or at the putative asymptote of likely gene content; no samples were observed with fewer than 65% of the Leishmania panamensis genes included. Note that the range of genes observed is quite small, 8500 <= x < 8700 genes, however this was plotted after already excluding samples with fewer than 8500 genes observed (of which there were 2) and any samples with fewer than 5 million protein coding mapped reads (there were 2 samples that had more than 8500 genes observed in less than 5 million reads).
## 7722 entries are 0. We are on a log scale, adding 1 to the data.
## TMRC20001 TMRC20065 TMRC20004 TMRC20005 TMRC20066 TMRC20039 TMRC20037
## min 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## q1 8.1254 9.5527 4.6439 9.1137 9.0526 9.6221 9.4310
## median 8.9366 10.5196 5.4919 9.9263 9.9571 10.4974 10.2621
## mean 8.6262 10.2353 5.2930 9.6450 9.6716 10.2874 10.0346
## q3 9.6439 11.3971 6.2095 10.6463 10.7624 11.3536 11.0708
## max 18.2820 19.4189 13.8030 17.8164 18.0472 17.9602 18.3070
## iqr 1.5184 1.8445 1.5656 1.5326 1.7098 1.7315 1.6398
## iqr_high 11.9215 14.1638 8.5578 12.9452 13.3271 13.9508 13.5305
## iqr_low -2.2777 -2.7667 -2.3484 -2.2989 -2.5647 -2.5973 -2.4597
## sd 1.8938 2.0949 1.5637 1.9216 1.9473 1.9117 1.9486
## var 3.5865 4.3885 2.4450 3.6926 3.7918 3.6544 3.7969
## stdvar 0.4158 0.4288 0.4619 0.3828 0.3921 0.3552 0.3784
## TMRC20038 TMRC20067 TMRC20068 TMRC20041 TMRC20015 TMRC20009 TMRC20010
## min 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## q1 9.5469 9.9622 9.7125 9.9073 11.8918 11.2680 10.2378
## median 10.3993 10.7969 10.6366 10.8576 12.7526 12.1232 11.1370
## mean 10.1845 10.4659 10.3475 10.6827 12.4476 11.8678 10.8502
## q3 11.2324 11.4903 11.4355 11.7842 13.5341 12.9119 11.9354
## max 19.1131 19.4121 18.9980 18.0636 21.9990 20.9345 21.4671
## iqr 1.6855 1.5282 1.7230 1.8770 1.6423 1.6439 1.6976
## iqr_high 13.7607 13.7826 14.0201 14.5997 15.9976 15.3778 14.4819
## iqr_low -2.5283 -2.2923 -2.5845 -2.8154 -2.4635 -2.4659 -2.5465
## sd 1.9580 1.9433 1.9994 1.9266 2.0988 1.9742 2.0191
## var 3.8336 3.7762 3.9978 3.7116 4.4050 3.8974 4.0767
## stdvar 0.3764 0.3608 0.3864 0.3474 0.3539 0.3284 0.3757
## TMRC20016 TMRC20011 TMRC20012 TMRC20013 TMRC20017 TMRC20014 TMRC20018
## min 0.0000 0.0000 0.0000 0.0000 0.0000 0.000 0.0000
## q1 10.6475 10.0701 10.1774 10.6268 11.3333 10.905 10.9507
## median 11.5379 10.9270 11.0317 11.5226 12.1957 11.758 11.8498
## mean 11.2610 10.6674 10.7377 11.2396 11.9434 11.504 11.5878
## q3 12.3424 11.7066 11.8222 12.3238 12.9921 12.542 12.6889
## max 20.5137 19.4904 19.7846 20.4443 20.9297 20.307 20.7357
## iqr 1.6950 1.6365 1.6448 1.6970 1.6588 1.636 1.7381
## iqr_high 14.8849 14.1613 14.2893 14.8693 15.4802 14.996 15.2960
## iqr_low -2.5424 -2.4547 -2.4671 -2.5455 -2.4881 -2.455 -2.6072
## sd 2.0175 1.9424 2.0390 2.0383 1.9937 1.969 2.0402
## var 4.0702 3.7728 4.1574 4.1547 3.9750 3.876 4.1625
## stdvar 0.3614 0.3537 0.3872 0.3697 0.3328 0.337 0.3592
## TMRC20019 TMRC20070 TMRC20020 TMRC20021 TMRC20022 TMRC20024 TMRC20036
## min 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## q1 12.1618 10.1589 12.4500 12.5341 10.9341 11.6718 9.0954
## median 13.0218 11.0678 13.3120 13.4197 11.8080 12.5395 9.9233
## mean 12.7333 10.7881 13.0315 13.1096 11.5668 12.3085 9.7180
## q3 13.7830 11.8624 14.0826 14.1917 12.5891 13.3716 10.7089
## max 21.3726 19.6638 21.4005 21.3018 20.4087 21.5897 18.5279
## iqr 1.6211 1.7035 1.6327 1.6576 1.6551 1.6998 1.6135
## iqr_high 16.2147 14.4177 16.5316 16.6781 15.0717 15.9213 13.1291
## iqr_low -2.4317 -2.5553 -2.4490 -2.4864 -2.4826 -2.5497 -2.4202
## sd 1.9919 1.9723 1.9774 2.0605 1.8626 1.9635 1.8184
## var 3.9676 3.8900 3.9102 4.2456 3.4693 3.8555 3.3066
## stdvar 0.3116 0.3606 0.3001 0.3238 0.2999 0.3132 0.3403
## TMRC20069 TMRC20033 TMRC20026 TMRC20031 TMRC20076 TMRC20073 TMRC20055
## min 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## q1 10.2204 11.8863 11.0191 10.8619 9.8906 9.8688 9.4579
## median 11.0765 12.7419 11.9187 11.7349 10.7570 10.8090 10.3038
## mean 10.8261 12.4892 11.6662 11.4706 10.5245 10.5587 10.0871
## q3 11.8619 13.5257 12.7317 12.5416 11.5961 11.6860 11.1253
## max 19.4667 21.7782 20.6152 20.5766 20.0927 19.9017 19.3504
## iqr 1.6415 1.6395 1.7127 1.6798 1.7054 1.8171 1.6674
## iqr_high 14.3241 15.9849 15.3008 15.0613 14.1542 14.4116 13.6263
## iqr_low -2.4622 -2.4592 -2.5690 -2.5197 -2.5581 -2.7257 -2.5010
## sd 1.9064 1.9242 1.9417 1.9792 1.9567 2.0178 1.9361
## var 3.6345 3.7026 3.7703 3.9173 3.8285 4.0716 3.7483
## stdvar 0.3357 0.2965 0.3232 0.3415 0.3638 0.3856 0.3716
## TMRC20079 TMRC20071 TMRC20078 TMRC20094 TMRC20042 TMRC20058 TMRC20072
## min 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## q1 9.7814 9.6041 10.1306 8.8009 9.7499 10.0324 9.8704
## median 10.6887 10.4625 10.9922 9.6671 10.6073 10.9054 10.7151
## mean 10.4475 10.1702 10.7160 9.4374 10.3742 10.6369 10.4508
## q3 11.5661 11.2490 11.7645 10.5196 11.4506 11.7201 11.4974
## max 20.0670 19.4627 19.7821 18.5989 19.3668 19.9893 19.4477
## iqr 1.7847 1.6449 1.6339 1.7187 1.7007 1.6877 1.6270
## iqr_high 14.2431 13.7163 14.2153 13.0977 14.0016 14.2518 13.9378
## iqr_low -2.6770 -2.4673 -2.4508 -2.5781 -2.5510 -2.5316 -2.4405
## sd 2.0145 1.9832 1.9813 1.9366 1.9737 2.0418 1.9952
## var 4.0582 3.9330 3.9256 3.7503 3.8955 4.1688 3.9807
## stdvar 0.3884 0.3867 0.3663 0.3974 0.3755 0.3919 0.3809
## TMRC20059 TMRC20048 TMRC20057 TMRC20088 TMRC20056 TMRC20060 TMRC20077
## min 0.0000 0.000 0.0000 0.0000 0.0000 0.0000 0.0000
## q1 9.8704 9.881 9.9218 8.9425 8.9366 9.9009 9.7781
## median 10.7789 10.749 10.7985 9.8090 9.8041 10.7649 10.6773
## mean 10.5042 10.499 10.5432 9.5452 9.5564 10.5338 10.4624
## q3 11.5929 11.554 11.5928 10.5668 10.6101 11.5688 11.5379
## max 20.1892 18.608 19.4695 18.3197 18.7969 19.0986 19.0149
## iqr 1.7226 1.672 1.6710 1.6243 1.6735 1.6679 1.7599
## iqr_high 14.1768 14.062 14.0993 13.0031 13.1203 14.0707 14.1778
## iqr_low -2.5838 -2.509 -2.5064 -2.4364 -2.5102 -2.5019 -2.6398
## sd 2.0074 1.965 1.9102 1.8646 1.9198 1.9247 1.9483
## var 4.0295 3.863 3.6490 3.4768 3.6856 3.7044 3.7957
## stdvar 0.3836 0.368 0.3461 0.3642 0.3857 0.3517 0.3628
## TMRC20074 TMRC20063 TMRC20053 TMRC20052 TMRC20064 TMRC20075 TMRC20051
## min 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## q1 9.4635 9.2644 9.3597 9.9890 9.4553 8.6582 9.7385
## median 10.3236 10.1749 10.2070 10.9425 10.3685 9.5868 10.5803
## mean 10.0886 9.9258 9.9956 10.6635 10.1149 9.3414 10.3294
## q3 11.1424 11.0005 11.0556 11.8037 11.2574 10.4336 11.3918
## max 17.8595 17.3060 19.3199 19.2045 18.5136 17.5756 19.3308
## iqr 1.6789 1.7361 1.6959 1.8147 1.8021 1.7754 1.6533
## iqr_high 13.6608 13.6047 13.5994 14.5258 13.9605 13.0966 13.8717
## iqr_low -2.5184 -2.6041 -2.5438 -2.7220 -2.7031 -2.6631 -2.4799
## sd 1.9334 1.9257 1.9315 2.0696 2.0351 1.9886 2.0110
## var 3.7382 3.7083 3.7309 4.2833 4.1418 3.9545 4.0440
## stdvar 0.3705 0.3736 0.3733 0.4017 0.4095 0.4233 0.3915
## TMRC20050 TMRC20049 TMRC20062 TMRC20110 TMRC20080 TMRC20043 TMRC20083
## min 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## q1 9.6165 10.3298 9.6817 10.6202 9.5038 9.9189 9.1599
## median 10.4367 11.1618 10.5236 11.4681 10.3750 10.7624 10.0028
## mean 10.1989 10.9229 10.2756 11.2152 10.1137 10.4876 9.7508
## q3 11.1699 11.9541 11.3368 12.2868 11.1971 11.5416 10.7682
## max 18.6003 19.8912 18.6611 21.2189 18.6962 19.8699 18.4172
## iqr 1.5534 1.6243 1.6551 1.6666 1.6932 1.6227 1.6083
## iqr_high 13.5000 14.3906 13.8194 14.7866 13.7369 13.9757 13.1807
## iqr_low -2.3301 -2.4365 -2.4827 -2.4998 -2.5399 -2.4341 -2.4125
## sd 1.7854 1.9919 2.0013 2.0318 2.0201 1.9853 1.8741
## var 3.1877 3.9677 4.0054 4.1282 4.0807 3.9414 3.5123
## stdvar 0.3126 0.3632 0.3898 0.3681 0.4035 0.3758 0.3602
## TMRC20054 TMRC20085 TMRC20046 TMRC20093 TMRC20089 TMRC20047 TMRC20090
## min 0.000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## q1 9.923 9.2408 10.0417 9.6600 9.3332 9.6546 9.3382
## median 10.796 10.1357 10.8463 10.5770 10.2574 10.4993 10.2252
## mean 10.523 9.8587 10.5675 10.3295 10.0049 10.2398 9.9736
## q3 11.614 10.9593 11.5558 11.4346 11.1296 11.2946 11.0632
## max 19.643 18.5656 18.9656 19.1285 19.0696 18.5807 18.7419
## iqr 1.691 1.7185 1.5141 1.7746 1.7964 1.6400 1.7250
## iqr_high 14.151 13.5370 13.8270 14.0966 13.8243 13.7546 13.6508
## iqr_low -2.536 -2.5777 -2.2712 -2.6619 -2.6947 -2.4600 -2.5876
## sd 2.028 1.9953 1.9014 1.9962 2.0322 1.9705 2.0165
## var 4.114 3.9814 3.6154 3.9849 4.1300 3.8830 4.0661
## stdvar 0.391 0.4038 0.3421 0.3858 0.4128 0.3792 0.4077
## TMRC20044 TMRC20045 TMRC20105 TMRC20108 TMRC20109 TMRC20098 TMRC20096
## min 0.000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## q1 9.984 9.7381 8.8887 10.3106 10.1602 9.4778 9.5680
## median 10.827 10.5765 9.7398 11.1630 10.9940 10.4019 10.4242
## mean 10.580 10.3210 9.4827 10.8975 10.6923 10.1547 10.1652
## q3 11.610 11.3498 10.5582 11.9761 11.7167 11.2779 11.1874
## max 18.985 18.4698 18.5681 20.0956 20.4154 19.3664 19.5585
## iqr 1.626 1.6117 1.6694 1.6655 1.5565 1.8001 1.6194
## iqr_high 14.048 13.7674 13.0623 14.4744 14.0515 13.9780 13.6164
## iqr_low -2.438 -2.4176 -2.5042 -2.4983 -2.3348 -2.7002 -2.4291
## sd 1.919 1.9263 1.9617 2.0548 1.9357 2.0123 1.9054
## var 3.682 3.7105 3.8481 4.2222 3.7468 4.0492 3.6305
## stdvar 0.348 0.3595 0.4058 0.3874 0.3504 0.3987 0.3572
## TMRC20101 TMRC20092 TMRC20082 TMRC20102 TMRC20099 TMRC20100 TMRC20091
## min 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## q1 9.5906 9.5162 9.5118 9.3641 9.1497 9.9024 8.8138
## median 10.4954 10.3961 10.4564 10.3106 9.9915 10.7423 9.6165
## mean 10.2282 10.1162 10.1862 10.0469 9.7298 10.4612 9.3844
## q3 11.2829 11.1561 11.2822 11.1693 10.7721 11.5236 10.3869
## max 19.1175 18.9750 20.8048 19.0707 18.9022 19.6087 18.3910
## iqr 1.6924 1.6399 1.7705 1.8052 1.6224 1.6212 1.5732
## iqr_high 13.8215 13.6159 13.9379 13.8770 13.2057 13.9553 12.7467
## iqr_low -2.5385 -2.4598 -2.6557 -2.7077 -2.4335 -2.4318 -2.3597
## sd 1.9196 1.9276 2.0100 2.0504 1.9574 1.9952 1.8832
## var 3.6848 3.7157 4.0400 4.2042 3.8313 3.9807 3.5466
## stdvar 0.3603 0.3673 0.3966 0.4185 0.3938 0.3805 0.3779
## TMRC20084 TMRC20087 TMRC20103 TMRC20104 TMRC20086 TMRC20107 TMRC20081
## min 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## q1 9.3242 9.2479 9.7846 9.8392 9.5608 9.7879 9.9381
## median 10.2252 10.1183 10.6799 10.7271 10.4853 10.7474 10.8313
## mean 9.9834 9.8869 10.4520 10.4655 10.2764 10.4811 10.6037
## q3 11.0607 10.9436 11.5578 11.5908 11.3872 11.6208 11.6733
## max 18.6469 18.6005 18.4626 19.7815 20.3414 20.8640 20.1558
## iqr 1.7365 1.6957 1.7732 1.7516 1.8264 1.8329 1.7352
## iqr_high 13.6655 13.4871 14.2176 14.2182 14.1268 14.3701 14.2761
## iqr_low -2.6048 -2.5435 -2.6598 -2.6274 -2.7396 -2.7493 -2.6028
## sd 1.9256 1.9267 1.9969 2.0768 1.9951 2.0700 1.9501
## var 3.7078 3.7122 3.9875 4.3133 3.9803 4.2850 3.8030
## stdvar 0.3714 0.3755 0.3815 0.4121 0.3873 0.4088 0.3586
## TMRC20095
## min 0.0000
## q1 7.8704
## median 8.7912
## mean 8.4484
## q3 9.4979
## max 18.9381
## iqr 1.6275
## iqr_high 11.9391
## iqr_low -2.4412
## sd 1.9317
## var 3.7315
## stdvar 0.4417
## Plot describing the gene distribution from a dataset.
## Error in `xy.coords()`:
## ! 'x' is a list, but does not have components 'x' and 'y'
## Error in `xy.coords()`:
## ! 'x' is a list, but does not have components 'x' and 'y'
The distribution of observed counts / gene for all samples was plotted as a boxplot on the log2 (it looks like it is log10, but I checked) scale. In contrast to host transcriptome distribution, the parasite distribution of reads/gene is log-normal.
## Warning: Using alpha for a discrete variable is not advised.
The numbers of genes removed by low-count filtering is drastically lower in parasite samples than human. Thus, even though the range of coverage for the parasite samples is from near 0 to ~ 150 CPM, the number of genes removed by the default low-count filter ranges only from 40 to 129, and the number of reads associated with them ranges only from 100 to 3168.
##
## z21 z22 z23 z24
## 7 42 41 2
##
## cure failure nd
## 40 34 18
Najib’s favorite plots are of course the PCA/TNSE. These are nice to look at in order to get a sense of the relationships between samples. They also provide a good opportunity to see what happens when one applies different normalizations, surrogate analyses, filters, etc. In addition, one may set different experimental factors as the primary ‘condition’ (usually the color of plots) and surrogate ‘batches’.
Column ‘Q’ in the sample sheet, make a categorical version of it with these parameters:
strain_norm <- normalize(lp_strain, norm = "quant", transform = "log2",
convert = "cpm", filter = TRUE)## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
zymo_pca <- plot_pca(strain_norm, plot_title = "PCA of parasite expression values",
plot_labels = FALSE)
zymo_pca## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by z2.1, z2.2, z2.3, z2.4
## Shapes are defined by resistant, sensitive.
pdf_pca01 <- pp(file = "figures/promastigote_zymocol_sensshape_z21_to_z24.pdf",
image = zymo_pca[["plot"]])## Warning in pp(file = "figures/promastigote_zymocol_sensshape_z21_to_z24.pdf", :
## There is no device to shut down.
svg_pca01 <- pp(file = "figures/promastigote_zymocol_sensshape_z21_to_z24.svg",
image = zymo_pca[["plot"]])## Warning in pp(file = "figures/promastigote_zymocol_sensshape_z21_to_z24.svg", :
## There is no device to shut down.
Exclude the unknown samples
lp_strain_known <- subset_se(lp_strain, subset = "clinicalcategorical!='unknown'")
strain_known_norm <- normalize(lp_strain_known, norm = "quant", transform = "log2",
convert = "cpm", filter = TRUE)## Removing 154 low-count genes (8624 remaining).
## transform_counts: Found 32 values equal to 0, adding 1 to the matrix.
zymo_known_pca <- plot_pca(strain_known_norm, plot_title = "PCA of parasite expression values",
plot_labels = FALSE)
zymo_known_pca## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by z2.1, z2.2, z2.3, z2.4
## Shapes are defined by resistant, sensitive.
pdf_pca02 <- pp(file = "figures/promastigote_zymocol_sensshape_z21_to_z24_only_known_clinical.pdf",
image = zymo_known_pca[["plot"]])## Warning in pp(file =
## "figures/promastigote_zymocol_sensshape_z21_to_z24_only_known_clinical.pdf", :
## There is no device to shut down.
svg_pca02 <- pp(file = "figures/promastigote_zymocol_sensshape_z21_to_z24_only_known_clinical.svg",
image = zymo_known_pca[["plot"]])## Warning in pp(file =
## "figures/promastigote_zymocol_sensshape_z21_to_z24_only_known_clinical.svg", :
## There is no device to shut down.
Now drop down to only the three most represented strains in the dataset.
only_three_types <- subset_se(lp_strain,
subset = "condition=='z2.1'|condition=='z2.3'|condition=='z2.2'")
only_three_norm <- normalize(only_three_types, norm = "quant", transform = "log2",
convert = "cpm", batch = FALSE, filter = TRUE) %>%
set_batches(fact = "phase")## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
## The number of samples by batch are:
##
## Stationary
## 90
onlythree_pca <- plot_pca(only_three_norm, plot_labels = FALSE,
plot_title = "PCA of z2.1, z2.2 and z2.3 parasite expression values")
onlythree_pca## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by z2.1, z2.2, z2.3
## Shapes are defined by Stationary.
svg_pca03 <- pp(file = "images/promastigote_threetypes_zymocol_noshape.svg",
image = onlythree_pca[["plot"]])## Warning in pp(file = "images/promastigote_threetypes_zymocol_noshape.svg", :
## There is no device to shut down.
pdf_pca03 <- pp(file = "images/promastigote_threetypes_zymocol_noshape.pdf",
image = onlythree_pca[["plot"]])## Warning in pp(file = "images/promastigote_threetypes_zymocol_noshape.pdf", :
## There is no device to shut down.
I added the result from my kmer classifier to the sample sheet, let us see how that looks.
## The numbers of samples by condition are:
##
## unknown z21 z22 z23 z24
## 1 5 43 41 2
strain_norm_knn <- normalize(lp_strain_knn, norm = "quant", transform = "log2",
convert = "cpm", filter = TRUE)## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
zymo_pca_knn <- plot_pca(strain_norm_knn, plot_title = "PCA of parasite expression values",
plot_labels = FALSE)
zymo_pca_knn## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by unknown, z21, z22, z23, z24
## Shapes are defined by resistant, sensitive.
svg_pca04 <- pp(file = "images/promastigote_zymocol_sensshape_knnv2.svg",
image = zymo_pca_knn[["plot"]])## Warning in pp(file = "images/promastigote_zymocol_sensshape_knnv2.svg", : There
## is no device to shut down.
pdf_pca04 <- pp(file = "images/promastigote_zymocol_sensshape_knnv2.pdf",
image = zymo_pca_knn[["plot"]])## Warning in pp(file = "images/promastigote_zymocol_sensshape_knnv2.pdf", : There
## is no device to shut down.
Repeat with sva modified expression values.
strain_nb <- normalize(lp_strain, convert = "cpm", transform = "log2",
filter = TRUE, batch = "svaseq")## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 541 values less than 0.
## transform_counts: Found 541 values equal to 0, adding 1 to the matrix.
strain_nb_pca <- plot_pca(strain_nb, plot_title = "PCA of parasite expression values",
plot_labels = FALSE)
strain_nb_pca## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by z2.1, z2.2, z2.3, z2.4
## Shapes are defined by resistant, sensitive.
## Warning in pp(file = "images/clinical_nb_pca_sus_shape.svg", image =
## strain_nb_pca[["plot"]]): There is no device to shut down.
## Warning in pp(file = "images/clinical_nb_pca_sus_shape.pdf", image =
## strain_nb_pca[["plot"]]): There is no device to shut down.
Add explicit labels for a few reference strains:
** NOTE ** These samples were all removed from examination in the sample_sheet in 202404 and so will not appear in this plot. Thus I am turning off the following block.
samples_to_label <- c("TMRC20023", "TMRC20006", "TMRC20029", "TMRC20007", "TMRC20034",
"TMRC20008", "TMRC20027", "TMRC20028", "TMRC20032", "TMRC20040")
label_entries <- zymo_pca$table[samples_to_label, ]
zymo_pca$plot +
geom_text(mapping = aes(x = "PC1", y = "PC2", label = "sampleid"),
data = label_entries)Some likely text for a figure legend might include something like the following (paraphrased from Najib’s 2016 dual transcriptome profiling paper (10.1128/mBio.00027-16)):
Expression profiles of the promastigote samples across multiple strains. Each glyph represents one sample, colors delineate the various strains and fall into two primary clades. Red samples are zymodeme 2.3, blue samples are zymodeme 2.2. The difference between these two primary groups make up approximately 17% of the variance in the PCA. Purple samples are Leishmania braziliensis or zymodeme 1.0/1.5 samples, orange are z2.4, browns and greys are z2.1, z2.0, z3.0, and z3.2 respectively. This analysis was performed following a low-count filter, cpm conversion, quantile normalization, and a log2 transformation. No batch factor was used, nor was a surrogate variable estimation performed.
Some interpretation for this figure might include:
When PCA was performed on the promastigote samples, the dominant (but still relatively small amount of variance) component observed coincided with the two primary strain groups, zymodeme 2.2 and 2.3. With the exception of some Leishmania braziliensis samples, all promatigote samples assayed fell into one of these two categories.
When surrogate varialbe estimation was performed on the entire set of samples, it increased the apparent strain-dependent variance, but had some potentially problematic effects for a couple of samples (one z2.3 sample now lies with the other z2.2 samples); it is assumed that this is because sva attempted to estimate surrogate values for the less-represented strains with some unintended consequences for sample TMRC20095 (which, along with TMRC20008 are the two least covered samples by a significant margin); this hypothesis may be tested by excluding the braziliensis and non-z2.2/2.3 samples and repeating (when this is performed later in the document, the difference between the two primary clades increases to 49.33% of the variance and there are no odd samples).
## The result of performing a tsne dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by z2.1, z2.2, z2.3, z2.4
## Shapes are defined by resistant, sensitive.
strain_nb_tsne <- plot_tsne(strain_nb, plot_title = "TSNE of parasite expression values")
strain_nb_tsne## The result of performing a tsne dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by z2.1, z2.2, z2.3, z2.4
## Shapes are defined by resistant, sensitive.
corheat <- plot_corheat(strain_norm, plot_title = "Correlation heatmap of parasite
expression values
")
corheat## A heatmap of pairwise sample correlations ranging from:
## 0.642203267746696 to 0.992867624966404.
disheat <- plot_disheat(strain_norm, plot_title = "Distance heatmap of parasite
expression values
")
disheat$plot## When the standard median metric was plotted, the values observed range
## from 0.642203267746696 to 1 with quartiles at 0.932012660770469 and 0.944521289249787.
Potential start for a figure legend:
Global relationships among the promastigote transcriptional profiles. Pairwise pearson correlations and Euclidean distances were calculated using the normalized expression matrices. Colors along the top row delineate the experimental conditions (same colors as the PCA) Samples were clustered by nearest neighbor clustering and each colored tile describes one correlation value between two samples (red to white delineates pearson correlation values of the 8,710 normalized gene values between two samples ranging from <= 0.7 to >= 1.0) or the euclidean distance between two samples (dark blue to white delineates identical to a normalized euclidean distance of >= 110).
Some interpretation for this figure might include:
When the global relationships among the samples were distilled down to individual euclidean distances or pearson correlation coefficients between pairs of samples, the primary clustering among samples observed was according to strain. The primary significant outlier sample (TMRC20095) is explicitly due to low coverage. The other outlier strains are either braziliensis (purple) or a series of strains which, when viewed in IGV, appear to have genetic variants which bridge the differences between the two primary zymodemes, particularly on the known aneuploid chromosomes.
lp_two_strains_norm <- normalize(lp_zymo, norm = "quant", transform = "log2",
convert = "cpm", batch = FALSE, filter = TRUE)
onlytwo_pca <- plot_pca(lp_two_strains_norm, plot_title = "PCA of z2.2 and z2.3 parasite expression values",
plot_labels = FALSE)
onlytwo_pca
svg_pca06 <- (file = "figures/zymo_z2.2_z2.3_pca_sus_shape.svg", image = onlytwo_pca[["plot"]])
pdf_pca06 <- (file = "figures/zymo_z2.2_z2.3_pca_sus_shape.pdf", image = onlytwo_pca[["plot"]])## Error in parse(text = input): <text>:6:64: unexpected ','
## 5: onlytwo_pca
## 6: svg_pca06 <- (file = "figures/zymo_z2.2_z2.3_pca_sus_shape.svg",
## ^
Remove the unknown samples.
lp_two_strains_known <- subset_se(lp_zymo, subset = "clinicalcategorical!='unknown'")
lp_two_strains_known_norm <- normalize(lp_two_strains_known, norm = "quant", transform = "log2",
convert = "cpm", batch = FALSE, filter = TRUE)## Removing 155 low-count genes (8623 remaining).
## transform_counts: Found 32 values equal to 0, adding 1 to the matrix.
onlytwo_known_pca <- plot_pca(lp_two_strains_known_norm, plot_labels = FALSE,
plot_title = "PCA of z2.2 and z2.3 parasite expression values")
onlytwo_known_pca## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by z2.2, z2.3
## Shapes are defined by undefined.
pdf_pca07 <- pp(file = "figures/zymo_z2.2_z2.3_pca_sus_shape_only_known.pdf",
image = onlytwo_known_pca[["plot"]])## Warning in pp(file = "figures/zymo_z2.2_z2.3_pca_sus_shape_only_known.pdf", :
## There is no device to shut down.
svg_pca07 <- pp(file = "figures/zymo_z2.2_z2.3_pca_sus_shape_only_known.svg",
image = onlytwo_known_pca[["plot"]])## Warning in pp(file = "figures/zymo_z2.2_z2.3_pca_sus_shape_only_known.svg", :
## There is no device to shut down.
And repeat following the application of sva.
lp_two_strains_nb <- normalize(lp_zymo, transform = "log2", convert = "cpm",
batch = "svaseq", filter = TRUE)## Removing 150 low-count genes (8628 remaining).
## transform_counts: Found 512 values less than 0.
## transform_counts: Found 512 values equal to 0, adding 1 to the matrix.
onlytwo_pca_nb <- plot_pca(lp_two_strains_nb, plot_labels = FALSE,
plot_title = "PCA of z2.2 and z2.3 parasite expression values")
onlytwo_pca_nb## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by z2.2, z2.3
## Shapes are defined by undefined.
pdf_pca08 <- pp(file = "images/zymo_z2.2_z2.3_pca_sus_shape_nb.pdf",
image = onlytwo_pca_nb[["plot"]])## Warning in pp(file = "images/zymo_z2.2_z2.3_pca_sus_shape_nb.pdf", image =
## onlytwo_pca_nb[["plot"]]): There is no device to shut down.
svg_pca08 <- pp(file = "images/zymo_z2.2_z2.3_pca_sus_shape_nb.svg",
image = onlytwo_pca_nb[["plot"]])## Warning in pp(file = "images/zymo_z2.2_z2.3_pca_sus_shape_nb.svg", image =
## onlytwo_pca_nb[["plot"]]): There is no device to shut down.
This is by far the most problematic comparison, I think the only interpretation of the following images is that the parasite has little effect on the likelihood that a person will successfully end treatment. There does appear to be some variance associated with cure/fail, but only in a few samples (visible in ~10 fail samples and perhaps ~8 cure samples when sva is applied to the data).
## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
start_cf <- plot_pca(cf_norm, plot_title = "PCA of parasite expression values",
plot_labels = FALSE)
start_cf## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by cure, fail, unknown
## Shapes are defined by resistant, sensitive.
## Warning in pp(file = "figures/cure_fail_sus_shape_all.pdf", image =
## start_cf[["plot"]]): There is no device to shut down.
## Warning in pp(file = "figures/cure_fail_sus_shape_all.svg", image =
## start_cf[["plot"]]): There is no device to shut down.
Once again, remove the unknown samples.
lp_cf_known <- subset_se(lp_cf, subset = "clinicalcategorical!='unknown'")
cf_known_norm <- normalize(lp_cf_known, convert = "cpm", transform = "log2",
norm = "quant", filter = TRUE)## Removing 154 low-count genes (8624 remaining).
## transform_counts: Found 32 values equal to 0, adding 1 to the matrix.
start_cf_known <- plot_pca(cf_known_norm, plot_title = "PCA of parasite expression values",
plot_labels = FALSE)
start_cf_known## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by cure, fail
## Shapes are defined by resistant, sensitive.
## Warning in pp(file = "figures/cure_fail_sus_shape_known.pdf", image =
## start_cf_known[["plot"]]): There is no device to shut down.
## Warning in pp(file = "figures/cure_fail_sus_shape_known.svg", image =
## start_cf_known[["plot"]]): There is no device to shut down.
only_two_cf <- set_conditions(lp_zymo, fact = "clinicalcategorical",
colors = color_choices[["cf"]]) %>%
set_batches(fact = "sus_category_current")## The numbers of samples by condition are:
##
## cure fail unknown
## 33 32 18
## Warning in set_se_colors(new_se, colors = colors): Colors for the following
## categories are not being used: notapplicable.
## The number of samples by batch are:
##
## resistant sensitive
## 44 39
only_two_cf_norm <- normalize(only_two_cf, norm = "quant", transform = "log2",
convert = "cpm", batch = FALSE, filter = TRUE)## Removing 150 low-count genes (8628 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
only_two_cf_pca <- plot_pca(only_two_cf_norm, plot_labels = FALSE,
plot_title = "PCA of z2.2 and z2.3 parasite expression values")
only_two_cf_pca## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by cure, fail, unknown
## Shapes are defined by resistant, sensitive.
pdf_pca11 <- pp(file = "figures/cure_fail_sus_shape_onlyz22_z23.pdf",
image = only_two_cf_pca[["plot"]])## Warning in pp(file = "figures/cure_fail_sus_shape_onlyz22_z23.pdf", image =
## only_two_cf_pca[["plot"]]): There is no device to shut down.
svg_pca11 <- pp(file = "figures/cure_fail_sus_shape_onlyz22_z23.svg",
image = only_two_cf_pca[["plot"]])## Warning in pp(file = "figures/cure_fail_sus_shape_onlyz22_z23.svg", image =
## only_two_cf_pca[["plot"]]): There is no device to shut down.
Drop down to just the two primary strains.
only_two_cf_known <- subset_se(only_two_cf, subset = "condition!='unknown'")
only_two_cf_known_norm <- normalize(only_two_cf_known, norm = "quant", transform = "log2",
convert = "cpm", batch = FALSE, filter = TRUE)## Removing 155 low-count genes (8623 remaining).
## transform_counts: Found 32 values equal to 0, adding 1 to the matrix.
only_two_cf_known_pca <- plot_pca(only_two_cf_known_norm, plot_labels = FALSE,
plot_title = "PCA of z2.2 and z2.3 parasite expression values")
only_two_cf_known_pca## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by cure, fail
## Shapes are defined by resistant, sensitive.
pdf_pca12 <- pp(file = "figures/cure_fail_sus_shape_onlyz22_z23_known.pdf",
image = only_two_cf_known_pca[["plot"]])## Warning in pp(file = "figures/cure_fail_sus_shape_onlyz22_z23_known.pdf", :
## There is no device to shut down.
svg_pca12 <- pp(file = "figures/cure_fail_sus_shape_onlyz22_z23_known.svg",
image = only_two_cf_known_pca[["plot"]])## Warning in pp(file = "figures/cure_fail_sus_shape_onlyz22_z23_known.svg", :
## There is no device to shut down.
Once again, apply sva.
## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 292 values less than 0.
## transform_counts: Found 292 values equal to 0, adding 1 to the matrix.
cf_nb_pca <- plot_pca(cf_nb, plot_title = "PCA of parasite expression values",
plot_labels = FALSE)
cf_nb_pca## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by cure, fail, unknown
## Shapes are defined by resistant, sensitive.
## Warning in pp(file = "images/cf_sus_share_nb.pdf", image =
## cf_nb_pca[["plot"]]): There is no device to shut down.
## Warning in pp(file = "images/cf_sus_share_nb.svg", image =
## cf_nb_pca[["plot"]]): There is no device to shut down.
## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
## Getting an error which really does not make sense, I ran it manually and it worked fine.
test <- pca_information(cf_norm, num_components = 6, plot_pcas = TRUE,
factors = c("clinicalcategorical", "zymodemecategorical",
"pathogenstrain", "passagenumber"))
test$anova_p## PC1 PC2 PC3 PC4 PC5 PC6
## clinicalcategorical 9.168e-02 0.4286 0.1710 0.185702 7.118e-01 0.18993
## zymodemecategorical 7.306e-29 0.2921 0.5239 0.373609 7.239e-01 0.86261
## pathogenstrain 2.624e-01 0.1768 0.1649 0.004742 3.411e-01 0.98471
## passagenumber 9.328e-01 0.2103 0.4129 0.001469 2.004e-14 0.02395
We have two competing metrics of antmonial sensitivity; one historical and one current. In both cases there is a reasonable expectation that resistant strains tend to be zymodeme 2.3 and sensitive strains tend to be zymodeme 2.2. There appear to be more exceptions to this rule of thumb in the current data than the historical.
## [1] 8778 92
sus_norm <- normalize(lp_susceptibility, transform = "log2", convert = "cpm",
norm = "quant", filter = TRUE)## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
sus_pca <- plot_pca(sus_norm, plot_title = "PCA of parasite expression values",
plot_labels = FALSE)
sus_pca## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by resistant, sensitive
## Shapes are defined by cure, fail, unknown.
## Warning in pp(file = "figures/sus_norm_pca.pdf", image = sus_pca[["plot"]]):
## There is no device to shut down.
## Warning in pp(file = "figures/sus_norm_pca.svg", image = sus_pca[["plot"]]):
## There is no device to shut down.
lp_susceptibility_known <- subset_se(lp_susceptibility, subset = "batch!='unknown'")
sus_known_norm <- normalize(lp_susceptibility_known, transform = "log2", convert = "cpm",
norm = "quant", filter = TRUE)## Removing 154 low-count genes (8624 remaining).
## transform_counts: Found 32 values equal to 0, adding 1 to the matrix.
sus_known_pca <- plot_pca(sus_known_norm, plot_title = "PCA of parasite expression values",
plot_labels = FALSE)
sus_known_pca## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by resistant, sensitive
## Shapes are defined by cure, fail.
## Warning in pp(file = "figures/sus_norm_known_pca.pdf", image =
## sus_known_pca[["plot"]]): There is no device to shut down.
## Warning in pp(file = "figures/sus_norm_known_pca.svg", image =
## sus_known_pca[["plot"]]): There is no device to shut down.
Once again, drop to our two primary strains…
lp_sus_two <- subset_se(lp_susceptibility, subset = "zymodemecategorical!='z21'") %>%
subset_se(subset = "zymodemecategorical!='z24'")
sus_two_norm <- normalize(lp_sus_two, transform = "log2", convert = "cpm",
norm = "quant", filter = TRUE)## Removing 150 low-count genes (8628 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
sus_two_pca <- plot_pca(sus_two_norm, plot_title = "PCA of parasite expression values",
plot_labels = FALSE)
sus_two_pca## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by resistant, sensitive
## Shapes are defined by cure, fail, unknown.
## Warning in pp(file = "figures/sus_norm_two_pca.pdf", image =
## sus_two_pca[["plot"]]): There is no device to shut down.
## Warning in pp(file = "figures/sus_norm_two_pca.svg", image =
## sus_two_pca[["plot"]]): There is no device to shut down.
lp_sus_two_known <- subset_se(lp_sus_two, subset = "clinicalcategorical!='unknown'")
sus_two_known_norm <- normalize(lp_sus_two_known, transform = "log2", convert = "cpm",
norm = "quant", filter = TRUE)## Removing 155 low-count genes (8623 remaining).
## transform_counts: Found 32 values equal to 0, adding 1 to the matrix.
sus_two_known_pca <- plot_pca(sus_two_known_norm, plot_title = "PCA of parasite expression values",
plot_labels = FALSE)
sus_two_known_pca## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by resistant, sensitive
## Shapes are defined by cure, fail.
## Warning in pp(file = "figures/sus_norm_two_known_pca.pdf", image =
## sus_two_known_pca[["plot"]]): There is no device to shut down.
## Warning in pp(file = "figures/sus_norm_two_known_pca.svg", image =
## sus_two_known_pca[["plot"]]): There is no device to shut down.
sus_nb <- normalize(lp_susceptibility, transform = "log2", convert = "cpm",
batch = "svaseq", filter = TRUE)## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 563 values less than 0.
## transform_counts: Found 563 values equal to 0, adding 1 to the matrix.
sus_nb_pca <- plot_pca(sus_nb, plot_title = "PCA of parasite expression values",
plot_labels = FALSE)
sus_nb_pca## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by resistant, sensitive
## Shapes are defined by cure, fail, unknown.
## Warning in pp(file = "images/sus_nb_pca.pdf", image = sus_nb_pca[["plot"]]):
## There is no device to shut down.
## Warning in pp(file = "images/sus_nb_pca.svg", image = sus_nb_pca[["plot"]]):
## There is no device to shut down.
sus_hist_norm <- normalize(lp_susceptibility_historical, transform = "log2", convert = "cpm",
norm = "quant", filter = TRUE)## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 96 values equal to 0, adding 1 to the matrix.
sus_hist_pca <- plot_pca(sus_hist_norm, plot_title = "PCA of parasite expression values",
plot_labels = FALSE)
sus_hist_pca## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by ambiguous, resistant, sensitive, unknown
## Shapes are defined by cure, fail, unknown.
## Warning in MASS::cov.trob(data[, vars], wt = weight * nrow(data)): Probable
## convergence failure
## Warning in MASS::cov.trob(data[, vars], wt = weight * nrow(data)): Probable
## convergence failure
## Warning in MASS::cov.trob(data[, vars], wt = weight * nrow(data)): Probable
## convergence failure
## Warning in MASS::cov.trob(data[, vars], wt = weight * nrow(data)): Probable
## convergence failure
## Warning in pp(file = "images/sus_hist_norm_pca.pdf", image =
## sus_hist_pca[["plot"]]): There is no device to shut down.
## Warning in MASS::cov.trob(data[, vars], wt = weight * nrow(data)): Probable
## convergence failure
## Warning in MASS::cov.trob(data[, vars], wt = weight * nrow(data)): Probable
## convergence failure
## Warning in pp(file = "images/sus_hist_norm_pca.svg", image =
## sus_hist_pca[["plot"]]): There is no device to shut down.
sus_hist_nb <- normalize(lp_susceptibility_historical, transform = "log2", convert = "cpm",
batch = "svaseq", filter = TRUE)## Removing 149 low-count genes (8629 remaining).
## transform_counts: Found 298 values less than 0.
## transform_counts: Found 298 values equal to 0, adding 1 to the matrix.
sus_hist_nb_pca <- plot_pca(sus_hist_nb, plot_title = "PCA of parasite expression values",
plot_labels = FALSE)
sus_hist_nb_pca## The result of performing a fast_svd dimension reduction.
## The x-axis is PC1 and the y-axis is PC2
## Colors are defined by ambiguous, resistant, sensitive, unknown
## Shapes are defined by cure, fail, unknown.
## Warning in pp(file = "images/sus_hist_nb_pca.pdf", image =
## sus_hist_nb_pca[["plot"]]): There is no device to shut down.
## Warning in pp(file = "images/sus_hist_nb_pca.svg", image =
## sus_hist_nb_pca[["plot"]]): There is no device to shut down.
Najib read me an email listing off the gene names associated with the zymodeme classification. I took those names and cross referenced them against the Leishmania panamensis gene annotations and found the following:
They are:
Given these 6 gene IDs (NH has two gene IDs associated with it), I can do some looking for specific differences among the various samples.
The following creates a colorspace (red to green) heatmap showing the observed expression of these genes in every sample.
my_genes <- c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
"LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300",
"other")
my_names <- c("ALAT", "ASAT", "G6PD", "NHv1", "NHv2", "MPI", "other")
zymo_se <- exclude_genes(strain_norm, ids = my_genes, method = "keep")## Note, I renamed this to subset_genes().
## subset_genes(), before removal, there were 8629 genes, now there are 6.
## There are 92 samples which kept less than 90 percent counts.
## TMRC20001 TMRC20065 TMRC20004 TMRC20005 TMRC20066 TMRC20039 TMRC20037 TMRC20038
## 0.08587 0.08454 0.08368 0.08346 0.08132 0.08408 0.08142 0.08294
## TMRC20067 TMRC20068 TMRC20041 TMRC20015 TMRC20009 TMRC20010 TMRC20016 TMRC20011
## 0.08342 0.08390 0.08245 0.08428 0.08310 0.08372 0.08304 0.08288
## TMRC20012 TMRC20013 TMRC20017 TMRC20014 TMRC20018 TMRC20019 TMRC20070 TMRC20020
## 0.08485 0.08515 0.08275 0.08332 0.08290 0.08304 0.08350 0.08154
## TMRC20021 TMRC20022 TMRC20024 TMRC20036 TMRC20069 TMRC20033 TMRC20026 TMRC20031
## 0.08139 0.08476 0.08158 0.08203 0.08201 0.08208 0.08690 0.08142
## TMRC20076 TMRC20073 TMRC20055 TMRC20079 TMRC20071 TMRC20078 TMRC20094 TMRC20042
## 0.08260 0.08427 0.08367 0.08462 0.08370 0.08320 0.08349 0.08360
## TMRC20058 TMRC20072 TMRC20059 TMRC20048 TMRC20057 TMRC20088 TMRC20056 TMRC20060
## 0.08254 0.08334 0.08301 0.08181 0.08540 0.08423 0.08398 0.08254
## TMRC20077 TMRC20074 TMRC20063 TMRC20053 TMRC20052 TMRC20064 TMRC20075 TMRC20051
## 0.08337 0.08304 0.08185 0.08225 0.08206 0.08254 0.08315 0.08381
## TMRC20050 TMRC20049 TMRC20062 TMRC20110 TMRC20080 TMRC20043 TMRC20083 TMRC20054
## 0.08196 0.08469 0.08361 0.08451 0.08162 0.08284 0.08379 0.08424
## TMRC20085 TMRC20046 TMRC20093 TMRC20089 TMRC20047 TMRC20090 TMRC20044 TMRC20045
## 0.08369 0.08478 0.08396 0.08296 0.08368 0.08111 0.08464 0.08318
## TMRC20105 TMRC20108 TMRC20109 TMRC20098 TMRC20096 TMRC20101 TMRC20092 TMRC20082
## 0.08388 0.08252 0.08391 0.08428 0.08292 0.08302 0.08254 0.08219
## TMRC20102 TMRC20099 TMRC20100 TMRC20091 TMRC20084 TMRC20087 TMRC20103 TMRC20104
## 0.08278 0.08408 0.08265 0.08430 0.08253 0.08380 0.08376 0.08352
## TMRC20086 TMRC20107 TMRC20081 TMRC20095
## 0.08305 0.08097 0.08154 0.07737
A recent suggestion included a query about the relationship of our amastigote TMRC2 samples which were the result of infecting a set of macrophages vs. these promastigote samples.
So far, we have kept these two experiments separate, now let us merge them.
tmrc2_macrophage_norm <- normalize(lp_macrophage, transform = "log2", convert = "cpm",
norm = "quant", filter = TRUE)## Removing 0 low-count genes (8778 remaining).
## transform_counts: Found 3577 values equal to 0, adding 1 to the matrix.
## Hey you, this annotation call should be made automatic for the container!
annotation(lp_se) <- "org.Lpanamensis.MHOMCOL81L13.v46.eg.db"
annotation(lp_macrophage) <- annotation(lp_se)
all_tmrc2 <- hpgltools:::combine_se(lp_se, lp_macrophage)
missing_ids <- is.na(colData(all_tmrc2)[["sampleid"]])
message("HEY! If you are looking for Error after 20260408, remember me to see if I changed the correct file!")## HEY! If you are looking for Error after 20260408, remember me to see if I changed the correct file!
Before we can use the combined data, we must reconcile a few of aspects of it, notably we need to specify which samples are amastigotes and which are promastigotes.
all_nosb <- all_tmrc2
colData(all_nosb)[["stage"]] <- "promastigote"
na_idx <- is.na(colData(all_nosb)[["macrophagetreatment"]])
colData(all_nosb)[na_idx, "macrophagetreatment"] <- "undefined"
all_nosb <- subset_se(all_nosb, subset = "macrophagetreatment!='inf_sb'")
ama_idx <- colData(all_nosb)[["macrophagetreatment"]] == "inf"
colData(all_nosb)[ama_idx, "stage" ] <- "amastigote"
## Make sure that the zymodeme does not have the inf_ prefix.
zymodeme_char <- gsub(x = colData(all_nosb)[["condition"]], pattern = "^inf_", replacement = "")
colData(all_nosb)[["condition"]] <- zymodeme_char
colData(all_nosb)[["batch"]] <- colData(all_nosb)[["stage"]]
all_nosb <- subset_se(all_nosb, subset = "condition!='none'")
all_norm <- normalize(all_nosb, convert = "cpm", norm = "quant",
transform = "log2", filter = TRUE)## Removing 94 low-count genes (8684 remaining).
## transform_counts: Found 81 values equal to 0, adding 1 to the matrix.
I think the above picture is sort of the opposite of what we want to compare in a DE analysis for this set of data, e.g. we want to compare promastigotes from amastigotes?
colData(all_nosb)[["sampleid"]] <- rownames(colData(all_nosb))
two_nosb <- set_batches(all_nosb, fact = "condition") %>%
set_conditions(fact = "stage") %>%
subset_se(subset = "batch=='z2.2'|batch=='z2.3'")## The number of samples by batch are:
##
## z2.1 z2.2 z2.3 z2.4
## 7 56 56 2
## The numbers of samples by condition are:
##
## amastigote promastigote
## 29 92
## It looks like some sampleIDs got messed up in the merge
two_norm <- normalize(two_nosb, convert = "cpm", norm = "quant",
transform = "log2", filter = TRUE)## Removing 94 low-count genes (8684 remaining).
## transform_counts: Found 81 values equal to 0, adding 1 to the matrix.
zy_stage_factor <- paste0(colData(two_nosb)[["batch"]], "_",
colData(two_nosb)[["stage"]])
colData(two_nosb)[["zystage"]] <- zy_stage_factor
zystage <- set_conditions(two_nosb, fact = "zystage")## The numbers of samples by condition are:
##
## z2.2_amastigote z2.2_promastigote z2.3_amastigote z2.3_promastigote
## 14 42 15 41
zystage_norm <- normalize(zystage, filter = TRUE, norm = "quant",
convert = "cpm", transform = "log2")## Removing 94 low-count genes (8684 remaining).
## transform_counts: Found 81 values equal to 0, adding 1 to the matrix.
zystage_keepers <- list(
"z2322_ama" = c("z23_amastigote", "z22_amastigote"),
"z2322_pro" = c("z23_promastigote", "z22_promastigote"),
"proama_z23" = c("z23_amastigote", "z23_promastigote"),
"proama_z22" = c("z22_amastigote", "z22_promastigote"))
zystage_de <- all_pairwise(zystage, filter = TRUE, model_batch = "svaseq",
model_fstring = "~ 0 + condition")## z2.2_amastigote z2.2_promastigote z2.3_amastigote z2.3_promastigote
## 14 42 15 41
## Removing 94 low-count genes (8684 remaining).
## Basic step 0/3: Normalizing data.
## Basic step 0/3: Converting data.
## I think this is failing? SummarizedExperiment
## Basic step 0/3: Transforming data.
## Setting 12029 entries to zero.
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## conditions
## z22_amastigote z22_promastigote z23_amastigote z23_promastigote
## 14 42 15 41
## conditions
## z22_amastigote z22_promastigote z23_amastigote z23_promastigote
## 14 42 15 41
## conditions
## z22_amastigote z22_promastigote z23_amastigote z23_promastigote
## 14 42 15 41
zystage_tables <- combine_de_tables(
zystage_de, keepers = zystage_keepers,
excel = glue("excel/zymodeme_stage_table-v{ver}.xlsx"))## Looking for subscript invalid names, start of extract_keepers.
## Looking for subscript invalid names, end of extract_keepers.
I want to make a plot where the x-axis is the number of genes on a chromosome and the y-axis is the mean of the expression of those genes.
assay_by_chr_plot <- plot_assay_by_chromosome(lp_zymo, chromosome_column = "chromosome")
assay_by_chr_plot[["plot"]]One potentially interesting aspect of the variant data: it may be able to help us define the zymodeme state of previous, untested samples.
In order to test this, I am loading some of the 2016 data alongside the new TMRC2 data to see if they fit together.
This is using an older dataset for which I am not sure we have permissions to include in the container, so I am turning them off for now.
old_se <- create_se("sample_sheets/tmrc2_samples_20191203.xlsx",
file_column = "tophat2file")
tt <- old_se$expressionset
rownames(tt) <- gsub(pattern = "^exon_", replacement = "", x = rownames(tt))
rownames(tt) <- gsub(pattern = "\\.1$", replacement = "", x = rownames(tt))
old_se$expressionset <- tt
rm(tt)One other important caveat, we have a group of new samples which have not yet run through the variant search pipeline, so I need to remove them from consideration. Though it looks like they finished overnight…
In the non-containerized version of this document, the following block combines an older dataset with the current data.
both_norm <- normalize(new_snps_sufficient, transform = "log2", norm = "quant") %>%
set_conditions(fact = "pathogenstrain")## transform_counts: Found 79143354 values equal to 0, adding 1 to the matrix.
## The numbers of samples by condition are:
##
## 10070 10750 10772 10977 11006 11024 11026 11028 11031 11045
## 1 1 1 1 1 1 1 1 1 1
## 11071 11075 11090 11108 11109 11126-I 11133 11134 11152 1131
## 1 1 1 1 1 1 1 1 1 1
## 12116 12166 12169 12218-I 12251 12309 12312 12355 12367 12371
## 1 1 1 1 1 1 1 1 1 1
## 12417 12444 12479 12535 12554 12556 12570 12578 12581 12588
## 1 1 1 1 1 1 1 1 1 1
## 13464 13473 13474 13582 13589 13595 13597 13625 13631 13703
## 1 1 1 1 1 1 1 1 1 1
## 13720 13740 13787 13794 13978 14016 14056 14096 14103 14111
## 1 1 1 1 1 1 1 1 1 1
## 14149 2122 2168 2173 2183 2198 2272 2330 2331 2411
## 1 1 1 1 1 1 1 1 1 1
## 2414 2423 2429 2439 2472 2482 2496 2500 3117 4700
## 1 1 1 1 1 1 1 1 1 1
## 4745 4810 4829 4830 4876 5986 6957 7011 7105 7158
## 1 1 1 1 1 1 1 1 1 1
## 8190
## 1
The data structure ‘both_norm’ now contains our 2016 data along with the newer data collected since 2019.
The following plot shows the SNP profiles of all samples (old and new) where the colors at the top show either the 2.2 strains (orange), 2.3 strains (green), the previous samples (purple), or the various lab strains (pink etc).
pdf_snp_disheat <- pp(file = "images/raw_snp_disheat.pdf", height = 12, width = 12,
image = new_variant_heatmap[["plot"]])## Warning in pp(file = "images/raw_snp_disheat.pdf", height = 12, width = 12, :
## There is no device to shut down.
svg_snp_disheat <- pp(file = "images/raw_snp_disheat.svg", height = 12, width = 12,
image = new_variant_heatmap[["plot"]])## Warning in pp(file = "images/raw_snp_disheat.svg", height = 12, width = 12, :
## There is no device to shut down.
The function get_snp_sets() takes the provided metadata factor (in this case ‘condition’) and looks for variants which are exclusive to each element in it. In this case, this is looking for differences between 2.2 and 2.3, as well as the set shared among them.
## The samples represent the following categories:
##
## z2.1 z2.2 z2.3 z2.4
## 7 42 40 2
## Using a proportion of observed variants, converting the data to binary observations.
## The factor z2.1 has 7 rows.
## The factor z2.2 has 42 rows.
## The factor z2.3 has 40 rows.
## The factor z2.4 has 2 rows.
## Finished iterating over the chromosomes.
## A set of variants observed when cross referencing all variants against
## the samples associated with each metadata factor: condition. 4
## categories and 927126 variants were observed with 15
## combinations among them. 725 chromosomes/scaffolds were observed with a
## density of variants ranging from 0.000652315720808871 to 0.114678899082569.
snp_genes <- snps_vs_genes(lp_se, snp_sets, chr_column = exp_chr_col,
start_column = exp_start_col, end_column = exp_end_col)## The snp grange data has 927126 elements.
## The first few snp chromosomes are: LPAL13_SCAF000001, LPAL13_SCAF000002, LPAL13_SCAF000003, LPAL13_SCAF000004, LPAL13_SCAF000005, LPAL13_SCAF000007
## The first few exp chromosomes are: LPAL13_SCAF000001, LPAL13_SCAF000003, LPAL13_SCAF000010, LPAL13_SCAF000011, LPAL13_SCAF000017, LPAL13_SCAF000021
## There are 437555 overlapping variants and genes.
## When the variants observed were cross referenced against annotated genes,
## 8633 genes were observed with at least 1 variant.
## LPAL13_250017600 had the most variants, with 889.
## I think we have some metrics here we can plot...
snp_subset <- snp_subset_genes(
lp_se, new_snps_sufficient, start_column = exp_start_col, end_column = exp_end_col,
exp_name_column = exp_chr_col,
genes = c("LPAL13_120010900", "LPAL13_340013000", "LPAL13_000054100",
"LPAL13_140006100", "LPAL13_180018500", "LPAL13_320022300"))## subset_genes(), before removal, there were 927126 genes, now there are 85.
## There are 91 samples which kept less than 90 percent counts.
## TMRC20001 TMRC20065 TMRC20004 TMRC20005 TMRC20066 TMRC20039 TMRC20037 TMRC20038
## 0.0363994 0.0284342 0.0704007 0.0446300 0.0244539 0.0218095 0.0228205 0.0244650
## TMRC20067 TMRC20068 TMRC20041 TMRC20015 TMRC20009 TMRC20010 TMRC20016 TMRC20011
## 0.0259861 0.0275633 0.0084708 0.0249880 0.0000000 0.0278667 0.0232143 0.0243409
## TMRC20012 TMRC20013 TMRC20017 TMRC20014 TMRC20018 TMRC20019 TMRC20070 TMRC20020
## 0.0778398 0.0294979 0.0102837 0.0191370 0.0239034 0.0282985 0.0274939 0.0235432
## TMRC20021 TMRC20022 TMRC20024 TMRC20036 TMRC20069 TMRC20033 TMRC20026 TMRC20031
## 0.0286477 0.0000000 0.0212984 0.0089603 0.0270318 0.0019682 0.0352553 0.0199682
## TMRC20076 TMRC20073 TMRC20055 TMRC20079 TMRC20071 TMRC20078 TMRC20094 TMRC20042
## 0.0270505 0.0282364 0.0395199 0.0280768 0.0247556 0.0177539 0.0279169 0.0398656
## TMRC20058 TMRC20072 TMRC20059 TMRC20048 TMRC20057 TMRC20088 TMRC20056 TMRC20060
## 0.0256906 0.0158464 0.0251221 0.0238737 0.0062348 0.0349161 0.0003009 0.0294943
## TMRC20077 TMRC20074 TMRC20063 TMRC20053 TMRC20052 TMRC20064 TMRC20075 TMRC20051
## 0.0340198 0.0282781 0.0017690 0.0202252 0.0274156 0.0280939 0.0236363 0.0297070
## TMRC20050 TMRC20049 TMRC20062 TMRC20110 TMRC20080 TMRC20043 TMRC20083 TMRC20054
## 0.0324800 0.0338522 0.0314505 0.0350400 0.0288995 0.0273341 0.0121428 0.0298779
## TMRC20085 TMRC20046 TMRC20093 TMRC20089 TMRC20047 TMRC20090 TMRC20044 TMRC20045
## 0.0251593 0.0054459 0.0065508 0.0269888 0.0299476 0.0273432 0.0316706 0.0052405
## TMRC20105 TMRC20108 TMRC20109 TMRC20098 TMRC20096 TMRC20101 TMRC20092 TMRC20102
## 0.0303013 0.0267368 0.0184037 0.0265627 0.0202590 0.0282691 0.0024627 0.0250142
## TMRC20099 TMRC20100 TMRC20091 TMRC20084 TMRC20087 TMRC20103 TMRC20104 TMRC20086
## 0.0291960 0.0266043 0.0274472 0.0063494 0.0291851 0.0065616 0.0265140 0.0251956
## TMRC20107 TMRC20081 TMRC20095
## 0.0200527 0.0133293 0.0136510
## Removing 0 low-count genes (85 remaining).
## transform_counts: Found 7122 values equal to 0, adding 1 to the matrix.
Najib has asked a few times about the relationship between variants and DE genes. In subsequent conversations I figured out what he really wants to learn is variants in the UTR (most likely 5’) which might affect expression of genes. The following explicitly does not help this question, but is a paralog: is there a relationship between variants in the CDS and differential expression?
In order to do this comparison, we need to reload some of the DE results.
These blocks need to be moved to post-differential analyses
rda <- glue("rda/zymo_tables_sva-v{ver}.rda")
varname <- gsub(x = basename(rda), pattern = "\\.rda", replacement = "")
loaded <- load(file = rda)
zy_df <- get0(varname)[["data"]][["zymodeme"]]vars_df <- data.frame(ID = names(snp_genes$summary_by_gene),
variants = as.numeric(snp_genes$summary_by_gene))
vars_df[["variants"]] <- log2(vars_df[["variants"]] + 1)
vars_by_de_gene <- merge(zy_df, vars_df, by.x = "row.names", by.y = "ID")
cor.test(vars_by_de_gene$deseq_logfc, vars_by_de_gene$variants)
variants_wrt_logfc <- plot_linear_scatter(vars_by_de_gene[, c("deseq_logfc", "variants")])
variants_wrt_logfc$scatter
## It looks like there might be some genes of interest, even though this is not actually
## the question of interest.Didn’t I create a set of densities by chromosome? Oh I think they come in from get_snp_sets()
## The samples represent the following categories:
##
## cure failure nd
## 40 33 18
## Using a proportion of observed variants, converting the data to binary observations.
## The factor cure has 40 rows.
## The factor failure has 33 rows.
## The factor nd has 18 rows.
## Finished iterating over the chromosomes.
## A set of variants observed when cross referencing all variants against
## the samples associated with each metadata factor: clinicalresponse. 3
## categories and 927126 variants were observed with 7
## combinations among them. 725 chromosomes/scaffolds were observed with a
## density of variants ranging from 0.000652315720808871 to 0.114678899082569.
density_vec <- clinical_sets[["density"]]
chromosome_idx <- grep(pattern = "LpaL", x = names(density_vec))
density_df <- as.data.frame(density_vec[chromosome_idx])
density_df[["chr"]] <- rownames(density_df)
colnames(density_df) <- c("density_vec", "chr")
var_den_chr <- ggplot(density_df, aes(x = chr, y = density_vec)) +
ggplot2::geom_col() +
ggplot2::theme(axis.text = ggplot2::element_text(size = 10, colour = "black"),
axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5))
var_den_chr## Warning in pp(file = "figures/variant_density_by_chromosome.pdf", image =
## var_den_chr): There is no device to shut down.
## Warning in pp(file = "figures/variant_density_by_chromosome.svg", image =
## var_den_chr): There is no device to shut down.
## oops, forgot to export write_snps... fixed.
clinical_written <- write_snps(new_snps_sufficient, output_file = "excel/clinical_variants.aln")clinical_genes <- snps_vs_genes(lp_se, clinical_sets, chr_column = exp_chr_col,
start_column = exp_start_col, end_column = exp_end_col)## The snp grange data has 927126 elements.
## The first few snp chromosomes are: LPAL13_SCAF000001, LPAL13_SCAF000002, LPAL13_SCAF000003, LPAL13_SCAF000004, LPAL13_SCAF000005, LPAL13_SCAF000007
## The first few exp chromosomes are: LPAL13_SCAF000001, LPAL13_SCAF000003, LPAL13_SCAF000010, LPAL13_SCAF000011, LPAL13_SCAF000017, LPAL13_SCAF000021
## There are 437555 overlapping variants and genes.
snp_density <- merge(as.data.frame(clinical_genes[["summary"]]),
as.data.frame(rowData(lp_se)),
by = "row.names")
snp_density <- snp_density[, c(1, 2, 4, 15)]
colnames(snp_density) <- c("name", "snps", "product", "length")
snp_density[["product"]] <- tolower(snp_density[["product"]])
snp_density[["length"]] <- as.numeric(snp_density[["length"]])
snp_density[["density"]] <- as.numeric(snp_density[["snps"]]) / snp_density[["length"]]
snp_idx <- order(snp_density[["density"]], decreasing = TRUE)
snp_density <- snp_density[snp_idx, ]
removers <- c("amastin", "gp63", "leishmanolysin")
for (r in removers) {
drop_idx <- grepl(pattern = r, x = snp_density[["product"]])
snp_density <- snp_density[!drop_idx, ]
}
## Filter these for [A|a]mastin gp63 LeishmanolysinLet us grab out the number of variants/gene for the cure/fail samples, merge them into a dataframe, and add that to the gene annotations for the lp_se datastructure.
clinical_snps <- snps_intersections(lp_se, clinical_sets, chr_column = exp_chr_col, start_column = exp_start_col, end_column = exp_end_col)
fail_ref_snps <- as.data.frame(clinical_snps[["inters"]][["failure, reference strain"]])
fail_ref_snps <- rbind(fail_ref_snps,
as.data.frame(clinical_snps[["inters"]][["failure"]]))
cure_snps <- as.data.frame(clinical_snps[["inters"]][["cure"]])
head(fail_ref_snps)## seqnames start end
## chr_LPAL13-SCAF000063_pos_2573_ref_C_alt_A LPAL13-SCAF000063 2573 2574
## chr_LPAL13-SCAF000165_pos_363_ref_G_alt_C LPAL13-SCAF000165 363 364
## chr_LPAL13-SCAF000627_pos_2267_ref_G_alt_T LPAL13-SCAF000627 2267 2268
## chr_LpaL13-01_pos_26758_ref_T_alt_C LpaL13-01 26758 26759
## chr_LpaL13-03_pos_164108_ref_A_alt_G LpaL13-03 164108 164109
## chr_LpaL13-03_pos_236263_ref_T_alt_C LpaL13-03 236263 236264
## width strand
## chr_LPAL13-SCAF000063_pos_2573_ref_C_alt_A 2 +
## chr_LPAL13-SCAF000165_pos_363_ref_G_alt_C 2 +
## chr_LPAL13-SCAF000627_pos_2267_ref_G_alt_T 2 +
## chr_LpaL13-01_pos_26758_ref_T_alt_C 2 +
## chr_LpaL13-03_pos_164108_ref_A_alt_G 2 +
## chr_LpaL13-03_pos_236263_ref_T_alt_C 2 +
## seqnames start end
## chr_LPAL13-SCAF000397_pos_1312_ref_G_alt_A LPAL13-SCAF000397 1312 1313
## chr_LPAL13-SCAF000627_pos_1869_ref_A_alt_G LPAL13-SCAF000627 1869 1870
## chr_LPAL13-SCAF000791_pos_906_ref_C_alt_T LPAL13-SCAF000791 906 907
## chr_LpaL13-06_pos_2975_ref_T_alt_C LpaL13-06 2975 2976
## chr_LpaL13-09_pos_58219_ref_G_alt_A LpaL13-09 58219 58220
## chr_LpaL13-10_pos_185578_ref_C_alt_T LpaL13-10 185578 185579
## width strand
## chr_LPAL13-SCAF000397_pos_1312_ref_G_alt_A 2 +
## chr_LPAL13-SCAF000627_pos_1869_ref_A_alt_G 2 +
## chr_LPAL13-SCAF000791_pos_906_ref_C_alt_T 2 +
## chr_LpaL13-06_pos_2975_ref_T_alt_C 2 +
## chr_LpaL13-09_pos_58219_ref_G_alt_A 2 +
## chr_LpaL13-10_pos_185578_ref_C_alt_T 2 +
write.csv(file = "excel/cure_variants.txt", x = rownames(cure_snps))
write.csv(file = "excel/fail_variants.txt", x = rownames(fail_ref_snps))
annot <- rowData(lp_se)
clinical_interest_cure <- as.data.frame(clinical_snps[["gene_summaries"]][["cure"]])
summary(as.factor(clinical_interest_cure[[1]]))## 0 1 2 3
## 8729 41 5 3
clinical_interest_fail <- as.data.frame(clinical_snps[["gene_summaries"]][["failure"]])
summary(as.factor(clinical_interest_fail[[1]]))## 0 1 2 3
## 8740 35 1 2
clinical_interest <- merge(clinical_interest_cure,
clinical_interest_fail,
by = "row.names", all = TRUE)
rownames(clinical_interest) <- clinical_interest[["Row.names"]]
clinical_interest[["Row.names"]] <- NULL
colnames(clinical_interest) <- c("cure_snps", "fail_snps")
clinical_annot <- merge(annot, clinical_interest, by = "row.names")
rownames(annot) <- annot[["Row.names"]]
annot[["Row.names"]] <- NULL
dim(annot)## [1] 8778 111
## [1] 8778 111
The heatmap produced here should show the variants only for the zymodeme genes.
I am thinking that if we find clusters of locations which are variant, that might provide some PCR testing possibilities.
## Drop the 2.1, 2.4, unknown, and null
pruned_snps <- subset_se(new_snps_sufficient, subset = "condition=='z2.2'|condition=='z2.3'")
new_sets <- get_snp_sets(pruned_snps, factor = "zymodemecategorical")## The samples represent the following categories:
##
## z21 z22 z23 z24
## 0 42 40 0
## Using a proportion of observed variants, converting the data to binary observations.
## Warning in median_by_factor(assay(data), colData(data), fact = fact, fun =
## fun): The level z21 of the factor has no columns.
## The factor z22 has 42 rows.
## The factor z23 has 40 rows.
## Warning in median_by_factor(assay(data), colData(data), fact = fact, fun =
## fun): The level z24 of the factor has no columns.
## Finished iterating over the chromosomes.
## Length Class Mode
## factor 1 -none- character
## values 2 data.frame list
## observations 3 data.frame list
## possibilities 2 -none- character
## intersections 3 -none- list
## chr_data 725 -none- list
## set_names 4 -none- list
## invert_names 4 -none- list
## density 725 -none- numeric
## Length Class Mode
## 799 character character
write.csv(file = "excel/variants_22.csv", x = new_sets[["intersections"]][["10"]])
summary(new_sets[["intersections"]][["01"]])## Length Class Mode
## 67068 character character
Thus we see that there are 3,553 variants associated with 2.2 and 81,589 associated with 2.3.
The following function uses the positional data to look for sequential mismatches associated with zymodeme in the hopes that there will be some regions which would provide good potential targets for a PCR-based assay.
sequential_variants <- function(snp_sets, conditions = NULL, minimum = 3, maximum_separation = 3) {
if (is.null(conditions)) {
conditions <- 1
}
intersection_sets <- snp_sets[["intersections"]]
intersection_names <- snp_sets[["set_names"]]
chosen_intersection <- 1
if (is.numeric(conditions)) {
chosen_intersection <- conditions
} else {
intersection_idx <- intersection_names == conditions
chosen_intersection <- names(intersection_names)[intersection_idx]
}
possible_positions <- intersection_sets[[chosen_intersection]]
position_table <- data.frame(row.names = possible_positions)
pat <- "^chr_(.+)_pos_(.+)_ref_.*$"
position_table[["chr"]] <- gsub(pattern = pat, replacement = "\\1", x = rownames(position_table))
position_table[["pos"]] <- as.numeric(gsub(pattern = pat, replacement = "\\2", x = rownames(position_table)))
position_idx <- order(position_table[, "chr"], position_table[, "pos"])
position_table <- position_table[position_idx, ]
position_table[["dist"]] <- 0
last_chr <- ""
for (r in 1:nrow(position_table)) {
this_chr <- position_table[r, "chr"]
if (r == 1) {
position_table[r, "dist"] <- position_table[r, "pos"]
last_chr <- this_chr
next
}
if (this_chr == last_chr) {
position_table[r, "dist"] <- position_table[r, "pos"] - position_table[r - 1, "pos"]
} else {
position_table[r, "dist"] <- position_table[r, "pos"]
}
last_chr <- this_chr
}
## Working interactively here.
doubles <- position_table[["dist"]] == 1
doubles <- position_table[doubles, ]
write.csv(doubles, "doubles.csv")
one_away <- position_table[["dist"]] == 2
one_away <- position_table[one_away, ]
write.csv(one_away, "one_away.csv")
two_away <- position_table[["dist"]] == 3
two_away <- position_table[two_away, ]
write.csv(two_away, "two_away.csv")
combined <- rbind(doubles, one_away)
combined <- rbind(combined, two_away)
position_idx <- order(combined[, "chr"], combined[, "pos"])
combined <- combined[position_idx, ]
this_chr <- ""
for (r in 1:nrow(combined)) {
this_chr <- combined[r, "chr"]
if (r == 1) {
combined[r, "dist_pair"] <- combined[r, "pos"]
last_chr <- this_chr
next
}
if (this_chr == last_chr) {
combined[r, "dist_pair"] <- combined[r, "pos"] - combined[r - 1, "pos"]
} else {
combined[r, "dist_pair"] <- combined[r, "pos"]
}
last_chr <- this_chr
}
dist_pair_maximum <- 1000
dist_pair_minimum <- 200
dist_pair_idx <- combined[["dist_pair"]] <= dist_pair_maximum &
combined[["dist_pair"]] >= dist_pair_minimum
remaining <- combined[dist_pair_idx, ]
no_weak_idx <- grepl(pattern = "ref_(G|C)", x = rownames(remaining))
remaining <- remaining[no_weak_idx, ]
print(head(table(position_table[["dist"]])))
sequentials <- position_table[["dist"]] <= maximum_separation
message("There are ", sum(sequentials), " candidate regions.")
## The following can tell me how many runs of each length occurred, that is not quite what I want.
## Now use run length encoding to find the set of sequential sequentials!
rle_result <- rle(sequentials)
rle_values <- rle_result[["values"]]
## The following line is equivalent to just leaving values alone:
## true_values <- rle_result[["values"]] == TRUE
rle_lengths <- rle_result[["lengths"]]
true_sequentials <- rle_lengths[rle_values]
rle_idx <- cumsum(rle_lengths)[which(rle_values)]
position_table[["last_sequential"]] <- 0
count <- 0
for (r in rle_idx) {
count <- count + 1
position_table[r, "last_sequential"] <- true_sequentials[count]
}
message("The maximum sequential set is: ", max(position_table[["last_sequential"]]), ".")
wanted_idx <- position_table[["last_sequential"]] >= minimum
wanted <- position_table[wanted_idx, c("chr", "pos")]
return(wanted)
}
zymo22_sequentials <- sequential_variants(new_sets, conditions = "z22",
minimum = 1, maximum_separation = 2)
dim(zymo22_sequentials)
## 7 candidate regions for zymodeme 2.2 -- thus I am betting that the reference strain is a 2.2
zymo23_sequentials <- sequential_variants(new_sets, conditions = "z23",
minimum = 2, maximum_separation = 2)
dim(zymo23_sequentials)
## In contrast, there are lots (587) of interesting regions for 2.3!The first 4 candidate regions from my set of remaining: * Chr Pos. Distance * LpaL13-15 238433 448 * LpaL13-18 142844 613 * LpaL13-29 830342 252 * LpaL13-33 1331507 843
Lets define a couple of terms: * Third: Each of the 4 above positions. * Second: Third - Distance * End: Third + PrimerLen * Start: Second - Primerlen
In each instance, these are the last positions, so we want to grab three things:
## * LpaL13-15 238433 448
first_candidate_chr <- lp_genome[["LpaL13_15"]]
primer_length <- 22
amplicon_length <- 448
first_candidate_third <- 238433
first_candidate_second <- first_candidate_third - amplicon_length
first_candidate_start <- first_candidate_second - primer_length
first_candidate_end <- first_candidate_third + primer_length
first_candidate_region <- subseq(first_candidate_chr, first_candidate_start, first_candidate_end)
first_candidate_region
first_candidate_5p <- subseq(first_candidate_chr, first_candidate_start, first_candidate_second)
as.character(first_candidate_5p)
first_candidate_3p <- spgs::reverseComplement(subseq(first_candidate_chr, first_candidate_third, first_candidate_end))
first_candidate_3p
## * LpaL13-18 142844 613
second_candidate_chr <- lp_genome[["LpaL13_18"]]
primer_length <- 22
amplicon_length <- 613
second_candidate_third <- 142844
second_candidate_second <- second_candidate_third - amplicon_length
second_candidate_start <- second_candidate_second - primer_length
second_candidate_end <- second_candidate_third + primer_length
second_candidate_region <- subseq(second_candidate_chr, second_candidate_start, second_candidate_end)
second_candidate_region
second_candidate_5p <- subseq(second_candidate_chr, second_candidate_start, second_candidate_second)
as.character(second_candidate_5p)
second_candidate_3p <- spgs::reverseComplement(subseq(second_candidate_chr, second_candidate_third, second_candidate_end))
second_candidate_3p
## * LpaL13-29 830342 252
third_candidate_chr <- lp_genome[["LpaL13_29"]]
primer_length <- 22
amplicon_length <- 252
third_candidate_third <- 830342
third_candidate_second <- third_candidate_third - amplicon_length
third_candidate_start <- third_candidate_second - primer_length
third_candidate_end <- third_candidate_third + primer_length
third_candidate_region <- subseq(third_candidate_chr, third_candidate_start, third_candidate_end)
third_candidate_region
third_candidate_5p <- subseq(third_candidate_chr, third_candidate_start, third_candidate_second)
as.character(third_candidate_5p)
third_candidate_3p <- spgs::reverseComplement(subseq(third_candidate_chr, third_candidate_third, third_candidate_end))
third_candidate_3p
## You are a garbage polypyrimidine tract.
## Which is actually interesting if the mutations mess it up.
## * LpaL13-33 1331507 843
fourth_candidate_chr <- lp_genome[["LpaL13_33"]]
primer_length <- 22
amplicon_length <- 843
fourth_candidate_third <- 1331507
fourth_candidate_second <- fourth_candidate_third - amplicon_length
fourth_candidate_start <- fourth_candidate_second - primer_length
fourth_candidate_end <- fourth_candidate_third + primer_length
fourth_candidate_region <- subseq(fourth_candidate_chr, fourth_candidate_start, fourth_candidate_end)
fourth_candidate_region
fourth_candidate_5p <- subseq(fourth_candidate_chr, fourth_candidate_start, fourth_candidate_second)
as.character(fourth_candidate_5p)
fourth_candidate_3p <- spgs::reverseComplement(subseq(fourth_candidate_chr, fourth_candidate_third, fourth_candidate_end))
fourth_candidate_3pI made a fun little function which should find regions which have lots of variants associated with a given experimental factor.
pheno <- subset_se(lp_se, subset = "condition=='z2.2'|condition=='z2.3'")
pheno <- subset_se(pheno, subset = "!is.na(colData(pheno)[['bcftable']])")
pheno_snps <- count_snps(pheno, annot_column = "freebayessummary", snp_column="PAIRED")## Using the snp column: PAIRED from the sample annotations.
I cannot run the following block in the container unless/until I copy the gff into it…
fun_stuff <- snp_density_primers(
pheno_snps,
bsgenome = "BSGenome.Leishmania.panamensis.MHOMCOL81L13.v53",
gff = "reference/TriTrypDB-53_LpanamensisMHOMCOL81L13.gff")
drop_scaffolds <- grepl(x = rownames(fun_stuff$favorites), pattern = "SCAF")
favorite_primer_regions <- fun_stuff[["favorites"]][!drop_scaffolds, ]
favorite_primer_regions[["bin"]] <- rownames(favorite_primer_regions)
favorite_primer_regions <- favorite_primer_regions %>%
relocate(bin)Here is my note from our meeting:
Cross reference primers to DE genes of 2.2/2.3 and/or resistance/suscpetible, add a column to the primer spreadsheet with the DE genes (in retrospect I am guessing this actually means to put the logFC as a column.
One nice thing, I did a semantic removal on the lp_se, so the set of logFC/pvalues should not have any of the offending types; thus I should be able to automagically get rid of them in the merge.
This block needs to go after differential expression analyses.
logfc <- zy_table_sva[["data"]][["z23_vs_z22"]]
logfc_columns <- logfc[, c("deseq_logfc", "deseq_adjp")]
colnames(logfc_columns) <- c("z23_logfc", "z23_adjp")
new_table <- merge(favorite_primer_regions, logfc_columns,
by.x = "closest_gene_before_id", by.y = "row.names")
sus <- sus_table_sva[["data"]][["sensitive_vs_resistant"]]
sus_columns <- sus[, c("deseq_logfc", "deseq_adjp")]
colnames(sus_columns) <- c("sus_logfc", "sus_adjp")
new_table <- merge(new_table, sus_columns,
by.x = "closest_gene_before_id", by.y = "row.names") %>%
relocate(bin)
written <- write_xlsx(data = new_table,
excel = "excel/favorite_primers_xref_zy_sus.xlsx")We can cross reference the variants against the zymodeme status and plot a heatmap of the results and hopefully see how they separate.
snp_genes <- sm(snps_vs_genes(lp_se, new_sets, chr_column = exp_chr_col,
start_column = exp_start_col, end_column = exp_end_col))
clinical_colors_v2 <- list(
"z22" = "#0000cc",
"z23" = "#cc0000")
new_zymo_norm <- normalize_se(pruned_snps, norm = "quant") %>%
set_conditions(fact = "zymodemecategorical", colors = clinical_colors_v2)## The numbers of samples by condition are:
##
## z22 z23
## 42 40
## A heatmap of pairwise sample distances ranging from:
## 405123.161006473 to 2192192.81018837.
pdf_zymo_heat <- pp(file = "images/onlyz22_z23_snp_heatmap.pdf", width = 12, height = 12,
image = zymo_heat[["plot"]])## Warning in pp(file = "images/onlyz22_z23_snp_heatmap.pdf", width = 12, height =
## 12, : There is no device to shut down.
svg_zymo_heat <- pp(file = "images/onlyz22_z23_snp_heatmap.svg", width = 12, height = 12,
image = zymo_heat[["plot"]])## Warning in pp(file = "images/onlyz22_z23_snp_heatmap.svg", width = 12, height =
## 12, : There is no device to shut down.
Now let us try to make a heatmap which includes some of the annotation data.
des <- colData(both_norm)
undef_idx <- is.na(des[["pathogenstrain"]])
des[undef_idx, "pathogenstrain"] <- "unknown"
##hmcols <- colorRampPalette(c("yellow","black","darkblue"))(256)
correlations <- hpgl_cor(assay(both_norm))
na_idx <- is.na(correlations)
correlations[na_idx] <- 0
## Make an initial heatmap via plot_disheat, which may get used as the figure:
initial_snps <- set_conditions(both_norm, fact = "zymodemereference", colors = color_choices[["strain"]])## The numbers of samples by condition are:
##
## z2.1 z2.2 z2.3 z2.4
## 7 42 40 2
## Warning in set_se_colors(new_se, colors = colors): Colors for the following
## categories are not being used: z2.0, z3.0, z3.2, z1.0, z1.5, b2904, unknown.
## A heatmap of pairwise sample distances ranging from:
## 759.771680323435 to 3604.85253361071.
pdf_initial_disheat <- pp(file = "figures/initial_snp_heatmap.pdf", width = 20, height = 20,
image = initial_disheat)## Error in `xy.coords()`:
## ! 'x' is a list, but does not have components 'x' and 'y'
svg_initial_disheat <- pp(file = "figures/initial_snp_heatmap.svg", width = 20, height = 20,
image = initial_disheat)## Error in `xy.coords()`:
## ! 'x' is a list, but does not have components 'x' and 'y'
zymo_missing_idx <- is.na(des[["zymodemecategorical"]])
des[["zymodemecategorical"]] <- as.character(des[["zymodemecategorical"]])
des[["clinicalcategorical"]] <- as.character(des[["clinicalcategorical"]])
des[zymo_missing_idx, "zymodemecategorical"] <- "unknown"
mydendro <- list(
"clustfun" = hclust,
"lwd" = 2.0)
col_data <- as.data.frame(des[, c("zymodemecategorical")])
unknown_clinical <- is.na(des[["clinicalcategorical"]])
colnames(col_data) <- c("zymodeme")
row_data <- as.data.frame(des[, c("sus_category_current", "clinicalcategorical")])
colnames(row_data) <- c("susceptibility", "outcome")
row_data[unknown_clinical, "outcome"] <- "undefined"
myannot <- list(
"Col" = list("data" = col_data),
"Row" = list("data" = row_data))
myclust <- list("cuth" = 1.0,
"col" = BrewerClusterCol)
mylabs <- list(
"Row" = list("nrow" = 4),
"Col" = list("nrow" = 4))
hmcols <- colorRampPalette(c("darkblue", "beige"))(240)
zymo_annot_heat <- annHeatmap2(
correlations,
dendrogram = mydendro,
annotation = myannot,
cluster = myclust,
labels = mylabs,
## The following controls if the picture is symmetric
scale = "none",
col = hmcols)## Warning in breakColors(breaks, col): more colors than classes: ignoring 35 last
## colors
plot(zymo_annot_heat)
pdf_dendro_heat <- pp(file = "images/dendro_heatmap.pdf", height = 20, width = 20,
image = zymo_annot_heat)## Warning in pp(file = "images/dendro_heatmap.pdf", height = 20, width = 20, :
## There is no device to shut down.
svg_dendro_heat <- pp(file = "images/dendro_heatmap.svg", height = 20, width = 20,
image = zymo_annot_heat)## Warning in pp(file = "images/dendro_heatmap.svg", height = 20, width = 20, :
## There is no device to shut down.
Print the larger heatmap so that all the labels appear. Keep in mind that as we get more samples, this image needs to continue getting bigger.
I cannot run the following block until/unless I install cmplot in the container. Oh, I did! Let us run it and see what happens.
##
## z2.2 z2.3
## 29 27
idx_tbl <- assay(pheno_snps) > 5
new_tbl <- data.frame(row.names = rownames(assay(pheno_snps)))
for (n in names(xref_prop)) {
samples <- colData(pheno_snps)[["condition"]] == n
new_tbl[[n]] <- 0
prop_col <- rowSums(idx_tbl[, samples]) / xref_prop[n]
new_tbl[n] <- prop_col
}
keepers <- grepl(x = rownames(new_tbl), pattern = "LpaL13")
new_tbl <- new_tbl[keepers, ]
new_tbl[["strong22"]] <- 1.001 - new_tbl[["z2.2"]]
new_tbl[["strong23"]] <- 1.001 - new_tbl[["z2.3"]]
s22_na <- new_tbl[["strong22"]] > 1
new_tbl[s22_na, "strong22"] <- 1
s23_na <- new_tbl[["strong23"]] > 1
new_tbl[s23_na, "strong23"] <- 1
new_tbl[["SNP"]] <- rownames(new_tbl)
new_tbl[["Chromosome"]] <- gsub(x = new_tbl[["SNP"]], pattern = "chr_(.*)_pos_.*", replacement = "\\1")
new_tbl[["Position"]] <- gsub(x = new_tbl[["SNP"]], pattern = ".*_pos_(\\d+)_.*", replacement = "\\1")
new_tbl <- new_tbl[, c("SNP", "Chromosome", "Position", "strong22", "strong23")]
simplify <- new_tbl
simplify[["strong22"]] <- NULL
CMplot(new_tbl, bin.size = 10000, threshold = c(0.01, 0.05), plot.type = "d",
file.name = "variant_density_10k")## Marker density plotting.
## Plots are stored in: /lab/singularity/clinical_strains_analyses/202604081742_outputs
CMplot(new_tbl, bin.size = 1000, threshold = c(0.01, 0.05), plot.type = "d",
file.name = "variant_density_1k")## Marker density plotting.
## Plots are stored in: /lab/singularity/clinical_strains_analyses/202604081742_outputs
CMplot(new_tbl, bin.size = 100000, threshold = c(0.01, 0.05), plot.type = "d",
file.name = "variant_density_100k")## Marker density plotting.
## Plots are stored in: /lab/singularity/clinical_strains_analyses/202604081742_outputs
CMplot(new_tbl, plot.type = "m", multracks = TRUE, threshold = c(0.01, 0.05),
threshold.lwd = c(1,1), threshold.col = c("black","grey"),
amplify = TRUE, bin.size = 1000,
chr.den.col = c("darkgreen", "yellow", "red"),
signal.col = c("red", "green", "blue"),
signal.cex = 1, file = "jpg", dpi = 300, file.output = TRUE, verbose = TRUE)## Multi-tracks Manhattan plotting strong22.
## Multi-tracks Manhattan plotting strong23.
## Plots are stored in: /lab/singularity/clinical_strains_analyses/202604081742_outputs
I have been a bit frustrated with the clunkyness of cmplot, so I did some reading and found autoplot. It makes use of g/iranges to plot arbitrary data and as such has the potential to be significantly more generally useful than cmplot. I think I will be able to use it to view a lot of interesting different data types. In this instance I want to plot density of variants associated with various conditions in the data (z2.3/z2.2, cure/fail, whatever). In addition, it might be nice to have the ORFs displayed in some fashion (space permitting).
I am pretty sure I made a function which makes this less clunky than what follows.
lp_entry <- EuPathDB::get_eupath_entry(species = "MHOM/COL", metadata = eu_meta)
## These lines cannot run in the container because it cannot write
##txdb_pkgname <- make_eupath_txdb(lp_entry)
##grange_name <- make_eupath_granges(lp_entry)
grange_name <- gsub(x = lp_entry[["GrangesPkg"]], pattern = "\\.rda$", replacement = "")
grange_filename <- file.path("build", lp_entry[["GrangesPkg"]])
if (file.exists(grange_filename)) {
load(grange_filename)
} else {
created <- dir.create("build/gff", recursive = TRUE)
grange_build <- make_eupath_granges(lp_entry)
grange_filename <- grange_build[["rda"]]
load(grange_filename)
}
grange_data <- get0(grange_name)
scaffold_idx <- grepl(x = as.character(seqnames(grange_data)), pattern = "SCAF")
no_scaffolds <- grange_data[!scaffold_idx]
scaffold_idx <- grepl(x = as.character(names(seqinfo(grange_data))), pattern = "SCAF")
chr_names <- names(seqinfo(grange_data))[!scaffold_idx]
no_scaffolds <- seqinfo(grange_data)[chr_names]
auto_tbl <- new_tbl
auto_tbl[["position2"]] <- auto_tbl[["Position"]]
auto_tbl[["SNP"]] <- NULL
rownames(auto_tbl) <- NULL
tilesize <- 1000
bins_1k <- GenomicRanges::tileGenome(seqlengths(no_scaffolds), tilewidth = 1000,
cut.last.tile.in.chrom = TRUE)
bins_5k <- GenomicRanges::tileGenome(seqlengths(no_scaffolds), tilewidth = 5000,
cut.last.tile.in.chrom = TRUE)
bins_10k <- GenomicRanges::tileGenome(seqlengths(no_scaffolds), tilewidth = 10000,
cut.last.tile.in.chrom = TRUE)
bins_1nt <- GenomicRanges::tileGenome(seqlengths(no_scaffolds), tilewidth = 1,
cut.last.tile.in.chrom = TRUE)
auto_tbl[["strand"]] <- "+"
## I want to calculate the number of intersecting positions between my auto_tbl and the 1k bins.
start <- auto_tbl[, c("Chromosome", "Position", "position2", "strand", "strong23")]
colnames(start) <- c("chr", "start", "end", "strand", "z23")
start[["chr"]] <- gsub(x = start[["chr"]], pattern = "-", replacement = "_")
var_grange <- makeGRangesFromDataFrame(start, seqinfo = no_scaffolds, keep.extra.columns = TRUE)
vars_per_bin <- findOverlaps(bins_1k, var_grange)
vars_per_bin_numeric <- as.data.frame(bins_1k)
vars_per_bin_numeric[["bin"]] <- rownames(vars_per_bin_numeric)
count_per_bin <- as.data.frame(vars_per_bin) %>%
group_by(queryHits) %>%
dplyr::tally()
colnames(count_per_bin) <- c("bin", "num")
vars_per_bin_numeric <- merge(vars_per_bin_numeric, count_per_bin, by = "bin", all.x = TRUE)
missing_idx <- is.na(vars_per_bin_numeric[["num"]])
vars_per_bin_numeric[missing_idx, "num"] <- 0
vars_per_bin <- vars_per_bin_numeric[, c("seqnames", "start", "end", "width", "strand", "num")]
vpb_grange <- makeGRangesFromDataFrame(vars_per_bin, seqinfo = no_scaffolds, keep.extra.columns = TRUE)
kary <- autoplot(vpb_grange, layout = "karyogram", aes(color = num, fill = num)) +
scale_color_gradient(low = "blue", high = "red") +
scale_fill_gradient(low = "blue", high = "red")
## theme_bw(base_size = 10) +
kary
pdf_kary <- pp(file = "images/karyogram_by_variants.pdf", height = 24, width = 18, image = kary)
svg_kary <- pp(file = "images/karyogram_by_variants.svg", height = 24, width = 18, image = kary)
var_kary <- ggbio() +
layout_karyogram(vpb_grange, aes(color = num, fill = num)) +
scale_fill_gradient(low = "blue", high = "white") +
scale_color_gradient(low = "blue", high = "white") +
theme_bw(base_size = 10)
var_karyThis tool looks a little opaque, but provides sample data with things that make sense to me and should be pretty easy to recapitulate in our data.
## For this, let us use the 'new_snps' data structure.
## Caveat here: these need to be coerced to numbers.
my_covariates <- colData(new_snps)[, c("zymodemecategorical", "clinicalcategorical")]
for (col in colnames(my_covariates)) {
my_covariates[[col]] <- as.numeric(as.factor(my_covariates[[col]]))
}
my_covariates <- t(my_covariates)
my_geneloc <- rowData(lp_se)[, c("gid", "chromosome", "start", "end")]
colnames(my_geneloc) <- c("geneid", "chr", "left", "right")
my_ge <- assay(normalize_se(lp_se, transform = "log2", filter = TRUE, convert = "cpm"))
used_samples <- tolower(colnames(my_ge)) %in% colnames(assay(new_snps))
my_ge <- my_ge[, used_samples]
my_snpsloc <- data.frame(rownames = rownames(assay(new_snps)))
## Oh, caveat here: Because of the way I stored the data,
## I could have duplicate rows which presumably will make matrixEQTL sad
my_snpsloc[["chr"]] <- gsub(pattern = "^chr_(.+)_pos(.+)_ref_.*$", replacement = "\\1",
x = rownames(my_snpsloc))
my_snpsloc[["pos"]] <- gsub(pattern = "^chr_(.+)_pos(.+)_ref_.*$", replacement = "\\2",
x = rownames(my_snpsloc))
test <- duplicated(my_snpsloc)
## Each duplicated row would be another variant at that position;
## so in theory we would do a rle to number them I am guessing
## However, I do not have different variants so I think I can ignore this for the moment
## but will need to make my matrix either 0 or 1.
if (sum(test) > 0) {
message("There are: ", sum(duplicated), " duplicated entries.")
keep_idx <- ! test
my_snpsloc <- my_snpsloc[keep_idx, ]
}
my_snps <- assay(new_snps)
one_idx <- my_snps > 0
my_snps[one_idx] <- 1
## Ok, at this point I think I have all the pieces which this method wants...
## Oh, no I guess not; it actually wants the data as a set of filenames...
library(MatrixEQTL)
write.table(my_snps, "eqtl/snps.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(my_snps, "eqtl/snps.tsv", )
write.table(my_snpsloc, "eqtl/snpsloc.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(my_snpsloc, "eqtl/snpsloc.tsv")
write.table(as.data.frame(my_ge), "eqtl/ge.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_ge), "eqtl/ge.tsv")
write.table(as.data.frame(my_geneloc), "eqtl/geneloc.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_geneloc), "eqtl/geneloc.tsv")
write.table(as.data.frame(my_covariates), "eqtl/covariates.tsv", na = "NA", col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)
## readr::write_tsv(as.data.frame(my_covariates), "eqtl/covariates.tsv")
useModel = modelLINEAR # modelANOVA, modelLINEAR, or modelLINEAR_CROSS
# Genotype file name
SNP_file_name = "eqtl/snps.tsv"
snps_location_file_name = "eqtl/snpsloc.tsv"
expression_file_name = "eqtl/ge.tsv"
gene_location_file_name = "eqtl/geneloc.tsv"
covariates_file_name = "eqtl/covariates.tsv"
# Output file name
output_file_name_cis = tempfile()
output_file_name_tra = tempfile()
# Only associations significant at this level will be saved
pvOutputThreshold_cis = 0.1
pvOutputThreshold_tra = 0.1
# Error covariance matrix
# Set to numeric() for identity.
errorCovariance = numeric()
# errorCovariance = read.table("Sample_Data/errorCovariance.txt");
# Distance for local gene-SNP pairs
cisDist = 1e6
## Load genotype data
snps = SlicedData$new()
snps$fileDelimiter = "\t" # the TAB character
snps$fileOmitCharacters = "NA" # denote missing values;
snps$fileSkipRows = 1 # one row of column labels
snps$fileSkipColumns = 1 # one column of row labels
snps$fileSliceSize = 2000 # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name)
## Load gene expression data
gene = SlicedData$new()
gene$fileDelimiter = "\t" # the TAB character
gene$fileOmitCharacters = "NA" # denote missing values;
gene$fileSkipRows = 1 # one row of column labels
gene$fileSkipColumns = 1 # one column of row labels
gene$fileSliceSize = 2000 # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name)
## Load covariates
cvrt = SlicedData$new()
cvrt$fileDelimiter = "\t" # the TAB character
cvrt$fileOmitCharacters = "NA" # denote missing values;
cvrt$fileSkipRows = 1 # one row of column labels
cvrt$fileSkipColumns = 1 # one column of row labels
if(length(covariates_file_name) > 0) {
cvrt$LoadFile(covariates_file_name)
}
## Run the analysis
snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE)
genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE)
me = Matrix_eQTL_main(
snps = snps,
gene = gene,
cvrt = cvrt,
output_file_name = output_file_name_tra,
pvOutputThreshold = pvOutputThreshold_tra,
useModel = useModel,
errorCovariance = errorCovariance,
verbose = TRUE,
output_file_name.cis = output_file_name_cis,
pvOutputThreshold.cis = pvOutputThreshold_cis,
snpspos = snpspos,
genepos = genepos,
cisDist = cisDist,
pvalue.hist = "qqplot",
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE);## Warning: Your system is mis-configured: '/etc/localtime' is not a symlink
## Warning: It is strongly recommended to set envionment variable TZ to
## 'America/New_York' (or equivalent)
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
locale: C
attached base packages: stats4, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: foreach(v.1.5.2), edgeR(v.4.8.2), ruv(v.0.9.7.1), hpgltools(v.2026.03), Heatplus(v.3.18.0), glue(v.1.8.0), ggbio(v.1.58.0), ggplot2(v.4.0.2), GenomicRanges(v.1.62.1), Seqinfo(v.1.0.0), IRanges(v.2.44.0), S4Vectors(v.0.48.0), BiocGenerics(v.0.56.0), generics(v.0.1.4), dplyr(v.1.2.0) and CMplot(v.4.5.1)
loaded via a namespace (and not attached): fs(v.2.0.1), ProtGenerics(v.1.42.0), matrixStats(v.1.5.0), bitops(v.1.0-9), blockmodeling(v.1.1.8), doParallel(v.1.0.17), httr(v.1.4.8), RColorBrewer(v.1.1-3), numDeriv(v.2016.8-1.1), tools(v.4.5.0), backports(v.1.5.0), R6(v.2.6.1), lazyeval(v.0.2.2), mgcv(v.1.9-3), withr(v.3.0.2), gridExtra(v.2.3), preprocessCore(v.1.70.0), cli(v.3.6.5), Biobase(v.2.70.0), textshaping(v.1.0.5), labeling(v.0.4.3), EBSeq(v.2.8.0), sass(v.0.4.10), robustbase(v.0.99-7), mvtnorm(v.1.3-6), S7(v.0.2.1), genefilter(v.1.92.0), Rsamtools(v.2.26.0), systemfonts(v.1.3.2), yulab.utils(v.0.2.4), foreign(v.0.8-90), DOSE(v.4.4.0), svglite(v.2.2.2), R.utils(v.2.13.0), dichromat(v.2.0-0.1), BSgenome(v.1.78.0), limma(v.3.66.0), rstudioapi(v.0.18.0), RSQLite(v.2.4.6), BiocIO(v.1.20.0), gtools(v.3.9.5), crosstalk(v.1.2.2), zip(v.2.3.3), GO.db(v.3.22.0), Matrix(v.1.7-3), abind(v.1.4-8), R.methodsS3(v.1.8.2), lifecycle(v.1.0.5), yaml(v.2.3.12), SummarizedExperiment(v.1.40.0), gplots(v.3.3.0), qvalue(v.2.42.0), SparseArray(v.1.10.10), Rtsne(v.0.17), grid(v.4.5.0), blob(v.1.3.0), promises(v.1.5.0), crayon(v.1.5.3), lattice(v.0.22-7), cowplot(v.1.2.0), GenomicFeatures(v.1.62.0), cigarillo(v.1.0.0), annotate(v.1.88.0), KEGGREST(v.1.50.0), pillar(v.1.11.1), knitr(v.1.51), varhandle(v.2.0.6), fgsea(v.1.36.2), rjson(v.0.2.23), boot(v.1.3-31), corpcor(v.1.6.10), codetools(v.0.2-20), fastmatch(v.1.1-8), Vennerable(v.3.1.0.9000), data.table(v.1.18.2.1), vctrs(v.0.7.2), png(v.0.1-9), Rdpack(v.2.6.6), testthat(v.3.3.2), gtable(v.0.3.6), cachem(v.1.1.0), openxlsx(v.4.2.8.1), xfun(v.0.57), rbibutils(v.2.4.1), S4Arrays(v.1.10.1), mime(v.0.13), RcppEigen(v.0.3.4.0.2), reformulas(v.0.4.4), survival(v.3.8-3), NOISeq(v.2.54.0), iterators(v.1.0.14), statmod(v.1.5.1), nlme(v.3.1-168), pbkrtest(v.0.5.5), bit64(v.4.6.0-1), EnvStats(v.3.1.0), rprojroot(v.2.1.1), GenomeInfoDb(v.1.46.2), bslib(v.0.10.0), KernSmooth(v.2.23-26), otel(v.0.2.0), rpart(v.4.1.24), colorspace(v.2.1-2), DBI(v.1.3.0), Hmisc(v.5.2-5), nnet(v.7.3-20), DESeq2(v.1.50.2), tidyselect(v.1.2.1), bit(v.4.6.0), compiler(v.4.5.0), curl(v.7.0.0), graph(v.1.88.1), htmlTable(v.2.4.3), desc(v.1.4.3), DelayedArray(v.0.36.1), plotly(v.4.12.0), rtracklayer(v.1.70.1), checkmate(v.2.3.4), scales(v.1.4.0), caTools(v.1.18.3), DEoptimR(v.1.1-4), remaCor(v.0.0.20), RBGL(v.1.86.0), rappdirs(v.0.3.4), stringr(v.1.6.0), digest(v.0.6.39), minqa(v.1.2.8), variancePartition(v.1.40.2), rmarkdown(v.2.31), aod(v.1.3.3), XVector(v.0.50.0), RhpcBLASctl(v.0.23-42), htmltools(v.0.5.9), pkgconfig(v.2.0.3), base64enc(v.0.1-6), lme4(v.2.0-1), MatrixGenerics(v.1.22.0), fastmap(v.1.2.0), ensembldb(v.2.34.0), rlang(v.1.1.7), htmlwidgets(v.1.6.4), UCSC.utils(v.1.6.1), shiny(v.1.13.0), farver(v.2.1.2), jquerylib(v.0.1.4), jsonlite(v.2.0.0), BiocParallel(v.1.44.0), GOSemSim(v.2.36.0), R.oo(v.1.27.1), VariantAnnotation(v.1.56.0), RCurl(v.1.98-1.18), magrittr(v.2.0.4), Formula(v.1.2-5), Rcpp(v.1.1.1), stringi(v.1.8.7), brio(v.1.1.5), MASS(v.7.3-65), plyr(v.1.8.9), parallel(v.4.5.0), ggrepel(v.0.9.8), doSNOW(v.1.0.20), Biostrings(v.2.78.0), splines(v.4.5.0), pander(v.0.6.6), locfit(v.1.5-9.12), fastcluster(v.1.3.0), pkgload(v.1.5.1), reshape2(v.1.4.5), restez(v.2.1.5), XML(v.3.99-0.23), evaluate(v.1.0.5), biovizBase(v.1.58.0), BiocManager(v.1.30.27), nloptr(v.2.2.1), httpuv(v.1.6.17), tidyr(v.1.3.2), purrr(v.1.2.1), broom(v.1.0.12), xtable(v.1.8-8), restfulr(v.0.0.16), AnnotationFilter(v.1.34.0), fANCOVA(v.0.6-1), later(v.1.4.8), viridisLite(v.0.4.3), snow(v.0.4-4), OrganismDbi(v.1.52.0), tibble(v.3.3.1), lmerTest(v.3.2-1), memoise(v.2.0.1), AnnotationDbi(v.1.72.0), GenomicAlignments(v.1.46.0), cluster(v.2.1.8.1) and sva(v.3.58.0)
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset 03a4e43defdc53e7038116087cb006b05404d424
## This is hpgltools commit: Tue Apr 7 15:44:04 2026 -0400: 03a4e43defdc53e7038116087cb006b05404d424