Changelog
- 202406: Adding a neat experiment from the Scott lab which I had
sitting in my tree but somehow forgot about.
- 202308: Moved preprocessing material from the datastructures file to
here.
Introduction
This document outlines in some detail the tasks performed in order to
process the ‘raw’ data.
Sequence
preprocessing
All of the preprocessing tasks were performed with CYOA ((ElsayedlabCYOA?)) and included
the following:
- All samples were trimmed with trimomatic
- using options to remove adapters using a supplemented copy of the
default adapter database (ILLUMINACLIP:2:30:10:2:keepBothReads), reads
with fewer than 40 nucleotides remaining were discarded (MINLEN:40), and
a mean quality >= 25 over 4 nucleotide window was used for trimming
(SLIDINGWINDOW:4:25). When paired end reads were available, only the
properly paired were used for the tasks that follow. The generated
scripts for each sample which performed this trimming have the prefix
‘05trim’.
- Reads were passed to fastqc (Andrews
(n.d.)) using the default parameters to query read quality,
remaining adapter content, and as an initial survey for contaminants.
The generated scripts which performed this have the prefix
‘06fastqc’
- Every sample was passed to kraken2 ((luMetagenomeAnalysisUsing2022?))
against its comprehensive ‘standard’ and ‘virus’ databases, updated
2021/06 in order to query for putative contaminants and potentially
interesting viral sequences. The scripts which performed this have the
prefix ‘11kraken’.
- Samples were passed to hisat2 (Kim et al.
(2019)) against the hg38 release 100 human genome ((HomoSapiensEnsembl?)) with the
default options as well as the Leishmania panamensis MHOM/COL release 46
- in serial. The scripts which performed these operations have the
prefixes ‘15hisat2_hg38_100’ and ‘15hisat2_lpanamensis_v46’
respectively.
- The resulting alignments were converted to the binary, sorted,
compressed, and indexed format via samtools ((bonfieldHTSlibLibraryReading2021?)).
Additional filters were performed to extract only the non-variant
alignments . The scripts which performed this have the prefix
‘19s2b_{genome}’.
- These were cross referenced against the relevant feature databases
in order to generate count tables by gene via htseq (Putri et al. (2022)). These scripts have the
prefix ‘21hts’.
- Each sample was passed to salmon (Patro et
al. (2017)) against the transcript databases for hg38 release
100, L. panamensis release 46, and a concatenation of the two. These
scripts have the suffix ‘30sal_{genome}’.
- A regular expression search was performed for the observed set of
spliced leader sequences. The sequence ‘AGTTTCTGTACTTTATTGG’ rather than
the entire SL sequence was searched to avoid potential SL variants. The
relevant scripts have the prefix ‘50slsearch’. I later set up a
fastp-based ((chenFastpUltrafastAllinone2018?))
search, but I do not think it is included.
- When sufficient coverage was observed against the parasite,
freebayes (Garrison and Marth (2012)) was
used to quantify variants vs. the reference genome and infer the most
likely parasite subtype.
(if anyone ever reads this (no one will ever read this), can you tell
me how in the flying hell the fancy single quotes appeared in my text
document? I am writing this in emacs using markdown-mode and when I am
typing I see normal person quotes, but when I save it…. oooohhh I added
a text-mode hook and forgot it, I think?)
Given the fact that these tasks were performed over the course of 4
years, during which time we changed sequencing platforms and
computational infrastructure, the actual prefixes/suffixes of the
various scripts and outputs are not as consistent as I would like.
Trimming raw
reads
With the above caveat in mind, I am copy/pasting the little shell
fragments I used when trimming the samples:
## trim.sh, this assumes one will manually cd into the new sample's directory and run ../tmrc_trim.sh
## In addition, it assumes that the new data files will reside in the directory 'unprocessed/'.
input=$(/bin/ls -d unprocessed | tr '\n' ':' | sed 's/:$//g')
echo "Running trimomatic on ${input}"
cmd="cyoa --method trim --input ${input} 2>>cyoa_trim.out 1>&2 &"
echo "${cmd}" > cyoa_trim.out
${cmd}
The following is a script which is generated by the above,
arbitrarily chosen from sample TMRC30140. If anyone reads it carefully,
one might note that this sample is from before I started depositing new
raw reads into the ‘unprocessed’ directory, as a result the input reads
were in the cwd.
#!/usr/bin/env bash
#SBATCH --export=ALL
#SBATCH --mail-type=NONE
#SBATCH --chdir=/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30140
#SBATCH --partition=dpart
#SBATCH --qos=workstation --nice=10
#SBATCH --nodes=1 --requeue
#SBATCH --time=10:00:00
#SBATCH --job-name=01trim_TMRC30140_S3_R1_001
#SBATCH --mem=40G
#SBATCH --cpus-per-task=3
#SBATCH --output=outputs/logs/trim_TMRC30140_S3_R1_001.sbatchout
echo "## Started /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30140/scripts/01trim_TMRC30140_S3_R1_001.sh at $(date) on $(hostname) with id ${SLURM_JOBID}." >> outputs/log.txt
module add trimomatic
## This call to trimomatic removes illumina and epicentre adapters from TMRC30140_S3_R1_001.fastq.gz:TMRC30140_S3_R2_001.fastq.gz.
## It also performs a sliding window removal of anything with quality <25;
## cutadapt provides an alternative to this tool.
## The original sequence data is recompressed and saved in the sequences/ directory.
mkdir -p outputs/01trimomatic
## Note that trimomatic prints all output and errors to STDERR, so send both to output
trimmomatic PE \
-threads 1 \
-phred33 \
TMRC30140_S3_R1_001.fastq.gz TMRC30140_S3_R2_001.fastq.gz \
TMRC30140_S3_R1_001-trimmed_paired.fastq TMRC30140_S3_R1_001-trimmed_unpaired.fastq \
TMRC30140_S3_R2_001-trimmed_paired.fastq TMRC30140_S3_R2_001-trimmed_unpaired.fastq \
ILLUMINACLIP:/cbcb/sw/RedHat-7-x86_64/common/local/perl/5.34/lib/site_perl/5.34.0/auto/share/dist/Bio-Adventure/genome/adapters.fa:2:20:10:2:keepBothReads \
SLIDINGWINDOW:4:20 MINLEN:40 \
1>outputs/01trimomatic/TMRC30140_S3_R1_001-trimomatic.out 2>&1
excepted=$(grep "Exception" outputs/TMRC30140_S3_R1_001-trimomatic.out)
## The following is in case the illumina clipping fails, which it does if this has already been run I think.
if [[ "${excepted}" != "" ]]; then
trimmomatic PE \
-threads 1 \
-phred33 \
TMRC30140_S3_R1_001.fastq.gz TMRC30140_S3_R2_001.fastq.gz \
TMRC30140_S3_R1_001-trimmed_paired.fastq TMRC30140_S3_R1_001-trimmed_unpaired.fastq \
TMRC30140_S3_R2_001-trimmed_paired.fastq TMRC30140_S3_R2_001-trimmed_unpaired.fastq \
SLIDINGWINDOW:4:25 MINLEN:50\
1>outputs/01trimomatic/TMRC30140_S3_R1_001-trimomatic.out 2>&1
fi
sleep 10
mv TMRC30140_S3_R1_001-trimmed_paired.fastq TMRC30140_S3_R1_001-trimmed.fastq
mv TMRC30140_S3_R2_001-trimmed_paired.fastq TMRC30140_S3_R2_001-trimmed.fastq
## Recompress the unpaired reads, this should not take long.
xz -9e -f TMRC30140_S3_R1_001-trimmed_unpaired.fastq
xz -9e -f TMRC30140_S3_R2_001-trimmed_unpaired.fastq
## Recompress the paired reads.
xz -9e -f TMRC30140_S3_R1_001-trimmed.fastq
xz -9e -f TMRC30140_S3_R2_001-trimmed.fastq
ln -sf TMRC30140_S3_R1_001-trimmed.fastq.xz r1_trimmed.fastq.xz
ln -sf TMRC30140_S3_R2_001-trimmed.fastq.xz r2_trimmed.fastq.xz
## The following lines give status codes and some logging
echo "## Job status: $? " >> outputs/log.txt
echo "## $(hostname) Finished ${SLURM_JOBID} 01trim_TMRC30140_S3_R1_001.sh at $(date), it took $(( SECONDS / 60 )) minutes." >> outputs/log.txt
walltime=$(scontrol show job "${SLURM_JOBID}" | grep RunTime | perl -F'/\s+|=/' -lane '{print $F[2]}')
echo "#### walltime used by ${SLURM_JOBID} was: ${walltime:-null}" >> outputs/log.txt
maxmem=$(sstat --format=MaxVMSize -n "${SLURM_JOBID}.batch")
echo "#### maximum memory used by ${SLURM_JOBID} was: ${maxmem:-null}" >> outputs/log.txt
avecpu=$(sstat --format=AveCPU -n "${SLURM_JOBID}.batch")
echo "#### average cpu used by ${SLURM_JOBID} was: ${avecpu:-null}" >> outputs/log.txt
Mapping the trimmed
reads
There is a similar shell fragment used to map these trimmed reads
against the hg38_100 genome/transcriptome along with the panamensis
genome:
#!/usr/bin/env bash
input=$(/bin/ls *_trimmed.fastq.xz | tr '\n' ':' | sed 's/:$//g')
cyoa --task map --method hisat --species hg38_100 --gff_type gene --gff_tag gene_id \
--input ${input} 2>tmrc3_map.out 1>&2 &
cyoa --task map --method hisat --species lpanamensis_v36 --gff_type gene --gff_tag ID \
--input ${input} 2>>tmrc3_map.out 1>&2 &
cyoa --task map --method salmon --species hg38_100 \
--input ${input} 2>tmrc3_salmon.out 1>&2 &
The following is the resulting mapping script taken arbitrarily from
sample TMRC30205. There are a few aspects of this script which changed
from the beginning to end of the project: the prefix # changed, I added
some logic to check if the requisite software (in this case
hisat2/samtools/htseq) is already in the path, and load it if not. I
also changed the htseq output filenames slightly over time to try to
make certain that they are consistent and that it is easy to identify
the important parameters just from the output filename.
#!/usr/bin/env bash
#SBATCH --chdir=/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205
#SBATCH --partition=dpart
#SBATCH --qos=workstation --nice=10
#SBATCH --nodes=1 --requeue
#SBATCH --time=10:00:00
#SBATCH --job-name=40hisat2_hg38_100_genome
#SBATCH --mem=48G
#SBATCH --cpus-per-task=4
#SBATCH --output=outputs/logs/hisat2_hg38_100_genome.sbatchout
echo "## Started /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/scripts/40hisat2_hg38_100_genome.sh at $(date) on $(hostname) with id ${SLURM_JOBID}." >> outputs/log.txt
module add hisat2 samtools htseq
## This is a hisat2 alignment of -1 <(less /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/r1_trimmed.fastq.xz) -2 <(less /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/r2_trimmed.fastq.xz) against /cbcbhomes/abelew/libraries/genome/indexes/hg38_100
mkdir -p outputs/40hisat2_hg38_100
sleep 3
hisat2 -x /cbcbhomes/abelew/libraries/genome/indexes/hg38_100 \
-p 4 \
-q -1 <(less /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/r1_trimmed.fastq.xz) -2 <(less /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/r2_trimmed.fastq.xz) \
--phred33 \
--un-gz outputs/40hisat2_hg38_100/TMRC30205_unaldis_hg38_100_genome.fastq \
--al-gz outputs/40hisat2_hg38_100/TMRC30205_aldis_hg38_100_genome.fastq \
--un-conc-gz outputs/40hisat2_hg38_100/TMRC30205_unalcon_hg38_100_genome.fastq \
--al-conc-gz outputs/40hisat2_hg38_100/TMRC30205_alcon_hg38_100_genome.fastq \
-S outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.sam \
2>outputs/40hisat2_hg38_100/hisat2_hg38_100_genome_TMRC30205.err \
1>outputs/40hisat2_hg38_100/hisat2_hg38_100_genome_TMRC30205.out
## The following lines give status codes and some logging
echo "## Job status: $? " >> outputs/log.txt
echo "## $(hostname) Finished ${SLURM_JOBID} 40hisat2_hg38_100_genome.sh at $(date), it took $(( SECONDS / 60 )) minutes." >> outputs/log.txt
walltime=$(scontrol show job "${SLURM_JOBID}" | grep RunTime | perl -F'/\s+|=/' -lane '{print $F[2]}')
echo "#### walltime used by ${SLURM_JOBID} was: ${walltime:-null}" >> outputs/log.txt
maxmem=$(sstat --format=MaxVMSize -n "${SLURM_JOBID}.batch")
echo "#### maximum memory used by ${SLURM_JOBID} was: ${maxmem:-null}" >> outputs/log.txt
avecpu=$(sstat --format=AveCPU -n "${SLURM_JOBID}.batch")
echo "#### average cpu used by ${SLURM_JOBID} was: ${avecpu:-null}" >> outputs/log.txt
From here on I will remove the suffix of the script. Here is the sam
to bam conversion script which was created by cyoa.
module add samtools bamtools
## Converting the text sam to a compressed, sorted, indexed bamfile.
## Also printing alignment statistics to outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam.stats
## This job depended on: 133995
echo "Starting samtools"
if $(test ! -r outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.sam); then
echo "Could not find the samtools input file."
exit 1
fi
samtools view -u -t /cbcbhomes/abelew/libraries/genome/hg38_100.fasta \
-S outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.sam -o outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam \
2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam.err 1>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam.out && \
samtools sort -l 9 outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam -o outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-sorted.bam \
2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-sorted.err 1>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-sorted.out && \
rm outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam && \
rm outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.sam && \
mv outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-sorted.bam outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam && samtools index outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam
bamtools stats -in outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam 2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam.stats 1>&2
## The following will fail if this is single-ended.
samtools view -b -f 2 -o outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam && samtools index outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam
bamtools stats -in outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam 2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.stats 1>&2
And the counting script for htseq. I changed the output filename for
this step on multiple occasions over the course of sample
collection.
module add htseq
## Counting the number of hits in outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam for each feature found in /cbcbhomes/abelew/libraries/genome/hg38_100.gff
## Is this stranded? no. The defaults of htseq are:
## --order=name --idattr=gene_id --minaqual=10 --type=exon --stranded=yes --mode=union
htseq-count \
-q -f bam -s no --type gene --idattr gene_id \
outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam \
/cbcbhomes/abelew/libraries/genome/hg38_100.gff \
2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.err \
1>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.count && \
xz -f -9e outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.count 2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.err.xz 1>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.count.xz
The same cyoa invocation produced a series of scripts to aggressively
compress the (un)mapped reads from hisat2 and print out some summary
statistics of the runs.
Conversely, here is a salmon script which was used for sample
TMRC30124 (without the slurm prefix/suffix):
mkdir -p outputs/salmon_hg38_100 && sleep 3 && \
salmon quant -i /cbcbhomes/abelew/libraries/genome/indexes/hg38_100_salmon_index \
-l A --gcBias --validateMappings \
-r <(less r1_trimmed.fastq.gz) \
-o outputs/salmon_hg38_100 \
2>outputs/salmon_hg38_100/salmon.err 1>outputs/salmon_hg38_100/salmon.out
Preprocessing data from
Dr. Scott’s lab
One potentially obvious comparison I neglected to make was against a
newer data set from the Scott lab. (hmm zotero cannot seem to suck in
the author names!? Did science somehow sanitize the pdf or is zotero
just being a doofus? Nope it is definitely science, I asked zotero to
index a fresh copy I downloaded from pubmed and it worked fine, that
sucks, I kind of wish we would collectively refuse to engage with the
for-profit journals.) ((amorimVariableGeneExpression2019?))
In any event, the bioproject accession for this experiment is:
PRJNA525604 which is handy for me, because cyoa has a target named
‘sradownload’ which will, if it sees a PRJNA, do a little web-scraping,
download the metadata.csv from SRA run finder and then download the data
for every run into separate directories.
I have since started (but not finished) the implementation of a
–iterate target which will do what it says on the tin: iterate over
every directory in a tree and invoke the same pipeline on it; but until
I finish that, bash for loops for everyone!
Here is what I did:
mkdir -p preprocessing/PRJNA525604/
module add cyoa
cd preprocessing/PRJNA525604
start=$(pwd)
cyoa --method sra --input PRJNA525604
## wait a while, I got a coffee! and listened to the first song from Olarfur Arnaulds via Cercle:
## https://www.youtube.com/watch?v=bMCiAKNUpTY
## I then realized my script has an error where it leaves the job log
## for this, it created another preprocessing directory with the log.
## Run the processing pipeling
for i in $(/bin/ls -d SRR*); do
cd $i
cyoa --method prnaseq --species hg38_11 \
--input $(/bin/ls *.fastq.gz | tr '\n' ':')
cd $start
done
## The Process_RNASeq target (prna) handles all the stuff above.
The scott lab followed this work up with an examination of the
interaction between microbial buren of the skin and its connection to
increased unpredictability in clinical outcome as well as increased
inflammation during infection ((fariasamorimMultiomicProfilingCutaneous2023?)).
I completely missed this paper until quite recently (202404); in
addition, as I was finishing my quick first-pass read through, I found
the bioproject accession, and upon reading the metadata at SRA realized
they also followed that work up with another piece of intriguing work
((sacramentoCCR5PromotesMigration2024?)).
So, with that in mind:
ac=PRJNA885131
mkdir -p preprocessing/${ac}
module add cyoa/202404
cd preprocessing/${ac}
start=$(pwd)
cyoa --method sra --input $ac
ac=PRJNA682985
mkdir -p preprocessing/${ac}
cd preprocessing/${ac}
start=$(pwd)
cyoa --method sra --input $ac
for i in $(/bin/ls -d SRR*); do
cd $i
cyoa --method prnaseq --species hg38_11 \
--input $(/bin/ls *.fastq.gz | tr '\n' ':')
cd $start
done
## Note, I only asked it to do hg38 for now, but I could/should change it to
## hg38_11:lbraziliensis_2904_v46:saureus_8325
## once I verify those are the appropriate genomes -- and I should
## probably grab fresh copies, I haven't updated them in a while.
Oh yeah, my Staphylococcus assembly is stupidly old, I cannot even
tell where I downloaded it (oh, I checked the fasta file, it is current,
I just manually downloaded everything like a doofus and it is hard to
tell the version) I am just going to redownload it via cyoa so I have
the full accession on hand:
cd ~/libraries/genome/genbank
cyoa --method ncbidownload --input GCF_000013425
## --download downloads whatever ncbi accession I feed it, maybe I
## should change that to something like --ncbidown (I did!) and make one for
## ensembl and microbesonline and friends. Thing is, I like having an
## archive of the gbk file so I can later extract stuff, ensembl/etc
## do not give me that, so it is often easier to just download
## manually, but then again, if I add it to cyoa I do not have to
## scroll around and accidently download some rando chromosome instead
## of the full assembly; and I can add a year or release ID and have
## it go to the appropriate archive server. yeah, I like that.
## maybe I should change --input to --accession for stuff like this?
ls -ltr
ln -s GCF_000013425.gbff saureus_8325.gbk
cyoa --method gb2gff --input saureus_8325.gbk
ls -ltr
mv saureus_8325.fsa ../fasta/
mv saureus_8325_all.gff ../gff/saureus_8325.gff
mv saureus_8325.faa ../../amino_acid/
mv saureus_8325.ffn ../../CDS/fasta/
mv saureus_8325_cds.gff ../../CDS/fasta/saureus_8325.gff
mv saureus_8325_interCDS.gff ../../interCDS/gff/saureus_8325.gff
mv saureus_8325_rrna.gff ../../rRNA/gff/saureus_8325.gff
mv saureus_8325_rrna.fasta ../../rRNA/fasta/saureus_8325.gasta
cd ../fasta
cyoa --method indexhisat --species saureus_8325 \
--input saureus_8325.fsa \
--genome saureus_8325
My favorite part of the above: I can provide the csv downloaded from
SRA to my hpgltools::gather_preprocessing_metadata() function along with
the root directory of the downloaded data (preprocessing/$ac) and it
will suck up all the numbers for me automagically, so I do not have to
worry any more about collecting mapping statistics and potentially
messing them up. One caveat: the rRNA collector still sometimes grabs
genomic numbers because I forgot to add the genome variable somewhere in
the last version… Nonetheless, I think that is awesome (I even made it
smrt enough to handle a colon separated list of species). Oh, this
reminds me, I should check that my S. aureus and braziliensis assemblies
are the same as the ones they used.
Bibliography
Andrews, Simon. n.d. “FastQC: A Quality Control
Tool for High Throughput Sequence Data.”
Garrison, Erik, and Gabor Marth. 2012.
“Haplotype-Based Variant
Detection from Short-Read Sequencing.” arXiv.
https://doi.org/10.48550/arXiv.1207.3907.
Kim, Daehwan, Joseph M. Paggi, Chanhee Park, Christopher Bennett, and
Steven L. Salzberg. 2019.
“Graph-Based Genome Alignment and
Genotyping with Hisat2 and HISAT-genotype.” Nature
Biotechnology 37 (8): 907–15.
https://doi.org/10.1038/s41587-019-0201-4.
Patro, Rob, Geet Duggal, Michael I. Love, Rafael A. Irizarry, and Carl
Kingsford. 2017.
“Salmon Provides Fast and Bias-Aware
Quantification of Transcript Expression.” Nature Methods
14 (4): 417–19.
https://doi.org/10.1038/nmeth.4197.
Putri, Givanna H, Simon Anders, Paul Theodor Pyl, John E Pimanda, and
Fabio Zanini. 2022.
“Analysing High-Throughput Sequencing Data in
Python with HTSeq 2.0.”
Bioinformatics 38 (10): 2943–45.
https://doi.org/10.1093/bioinformatics/btac166.
---
title: "TMRC3 202308: Creating Data Structures"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
bibliography: atb.bib
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
   collapsed: false
   smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include=FALSE}
library(hpgltools)
library(dplyr)
library(forcats)
library(glue)
library(EuPathDB)

tt <- try(devtools::load_all("~/hpgltools"))
knitr::opts_knit$set(
  progress = TRUE, verbose = TRUE, width = 90, echo = TRUE)
knitr::opts_chunk$set(
  error = TRUE, fig.width = 8, fig.height = 8, fig.retina = 2,
  fig.pos = "t", fig.align = "center", dpi = if (knitr::is_latex_output()) 72 else 300,
  out.width = "100%", dev = "svg",
  dev.args = list(png = list(type = "cairo-png")))
old_options <- options(
  digits = 4, stringsAsFactors = FALSE, knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 12))
ver <- Sys.getenv("VERSION")
previous_file <- ""
rundate <- format(Sys.Date(), format = "%Y%m%d")

##tmp <- try(sm(loadme(filename=gsub(pattern="\\.Rmd", replace="\\.rda\\.xz", x=previous_file))))
rmd_file <- glue("tmrc3_datasets_{ver}.Rmd")
savefile <- gsub(pattern = "\\.Rmd", replace = "\\.rda\\.xz", x = rmd_file)
data_structures <- c()
```

# Changelog

* 202406: Adding a neat experiment from the Scott lab which I had
  sitting in my tree but somehow forgot about.
* 202308: Moved preprocessing material from the datastructures file to here.

# Introduction

This document outlines in some detail the tasks performed in order to
process the 'raw' data.

# Sequence preprocessing

All of the preprocessing tasks were performed with CYOA (@ElsayedlabCYOA) and included
the following:

1. All samples were trimmed with trimomatic
   (@bolgerTrimmomaticFlexibleTrimmer2014a) using options to remove
   adapters using a supplemented copy of the default adapter database
   (ILLUMINACLIP:2:30:10:2:keepBothReads), reads with fewer than 40
   nucleotides remaining were discarded (MINLEN:40), and a mean quality
   >= 25 over 4 nucleotide window was used for trimming
   (SLIDINGWINDOW:4:25).  When paired end reads were available, only the
   properly paired were used for the tasks that follow.  The generated
   scripts for each sample which performed this trimming have the prefix
   ‘05trim’.
2. Reads were passed to fastqc (@andrewsFastQCQualityControl) using
   the default parameters to query read quality, remaining adapter
   content, and as an initial survey for contaminants.  The generated
   scripts which performed this have the prefix ‘06fastqc’
3. Every sample was passed to kraken2 (@luMetagenomeAnalysisUsing2022)
   against its comprehensive ‘standard’ and ‘virus’ databases, updated
   2021/06 in order to query for putative contaminants and potentially
   interesting viral sequences.  The scripts which performed this have
   the prefix ‘11kraken’.
4. Samples were passed to hisat2 (@kimGraphbasedGenomeAlignment2019)
   against the hg38 release 100 human genome (@HomoSapiensEnsembl)
   with the default options as well as the Leishmania panamensis
   MHOM/COL release 46
   (@shanmugasundramTriTrypDBIntegratedFunctional2023) in serial. The
   scripts which performed these operations have the prefixes
   ‘15hisat2_hg38_100’ and ‘15hisat2_lpanamensis_v46’ respectively.
5. The resulting alignments were converted to the binary, sorted,
   compressed, and indexed format via samtools
   (@bonfieldHTSlibLibraryReading2021).  Additional filters were
   performed to extract only the non-variant alignments .  The scripts
   which performed this have the prefix ‘19s2b_{genome}’.
6. These were cross referenced against the relevant feature databases in
   order to generate count tables by gene via htseq
   (@putriAnalysingHighthroughputSequencing2022).  These scripts have
   the prefix ‘21hts’.
7. Each sample was passed to salmon (@patroSalmonProvidesFast2017)
   against the transcript databases for hg38 release 100,
   L. panamensis release 46, and a concatenation of the two.  These
   scripts have the suffix ‘30sal_{genome}’.
8. A regular expression search was performed for the observed set of
   spliced leader sequences.  The sequence ‘AGTTTCTGTACTTTATTGG’ rather
   than the entire SL sequence was searched to avoid potential SL
   variants.  The relevant scripts have the prefix ‘50slsearch’.  I
   later set up a fastp-based (@chenFastpUltrafastAllinone2018)
   search, but I do not think it is included.
9. When sufficient coverage was observed against the parasite,
   freebayes (@garrisonHaplotypebasedVariantDetection2012) was used to
   quantify variants vs. the reference genome and infer the most
   likely parasite subtype.

(if anyone ever reads this (no one will ever read this), can you tell
me how in the flying hell the fancy single quotes appeared in my text
document?  I am writing this in emacs using markdown-mode and when I
am typing I see normal person quotes, but when I save it.... oooohhh I
added a text-mode hook and forgot it, I think?)

Given the fact that these tasks were performed over the course of 4
years, during which time we changed sequencing platforms and
computational infrastructure, the actual prefixes/suffixes of the
various scripts and outputs are not as consistent as I would like.

### Trimming raw reads

With the above caveat in mind, I am copy/pasting the little shell fragments I
used when trimming the samples:

```{bash preprocess_fragment, eval=FALSE}
## trim.sh, this assumes one will manually cd into the new sample's directory and run ../tmrc_trim.sh
## In addition, it assumes that the new data files will reside in the directory 'unprocessed/'.
input=$(/bin/ls -d unprocessed | tr '\n' ':' | sed 's/:$//g')
echo "Running trimomatic on ${input}"
cmd="cyoa --method trim --input ${input} 2>>cyoa_trim.out 1>&2 &"
echo "${cmd}" > cyoa_trim.out
${cmd}
```

The following is a script which is generated by the above, arbitrarily
chosen from sample TMRC30140.  If anyone reads it carefully, one might
note that this sample is from before I started depositing new raw
reads into the 'unprocessed' directory, as a result the input reads
were in the cwd.

```{bash trim_sequence_script, eval=FALSE}
#!/usr/bin/env bash
#SBATCH --export=ALL
#SBATCH --mail-type=NONE
#SBATCH --chdir=/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30140
#SBATCH --partition=dpart
#SBATCH --qos=workstation --nice=10
#SBATCH --nodes=1 --requeue
#SBATCH --time=10:00:00
#SBATCH --job-name=01trim_TMRC30140_S3_R1_001
#SBATCH --mem=40G
#SBATCH --cpus-per-task=3
#SBATCH --output=outputs/logs/trim_TMRC30140_S3_R1_001.sbatchout

echo "## Started /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30140/scripts/01trim_TMRC30140_S3_R1_001.sh at $(date) on $(hostname) with id ${SLURM_JOBID}." >> outputs/log.txt

module add trimomatic

## This call to trimomatic removes illumina and epicentre adapters from TMRC30140_S3_R1_001.fastq.gz:TMRC30140_S3_R2_001.fastq.gz.
## It also performs a sliding window removal of anything with quality <25;
## cutadapt provides an alternative to this tool.
## The original sequence data is recompressed and saved in the sequences/ directory.
mkdir -p outputs/01trimomatic
## Note that trimomatic prints all output and errors to STDERR, so send both to output
trimmomatic PE \
  -threads 1 \
  -phred33 \
  TMRC30140_S3_R1_001.fastq.gz TMRC30140_S3_R2_001.fastq.gz \
  TMRC30140_S3_R1_001-trimmed_paired.fastq TMRC30140_S3_R1_001-trimmed_unpaired.fastq \
  TMRC30140_S3_R2_001-trimmed_paired.fastq TMRC30140_S3_R2_001-trimmed_unpaired.fastq \
   ILLUMINACLIP:/cbcb/sw/RedHat-7-x86_64/common/local/perl/5.34/lib/site_perl/5.34.0/auto/share/dist/Bio-Adventure/genome/adapters.fa:2:20:10:2:keepBothReads \
  SLIDINGWINDOW:4:20 MINLEN:40 \
  1>outputs/01trimomatic/TMRC30140_S3_R1_001-trimomatic.out 2>&1
excepted=$(grep "Exception" outputs/TMRC30140_S3_R1_001-trimomatic.out)
## The following is in case the illumina clipping fails, which it does if this has already been run I think.
if [[ "${excepted}" != "" ]]; then
  trimmomatic PE \
    -threads 1 \
    -phred33 \
    TMRC30140_S3_R1_001.fastq.gz TMRC30140_S3_R2_001.fastq.gz \
    TMRC30140_S3_R1_001-trimmed_paired.fastq TMRC30140_S3_R1_001-trimmed_unpaired.fastq \
    TMRC30140_S3_R2_001-trimmed_paired.fastq TMRC30140_S3_R2_001-trimmed_unpaired.fastq \
     SLIDINGWINDOW:4:25 MINLEN:50\
    1>outputs/01trimomatic/TMRC30140_S3_R1_001-trimomatic.out 2>&1
fi
sleep 10
mv TMRC30140_S3_R1_001-trimmed_paired.fastq TMRC30140_S3_R1_001-trimmed.fastq
mv TMRC30140_S3_R2_001-trimmed_paired.fastq TMRC30140_S3_R2_001-trimmed.fastq

## Recompress the unpaired reads, this should not take long.
xz -9e -f TMRC30140_S3_R1_001-trimmed_unpaired.fastq
xz -9e -f TMRC30140_S3_R2_001-trimmed_unpaired.fastq
## Recompress the paired reads.
xz -9e -f TMRC30140_S3_R1_001-trimmed.fastq
xz -9e -f TMRC30140_S3_R2_001-trimmed.fastq
ln -sf TMRC30140_S3_R1_001-trimmed.fastq.xz r1_trimmed.fastq.xz
ln -sf TMRC30140_S3_R2_001-trimmed.fastq.xz r2_trimmed.fastq.xz


## The following lines give status codes and some logging
echo "## Job status: $? " >> outputs/log.txt
echo "## $(hostname) Finished ${SLURM_JOBID} 01trim_TMRC30140_S3_R1_001.sh at $(date), it took $(( SECONDS / 60 )) minutes." >> outputs/log.txt

walltime=$(scontrol show job "${SLURM_JOBID}" | grep RunTime | perl -F'/\s+|=/' -lane '{print $F[2]}')
echo "#### walltime used by ${SLURM_JOBID} was: ${walltime:-null}" >> outputs/log.txt
maxmem=$(sstat --format=MaxVMSize -n "${SLURM_JOBID}.batch")
echo "#### maximum memory used by ${SLURM_JOBID} was: ${maxmem:-null}" >> outputs/log.txt
avecpu=$(sstat --format=AveCPU -n "${SLURM_JOBID}.batch")
echo "#### average cpu used by ${SLURM_JOBID} was: ${avecpu:-null}" >> outputs/log.txt
```

### Mapping the trimmed reads

There is a similar shell fragment used to map these trimmed reads
against the hg38_100 genome/transcriptome along with the panamensis
genome:

```{bash map_fragment, eval=FALSE}
#!/usr/bin/env bash
input=$(/bin/ls *_trimmed.fastq.xz | tr '\n' ':' | sed 's/:$//g')
cyoa --task map --method hisat --species hg38_100 --gff_type gene --gff_tag gene_id  \
  --input ${input} 2>tmrc3_map.out 1>&2 &
cyoa --task map --method hisat --species lpanamensis_v36 --gff_type gene --gff_tag ID \
  --input ${input} 2>>tmrc3_map.out 1>&2 &
cyoa --task map --method salmon --species hg38_100 \
     --input ${input} 2>tmrc3_salmon.out 1>&2 &
```

The following is the resulting mapping script taken arbitrarily from
sample TMRC30205.  There are a few aspects of this script which
changed from the beginning to end of the project: the prefix #
changed, I added some logic to check if the requisite software (in
this case hisat2/samtools/htseq) is already in the path, and load it
if not.  I also changed the htseq output filenames slightly over time
to try to make certain that they are consistent and that it is easy to
identify the important parameters just from the output filename.

```{bash hisat_map_script, eval=FALSE}
#!/usr/bin/env bash
#SBATCH --chdir=/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205
#SBATCH --partition=dpart
#SBATCH --qos=workstation --nice=10
#SBATCH --nodes=1 --requeue
#SBATCH --time=10:00:00
#SBATCH --job-name=40hisat2_hg38_100_genome
#SBATCH --mem=48G
#SBATCH --cpus-per-task=4
#SBATCH --output=outputs/logs/hisat2_hg38_100_genome.sbatchout

echo "## Started /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/scripts/40hisat2_hg38_100_genome.sh at $(date) on $(hostname) with id ${SLURM_JOBID}." >> outputs/log.txt

module add hisat2 samtools htseq

## This is a hisat2 alignment of  -1 <(less /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/r1_trimmed.fastq.xz) -2 <(less /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/r2_trimmed.fastq.xz)  against /cbcbhomes/abelew/libraries/genome/indexes/hg38_100

mkdir -p outputs/40hisat2_hg38_100
sleep 3
hisat2 -x /cbcbhomes/abelew/libraries/genome/indexes/hg38_100  \
  -p 4 \
  -q   -1 <(less /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/r1_trimmed.fastq.xz) -2 <(less /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/r2_trimmed.fastq.xz)  \
  --phred33 \
  --un-gz outputs/40hisat2_hg38_100/TMRC30205_unaldis_hg38_100_genome.fastq \
  --al-gz outputs/40hisat2_hg38_100/TMRC30205_aldis_hg38_100_genome.fastq \
  --un-conc-gz outputs/40hisat2_hg38_100/TMRC30205_unalcon_hg38_100_genome.fastq \
  --al-conc-gz outputs/40hisat2_hg38_100/TMRC30205_alcon_hg38_100_genome.fastq \
  -S outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.sam \
  2>outputs/40hisat2_hg38_100/hisat2_hg38_100_genome_TMRC30205.err \
  1>outputs/40hisat2_hg38_100/hisat2_hg38_100_genome_TMRC30205.out


## The following lines give status codes and some logging
echo "## Job status: $? " >> outputs/log.txt
echo "## $(hostname) Finished ${SLURM_JOBID} 40hisat2_hg38_100_genome.sh at $(date), it took $(( SECONDS / 60 )) minutes." >> outputs/log.txt

walltime=$(scontrol show job "${SLURM_JOBID}" | grep RunTime | perl -F'/\s+|=/' -lane '{print $F[2]}')
echo "#### walltime used by ${SLURM_JOBID} was: ${walltime:-null}" >> outputs/log.txt
maxmem=$(sstat --format=MaxVMSize -n "${SLURM_JOBID}.batch")
echo "#### maximum memory used by ${SLURM_JOBID} was: ${maxmem:-null}" >> outputs/log.txt
avecpu=$(sstat --format=AveCPU -n "${SLURM_JOBID}.batch")
echo "#### average cpu used by ${SLURM_JOBID} was: ${avecpu:-null}" >> outputs/log.txt
```

From here on I will remove the suffix of the script.  Here is the sam
to bam conversion script which was created by cyoa.

```{bash sam2bam_script, eval=FALSE}
module add samtools bamtools

## Converting the text sam to a compressed, sorted, indexed bamfile.
## Also printing alignment statistics to outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam.stats
## This job depended on: 133995

echo "Starting samtools"
if $(test ! -r outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.sam); then
    echo "Could not find the samtools input file."
    exit 1
fi
samtools view -u -t /cbcbhomes/abelew/libraries/genome/hg38_100.fasta \
  -S outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.sam -o outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam  \
  2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam.err 1>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam.out && \

  samtools sort -l 9 outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam -o outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-sorted.bam \
  2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-sorted.err 1>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-sorted.out && \
  rm outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam && \
  rm outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.sam && \
  mv outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-sorted.bam outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam &&  samtools index outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam

bamtools stats -in outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam 2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam.stats 1>&2

## The following will fail if this is single-ended.
samtools view -b -f 2 -o outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam && samtools index outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam
bamtools stats -in outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam 2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.stats 1>&2
```

And the counting script for htseq.  I changed the output filename for
this step on multiple occasions over the course of sample collection.

```{bash htseq_hg38_100, eval=FALSE}
module add htseq

## Counting the number of hits in outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam for each feature found in /cbcbhomes/abelew/libraries/genome/hg38_100.gff
## Is this stranded? no.  The defaults of htseq are:
##  --order=name --idattr=gene_id --minaqual=10 --type=exon --stranded=yes --mode=union

htseq-count \
  -q -f bam -s no  --type gene  --idattr gene_id \
  outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam \
  /cbcbhomes/abelew/libraries/genome/hg38_100.gff \
  2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.err \
  1>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.count && \
    xz -f -9e outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.count 2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.err.xz 1>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.count.xz
```

The same cyoa invocation produced a series of scripts to aggressively
compress the (un)mapped reads from hisat2 and print out some summary
statistics of the runs.

Conversely, here is a salmon script which was used for sample
TMRC30124 (without the slurm prefix/suffix):

```{bash salmon_script, eval=FALSE}
mkdir -p outputs/salmon_hg38_100 && sleep 3 && \
salmon quant -i /cbcbhomes/abelew/libraries/genome/indexes/hg38_100_salmon_index \
  -l A --gcBias --validateMappings  \
   -r <(less r1_trimmed.fastq.gz)  \
  -o outputs/salmon_hg38_100 \
  2>outputs/salmon_hg38_100/salmon.err 1>outputs/salmon_hg38_100/salmon.out
```

# Preprocessing data from Dr. Scott's lab

One potentially obvious comparison I neglected to make was against a
newer data set from the Scott lab.  (hmm zotero cannot seem to suck in
the author names!?  Did science somehow sanitize the pdf or is zotero
just being a doofus?  Nope it is definitely science, I asked zotero to
index a fresh copy I downloaded from pubmed and it worked fine, that
sucks, I kind of wish we would collectively refuse to engage with the
for-profit journals.)  (@amorimVariableGeneExpression2019)

In any event, the bioproject accession for this experiment is:
PRJNA525604 which is handy for me, because cyoa has a target named
'sradownload' which will, if it sees a PRJNA, do a little
web-scraping, download the metadata.csv from SRA run finder and then
download the data for every run into separate directories.

I have since started (but not finished) the implementation of a
--iterate target which will do what it says on the tin: iterate over
every directory in a tree and invoke the same pipeline on it; but
until I finish that, bash for loops for everyone!

Here is what I did:

```{bash, eval=FALSE}
mkdir -p preprocessing/PRJNA525604/
module add cyoa
cd preprocessing/PRJNA525604
start=$(pwd)
cyoa --method sra --input PRJNA525604
## wait a while, I got a coffee! and listened to the first song from Olarfur Arnaulds via Cercle:
## https://www.youtube.com/watch?v=bMCiAKNUpTY
## I then realized my script has an error where it leaves the job log
## for this, it created another preprocessing directory with the log.

## Run the processing pipeling
for i in $(/bin/ls -d SRR*); do
    cd $i
    cyoa --method prnaseq --species hg38_11 \
         --input $(/bin/ls *.fastq.gz | tr '\n' ':')
    cd $start
done
## The Process_RNASeq target (prna) handles all the stuff above.
```

The scott lab followed this work up with an examination of the
interaction between microbial buren of the skin and its connection to
increased unpredictability in clinical outcome as well as increased
inflammation during
infection (@fariasamorimMultiomicProfilingCutaneous2023).  I
completely missed this paper until quite recently (202404); in
addition, as I was finishing my quick first-pass read through, I found
the bioproject accession, and upon reading the metadata at SRA
realized they also followed that work up with another piece of
intriguing work (@sacramentoCCR5PromotesMigration2024).

So, with that in mind:

```{bash, eval=FALSE}
ac=PRJNA885131
mkdir -p preprocessing/${ac}
module add cyoa/202404
cd preprocessing/${ac}
start=$(pwd)
cyoa --method sra --input $ac
ac=PRJNA682985
mkdir -p preprocessing/${ac}
cd preprocessing/${ac}
start=$(pwd)
cyoa --method sra --input $ac

for i in $(/bin/ls -d SRR*); do
    cd $i
    cyoa --method prnaseq --species hg38_11 \
         --input $(/bin/ls *.fastq.gz | tr '\n' ':')
    cd $start
done
## Note, I only asked it to do hg38 for now, but I could/should change it to
## hg38_11:lbraziliensis_2904_v46:saureus_8325
## once I verify those are the appropriate genomes -- and I should
## probably grab fresh copies, I haven't updated them in a while.
```

Oh yeah, my Staphylococcus assembly is stupidly old, I cannot even
tell where I downloaded it (oh, I checked the fasta file, it is
current, I just manually downloaded everything like a doofus and it is
hard to tell the version)  I am just going to redownload it via cyoa
so I have the full accession on hand:

```{bash, eval=FALSE}
cd ~/libraries/genome/genbank
cyoa --method ncbidownload --input GCF_000013425
## --download downloads whatever ncbi accession I feed it, maybe I
## should change that to something like --ncbidown (I did!) and make one for
## ensembl and microbesonline and friends.  Thing is, I like having an
## archive of the gbk file so I can later extract stuff, ensembl/etc
## do not give me that, so it is often easier to just download
## manually, but then again, if I add it to cyoa I do not have to
## scroll around and accidently download some rando chromosome instead
## of the full assembly; and I can add a year or release ID and have
## it go to the appropriate archive server.  yeah, I like that.
## maybe I should change --input to --accession for stuff like this?
ls -ltr
ln -s GCF_000013425.gbff saureus_8325.gbk
cyoa --method gb2gff --input saureus_8325.gbk
ls -ltr
mv saureus_8325.fsa ../fasta/
mv saureus_8325_all.gff ../gff/saureus_8325.gff
mv saureus_8325.faa ../../amino_acid/
mv saureus_8325.ffn ../../CDS/fasta/
mv saureus_8325_cds.gff ../../CDS/fasta/saureus_8325.gff
mv saureus_8325_interCDS.gff ../../interCDS/gff/saureus_8325.gff
mv saureus_8325_rrna.gff ../../rRNA/gff/saureus_8325.gff
mv saureus_8325_rrna.fasta ../../rRNA/fasta/saureus_8325.gasta
cd ../fasta
cyoa --method indexhisat --species saureus_8325 \
     --input saureus_8325.fsa \
     --genome saureus_8325

```


My favorite part of the above: I can provide the csv downloaded from
SRA to my hpgltools::gather_preprocessing_metadata() function along
with the root directory of the downloaded data (preprocessing/$ac) and
it will suck up all the numbers for me automagically, so I do not have
to worry any more about collecting mapping statistics and potentially
messing them up.  One caveat: the rRNA collector still sometimes grabs
genomic numbers because I forgot to add the genome variable somewhere
in the last version...  Nonetheless, I think that is awesome (I even
made it smrt enough to handle a colon separated list of species).  Oh,
this reminds me, I should check that my S. aureus and braziliensis
assemblies are the same as the ones they used.

# Bibliography
