1 Changelog

  • 202406: Adding a neat experiment from the Scott lab which I had sitting in my tree but somehow forgot about ((Farias Amorim et al. (2023)), which seems to me a fascinating line of inquiry following (Amorim et al. (2019)); the second of which I attempted to explicitly compare).
  • 202308: Moved preprocessing material from the datastructures file to here.

2 Introduction

This document outlines in some detail the tasks performed in order to process the ‘raw’ data.

3 Sequence preprocessing

All of the preprocessing tasks were performed with CYOA (“Elsayed-Lab/CYOA (n.d.)) and included the following:

  1. All samples were trimmed with trimomatic
    1. using options to remove adapters using a supplemented copy of the default adapter database (ILLUMINACLIP:2:30:10:2:keepBothReads), reads with fewer than 40 nucleotides remaining were discarded (MINLEN:40), and a mean quality >= 25 over 4 nucleotide window was used for trimming (SLIDINGWINDOW:4:25). When paired end reads were available, only the properly paired were used for the tasks that follow. The generated scripts for each sample which performed this trimming have the prefix ‘05trim’.
  2. Reads were passed to fastqc (Andrews (n.d.)) using the default parameters to query read quality, remaining adapter content, and as an initial survey for contaminants. The generated scripts which performed this have the prefix ‘06fastqc’
  3. Every sample was passed to kraken2 (Lu et al. (2022)) against its comprehensive ‘standard’ and ‘virus’ databases, updated 2021/06 in order to query for putative contaminants and potentially interesting viral sequences. The scripts which performed this have the prefix ‘11kraken’.
  4. Samples were passed to hisat2 (Kim et al. (2019)) against the hg38 release 100 human genome (“Homo Sapiens - Ensembl Genome Browser 100” (n.d.)) with the default options as well as the Leishmania panamensis MHOM/COL release 46
    1. in serial. The scripts which performed these operations have the prefixes ‘15hisat2_hg38_100’ and ‘15hisat2_lpanamensis_v46’ respectively.
  5. The resulting alignments were converted to the binary, sorted, compressed, and indexed format via samtools (Bonfield et al. (2021)). Additional filters were performed to extract only the non-variant alignments . The scripts which performed this have the prefix ‘19s2b_{genome}’.
  6. These were cross referenced against the relevant feature databases in order to generate count tables by gene via htseq (Putri et al. (2022)). These scripts have the prefix ‘21hts’.
  7. Each sample was passed to salmon (Patro et al. (2017)) against the transcript databases for hg38 release 100, L. panamensis release 46, and a concatenation of the two. These scripts have the suffix ‘30sal_{genome}’.
  8. A regular expression search was performed for the observed set of spliced leader sequences. The sequence ‘AGTTTCTGTACTTTATTGG’ rather than the entire SL sequence was searched to avoid potential SL variants. The relevant scripts have the prefix ‘50slsearch’. I later set up a fastp-based (Chen et al. (2018)) search, but I do not think it is included.
  9. When sufficient coverage was observed against the parasite, freebayes (Garrison and Marth (2012)) was used to quantify variants vs. the reference genome and infer the most likely parasite subtype.

(if anyone ever reads this (no one will ever read this), can you tell me how in the flying hell the fancy single quotes appeared in my text document? I am writing this in emacs using markdown-mode and when I am typing I see normal person quotes, but when I save it…. oooohhh I added a text-mode hook and forgot it, I think?)

Given the fact that these tasks were performed over the course of 4 years, during which time we changed sequencing platforms and computational infrastructure, the actual prefixes/suffixes of the various scripts and outputs are not as consistent as I would like.

Also, the samples we did in the midst of pandemic are sometimes quite a mess, some of them have far far too much data, others have far too little.

3.0.1 Trimming raw reads

With the above caveat in mind, I am copy/pasting the little shell fragments I used when trimming the samples:

## trim.sh, this assumes one will manually cd into the new sample's directory and run ../tmrc_trim.sh
## In addition, it assumes that the new data files will reside in the directory 'unprocessed/'.
input=$(/bin/ls -d unprocessed | tr '\n' ':' | sed 's/:$//g')
echo "Running trimomatic on ${input}"
cmd="cyoa --method trim --input ${input} 2>>cyoa_trim.out 1>&2 &"
echo "${cmd}" > cyoa_trim.out
${cmd}

The following is a script which is generated by the above, arbitrarily chosen from sample TMRC30140. If anyone reads it carefully, one might note that this sample is from before I started depositing new raw reads into the ‘unprocessed’ directory, as a result the input reads were in the cwd.

#!/usr/bin/env bash
#SBATCH --export=ALL
#SBATCH --mail-type=NONE
#SBATCH --chdir=/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30140
#SBATCH --partition=dpart
#SBATCH --qos=workstation --nice=10
#SBATCH --nodes=1 --requeue
#SBATCH --time=10:00:00
#SBATCH --job-name=01trim_TMRC30140_S3_R1_001
#SBATCH --mem=40G
#SBATCH --cpus-per-task=3
#SBATCH --output=outputs/logs/trim_TMRC30140_S3_R1_001.sbatchout

echo "## Started /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30140/scripts/01trim_TMRC30140_S3_R1_001.sh at $(date) on $(hostname) with id ${SLURM_JOBID}." >> outputs/log.txt

module add trimomatic

## This call to trimomatic removes illumina and epicentre adapters from TMRC30140_S3_R1_001.fastq.gz:TMRC30140_S3_R2_001.fastq.gz.
## It also performs a sliding window removal of anything with quality <25;
## cutadapt provides an alternative to this tool.
## The original sequence data is recompressed and saved in the sequences/ directory.
mkdir -p outputs/01trimomatic
## Note that trimomatic prints all output and errors to STDERR, so send both to output
trimmomatic PE \
  -threads 1 \
  -phred33 \
  TMRC30140_S3_R1_001.fastq.gz TMRC30140_S3_R2_001.fastq.gz \
  TMRC30140_S3_R1_001-trimmed_paired.fastq TMRC30140_S3_R1_001-trimmed_unpaired.fastq \
  TMRC30140_S3_R2_001-trimmed_paired.fastq TMRC30140_S3_R2_001-trimmed_unpaired.fastq \
   ILLUMINACLIP:/cbcb/sw/RedHat-7-x86_64/common/local/perl/5.34/lib/site_perl/5.34.0/auto/share/dist/Bio-Adventure/genome/adapters.fa:2:20:10:2:keepBothReads \
  SLIDINGWINDOW:4:20 MINLEN:40 \
  1>outputs/01trimomatic/TMRC30140_S3_R1_001-trimomatic.out 2>&1
excepted=$(grep "Exception" outputs/TMRC30140_S3_R1_001-trimomatic.out)
## The following is in case the illumina clipping fails, which it does if this has already been run I think.
if [[ "${excepted}" != "" ]]; then
  trimmomatic PE \
    -threads 1 \
    -phred33 \
    TMRC30140_S3_R1_001.fastq.gz TMRC30140_S3_R2_001.fastq.gz \
    TMRC30140_S3_R1_001-trimmed_paired.fastq TMRC30140_S3_R1_001-trimmed_unpaired.fastq \
    TMRC30140_S3_R2_001-trimmed_paired.fastq TMRC30140_S3_R2_001-trimmed_unpaired.fastq \
     SLIDINGWINDOW:4:25 MINLEN:50\
    1>outputs/01trimomatic/TMRC30140_S3_R1_001-trimomatic.out 2>&1
fi
sleep 10
mv TMRC30140_S3_R1_001-trimmed_paired.fastq TMRC30140_S3_R1_001-trimmed.fastq
mv TMRC30140_S3_R2_001-trimmed_paired.fastq TMRC30140_S3_R2_001-trimmed.fastq

## Recompress the unpaired reads, this should not take long.
xz -9e -f TMRC30140_S3_R1_001-trimmed_unpaired.fastq
xz -9e -f TMRC30140_S3_R2_001-trimmed_unpaired.fastq
## Recompress the paired reads.
xz -9e -f TMRC30140_S3_R1_001-trimmed.fastq
xz -9e -f TMRC30140_S3_R2_001-trimmed.fastq
ln -sf TMRC30140_S3_R1_001-trimmed.fastq.xz r1_trimmed.fastq.xz
ln -sf TMRC30140_S3_R2_001-trimmed.fastq.xz r2_trimmed.fastq.xz


## The following lines give status codes and some logging
echo "## Job status: $? " >> outputs/log.txt
echo "## $(hostname) Finished ${SLURM_JOBID} 01trim_TMRC30140_S3_R1_001.sh at $(date), it took $(( SECONDS / 60 )) minutes." >> outputs/log.txt

walltime=$(scontrol show job "${SLURM_JOBID}" | grep RunTime | perl -F'/\s+|=/' -lane '{print $F[2]}')
echo "#### walltime used by ${SLURM_JOBID} was: ${walltime:-null}" >> outputs/log.txt
maxmem=$(sstat --format=MaxVMSize -n "${SLURM_JOBID}.batch")
echo "#### maximum memory used by ${SLURM_JOBID} was: ${maxmem:-null}" >> outputs/log.txt
avecpu=$(sstat --format=AveCPU -n "${SLURM_JOBID}.batch")
echo "#### average cpu used by ${SLURM_JOBID} was: ${avecpu:-null}" >> outputs/log.txt

3.0.2 Mapping the trimmed reads

There is a similar shell fragment used to map these trimmed reads against the hg38_100 genome/transcriptome along with the panamensis genome:

#!/usr/bin/env bash
input=$(/bin/ls *_trimmed.fastq.xz | tr '\n' ':' | sed 's/:$//g')
cyoa --task map --method hisat --species hg38_100 --gff_type gene --gff_tag gene_id  \
  --input ${input} 2>tmrc3_map.out 1>&2 &
cyoa --task map --method hisat --species lpanamensis_v36 --gff_type gene --gff_tag ID \
  --input ${input} 2>>tmrc3_map.out 1>&2 &
cyoa --task map --method salmon --species hg38_100 \
     --input ${input} 2>tmrc3_salmon.out 1>&2 &

The following is the resulting mapping script taken arbitrarily from sample TMRC30205. There are a few aspects of this script which changed from the beginning to end of the project: the prefix # changed, I added some logic to check if the requisite software (in this case hisat2/samtools/htseq) is already in the path, and load it if not. I also changed the htseq output filenames slightly over time to try to make certain that they are consistent and that it is easy to identify the important parameters just from the output filename.

#!/usr/bin/env bash
#SBATCH --chdir=/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205
#SBATCH --partition=dpart
#SBATCH --qos=workstation --nice=10
#SBATCH --nodes=1 --requeue
#SBATCH --time=10:00:00
#SBATCH --job-name=40hisat2_hg38_100_genome
#SBATCH --mem=48G
#SBATCH --cpus-per-task=4
#SBATCH --output=outputs/logs/hisat2_hg38_100_genome.sbatchout

echo "## Started /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/scripts/40hisat2_hg38_100_genome.sh at $(date) on $(hostname) with id ${SLURM_JOBID}." >> outputs/log.txt

module add hisat2 samtools htseq

## This is a hisat2 alignment of  -1 <(less /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/r1_trimmed.fastq.xz) -2 <(less /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/r2_trimmed.fastq.xz)  against /cbcbhomes/abelew/libraries/genome/indexes/hg38_100

mkdir -p outputs/40hisat2_hg38_100
sleep 3
hisat2 -x /cbcbhomes/abelew/libraries/genome/indexes/hg38_100  \
  -p 4 \
  -q   -1 <(less /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/r1_trimmed.fastq.xz) -2 <(less /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/r2_trimmed.fastq.xz)  \
  --phred33 \
  --un-gz outputs/40hisat2_hg38_100/TMRC30205_unaldis_hg38_100_genome.fastq \
  --al-gz outputs/40hisat2_hg38_100/TMRC30205_aldis_hg38_100_genome.fastq \
  --un-conc-gz outputs/40hisat2_hg38_100/TMRC30205_unalcon_hg38_100_genome.fastq \
  --al-conc-gz outputs/40hisat2_hg38_100/TMRC30205_alcon_hg38_100_genome.fastq \
  -S outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.sam \
  2>outputs/40hisat2_hg38_100/hisat2_hg38_100_genome_TMRC30205.err \
  1>outputs/40hisat2_hg38_100/hisat2_hg38_100_genome_TMRC30205.out


## The following lines give status codes and some logging
echo "## Job status: $? " >> outputs/log.txt
echo "## $(hostname) Finished ${SLURM_JOBID} 40hisat2_hg38_100_genome.sh at $(date), it took $(( SECONDS / 60 )) minutes." >> outputs/log.txt

walltime=$(scontrol show job "${SLURM_JOBID}" | grep RunTime | perl -F'/\s+|=/' -lane '{print $F[2]}')
echo "#### walltime used by ${SLURM_JOBID} was: ${walltime:-null}" >> outputs/log.txt
maxmem=$(sstat --format=MaxVMSize -n "${SLURM_JOBID}.batch")
echo "#### maximum memory used by ${SLURM_JOBID} was: ${maxmem:-null}" >> outputs/log.txt
avecpu=$(sstat --format=AveCPU -n "${SLURM_JOBID}.batch")
echo "#### average cpu used by ${SLURM_JOBID} was: ${avecpu:-null}" >> outputs/log.txt

From here on I will remove the suffix of the script. Here is the sam to bam conversion script which was created by cyoa.

module add samtools bamtools

## Converting the text sam to a compressed, sorted, indexed bamfile.
## Also printing alignment statistics to outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam.stats
## This job depended on: 133995

echo "Starting samtools"
if $(test ! -r outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.sam); then
    echo "Could not find the samtools input file."
    exit 1
fi
samtools view -u -t /cbcbhomes/abelew/libraries/genome/hg38_100.fasta \
  -S outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.sam -o outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam  \
  2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam.err 1>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam.out && \

  samtools sort -l 9 outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam -o outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-sorted.bam \
  2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-sorted.err 1>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-sorted.out && \
  rm outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam && \
  rm outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.sam && \
  mv outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-sorted.bam outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam &&  samtools index outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam

bamtools stats -in outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam 2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam.stats 1>&2

## The following will fail if this is single-ended.
samtools view -b -f 2 -o outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam && samtools index outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam
bamtools stats -in outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam 2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.stats 1>&2

And the counting script for htseq. I changed the output filename for this step on multiple occasions over the course of sample collection.

module add htseq

## Counting the number of hits in outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam for each feature found in /cbcbhomes/abelew/libraries/genome/hg38_100.gff
## Is this stranded? no.  The defaults of htseq are:
##  --order=name --idattr=gene_id --minaqual=10 --type=exon --stranded=yes --mode=union

htseq-count \
  -q -f bam -s no  --type gene  --idattr gene_id \
  outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam \
  /cbcbhomes/abelew/libraries/genome/hg38_100.gff \
  2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.err \
  1>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.count && \
    xz -f -9e outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.count 2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.err.xz 1>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.count.xz

The same cyoa invocation produced a series of scripts to aggressively compress the (un)mapped reads from hisat2 and print out some summary statistics of the runs.

Conversely, here is a salmon script which was used for sample TMRC30124 (without the slurm prefix/suffix):

mkdir -p outputs/salmon_hg38_100 && sleep 3 && \
salmon quant -i /cbcbhomes/abelew/libraries/genome/indexes/hg38_100_salmon_index \
  -l A --gcBias --validateMappings  \
   -r <(less r1_trimmed.fastq.gz)  \
  -o outputs/salmon_hg38_100 \
  2>outputs/salmon_hg38_100/salmon.err 1>outputs/salmon_hg38_100/salmon.out

4 Preprocessing data from Dr. Scott’s lab

One potentially obvious comparison I neglected to make was against a newer data set from the Scott lab. (hmm zotero cannot seem to suck in the author names!? Did science somehow sanitize the pdf or is zotero just being a doofus? Nope it is definitely science, I asked zotero to index a fresh copy I downloaded from pubmed and it worked fine, that sucks, I kind of wish we would collectively refuse to engage with the for-profit journals.) (Amorim et al. (2019))

In any event, the bioproject accession for this experiment is: PRJNA525604 which is handy for me, because cyoa has a target named ‘sradownload’ which will, if it sees a PRJNA, do a little web-scraping, download the metadata.csv from SRA run finder and then download the data for every run into separate directories.

I have since started (but not finished) the implementation of a –iterate target which will do what it says on the tin: iterate over every directory in a tree and invoke the same pipeline on it; but until I finish that, bash for loops for everyone!

Here is what I did:

mkdir -p preprocessing/PRJNA525604/
module add cyoa
cd preprocessing/PRJNA525604
start=$(pwd)
cyoa --method sra --input PRJNA525604
## wait a while, I got a coffee! and listened to the first song from Olarfur Arnaulds via Cercle:
## https://www.youtube.com/watch?v=bMCiAKNUpTY
## I then realized my script has an error where it leaves the job log
## for this, it created another preprocessing directory with the log.

## Run the processing pipeling
for i in $(/bin/ls -d SRR*); do
    cd $i
    cyoa --method prnaseq --species hg38_11 \
         --input $(/bin/ls *.fastq.gz | tr '\n' ':')
    cd $start
done
## The Process_RNASeq target (prna) handles all the stuff above.

The scott lab followed this work up with an examination of the interaction between microbial buren of the skin and its connection to increased unpredictability in clinical outcome as well as increased inflammation during infection (Farias Amorim et al. (2023)). I completely missed this paper until quite recently (202404); in addition, as I was finishing my quick first-pass read through, I found the bioproject accession, and upon reading the metadata at SRA realized they also followed that work up with another piece of intriguing work (Sacramento et al. (2024)).

So, with that in mind:

ac=PRJNA885131
mkdir -p preprocessing/${ac}
module add cyoa/202404
cd preprocessing/${ac}
start=$(pwd)
cyoa --method sra --input $ac
ac=PRJNA682985
mkdir -p preprocessing/${ac}
cd preprocessing/${ac}
start=$(pwd)
cyoa --method sra --input $ac

## This set of samples is not paired ended
for i in $(/bin/ls -d SRR*); do
    cd $i
    cyoa --method prnaseq --species hg38_111 \
         --input $(/bin/ls *.fastq.gz)
    cd $start
done
## Note, I only asked it to do hg38 for now, but I could/should change it to
## hg38_11:lbraziliensis_2904_v46:saureus_8325
## once I verify those are the appropriate genomes -- and I should
## probably grab fresh copies, I haven't updated them in a while.

Oh yeah, my Staphylococcus assembly is stupidly old, I cannot even tell where I downloaded it (oh, I checked the fasta file, it is current, I just manually downloaded everything like a doofus and it is hard to tell the version) I am just going to redownload it via cyoa so I have the full accession on hand:

cd ~/libraries/genome/genbank
cyoa --method ncbidownload --input GCF_000013425
## --download downloads whatever ncbi accession I feed it, maybe I
## should change that to something like --ncbidown (I did!) and make one for
## ensembl and microbesonline and friends.  Thing is, I like having an
## archive of the gbk file so I can later extract stuff, ensembl/etc
## do not give me that, so it is often easier to just download
## manually, but then again, if I add it to cyoa I do not have to
## scroll around and accidently download some rando chromosome instead
## of the full assembly; and I can add a year or release ID and have
## it go to the appropriate archive server.  yeah, I like that.
## maybe I should change --input to --accession for stuff like this?
ls -ltr
ln -s GCF_000013425.gbff saureus_8325.gbk
cyoa --method gb2gff --input saureus_8325.gbk
ls -ltr
mv saureus_8325.fsa ../fasta/
mv saureus_8325_all.gff ../gff/saureus_8325.gff
mv saureus_8325.faa ../../amino_acid/
mv saureus_8325.ffn ../../CDS/fasta/
mv saureus_8325_cds.gff ../../CDS/fasta/saureus_8325.gff
mv saureus_8325_interCDS.gff ../../interCDS/gff/saureus_8325.gff
mv saureus_8325_rrna.gff ../../rRNA/gff/saureus_8325.gff
mv saureus_8325_rrna.fasta ../../rRNA/fasta/saureus_8325.gasta
cd ../fasta
cyoa --method indexhisat --species saureus_8325 \
     --input saureus_8325.fsa \
     --genome saureus_8325

My favorite part of the above: I can provide the csv downloaded from SRA to my hpgltools::gather_preprocessing_metadata() function along with the root directory of the downloaded data (preprocessing/$ac) and it will suck up all the numbers for me automagically, so I do not have to worry any more about collecting mapping statistics and potentially messing them up. One caveat: the rRNA collector still sometimes grabs genomic numbers because I forgot to add the genome variable somewhere in the last version… Nonetheless, I think that is awesome (I even made it smrt enough to handle a colon separated list of species). Oh, this reminds me, I should check that my S. aureus and braziliensis assemblies are the same as the ones they used.

Bibliography

Amorim, Camila Farias, Fernanda O. Novais, Ba T. Nguyen, Ana M. Misic, Lucas P. Carvalho, Edgar M. Carvalho, Daniel P. Beiting, and Phillip Scott. 2019. “Variable Gene Expression and Parasite Load Predict Treatment Outcome in Cutaneous Leishmaniasis.” Science Translational Medicine 11 (519): eaax4204. https://doi.org/10.1126/scitranslmed.aax4204.
Andrews, Simon. n.d. FastQC: A Quality Control Tool for High Throughput Sequence Data.”
Bonfield, James K, John Marshall, Petr Danecek, Heng Li, Valeriu Ohan, Andrew Whitwham, Thomas Keane, and Robert M Davies. 2021. HTSlib: C Library for Reading/Writing High-Throughput Sequencing Data.” GigaScience 10 (2): giab007. https://doi.org/10.1093/gigascience/giab007.
Chen, Shifu, Yanqing Zhou, Yaru Chen, and Jia Gu. 2018. “Fastp: An Ultra-Fast All-in-One FASTQ Preprocessor.” Bioinformatics 34 (17): i884–90. https://doi.org/10.1093/bioinformatics/bty560.
“Elsayed-Lab/CYOA.” n.d. https://github.com/elsayed-lab/CYOA. Accessed June 24, 2024.
Farias Amorim, Camila, Victoria M. Lovins, Tej Pratap Singh, Fernanda O. Novais, Jordan C. Harris, Alexsandro S. Lago, Lucas P. Carvalho, et al. 2023. “Multiomic Profiling of Cutaneous Leishmaniasis Infections Reveals Microbiota-Driven Mechanisms Underlying Disease Severity.” Science Translational Medicine 15 (718): eadh1469. https://doi.org/10.1126/scitranslmed.adh1469.
Garrison, Erik, and Gabor Marth. 2012. “Haplotype-Based Variant Detection from Short-Read Sequencing.” arXiv. https://doi.org/10.48550/arXiv.1207.3907.
“Homo Sapiens - Ensembl Genome Browser 100.” n.d. http://apr2020.archive.ensembl.org/Homo_sapiens/Info/Index. Accessed June 24, 2024.
Kim, Daehwan, Joseph M. Paggi, Chanhee Park, Christopher Bennett, and Steven L. Salzberg. 2019. “Graph-Based Genome Alignment and Genotyping with Hisat2 and HISAT-genotype.” Nature Biotechnology 37 (8): 907–15. https://doi.org/10.1038/s41587-019-0201-4.
Lu, Jennifer, Natalia Rincon, Derrick E. Wood, Florian P. Breitwieser, Christopher Pockrandt, Ben Langmead, Steven L. Salzberg, and Martin Steinegger. 2022. “Metagenome Analysis Using the Kraken Software Suite.” Nature Protocols 17 (12): 2815–39. https://doi.org/10.1038/s41596-022-00738-y.
Patro, Rob, Geet Duggal, Michael I. Love, Rafael A. Irizarry, and Carl Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nature Methods 14 (4): 417–19. https://doi.org/10.1038/nmeth.4197.
Putri, Givanna H, Simon Anders, Paul Theodor Pyl, John E Pimanda, and Fabio Zanini. 2022. “Analysing High-Throughput Sequencing Data in Python with HTSeq 2.0.” Bioinformatics 38 (10): 2943–45. https://doi.org/10.1093/bioinformatics/btac166.
Sacramento, Laís Amorim, Camila Farias Amorim, Claudia G. Lombana, Daniel Beiting, Fernanda Novais, Lucas P. Carvalho, Edgar M. Carvalho, and Phillip Scott. 2024. Ccr5 Promotes the Migration of Pathological Cd8+ T Cells to the Leishmanial Lesions.” PLOS Pathogens 20 (5): e1012211. https://doi.org/10.1371/journal.ppat.1012211.
---
title: "TMRC3 202308: Preprocessing"
author: "atb abelew@gmail.com"
date: "`r Sys.Date()`"
bibliography: atb.bib
output:
 html_document:
  code_download: true
  code_folding: show
  fig_caption: true
  fig_height: 7
  fig_width: 7
  highlight: default
  keep_md: false
  mode: selfcontained
  number_sections: true
  self_contained: true
  theme: readable
  toc: true
  toc_float:
   collapsed: false
   smooth_scroll: false
---

<style>
  body .main-container {
    max-width: 1600px;
  }
</style>

```{r options, include=FALSE}
library(hpgltools)
library(dplyr)
library(forcats)
library(glue)

tt <- try(devtools::load_all("~/hpgltools"))
knitr::opts_knit$set(
  progress = TRUE, verbose = TRUE, width = 90, echo = TRUE)
knitr::opts_chunk$set(
  error = TRUE, fig.width = 8, fig.height = 8, fig.retina = 2,
  fig.pos = "t", fig.align = "center", dpi = if (knitr::is_latex_output()) 72 else 300,
  out.width = "100%", dev = "svg",
  dev.args = list(png = list(type = "cairo-png")))
old_options <- options(
  digits = 4, stringsAsFactors = FALSE, knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 12))
ver <- Sys.getenv("VERSION")
rundate <- format(Sys.Date(), format = "%Y%m%d")
rmd_file <- glue("tmrc3_datasets_{ver}.Rmd")
savefile <- gsub(pattern = "\\.Rmd", replace = "\\.rda\\.xz", x = rmd_file)
data_structures <- c()
```

# Changelog

* 202406: Adding a neat experiment from the Scott lab which I had
  sitting in my tree but somehow forgot about
  ((@fariasamorimMultiomicProfilingCutaneous2023), which seems to me a
  fascinating line of inquiry following
  (@amorimVariableGeneExpression2019); the second of which I attempted
  to explicitly compare).
* 202308: Moved preprocessing material from the datastructures file to here.

# Introduction

This document outlines in some detail the tasks performed in order to
process the 'raw' data.

# Sequence preprocessing

All of the preprocessing tasks were performed with CYOA (@ElsayedlabCYOA) and included
the following:

1. All samples were trimmed with trimomatic
   (@bolgerTrimmomaticFlexibleTrimmer2014a) using options to remove
   adapters using a supplemented copy of the default adapter database
   (ILLUMINACLIP:2:30:10:2:keepBothReads), reads with fewer than 40
   nucleotides remaining were discarded (MINLEN:40), and a mean quality
   >= 25 over 4 nucleotide window was used for trimming
   (SLIDINGWINDOW:4:25).  When paired end reads were available, only the
   properly paired were used for the tasks that follow.  The generated
   scripts for each sample which performed this trimming have the prefix
   ‘05trim’.
2. Reads were passed to fastqc (@andrewsFastQCQualityControl) using
   the default parameters to query read quality, remaining adapter
   content, and as an initial survey for contaminants.  The generated
   scripts which performed this have the prefix ‘06fastqc’
3. Every sample was passed to kraken2 (@luMetagenomeAnalysisUsing2022)
   against its comprehensive ‘standard’ and ‘virus’ databases, updated
   2021/06 in order to query for putative contaminants and potentially
   interesting viral sequences.  The scripts which performed this have
   the prefix ‘11kraken’.
4. Samples were passed to hisat2 (@kimGraphbasedGenomeAlignment2019)
   against the hg38 release 100 human genome (@HomoSapiensEnsembl)
   with the default options as well as the Leishmania panamensis
   MHOM/COL release 46
   (@shanmugasundramTriTrypDBIntegratedFunctional2023) in serial. The
   scripts which performed these operations have the prefixes
   ‘15hisat2_hg38_100’ and ‘15hisat2_lpanamensis_v46’ respectively.
5. The resulting alignments were converted to the binary, sorted,
   compressed, and indexed format via samtools
   (@bonfieldHTSlibLibraryReading2021).  Additional filters were
   performed to extract only the non-variant alignments .  The scripts
   which performed this have the prefix ‘19s2b_{genome}’.
6. These were cross referenced against the relevant feature databases in
   order to generate count tables by gene via htseq
   (@putriAnalysingHighthroughputSequencing2022).  These scripts have
   the prefix ‘21hts’.
7. Each sample was passed to salmon (@patroSalmonProvidesFast2017)
   against the transcript databases for hg38 release 100,
   L. panamensis release 46, and a concatenation of the two.  These
   scripts have the suffix ‘30sal_{genome}’.
8. A regular expression search was performed for the observed set of
   spliced leader sequences.  The sequence ‘AGTTTCTGTACTTTATTGG’ rather
   than the entire SL sequence was searched to avoid potential SL
   variants.  The relevant scripts have the prefix ‘50slsearch’.  I
   later set up a fastp-based (@chenFastpUltrafastAllinone2018)
   search, but I do not think it is included.
9. When sufficient coverage was observed against the parasite,
   freebayes (@garrisonHaplotypebasedVariantDetection2012) was used to
   quantify variants vs. the reference genome and infer the most
   likely parasite subtype.

(if anyone ever reads this (no one will ever read this), can you tell
me how in the flying hell the fancy single quotes appeared in my text
document?  I am writing this in emacs using markdown-mode and when I
am typing I see normal person quotes, but when I save it.... oooohhh I
added a text-mode hook and forgot it, I think?)

Given the fact that these tasks were performed over the course of 4
years, during which time we changed sequencing platforms and
computational infrastructure, the actual prefixes/suffixes of the
various scripts and outputs are not as consistent as I would like.

Also, the samples we did in the midst of pandemic are sometimes quite
a mess, some of them have far far too much data, others have far too
little.

### Trimming raw reads

With the above caveat in mind, I am copy/pasting the little shell fragments I
used when trimming the samples:

```{bash preprocess_fragment, eval=FALSE}
## trim.sh, this assumes one will manually cd into the new sample's directory and run ../tmrc_trim.sh
## In addition, it assumes that the new data files will reside in the directory 'unprocessed/'.
input=$(/bin/ls -d unprocessed | tr '\n' ':' | sed 's/:$//g')
echo "Running trimomatic on ${input}"
cmd="cyoa --method trim --input ${input} 2>>cyoa_trim.out 1>&2 &"
echo "${cmd}" > cyoa_trim.out
${cmd}
```

The following is a script which is generated by the above, arbitrarily
chosen from sample TMRC30140.  If anyone reads it carefully, one might
note that this sample is from before I started depositing new raw
reads into the 'unprocessed' directory, as a result the input reads
were in the cwd.

```{bash trim_sequence_script, eval=FALSE}
#!/usr/bin/env bash
#SBATCH --export=ALL
#SBATCH --mail-type=NONE
#SBATCH --chdir=/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30140
#SBATCH --partition=dpart
#SBATCH --qos=workstation --nice=10
#SBATCH --nodes=1 --requeue
#SBATCH --time=10:00:00
#SBATCH --job-name=01trim_TMRC30140_S3_R1_001
#SBATCH --mem=40G
#SBATCH --cpus-per-task=3
#SBATCH --output=outputs/logs/trim_TMRC30140_S3_R1_001.sbatchout

echo "## Started /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30140/scripts/01trim_TMRC30140_S3_R1_001.sh at $(date) on $(hostname) with id ${SLURM_JOBID}." >> outputs/log.txt

module add trimomatic

## This call to trimomatic removes illumina and epicentre adapters from TMRC30140_S3_R1_001.fastq.gz:TMRC30140_S3_R2_001.fastq.gz.
## It also performs a sliding window removal of anything with quality <25;
## cutadapt provides an alternative to this tool.
## The original sequence data is recompressed and saved in the sequences/ directory.
mkdir -p outputs/01trimomatic
## Note that trimomatic prints all output and errors to STDERR, so send both to output
trimmomatic PE \
  -threads 1 \
  -phred33 \
  TMRC30140_S3_R1_001.fastq.gz TMRC30140_S3_R2_001.fastq.gz \
  TMRC30140_S3_R1_001-trimmed_paired.fastq TMRC30140_S3_R1_001-trimmed_unpaired.fastq \
  TMRC30140_S3_R2_001-trimmed_paired.fastq TMRC30140_S3_R2_001-trimmed_unpaired.fastq \
   ILLUMINACLIP:/cbcb/sw/RedHat-7-x86_64/common/local/perl/5.34/lib/site_perl/5.34.0/auto/share/dist/Bio-Adventure/genome/adapters.fa:2:20:10:2:keepBothReads \
  SLIDINGWINDOW:4:20 MINLEN:40 \
  1>outputs/01trimomatic/TMRC30140_S3_R1_001-trimomatic.out 2>&1
excepted=$(grep "Exception" outputs/TMRC30140_S3_R1_001-trimomatic.out)
## The following is in case the illumina clipping fails, which it does if this has already been run I think.
if [[ "${excepted}" != "" ]]; then
  trimmomatic PE \
    -threads 1 \
    -phred33 \
    TMRC30140_S3_R1_001.fastq.gz TMRC30140_S3_R2_001.fastq.gz \
    TMRC30140_S3_R1_001-trimmed_paired.fastq TMRC30140_S3_R1_001-trimmed_unpaired.fastq \
    TMRC30140_S3_R2_001-trimmed_paired.fastq TMRC30140_S3_R2_001-trimmed_unpaired.fastq \
     SLIDINGWINDOW:4:25 MINLEN:50\
    1>outputs/01trimomatic/TMRC30140_S3_R1_001-trimomatic.out 2>&1
fi
sleep 10
mv TMRC30140_S3_R1_001-trimmed_paired.fastq TMRC30140_S3_R1_001-trimmed.fastq
mv TMRC30140_S3_R2_001-trimmed_paired.fastq TMRC30140_S3_R2_001-trimmed.fastq

## Recompress the unpaired reads, this should not take long.
xz -9e -f TMRC30140_S3_R1_001-trimmed_unpaired.fastq
xz -9e -f TMRC30140_S3_R2_001-trimmed_unpaired.fastq
## Recompress the paired reads.
xz -9e -f TMRC30140_S3_R1_001-trimmed.fastq
xz -9e -f TMRC30140_S3_R2_001-trimmed.fastq
ln -sf TMRC30140_S3_R1_001-trimmed.fastq.xz r1_trimmed.fastq.xz
ln -sf TMRC30140_S3_R2_001-trimmed.fastq.xz r2_trimmed.fastq.xz


## The following lines give status codes and some logging
echo "## Job status: $? " >> outputs/log.txt
echo "## $(hostname) Finished ${SLURM_JOBID} 01trim_TMRC30140_S3_R1_001.sh at $(date), it took $(( SECONDS / 60 )) minutes." >> outputs/log.txt

walltime=$(scontrol show job "${SLURM_JOBID}" | grep RunTime | perl -F'/\s+|=/' -lane '{print $F[2]}')
echo "#### walltime used by ${SLURM_JOBID} was: ${walltime:-null}" >> outputs/log.txt
maxmem=$(sstat --format=MaxVMSize -n "${SLURM_JOBID}.batch")
echo "#### maximum memory used by ${SLURM_JOBID} was: ${maxmem:-null}" >> outputs/log.txt
avecpu=$(sstat --format=AveCPU -n "${SLURM_JOBID}.batch")
echo "#### average cpu used by ${SLURM_JOBID} was: ${avecpu:-null}" >> outputs/log.txt
```

### Mapping the trimmed reads

There is a similar shell fragment used to map these trimmed reads
against the hg38_100 genome/transcriptome along with the panamensis
genome:

```{bash map_fragment, eval=FALSE}
#!/usr/bin/env bash
input=$(/bin/ls *_trimmed.fastq.xz | tr '\n' ':' | sed 's/:$//g')
cyoa --task map --method hisat --species hg38_100 --gff_type gene --gff_tag gene_id  \
  --input ${input} 2>tmrc3_map.out 1>&2 &
cyoa --task map --method hisat --species lpanamensis_v36 --gff_type gene --gff_tag ID \
  --input ${input} 2>>tmrc3_map.out 1>&2 &
cyoa --task map --method salmon --species hg38_100 \
     --input ${input} 2>tmrc3_salmon.out 1>&2 &
```

The following is the resulting mapping script taken arbitrarily from
sample TMRC30205.  There are a few aspects of this script which
changed from the beginning to end of the project: the prefix #
changed, I added some logic to check if the requisite software (in
this case hisat2/samtools/htseq) is already in the path, and load it
if not.  I also changed the htseq output filenames slightly over time
to try to make certain that they are consistent and that it is easy to
identify the important parameters just from the output filename.

```{bash hisat_map_script, eval=FALSE}
#!/usr/bin/env bash
#SBATCH --chdir=/fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205
#SBATCH --partition=dpart
#SBATCH --qos=workstation --nice=10
#SBATCH --nodes=1 --requeue
#SBATCH --time=10:00:00
#SBATCH --job-name=40hisat2_hg38_100_genome
#SBATCH --mem=48G
#SBATCH --cpus-per-task=4
#SBATCH --output=outputs/logs/hisat2_hg38_100_genome.sbatchout

echo "## Started /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/scripts/40hisat2_hg38_100_genome.sh at $(date) on $(hostname) with id ${SLURM_JOBID}." >> outputs/log.txt

module add hisat2 samtools htseq

## This is a hisat2 alignment of  -1 <(less /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/r1_trimmed.fastq.xz) -2 <(less /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/r2_trimmed.fastq.xz)  against /cbcbhomes/abelew/libraries/genome/indexes/hg38_100

mkdir -p outputs/40hisat2_hg38_100
sleep 3
hisat2 -x /cbcbhomes/abelew/libraries/genome/indexes/hg38_100  \
  -p 4 \
  -q   -1 <(less /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/r1_trimmed.fastq.xz) -2 <(less /fs/cbcb-lab/nelsayed/scratch/atb/rnaseq/lpanamensis_tmrc_2019/preprocessing/TMRC30205/r2_trimmed.fastq.xz)  \
  --phred33 \
  --un-gz outputs/40hisat2_hg38_100/TMRC30205_unaldis_hg38_100_genome.fastq \
  --al-gz outputs/40hisat2_hg38_100/TMRC30205_aldis_hg38_100_genome.fastq \
  --un-conc-gz outputs/40hisat2_hg38_100/TMRC30205_unalcon_hg38_100_genome.fastq \
  --al-conc-gz outputs/40hisat2_hg38_100/TMRC30205_alcon_hg38_100_genome.fastq \
  -S outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.sam \
  2>outputs/40hisat2_hg38_100/hisat2_hg38_100_genome_TMRC30205.err \
  1>outputs/40hisat2_hg38_100/hisat2_hg38_100_genome_TMRC30205.out


## The following lines give status codes and some logging
echo "## Job status: $? " >> outputs/log.txt
echo "## $(hostname) Finished ${SLURM_JOBID} 40hisat2_hg38_100_genome.sh at $(date), it took $(( SECONDS / 60 )) minutes." >> outputs/log.txt

walltime=$(scontrol show job "${SLURM_JOBID}" | grep RunTime | perl -F'/\s+|=/' -lane '{print $F[2]}')
echo "#### walltime used by ${SLURM_JOBID} was: ${walltime:-null}" >> outputs/log.txt
maxmem=$(sstat --format=MaxVMSize -n "${SLURM_JOBID}.batch")
echo "#### maximum memory used by ${SLURM_JOBID} was: ${maxmem:-null}" >> outputs/log.txt
avecpu=$(sstat --format=AveCPU -n "${SLURM_JOBID}.batch")
echo "#### average cpu used by ${SLURM_JOBID} was: ${avecpu:-null}" >> outputs/log.txt
```

From here on I will remove the suffix of the script.  Here is the sam
to bam conversion script which was created by cyoa.

```{bash sam2bam_script, eval=FALSE}
module add samtools bamtools

## Converting the text sam to a compressed, sorted, indexed bamfile.
## Also printing alignment statistics to outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam.stats
## This job depended on: 133995

echo "Starting samtools"
if $(test ! -r outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.sam); then
    echo "Could not find the samtools input file."
    exit 1
fi
samtools view -u -t /cbcbhomes/abelew/libraries/genome/hg38_100.fasta \
  -S outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.sam -o outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam  \
  2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam.err 1>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam.out && \

  samtools sort -l 9 outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam -o outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-sorted.bam \
  2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-sorted.err 1>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-sorted.out && \
  rm outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam && \
  rm outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.sam && \
  mv outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-sorted.bam outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam &&  samtools index outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam

bamtools stats -in outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam 2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam.stats 1>&2

## The following will fail if this is single-ended.
samtools view -b -f 2 -o outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome.bam && samtools index outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam
bamtools stats -in outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam 2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.stats 1>&2
```

And the counting script for htseq.  I changed the output filename for
this step on multiple occasions over the course of sample collection.

```{bash htseq_hg38_100, eval=FALSE}
module add htseq

## Counting the number of hits in outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam for each feature found in /cbcbhomes/abelew/libraries/genome/hg38_100.gff
## Is this stranded? no.  The defaults of htseq are:
##  --order=name --idattr=gene_id --minaqual=10 --type=exon --stranded=yes --mode=union

htseq-count \
  -q -f bam -s no  --type gene  --idattr gene_id \
  outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired.bam \
  /cbcbhomes/abelew/libraries/genome/hg38_100.gff \
  2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.err \
  1>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.count && \
    xz -f -9e outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.count 2>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.err.xz 1>outputs/40hisat2_hg38_100/TMRC30205_hg38_100_genome-paired_all_hg38_100_sno_gene_gene_id.count.xz
```

The same cyoa invocation produced a series of scripts to aggressively
compress the (un)mapped reads from hisat2 and print out some summary
statistics of the runs.

Conversely, here is a salmon script which was used for sample
TMRC30124 (without the slurm prefix/suffix):

```{bash salmon_script, eval=FALSE}
mkdir -p outputs/salmon_hg38_100 && sleep 3 && \
salmon quant -i /cbcbhomes/abelew/libraries/genome/indexes/hg38_100_salmon_index \
  -l A --gcBias --validateMappings  \
   -r <(less r1_trimmed.fastq.gz)  \
  -o outputs/salmon_hg38_100 \
  2>outputs/salmon_hg38_100/salmon.err 1>outputs/salmon_hg38_100/salmon.out
```

# Preprocessing data from Dr. Scott's lab

One potentially obvious comparison I neglected to make was against a
newer data set from the Scott lab.  (hmm zotero cannot seem to suck in
the author names!?  Did science somehow sanitize the pdf or is zotero
just being a doofus?  Nope it is definitely science, I asked zotero to
index a fresh copy I downloaded from pubmed and it worked fine, that
sucks, I kind of wish we would collectively refuse to engage with the
for-profit journals.)  (@amorimVariableGeneExpression2019)

In any event, the bioproject accession for this experiment is:
PRJNA525604 which is handy for me, because cyoa has a target named
'sradownload' which will, if it sees a PRJNA, do a little
web-scraping, download the metadata.csv from SRA run finder and then
download the data for every run into separate directories.

I have since started (but not finished) the implementation of a
--iterate target which will do what it says on the tin: iterate over
every directory in a tree and invoke the same pipeline on it; but
until I finish that, bash for loops for everyone!

Here is what I did:

```{bash, eval=FALSE}
mkdir -p preprocessing/PRJNA525604/
module add cyoa
cd preprocessing/PRJNA525604
start=$(pwd)
cyoa --method sra --input PRJNA525604
## wait a while, I got a coffee! and listened to the first song from Olarfur Arnaulds via Cercle:
## https://www.youtube.com/watch?v=bMCiAKNUpTY
## I then realized my script has an error where it leaves the job log
## for this, it created another preprocessing directory with the log.

## Run the processing pipeling
for i in $(/bin/ls -d SRR*); do
    cd $i
    cyoa --method prnaseq --species hg38_11 \
         --input $(/bin/ls *.fastq.gz | tr '\n' ':')
    cd $start
done
## The Process_RNASeq target (prna) handles all the stuff above.
```

The scott lab followed this work up with an examination of the
interaction between microbial buren of the skin and its connection to
increased unpredictability in clinical outcome as well as increased
inflammation during
infection (@fariasamorimMultiomicProfilingCutaneous2023).  I
completely missed this paper until quite recently (202404); in
addition, as I was finishing my quick first-pass read through, I found
the bioproject accession, and upon reading the metadata at SRA
realized they also followed that work up with another piece of
intriguing work (@sacramentoCCR5PromotesMigration2024).

So, with that in mind:

```{bash, eval=FALSE}
ac=PRJNA885131
mkdir -p preprocessing/${ac}
module add cyoa/202404
cd preprocessing/${ac}
start=$(pwd)
cyoa --method sra --input $ac
ac=PRJNA682985
mkdir -p preprocessing/${ac}
cd preprocessing/${ac}
start=$(pwd)
cyoa --method sra --input $ac

## This set of samples is not paired ended
for i in $(/bin/ls -d SRR*); do
    cd $i
    cyoa --method prnaseq --species hg38_111 \
         --input $(/bin/ls *.fastq.gz)
    cd $start
done
## Note, I only asked it to do hg38 for now, but I could/should change it to
## hg38_11:lbraziliensis_2904_v46:saureus_8325
## once I verify those are the appropriate genomes -- and I should
## probably grab fresh copies, I haven't updated them in a while.
```

Oh yeah, my Staphylococcus assembly is stupidly old, I cannot even
tell where I downloaded it (oh, I checked the fasta file, it is
current, I just manually downloaded everything like a doofus and it is
hard to tell the version)  I am just going to redownload it via cyoa
so I have the full accession on hand:

```{bash, eval=FALSE}
cd ~/libraries/genome/genbank
cyoa --method ncbidownload --input GCF_000013425
## --download downloads whatever ncbi accession I feed it, maybe I
## should change that to something like --ncbidown (I did!) and make one for
## ensembl and microbesonline and friends.  Thing is, I like having an
## archive of the gbk file so I can later extract stuff, ensembl/etc
## do not give me that, so it is often easier to just download
## manually, but then again, if I add it to cyoa I do not have to
## scroll around and accidently download some rando chromosome instead
## of the full assembly; and I can add a year or release ID and have
## it go to the appropriate archive server.  yeah, I like that.
## maybe I should change --input to --accession for stuff like this?
ls -ltr
ln -s GCF_000013425.gbff saureus_8325.gbk
cyoa --method gb2gff --input saureus_8325.gbk
ls -ltr
mv saureus_8325.fsa ../fasta/
mv saureus_8325_all.gff ../gff/saureus_8325.gff
mv saureus_8325.faa ../../amino_acid/
mv saureus_8325.ffn ../../CDS/fasta/
mv saureus_8325_cds.gff ../../CDS/fasta/saureus_8325.gff
mv saureus_8325_interCDS.gff ../../interCDS/gff/saureus_8325.gff
mv saureus_8325_rrna.gff ../../rRNA/gff/saureus_8325.gff
mv saureus_8325_rrna.fasta ../../rRNA/fasta/saureus_8325.gasta
cd ../fasta
cyoa --method indexhisat --species saureus_8325 \
     --input saureus_8325.fsa \
     --genome saureus_8325

```

My favorite part of the above: I can provide the csv downloaded from
SRA to my hpgltools::gather_preprocessing_metadata() function along
with the root directory of the downloaded data (preprocessing/$ac) and
it will suck up all the numbers for me automagically, so I do not have
to worry any more about collecting mapping statistics and potentially
messing them up.  One caveat: the rRNA collector still sometimes grabs
genomic numbers because I forgot to add the genome variable somewhere
in the last version...  Nonetheless, I think that is awesome (I even
made it smrt enough to handle a colon separated list of species).  Oh,
this reminds me, I should check that my S. aureus and braziliensis
assemblies are the same as the ones they used.

# Bibliography
