First of all read the GATK best Practices to obtain a overal perspective of variant analysis
Next, just for your records Ricardo in order to understand what you are going to do (and the limitations) in the variant analysis for RNASeq data using GATK read also the following threads.
The GATK Best Practices for variant calling on RNAseq, in full detail
Calling variants in RNAseq
Optional. You can also read this powerpoint in you want to know more about the whole perspective
Variant Calling and Clustering on RNA-Seq Data
From here i will summarize you all the steps to perform in the analysis for Mus musculus RNAseq Data (ensembl release 38) using GATK as implemented in variantSeq of GPRO in combination with some commands
1) Quality analysis with FASTQC, as you has been previously told in other analysis using the preprocessing interface
2) Preprocessing with cutadapt and prinseq as you has also been previously told in other analyses using the preprocessing interface.
3) downloading the genome reference and the SNP set you will use as training set
To do this you must enter in your user account in your server and go to the folder you want where you want to place the genome and the VNP vcf file
then get them from ensembl
4) Prepare the reference fasta file names of each chromosome to be concordant with the chromosome names assigned to each chromosome
You will see that the fasta file present the following fasta heads
1 dna_sm:chromosome chromosome:GRCm38:1:1:195471971:1 REF
2 dna_sm:chromosome chromosome:GRCm38:2:1:182113224:1 REF
3 dna_sm:chromosome chromosome:GRCm38:3:1:160039680:1 REF
4 dna_sm:chromosome chromosome:GRCm38:4:1:156508116:1 REF
5 dna_sm:chromosome chromosome:GRCm38:5:1:151834684:1 REF
6 dna_sm:chromosome chromosome:GRCm38:6:1:149736546:1 REF
7 dna_sm:chromosome chromosome:GRCm38:7:1:145441459:1 REF
MT dna_sm:chromosome chromosome:GRCm38:MT:1:16299:1 REF
You must to edit the file using the sed command to let the names to be only as
You must to also remove the scaffolds (CHR_xx) using the seqeditor of GPRO to keep only the chromosomes to 1 to MT
It seems there are recommended mappers to perform variant analysis with RNAseq data: STAR aligner and Tophap
Both work well. We are going to choose STAR as it implemented in variantseq.
Open variantseq and go
Variant Protocols -> SNPs/Indels -> Exome -> Mapping -> STARS
Then configure the inferface as indicated in the screenshot
Basically in the basal mode to add the GTF file indicate below, but yo can also try to improve the alignment adding some parameters for readmap number o clipping the ends as also highlighted in the image below.
6) Postprocessing: you need to postprocess your bam files (the results of mapping)
in the same menu for SNPs & Indels go to
Postprocessing -> picard-tools -> Picard tools : AddReplaceReadGroups
And then do as in the image below
Read also these threads of seqanswer and biostars to understang what you are doing
Seqanswers thread on topic
biostars thread on topic
7) Postprocessing: already in postprocess step take your bam files (the results of from addreplace group) and mark duplicates going to
Postprocessing -> picard-tools -> Picard tools : MarkDuplicates
8) Postprocessing: then you need to Split'N'Trim and reassign mapping qualities (you are still in the postprocessing step).
SplitNCigarReads is a GATK tool developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.
Do as indicated in the image below
9) Postprocessing: Base recalibration (BSQR)
Here is where you should to use the mus_musculus.vcf.gz
Go to Postprocessing -> GATK-tools -> gatk tools : BSQR
And do as in the image
Then you are ready to make the calling
10) Calling of SNPs and Indels
Here we must to use the interface for Haplotype caller of GATK in GPRO.
To do this
Go to SNPs/Indels calling -> Variant calling -> GATK based -> gatk haplotyper
The way of usage is similar to the other cases. Do as indicated in the figure
11) when you done with the calling you must to apply the command of filtering
The protocol addressed by GATK good practices (the first links given above to the GATK forum) recommend to use the tool variantFiltration instead of VQSR which is the typical tool that must be used in this step. The problem is that VSQR requires truth and training sets created specifically for RNAseq data and we only have a trining set created based on genomic variants.
Then we need to make the variant filtration, which i see is not implemented as an interface in variant seq
Ahmed will made an interface doing this funtion.
java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf
Ricardo once Ahmed has this interface the next week you only need to input your reference fasta files as in the other cases, your results.vcf files obtained from the calling and add these parameters below in the box of parameters
window = 35
cluster = 3
filterName = FS
filter = FS > 30.0
filterName = QD
filter QD < 2.0
When having this completed, we can pass to the next step, which is annotaiton.
12 ) annotation effects
Here we need to run VEP
To do this
Go to SNPs/Indels calling -> Annotation -> VEP –Variant effect predictor
I do not put example of input fields of the interface because you are already familiar with them.
Yo just need to select your species mus musculus (you will see you have several mus musculus options you can select any of them or optionally if you want to do it check in ensembl which of these is more suitable to your reference genome. The easy thing is to take one names just as mus musculus
Then drag your filtered vcf files from the last filter analysis, to the interface
Following is a screenshot of the parameters selected, Look at all checked items! Anyway you can run it with default conditions.