Variant analysis on RNA

tripleR

Dear Ahmed/Carlos/Bea
Could you please how to proceed with GPRO for making a variant analysis using RNASeq data. They are mouse sequences
Thank you in advance
Ricardo

llorens

First of all read the GATK best Practices to obtain a overal perspective of variant analysis

Next, just for your records Ricardo in order to understand what you are going to do (and the limitations) in the variant analysis for RNASeq data using GATK read also the following threads.

The GATK Best Practices for variant calling on RNAseq, in full detail

Calling variants in RNAseq

Optional. You can also read this powerpoint in you want to know more about the whole perspective

Variant Calling and Clustering on RNA-Seq Data

Keep working

From here i will summarize you all the steps to perform in the analysis for Mus musculus RNAseq Data (ensembl release 38) using GATK as implemented in variantSeq of GPRO in combination with some commands

1) Quality analysis with FASTQC, as you has been previously told in other analysis using the preprocessing interface

2) Preprocessing with cutadapt and prinseq as you has also been previously told in other analyses using the preprocessing interface.

3) downloading the genome reference and the SNP set you will use as training set

To do this you must enter in your user account in your server and go to the folder you want where you want to place the genome and the VNP vcf file

then get them from ensembl

wget ftp://ftp.ensembl.org/pub/release-96/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna_rm.toplevel.fa.gz

wget ftp://ftp.ensembl.org/pub/release-96/variation/vcf/mus_musculus/mus_musculus.vcf.gz

4) Prepare the reference fasta file names of each chromosome to be concordant with the chromosome names assigned to each chromosome

You will see that the fasta file present the following fasta heads

1 dna_sm:chromosome chromosome:GRCm38:1:1:195471971:1 REF
2 dna_sm:chromosome chromosome:GRCm38:2:1:182113224:1 REF
3 dna_sm:chromosome chromosome:GRCm38:3:1:160039680:1 REF
4 dna_sm:chromosome chromosome:GRCm38:4:1:156508116:1 REF
5 dna_sm:chromosome chromosome:GRCm38:5:1:151834684:1 REF
6 dna_sm:chromosome chromosome:GRCm38:6:1:149736546:1 REF
7 dna_sm:chromosome chromosome:GRCm38:7:1:145441459:1 REF
.
.
.
MT dna_sm:chromosome chromosome:GRCm38:MT:1:16299:1 REF

You must to edit the file using the sed command to let the names to be only as

1
2
.
.
MT

You must to also remove the scaffolds (CHR_xx) using the seqeditor of GPRO to keep only the chromosomes to 1 to MT

5) mapping

It seems there are recommended mappers to perform variant analysis with RNAseq data: STAR aligner and Tophap

Both work well. We are going to choose STAR as it implemented in variantseq.

Then
Open variantseq and go

Variant Protocols -> SNPs/Indels -> Exome -> Mapping -> STARS

Then configure the inferface as indicated in the screenshot

Basically in the basal mode to add the GTF file indicate below, but yo can also try to improve the alignment adding some parameters for readmap number o clipping the ends as also highlighted in the image below.

[attachment:5d63b6c6b059b]

6) Postprocessing: you need to postprocess your bam files (the results of mapping)

in the same menu for SNPs & Indels go to
Postprocessing -> picard-tools -> Picard tools : AddReplaceReadGroups
And then do as in the image below

[attachment:5d63f2f6bcf9a]

Read also these threads of seqanswer and biostars to understang what you are doing

Seqanswers thread on topic
biostars thread on topic

7) Postprocessing: already in postprocess step take your bam files (the results of from addreplace group) and mark duplicates going to

Postprocessing -> picard-tools -> Picard tools : MarkDuplicates

8) Postprocessing: then you need to Split'N'Trim and reassign mapping qualities (you are still in the postprocessing step).

SplitNCigarReads is a GATK tool developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.

Do as indicated in the image below

[attachment:5d63f260da735]

9) Postprocessing: Base recalibration (BSQR)

Here is where you should to use the mus_musculus.vcf.gz

Go to Postprocessing -> GATK-tools -> gatk tools : BSQR

And do as in the image

[attachment:5d63f3b8822a3]

Then you are ready to make the calling

10) Calling of SNPs and Indels

Here we must to use the interface for Haplotype caller of GATK in GPRO.

To do this

Go to SNPs/Indels calling -> Variant calling -> GATK based -> gatk haplotyper

The way of usage is similar to the other cases. Do as indicated in the figure

[attachment:5d63f35f19c18]

11) when you done with the calling you must to apply the command of filtering

The protocol addressed by GATK good practices (the first links given above to the GATK forum) recommend to use the tool variantFiltration instead of VQSR which is the typical tool that must be used in this step. The problem is that VSQR requires truth and training sets created specifically for RNAseq data and we only have a trining set created based on genomic variants.

Then we need to make the variant filtration, which i see is not implemented as an interface in variant seq

Ahmed will made an interface doing this funtion.

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf

Ricardo once Ahmed has this interface the next week you only need to input your reference fasta files as in the other cases, your results.vcf files obtained from the calling and add these parameters below in the box of parameters

window = 35
cluster = 3
filterName = FS
filter = FS > 30.0
filterName = QD
filter QD < 2.0

When having this completed, we can pass to the next step, which is annotaiton.

12 ) annotation effects

Here we need to run VEP

To do this

Go to SNPs/Indels calling -> Annotation -> VEP –Variant effect predictor

I do not put example of input fields of the interface because you are already familiar with them.

Yo just need to select your species mus musculus (you will see you have several mus musculus options you can select any of them or optionally if you want to do it check in ensembl which of these is more suitable to your reference genome. The easy thing is to take one names just as mus musculus

Then drag your filtered vcf files from the last filter analysis, to the interface

Following is a screenshot of the parameters selected, Look at all checked items! Anyway you can run it with default conditions.

[attachment:5d63f3fb10fce]