exome jobs

tripleR

Dear Bea/CArlos / Ahmed.
Cpild you please facilitate me a short descruption for variant analysis using human exome data?
Thanks in advance!
Ricardo

llorens

Hi Ricardo
As you already know the basic procedure, i did not include explicative images or how to upload the files and will just indicate the steps. If you have any additional doubt call or email me and will explain you more details personally.

STEP 1: MATERIALS TO DOWNLOAD

First of all, download the bundle resource hg38 and deposit it in the folder where you are going to do the analysis.

here your have the link

Basically you have to do the following

To access the bundle on the FTP server, use the following login credentials in your favorite FTP client (for instance filezilla)

Or you can simply click on this link via browser,
ftp: //gsapubftp-anonymous@ftp.broadinstitute.org/bundle/

If you are asked for a password, leave it in blank

The bundle directory contains five subdirectories, one for each build of the human genome that we have resources for: b36, b37, hg18, hg19 and hg38 (aka GRCh38). Be aware that the hg38 resource set is provided as-is, and its contents may still be incomplete.

Go to the page of hg38 and download everything except the beta folder. Notice that there are some compressed files.

then you are ready for the analysis.

STEP 2: QUALITY AND PREPROCESSED.

you already knows how to do this with the interfaces for FastQC for quality analysis and prinseq and cutadapt for preprocessing. It works as previously done.

STEP 3: MAPPING
If you want to do it with Bowtie do it as in previous cases but use the reference genome you download from the bundle

Homo_sapiens_assembly38.fasta.gz
Homo_sapiens_assembly38.fasta.fai

POST PROCESSING STEPS

STEP 4: MARK DUPLICATES

follow the following path in VariantSeq

Postprocessing -> Picard Tools -> mark duplicates

As an option you tell it to create index = true

STEP 5: RE-ALING AROUND INDELS

follow the following path in VariantSeq

Postprocessing -> GATK Tools -> Indel Local Realigment

In the “known sites files” field put the two vcf files of the bundle

Mills_and_1000G_gold_standard.indels.hg38.vcf
1000G_phase1.snps.high_confidence.hg38.vcf

You do not need to put options, let the tool work by default

STEP 6: BSQR

The same as in previous cases

path on VariantSeq

Postprocessing -> GATK Tools -> BSQR

In the field of “Known sites files” you can put these three vcf bundle

dbsnp_146.hg38.vcf
1000G_phase1.snps.high_confidence.hg38.vcf
hapmap_3.3.hg38.vcf.gz

again, you do not need to put options, let the tool work by default

ALING STEPS

7) CALL OF VARIANTS

path on VariantSeq

Variant Calling -> Mutect2

Same as always but read this web first for details

as you do not have tumor-normal pairs nor panel of normals select Tumor only mode

In options put the bundle file wgs_calling_regions.hg38.interval_list in options (–L intervals)

7) RECALIBRATION OF VARIANTS BY QUALITY VALUE (VQSR)

path on VariantSeq

Variant Filtering -> Variant Quality Score Recalibration

There are three types of training resources, training sites, truth sites and known sites. According to this put these files below where correspond in the interface

dbsnp_146.hg38.vcf
1000G_phase1.snps.high_confidence.hg38.vcf
hapmap_3.3.hg38.vcf.gz
Mills_and_1000G_gold_standard.indels.hg38.vcf
1000G_omni2.5.hg38.vcf

8) ANNOTATION OF VARIANTS

Path on Variantseq

annotation -> VEP