- ASE using DNA-seq and RNA-seq data
- Introduction
One of the major steps in determining ASE requires identification of heterozygous biallelic variant sites. From a NGS perspective, when reads generated by a sequencer are aligned to a reference genome, ‘variant calling’ is performed to identify the biallelic variant sites and ‘genotype calling’ determines the genotype or if the studied organism is heterozygous at that site or not. As mentioned in the previous chapter, RNA-seq variant calling is an added advantage that can be achieved in addition to expression analysis and identification/quantification of transcript isoforms. Unfortunately, RNA-seq variant calling will have caveats such as alignment errors due to splicing, increased error rate due to reverse transcription, failure to call variants and infer genotypes in lowly expressed genes and incorrectly inferring RNA-editing sites as genomic variation (1).
In the previous chapter, we have aligned RNA-seq reads to a reference genome to identify genomic variants. RNA-seq short reads are derived from mature-RNA (in case of mRNA-seq) which do not have introns on their sequence. A read generated from an mRNA-seq experiment will span two exons, while the reference genome will have an exon, followed by an intron leading to splitting of a single read (popularly known as ‘split-reads’ formed due to a splice junction). This would need an aligner to deal with issues such as mapping ambiguity due to short length of reads and sequencing errors leading to misaligned split reads. For this reason, specific aligners known as ‘splice-aware’ aligners are used and many of such aligners follow different strategies to identify splice junctions and also alternative splicing that enables a gene to code for multiple proteins through inclusion or exclusion of exons (2-4). It was also observed that technical error from reverse transcription and sequencing, and biological error caused from RNA polymerase look undistinguishable, challenging the identification of variants (5). RNA-seq read distribution on a reference genome (when aligned to one) is non-uniform in nature which means that reads will be concentrated on highly expressed genes than genes which have low or no expression. Enough depth is required at a particular locus in order to call genotype and variant confidently and lowly expressed genes or even unexpressed genes defy this (6). On the other hand, reads are distributed uniformly across the whole genome in DNA-seq and is helpful in calling genotypes as well as variants from any locus. RNA-editing refers to a post-transcriptional modification that makes a mature RNA sequence differ from its template DNA sequence by base insertion, deletions or substation leading to RNA mutations and altering gene regulation (7, 8). This means that when only RNA-seq is used for detecting variations after alignment to the reference genome, any change from the genomic DNA will be misinterpreted as a single nucleotide variation or SNV. Having the genomic sequence of the same organism under study will aid in identification and correct interpretation of such events which have been previously done in cow and humans (9-12).
This chapter deals with the determination of ASE by identifying and genotyping variants using DNA-seq data. The variant calling workflow has been ‘adapted from’ the GATK (The Genome Analysis Toolkit) (13) best practices workflow for Germline single nucleotide polymorphisms (SNPs) + Indels (Insertions or deletions) for our ‘non-human’ organism as mentioned in (14).
- Methods
- WGS Raw data description
As seen in table 1, the raw data consisted of Whole Genome sequencing (WGS) data from 81 water buffaloes from 7 breeds. Out of 7, 6 breeds are of Indian origin. Paired-end sequencing of the animals were done in two sequencing centres- Edinburgh Genomics, UK and SciGenom (India) in different depths-30x and 10x, respectively. The number of animals sequenced at 10x and 30x depth is mentioned in table 1. Edinburgh genomics, UK sequenced the animals using Illumina HiSeq X machine (read length: 150) whereas SciGenom used Illumina Illumina HiSeq 2500 machine (read length: 250). There was no particular reason for using different sequencing technologies for different animals, except from what was available during the time when the samples were ready for sequencing as they were sequenced in different times. Due to this reason, batch effect (15) could have hampered downstream analysis and its presence was checked as described further ahead in this chapter. The six Mediterranean breed animals consists of two animals from Lodi (Italy), and four animals from the same farm in Kirkcaldy, town in Fife, on the east coast of Scotland, UK. The Indian breed samples were collected from their respective breeding tracts as mentioned in table 2.
Breed |
No. of animals |
Sequencing Centre |
Depth |
Italian (Mediterranean) |
6 |
Edinburgh Genomics |
30x |
Jaffrabadi |
13 |
SciGenom (India) and |