How to analyze your genome

in #genetics6 years ago

As they say, “know thyself is the advice of the sage” and so after a few years of procrastination i.e. waiting for a price I could actually afford, I decided to make the leap and have my genome sequenced. I wanted to document my process, in the hope that it might help someone else looking into the same topic. This is a quick overview of the different steps I took to get my genome sequenced and get some useful info out of it.

A quick disclaimer: I am by no means a geneticist nor do I have advance knowledge in genetics (except what was gleaned through hours of web browsing). As such, most of the different steps in the analysis section were the product of trial and error. There might be better ways of achieving the wanted results and if you know some, let me know in the comments.

What’s contained in this post?

  • Chapter 1: DNA sequencing, what to choose?
  • Chapter 2: Sequencing companies
  • Chapter 3: …and the results are…
  • Chapter 4: Raw data analysis (i.e. the good stuff)

DNA sequencing: what to choose?

There are many choices when it comes to DNA sequencing. Which one to choose?

Well the short answer is: it depends! Are you trying to figure out which Neolithic tribe you are coming from? Which diseases are you susceptible to? All of the above? Is cost an issue? How about privacy?

Here is an attempt to describe what is out there:

Genealogy/Ancestry testing

As its name suggests, genealogy testing aims to examine your DNA to help you determine whether you are of African, Asian, European, Native America etc. descent or provide more specific information and trace your family tree back to localised prehistoric tribes. As such, they focus on specific DNA sequences known to correlate with those populations.
As multiple companies offer genealogy testing (23andme, Family Tree and AncestryDNA being the most popular), there will be some variations in what the tests are looking at. However, there are three main types of tests that are available:

Y- chromosome DNA Tests: Y-DNA tests look for specific markers on the Y-chromosome. Since females do the carry the Y-chromosome, the test can only apply to males. As the Y-chromosome is passed from father to son, the tests trace the direct patrilineal line (father, father’s father and so on). Although the main Y-DNA haplotype are known (e.g. Eastern Europeans: Haplotype R1A, Western Europeans: Haplotype R1b etc.), as the databases grow, finer granularity Haplotypes are discovered and can even correlate to a surname (e.g. surname project).

Mitochondrial DNA Tests: Although not strictly part of the human chromosomes, mitochondria are sub-units present in most human cells in charge of generating energy for the cells. It uses a separate genetic code that is transmitted from the mother to both male and female children. As such, the test can be taken by both men and women and shows the matrilineal line. Note that mtDNA changes very slowly over time (slower than chromosomal DNA) and, therefore, can be only used for deep ancestry high up in the proverbial genealogy tree. If two people share the same mtDNA, it is likely they are related; however, it may be difficult to tell when the common ancestor lived.
Although some sequencing companies look at specific Haplogroups in mtDNA, others sequence the whole mtDNA genome (e.g. Family Tree DNA).

Autosomal DNA Tests: Autosomal looks at genetic markers in the 22 non-sex chromosomes. Unlike Y-DNA tests and mtDNA that identify a single ancestral line, (i.e. patrilineal or matrilineal), autosomal DNA looks at the input from all the ancestors. Although maybe less satisfying because of a more nuanced answer, this paints a more complete picture of the subjects’ geographical origins, which translates into percentages of various population groups.
Note that these percentages are based on different models and, therefore, the results can vary depending on the specific model that is applied.

It is worthwhile mentioning that the International Society of Genetic Genealogists (ISOGG) offers continuously updated information about companies doing genealogical testing as well as the details of the tests offered by the genealogy companies to help you in making a choice. The main advantage of ancestry testing is certainly its simplicity. It is a one-stop shop at a low price: you provide a spit sample/cheek swab and you have in return an analysis of your ancestry. Some companies, but not all, offer also the raw data of the relevant DNA markers. However, from what I saw, that data is already in pre-processed format, which limits its usability.
The inconvenience, however, is the limited scope of the analysis; only a subset of known markers is tested. Some companies offer additional visibility into specific sequences where relevant variations might occur (i.e. SNPs) to further refine the analysis, through the purchase of additional tests, but this increases the overall cost. When all the costs are added up, the result is not insignificant when compared to the whole genome analysis (see below).

Health-centric gene analysis:

As the knowledge about the impact of genes keeps growing, there is now multiple health related DNA test. As with their genealogy counterpart, the tests look for specific markers. For example, it checks the carrier status for genetic diseases (e.g. certain types of cancer, sickle cell anemia etc.), how you genes can impact your lifestyle (e.g. sensitivity to coffee, proneness to alcoholism etc.) and general health risks. As these results are more likely to have a significant impact (e.g. think double mastectomy because or BRCA genes), it is critical that these tests are held to a higher standard that run-of- the-mill genealogy testing.

Whole Exome Sequencing (WES):

This type of analysis targets protein-coding regions of the genome, which is arguably the most important part. This region represents less than 2% of the 3 billion nucleotides in the human DNA.

This reduces the cost of the sequencing relative to a whole genome analysis. For example, in the case of DanteLabs, where I chose to have my genome sequenced, the cost of WES is $499 USD vs. $699 USD for the whole genome analysis. Furthermore, the relatively small proportion of the exome relative to the genome allows to focus on those key sequences by increasing the number reads for a given nucleotide sequence (e.g. increasing coverage or depth). This is important since, despite the high accuracy of sequencing techniques, the number of nucleotides involved is enormous, which statistically will create a large number of sequencing errors. This, of course will lead to false results and an erroneous analysis. To address this issue, it is necessary to sequence the genome a large number of times, thereby increasing sequencing accuracy. Research has found that a 120X coverage was adequate detection of most variations in the human genes. Since, from what I saw, the accepted standard for exome testing is 100X, it is in the ballpark of meeting clinical requirements.

Whole Genome Sequencing (WGS):

The Holy Grail of DNA sequencing. Given infinite resources, it is the best choice since it gives you a complete snapshot of your chromosomes. The first genome sequencing, The Human Genome Project cost 2.7 billion dollars and took 15 years to complete. As the technology improved at an exponential rate, so did the cost reduction. It now takes a few weeks and, with some companies, less than a $1000 to complete.
The good news is that WGS encompasses all the tests previously mentioned (with the exception of mtDNA, which does not sequence chromosomes, but instead, the mitochondrion’s genetic code). This means that the sequencing needs to be done only once and the analysis repeated for different information or as new discoveries come around. This might be especially useful for the emerging field of personalized medicine where the treatment can be uniquely adapted to individual patients.
When compared to WES, WGS examines non-coding of the genome such as promoters and enhancers that can change the likelihood a transcription of a gene occurs or that the transcription is initiated. Further, due to technical quirks, WGS has more reliable/uniform sequence coverage, whereas WES can result in regions of the genome with little or no coverage and is therefore less prone to false negative sequence identification. On the downside however, there is of course the cost, although the difference is not as large as one might expect (30% in my experience). Additionally, because of the significantly larger amount of data, the industry standard is 30X, instead of the 100X for WES, which in some scenarios, such as clinical disease detection, is a non-starter.
To summarize, while WES has a better coverage, it may not be able to detect specific regions accessible through WGS with its better uniformity/reliable coverage. Note that in the US, WGS requires a doctor’s approval, which may be a major downside if one lives outside the US and has no access to a US doctor or if one has privacy concerns about the possibility of its genetic results entering into the healthcare system.

The Sequencing companies

After looking at the different sequencing options, I decided on WGS; why limit oneself? WGS gives you access to all the other sequencing types (Genealogy, Health etc., mtDNA notwithstanding) so might as well go with the whole master blueprint, which should become more and more valuable as new discoveries about the impact of specific sequences are made. It is the most exhaustive type of sequencing and since I am more looking for qualitative completeness, I do not need the X100 coverage offered by WES. As of January 2018, several companies offer WGS sequencing:

Check out isogg.org for additional info.

After looking into it, I opted for DanteLabs, for several reasons:

  • Price: Lowest price out of all the companies above, which always counts for something.
  • Physician’s participation: Unlike companies based in the US, European companies do not require a US medical doctor’s note. Since I am not living in the States, another choice seemed like too much hassle.
  • Privacy (it’s not because you’re paranoid that they are not out ther to get you): Given the “official healthcare” route of the analysis in the US, where my name is associated with the analysis, I would not be surprise if my name/DNA ended up in some research database or more problematically, in insurer’s database. Not that I believe that my DNA will not end-up in a research database anyway, but at least it will be anonymously. Plus, European privacy laws are more stringent that in the US (then again I might be fooling myself).
  • Free raw data: some of the other providers (e.g. Veritas) only provide raw data at an extra cost, which increase the total price tag.

...and the results are...

After ordering, I received a sample kit and shipped it back with a lot of spit i it. The analysis process took longer than expected e.g. 3 months instead of 6 weeks. This was not a huge issue however, since I already had waited 5 years for prices to get to a sweet spot I could afford, so I was ready to handle a few extra-weeks. On the plus side, the team at Dante Lab was very extremely responsive, before and after I received my results.

With the sequencing completed, I received a report and two raw data VCF files. The report analyzing key genes (a sample report is available here), detailing sensitivity to different medications as well as resistance or vulnerability to specific diseases and general health information. No ancestry information apart from me being Caucasian (not necessarily a huge discovery). The information was interesting, but limited to the subset they deemed to be most relevant (or maybe least contreversial?). However, there is a much more interesting source of information: the raw data.

Raw data analysis (i.e. the good stuff)

The raw data is in VCF format. As the original files obtained after sequencing are SAM files, or their binary equivalent (BAM files), which takes up approx. 120GB for a 30X analysis, the VCF is a more portable version as it has already been processed by retaining only the differences relative to a reference genome (grCh37/hg19 in this case):

  • SNP VCF describing relative mutations
  • INDEL VCF describing inserted or deleted sequences

With this, I explored available third-party software that would answer all my little questions and give me as complete a level of details as possible. After researching, there did not seem to be a one-stop solution and a piecemeal approach was needed:

Promethease

Promethease is a cloud-based service that uses SNPedia entries to analyze raw data files. SNPedia is a vast repository investigating human genetics. Amongst other things, it stores the known effect of the gene, its location in grCh38 reference genome as well as references. To my knowledge, it is the largest of its kind publicly available and is continuously updated as new SNPs are discovered. Promethease uses the database to parse the raw data and extract all available analysis. Trouble is, however, the service accepts only a single VCF already annotated with SNP entries (which was not provided on the data I received).
After some, -a lot- investigation for tools to generate/annotate/merge VCF files, I realized that the most flexible program was GATK, a toolbox used by geneticists. The next issue was that the software only works on Unix/Linux boxes. Since I do not have a machine running Linux, I decided to install a virtual Ubuntu using Virtual Box. The scheme seems to work fairly well. I first tried to merge and annotate my VCF files.

Working with INDEL and SNP VCFs

Since VCF are basically the delta with a known genome, you will need the reference genome it was created with. I downloaded the reference files using these instructions:

  • Dictionary file (ucsc.hg19.dict.gz)
  • Index file (ucsc.hg19.fasta.fai.gz)
  • Reference file in FASTA format (ucsc.hg19.fasta.gz)

Next, I attempted to combine the files using GATK:
java -jar GenomeAnalysisTK.jar \
-T CombineVariants \
-R Reference/ucsc.hg19.fasta \
-- variant:snp sequenced_genome.snp.vcf \
-- variant:indel sequenced_genome.indel.vcf \
-o sequenced_genome.merged.vcf \
-genotypeMergeOptions PRIORITIZE -priority indel,snp

A few words of caution: from my conversations with GATK users, this is not a typical use case of the software. In addition, although the reference genome is grCh37/hg19 (also known as hg19), there are many revisions of this file and to this day I was not able to figure out which one had been used to create the VCF files. As such, there might be misalignments of the genome.
This were collisions between the INDEL and SNP files and a prioritization strategy was necessary (in this case, I chose –priority indel, snp).
The merged file is then annotated with a database file (dbsnp_138.hg19.vcf.gz), found from the same place as the reference files) that correlate sequences to known SNPs.
java -jar GenomeAnalysisTK.jar \
-T VariantAnnotator \
-R Reference/ucsc.hg19.fasta \
-V sequenced_genome.merged_snp.vcf \
-o sequenced_genome.annotated_snp.vcf \
-- dbsnp Reference/dbsnp_138.hg19.vcf

The output can then be uploaded to Promethease.

Working with BAM files

Since there were a couple of issues with the VCF approach, (e.g. outside the normal process flow and could not be assured about the quality of the output because of the uncertainty with the reference file), I requested DanteLabs to send me the BAM file (original raw data). The file is huge 120GB, sent on a memory stick. From the original BAM file, it was relatively straightforward to create a single annotated VCF file:
java -Xmx10g -jar GenomeAnalysisTK.jar \
-R /Reference/ucsc.hg19.fasta \
-T HaplotypeCaller \
-I /media/sf_wgs/sequenced_genome.bam \
-- emitRefConfidence GVCF \
-maxAltAlleles 24 \
-- dbsnp /Reference/dbsnp_138.hg19.vcf \
-- output_mode EMIT_ALL_SITES \
-o /media/sf_wgs/sequenced_genome.g.vcf

A few comments:

  • This command creates a GVCF file, which is comparable to a VCF except it produces a record for all SNPs, not only the ones that are different from the reference genome. This is useful for Promethease as all your SNPs will be analyzed, not just the ones that are different from the reference genome. Beware, however, as the output is large (20GB).
    To obtain the smaller VCF file instead, remove --emitRefConfidence GVCF and --emitRefConfidence GVCF.
  • The files involved in this processing are rather large and unless you created a virtual image to accommodate large files, it is simpler to mount them from a remote drive.
  • Xmx will limit the memory that can be used by java. The process is memory and CPU intensive and without this the process will likely crash (unless you have a powerhouse rig).
  • The maxAltAlleles increases the maximum number of alternate alleles that can be presented to the engine. The default is six. This might increase accuracy in some cases. However, as the process is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles, this will make the processing so much longer.
    The process is intensive: to process my BAM file, the software ran for four days! However, at the end I had what I needed: a 20GB file to be uploaded into Promethease. One advice compress it with gzip first, otherwise the upload will most likely fail, in addition to costing an extra amount of bandwidth.
    The resulting report was as comprehensive as expected costing me one sleepless night just to start understanding the main points.
    Always request the BAM file from your sequencing: it’s free and gives you access to the most useful data format since everything else can be derived from it.

Tools from Sequencing.com

Another toolset worth mentioning is sequencing.com, an app-store for everything related to genes. Although my use of the platform was minimal due to the immaturity of the apps relative to what I wanted to do, its approach is promising: the apps are much more user-friendly than GATK and do not require setting up a whole OS/environment. Two apps are worth mentioning:
EvE: will convert between different formats, annotate the data as well as do multiple other processing operations that only true geneticists would understand. However, at the time of my little investigation, it was not able to combine the two files I had (namely INDEL and SNP VCFs). I will say this though, the support was tremendous and support staff tried to do combine the two files manually for me.
*Genome-vcf: I was not aware of this app, but Promethease recommends it to derive a VCF of GVCF from a BAM file. I might be significantly more straightforward than GATK. Once again, the moral of the story is to obtain the BAM files from your sequencing company.
The field is evolving at an incredible speed, and I would expect new functionality/new apps to be now available on the platform.
GEDmatch: The largest downside of Promethease how little information is available for genealogy, especially the autosomal chromosomes. Fear not, GEDmatch, a site specializing in just that promises to provide you information about your origins. For some reason, however, it does not look at XY-chromosomes, just the remaining 22 pairs of autosomal chromosomes. Since these chromosomes receive a contribution from every ancestor, it does not give one specific tribe of origin, but rather a statistical distribution of the different locations you might be from. The distribution are dependent on the specific model so the results might vary somewhat depending on the model chosen. The site also allows additional fun stuff such as matches between genomes to see how they are related etc.
GEDmatch allows to upload raw data from different genealogy companies. However, at the time
of my investigation, (g)VCFs were not accepted, I had to convert (again), into an acceptable
format. There seems to be a few tools that can convert VCF to 23andme text format:

PLINK2: A command-line utility, plink2 can be used in the Linux environment after obtaining
annotated VCF files:
java -Xmx10g -jar GenomeAnalysisTK.jar \
-R /Reference/ucsc.hg19.fasta \
-T HaplotypeCaller \
-I /media/sf_wgs/sequenced_genome.bam \
-maxAltAlleles 24 \
-- dbsnp /Reference/dbsnp_138.hg19.vcf \
-o /media/sf_wgs/sequenced_genome.vcf

./plink2 -- vcf sequenced_genome.vcf -- snps-only -- recode 23

Alternatively, EvE from sequencing.com can also be used to convert VCFs to 23andme. Upload the new file into GEDmatch and get your heart’s fill of information about the geographical regions most of your family tree most likely came from.

Y-DNA genealogy - Manual analysis

At the same time the most fun and the most tedious part of this project (… a long sleepless night…). Promethease does reference SNPs correlated to the different Haplogroups. As Haplogroups are subsets of each other (e.g. Y-Haplogroup A being the prototypical Adam), one will belong to multiple Haplogroups, each subset bringing us closer to the present time). However, there are no details provided for each SNP, making it necessary to research the meaning of each SNP. Furthermore, in this case, the SNPs database and will take you only a few millennia BC. For the rest, manual work might be necessary (unless, of course, there is a better way).
Starting from my Haplogroup (Y-R1b since you’re asking), I started researching possible subgroups. I would then figure out if I have the corresponding mutation and repeat the process one subgroup down. Like peeling layers of an onion really. To view the VCF file, I used the IGV viewer. Just make sure you don’t load a GVCF! As it has an entry for each SNP, whether different from the reference genome or not, the amount of data will make the software freeze. To correlate the haplotype name to its location on the genome, I can suggest this site: https://www.genetichomeland.com/welcome/dnamarkerindex.asp.

It provides the location for different location genomes and so far, it has not failed me once. For the different Haplogroups and subgroups, however, you are on your own. I was not able to find a single site that would be relevant for all and it took a lot of searching to find subgroups that were relevant to my case.
Through this, I was able to get to circa 1000AD, after that, the trace goes cold (waiting for more subgroups to be found). Not sure what that got me, but it was amusing playing detective for a while. I am sure I could contact some genealogy groups to discuss our common heritage, but this seems superfluous right now.

In any case, here is the first chapter of my little adventure in genetics. Feel free to comment about your experience or things I got wrong. As time goes, I am sure there will more useful information to get out of the data.

Sort:  

Congratulations @nitramc, you have decided to take the next big step with your first post! The Steem Network Team wishes you a great time among this awesome community.


Thumbs up for Steem Network´s strategy

The proven road to boost your personal success in this amazing Steem Network

Do you already know that awesome content will get great profits by following these simple steps that have been worked out by experts?

Congratulations @nitramc! You have completed some achievement on Steemit and have been rewarded with new badge(s) :

You published your First Post
You made your First Vote
You got a First Vote

Click on any badge to view your own Board of Honor on SteemitBoard.
For more information about SteemitBoard, click here

If you no longer want to receive notifications, reply to this comment with the word STOP

Upvote this notification to help all Steemit users. Learn why here!

Thank you for this excellent post! I think as sequencing becomes cheaper many people will have their genome sequenced and I am glad you have shared your experience! I think one of the major factors (other than cost) is the privacy issue. Many companies will use your DNA information and although they can say it will be used in an anonymous way there is no guaranty. What is the rationale of having to get a doctor's approval for whole genome sequencing? Shouldn't a person have the freedom to chose to have (or not to have) their genome sequenced independent of a doctor's approval?
I look forward to your future posts!
Cheers!
Ian

Congratulations @nitramc! You received a personal award!

Happy Birthday! - You are on the Steem blockchain for 1 year!

You can view your badges on your Steem Board and compare to others on the Steem Ranking

Do not miss the last post from @steemitboard:

Are you a DrugWars early adopter? Benvenuto in famiglia!
Vote for @Steemitboard as a witness to get one more award and increased upvotes!

Coin Marketplace

STEEM 0.20
TRX 0.13
JST 0.030
BTC 64850.80
ETH 3471.70
USDT 1.00
SBD 2.55