How I Analyze DNA with Biopython

cristi (70)in #bioinformatics • 8 years ago

With its highly readable syntax and uncluttered visual layout, Python is one of the most versatile and adopted programming language. Many programmers can ignore that Python is not as fast as lower level languages when they have the modularity and the ability to scale things faster. Plus, the community is nothing short of amazing.

Speaking of modularity, in Python there's a module for everything. Whether you want to do complex math, statistics, machine learning, text processing and text recognition, penetration testing and even bioinformatics, there's at least one module you can work with.

In fact, my interest in bioinformatics is what allowed me to go over the struggle of learning to program. I tried to teach myself coding several times and I failed every single time. It was earlier this year when I started to take off from the ground and it was because of my active interest in genomics and bioinformatics.

This is how I discovered Biopython, a Python module. With something specific to practice on and to play with everyday, I started having a better understanding of both worlds.

In this post I will show you some basic stuff you can do with the Biopython module; and if you're interested you can learn on your own from here on. In future posts I will discuss other interesting/helpful Python modules.

Playing with DNA - The Biopython Way

While there are several modules geared towards bioinformatics, the one that I learned and practice with is Biopython. It was my best choice because I found several books discussing it at length, one of these books being the free Biopython Cookbook that you can read here. It's like a hands-on approach, 400 pages long. I've inspired the following examples from the book.

Before getting into the nitty-gritty I'm going to make a couple of assumptions:

you have some knowledge of biology, especially genomics
you have some knowledge of computation and programming languages

I am not going to explain genomics and DNA sequences, what Python is or how to install it because the post would turn to be unbearably long. There are countless tutorials/lectures/videos freely available out there for that purpose. Here I'm trying to be very specific:

how you can use Biopython in computational biology.

According to the Cookbook:

"The Biopython Project is an international association of developers of freely available Python tools for computational molecular biology."

To install this module, go to the download page here. Make sure to install dependencies before you install Biopython.

I use Biopython in Windows, but you can use it in Mac or Linux as well. In Windows, for easier installation, you could get the 'wheels' from here (dependencies and latest Biopython) and install them via pip.

Now I'll show you some stuff you can do with it.

1. Instead of making up a DNA sequence, I'll take a real genetic sequence from the National Center for Biotechnology Information (NCBI):

go to NCBI homepage
search for 'tiger' (lack of inspiration :) )
use the sequence:

Panthera tigris voucher IRSNB-908B cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial

I'll click on Fasta
copy the sequence:

CTGATTGGCCACTCTTCACGGGGGTAATATTAAATGGTCTCCCGCTATACTATGGGCTTTGGGATTCATTTTCCTATTCACCG
TAGGGGGCTTAACAGGAATTGTATTAGCAAACTCCTCATTGGATATTGTCCTTCACGACACATACTACGTAGTAGCCCACTTC
CACTACGTCTA

alternatively I could save it as Fasta or Genbank file and work with it directly. Biopython knows both extensions.

2. In Python

importing Biopython Seq and Alphabet methods
inputing our sequence:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> myseq = Seq('CTGATTGGCCACTCTTCACGGGGGTAATATTAAATGGTCTCCCGCTATACTATGGGCTTTGGGATTCATTTTCCTATTCA
CCGTAGGGGGCTTAACAGGAATTGTATTAGCAAACTCCTCATTGGATATTGTCCTTCACGACACATACTACGTAGTAGCCCACTTCCACTACGTCTA',
IUPAC.unambiguous_dna)

3. Let's find the complement sequence of myseq:

>>> myseq.complement()

You can see it shows truncated (not in its entirety). You could use the print function to see the full sequence:

>>> print(myseq.complement())

similarly, you find the reverse complement with myseq.reverse_complement()

4. Assuming that myseq is a coding sequence, let's transcribe it to messenger RNA.

According to the cookbook:

"The actual biological transcription process works from the template strand, doing a reverse complement (TCAG ! CUGA) to give the mRNA. However, in Biopython and bioinformatics in general, we typically work directly with the coding strand because this means we can get the mRNA sequence just by switching T ! U."

>>> mrna = myseq.transcribe()
>>> mrna

5. Let's translate the messenger RNA to protein:

>>> mrna.translate()

5. Parsing full records

search the Entrez databases (like Pubmed, Genbank, Nucleotide, etc) with Bio.Entrez.esearch():

>>> from Bio import Entrez
>>> Entrez.email = "[email protected]" 
>>> handle = Entrez.esearch(db="pubmed", term="human monoclonal antibodies")
>>> record = Entrez.read(handle)
>>> record["IdList"]

And you get a list of Pubmed IDs of articles related to your search:

you could retrieve the full entry for an id with Efectch:

>>> handle = Entrez.efetch(db="nucleotide", id="186972442", rettype="gb", retmode="text")
>>> print(handle.read())

Here I retrieved the id 186972442 from the nucleotide database as Genbank file (rettype = 'gb' and retmode = 'text').

What else can you do with Biopython?

find the lineage of an organism
run BLAST (basic local alignment search tool) with Bio.Blast
work with Swiss-Port and ExPaSy
3D protein rendering with the PDB module
Population genetics
Phylogenetics and Sequence motif analysis
machine learning (!!!)
and many other useful operations for bioinformaticians and computational biologists.

For this tutorial I used the Cookbook version from December 2015. The latest (which I linked above and here - the pdf) is from August 25, 2016. So, it's 'extremely' recent.

Ending Thoughts

If you want to follow the same learning curve as I did (genomics + programming), there are a few free resources that I'm going to provide to get you started. These are the ones that I used:

Introduction to Biostatistics and Bioinformatics from NYU

Course Materials: here
My favorite lecture: IBB 4: Biopython

Dr. Martin Jones' books:

Python for Biologists
Advanced Python for Biologists

Via, Rother, and Tramontano's book:

Managing your Biological Data with Python

there are a few other books that I wish to go through next (on my watchlist), but I'm only going to mention one:

Haddock and Dunn - Practical Computing for Biologists

Of course, you can watch many other free lectures on Youtube and go through course materials from top universities.

The only thing that matters, in my opinion, is to actually do the work. Start with 1 book or 1 course and get your hands dirty. Practice deliberately and do it consistently.

To stay in touch, follow @cristi

#bioinformatics #programming #genomics #research

Credits for Images: Dna Strands via Wikimedia Commons and Biopython Logo.

Cristi Vlad, Self-Experimenter and Author

#programming #genomics #research

8 years ago in #bioinformatics by cristi (70)

$11.57

Sort:

Trending

[-]

bitcalm (64) 8 years ago

You'll probably be interested in my (rather underappreciated) how to compare gps tracks post.

$0.00

[-]

cristi (70) 8 years ago

you know, I check the bioinformatics tags :) and I saw you posted it a few weeks ago.

$0.00

[-]

runaway-psyche (57) 8 years ago

Awesome post, thanks for sharing. I can't wait to see what the future of bioinformatics looks like!

$0.00

[-]

jlufer (75) 8 years ago

Thank you for sharing this material, I like what you posted. Thank you so much

$0.00

[-]

cristi (70) 8 years ago

what do you like about it?

$0.00

[-]

justtryme90 (71) 8 years ago

Interesting application of python, didn't know this existed. I don't know if it would save me any time over any of the available internet tools though. Will think about looking more into this. Thanks for posting about it @cristi!

$0.00

[-]

cristi (70) 8 years ago

it could save time if you script things, not doing them in the interpreter like I did here :) think about it.

$0.00

[-]

gargon (71) 8 years ago

This is soooooo good, loved this post, followed four times if i could lol

$0.00

[-]

cristi (70) 8 years ago

shall I consider this a troll post?

$0.00

[-]

lemouth (72) 8 years ago

Four times maybe? :)

$0.00

[-]

thecryptofiend (74) 8 years ago

Fantastic post. I really must learn at least one programming language. Would Python be the best one to start with?

$0.00

[-]

cristi (70) 8 years ago

any kind of programming language is good to start with. for me it was python, as I said in the end, what's important is to code everyday, consistently!

$0.00

[-]

lemouth (72) 8 years ago

Python is indeed very powerful. In particle physics, many largely used software have a pythonic interface and a C++ or Fortran core.

$0.00

[-]

cristi (70) 8 years ago

now that I have that particle physics books maybe I can find some applications to work with .

$0.00

[-]

lemouth (72) 8 years ago

I can actually show you what I am developing with a colleague, if you are interested. That's an 'LHC analysis made easy' pythonic tool, that is actually used a lot by high-energy physicists (and we are missing developers). And let's stop the advertisement here ^^

$0.00

[-]

cristi (70) 8 years ago

dont get me excited! :D really, I'm interested.

$0.00

[-]

lemouth (72) 8 years ago

I was actually thinking about writing a post about the idea behind this program. I will take your comment as a further motivation. When the post will be written, maybe could we then discuss about it on the chat.

What do you think?

$0.00

[-]

cristi (70) 8 years ago (edited)

Im up for that

$0.00

STEEM 0.25

TRX 0.11

JST 0.032

BTC 61041.41

ETH 2947.17

USDT 1.00

SBD 3.85

How I Analyze DNA with Biopython

Playing with DNA - The Biopython Way

1. Instead of making up a DNA sequence, I'll take a real genetic sequence from the National Center for Biotechnology Information (NCBI):

2. In Python

3. Let's find the complement sequence of myseq:

4. Assuming that myseq is a coding sequence, let's transcribe it to messenger RNA.

5. Let's translate the messenger RNA to protein:

5. Parsing full records

What else can you do with Biopython?

Ending Thoughts

To stay in touch, follow @cristi

Coin Marketplace