How to compare GPS tracks

in #programming8 years ago

I've been using my Garmin Forerunner 210 to record my running progress. Like many gadgets it has a built in GPS receiver, allowing the watch to not only record how long my run is, but also where I run.

Running the same route every time can be really boring. To keep things interesting, I take slight detours on my routes. Then a crazy idea popped into my head: can I automatically compare GPS tracks to see where they differ?


Aligning Protein Sequences

Aligning protein sequences is really useful. It allows bioinformaticians to find regions of similarity which may be a consequence of functional, structural, or evolutionary relationships between the sequences.

Sequence 1: TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
Sequence 2: TCCPSIVARRSNFNVCRLTPEAICATYTGCIIIPGATSPGDYAN

When the above sequences are aligned, gaps are shown to indicate insertions and deletions.

Alignment 1: TTCCPSIVA-RSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
Alignment 2: -TCCPSIVARRSNFNVCRL--TPEAICATYTGCIIIPGATSPGDYAN

One of the earliest methods for aligning sequences is the Needleman-Wunsch algorithm, published in 1970.

At the heart of the algorithm is the scoring system, which is used to determine how a pair of residues should align. The higher the score for a given combination, the more it is biased in the resulting alignment.

A gap penalty is also applied during alignment. If the gap penalty is greater than the score for aligning two residues, a gap is introduced in the alignment instead.

You can use any scoring system you like, depending on the problem you're trying to solve. BLOSUM is a similarity matrix commonly used as the scoring system to align protein sequences. It takes into account the liklihood that a residue will mutate into another based on what has already been observed in nature.

Aligning GPS tracks

How many of you skipped the bit about protein sequences? Come on, be honest. It's okay, I don't mind, but you might be confused if you don't read that first.

Hopefully you can already see the connection: instead of aligning residues in protein sequences, we'll be aligning co-ordinates in GPS tracks. The bits that don't align, the gaps, will be the sections of the tracks that differ.

But what scoring system and gap penalty should be used?

When we compare tracks by eye, we look at the distance between points. We can use the same metric to score GPS points in the alignment; however, the Needleman-Wunsch algorithm considers higher values a match.

Higher distances mean the points are further apart and therefore non-matching, so we need to use the negative distance as the score, which results in larger distances having smaller values.

The gap penalty works as the cut-off for when the distance between two points is considered large enough that they don't represent the same location. As the distance is always negative, the gap penalty distance must also be negative.

There is no universal value for this cut-off. It depends on the quality of the data and the context.

If the GPS device is accurate to 5 metres, setting the cut-off to 2 metres will result in a noisy alignment. Likewise, if the tracks being compared are from planes, you might consider them on the same flight path if they are within 200 metres of one another; for tracks of people walking, this distance will be far too large.

So what happens if we use all this information to align two GPS tracks stored in a GPX file and plot them on a map, showing the matched points in blue and the differences in red and orange?

It's not bad, but also not perfect. It's found the major detour at the bottom-left of the map (in red), but there are also a lot of incorrect mismatches spread throughout the tracks (in orange). What's causing this?

The GPS points in the two tracks are not evenly distributed, resulting in points that are far apart being compared when the density in a section of one track differs greatly from the other. If you look at the locations of the orange points in the alignment, they match very closely to the low-density regions in the plot of the first track shown in the introduction.

This is fixed by pre-processing the GPX files to distribute GPS points evenly over the track. The closer together the points, the more accurate the result, but the longer the alignment will take.

So how does it look now? Amazing!

Show me the code

The implementation of the Needleman-Wunsch algorithm, adapted for GPS tracks, is shown below:

def align_tracks(track1, track2, gap_penalty):
    """ Needleman-Wunsch algorithm adapted for gps tracks. """

    _log.info("Aligning tracks")

    def similarity(p1, p2):
        d = gpxpy.geo.distance(p1.latitude, p1.longitude, p1.elevation,
                               p2.latitude, p2.longitude, p2.elevation)
        return -d

    # construct f-matrix
    f = numpy.zeros((len(track1), len(track2)))
    for i in range(0, len(track1)):
        f[i][0] = gap_penalty * i
    for j in range(0, len(track2)):
        f[0][j] = gap_penalty * j
    for i in range(1, len(track1)):
        t1 = track1[i]
        for j in range(1, len(track2)):
            t2 = track2[j]
            match = f[i-1][j-1] + similarity(t1, t2)
            delete = f[i-1][j] + gap_penalty
            insert = f[i][j-1] + gap_penalty
            f[i, j] = max(match, max(delete, insert))

    # backtrack to create alignment
    a1 = []
    a2 = []
    i = len(track1) - 1
    j = len(track2) - 1
    while i > 0 or j > 0:
        if i > 0 and j > 0 and \
           f[i, j] == f[i-1][j-1] + similarity(track1[i], track2[j]):
            a1.insert(0, track1[i])
            a2.insert(0, track2[j])
            i -= 1
            j -= 1
        elif i > 0 and f[i][j] == f[i-1][j] + gap_penalty:
            a1.insert(0, track1[i])
            a2.insert(0, None)
            i -= 1
        elif j > 0 and f[i][j] == f[i][j-1] + gap_penalty:
            a1.insert(0, None)
            a2.insert(0, track2[j])
            j -= 1
    return a1, a2

There are more examples, along with the code to generate your own alignments, in the jonblack/cmpgpx repository on GitHub. It's in the public domain, too, so do what you please with it. Yay!


This was originally posted on my blog.

Sort:  

Hi! This post has a Flesch-Kincaid grade level of 8.9 and reading ease of 68%. This puts the writing level on par with Leo Tolstoy and David Foster Wallace.

I've been trying to break into bioinformatics (to no avail) for some time now, so I really appreciate this example. I really appreciate that the code you posted is written in Python, because that's what I'm heavily into these days. There are tons of applications for your idea especially for traffic management and rerouting.

Based on the problem you posited, my first thought for a solution was similar to how noise from images are removed. Since we're talking about coordinates on a 2d map, I feel that it could be used in conjunction with your sequencing algorithm. I've had more experience (and success) dealing with images that genetic sequencing, so my stand is a bit biased.

Here's another resource I found that discussed a different algorithm for the same problem: http://gis.stackexchange.com/questions/81551/matching-gps-tracks

When you told me that you weren't in the field of psychology, I didn't imagine that we belong in the same field. Though I bet you're miles ahead from where I'm currently at.

The book An Introduction to Bioinformatics Algorithms is a good place to start. The author also has a course on Coursera.

It's funny you mention image noise removal because Needleman-Wunsch is also used for that, particularly with stereoscopic 3D images, where one side can be considered a mutated form of the other.

There are lots of other possibilities as well. Needleman-Wunsch is a global alignment algorithm which finds the optimal solution. With two sequences/tracks this isn't a problem, but if you want to compare a track to a database, it would take too long. To solve this problem, a local alignment algorithm would be much more efficient but not optimal. A popular algorithm for this is BLAST (Basic Local Alignment Search Tool). I've been meaning to implement this for GPS tracks but haven't found the time.

The psychology articles are things I'm interested in, plus it helps that my girlfriend is a psychologist :)

I've actually enrolled in the Coursera specialization. I didn't get the verified course, but I passed all the courses. It's hard to break into the industry though. Having a terrible time finding (remote) employment in the field.

I've actually used BLAST in genetic sequencing, so I already have a brief notion of its capabilities for comparing GPS tracks.

Coin Marketplace

STEEM 0.16
TRX 0.13
JST 0.027
BTC 60589.35
ETH 2628.62
USDT 1.00
SBD 2.53