CIS 110 Summer 2012 - Introduction to Computer Programming

Java | /r/compsci (reddit) | PennPortal |
upenn.edu | directories | van pelt library
seas.upenn.edu | CETS | engineering library
computer graphics & game dev (SIGGRAPH @ Penn) | dining philosophers (DP) | science & tech wing (STWING) | women in cs (WICS)
CETS Answers

Checklist: Edit Distance

Frequently Asked Questions: Sequence Alignment

What are the main goals of this assignment? You will (i) solve a fundamental problem in computational biology, and (ii) learn about a powerful programming paradigm known as dynamic programming.

How do I access the length of a string s? The ith character? Use s.length() and s.charAt(i), respectively. As with arrays, indices start at 0. We'll learn about this notation for manipulating (String) objects in Section 3.1. For this assignment, this is all you'll need to know about objects.

How do I tell Java to use more of my computer's memory? By default, Java will only use 64MB of memory (not very much for this application). You must explicitly ask for more by executing with

java -Xmx1024m EditDistance < input.txt
The 1024m means 1024MB (1 GB), and you should adjust this number depending on the amount of memory your computer has and the size of the arrays you will need for the data set you are running. (Our implementation needs close to 3GB of RAM for ecoli7000.txt and about 5GB for ecoli10000.txt.)

How do I determine how much memory I have? On Mac, select About this Mac from the Apple menubar. You can also observe the CPU and memory usage using the Activity Monitor. On Windows, type Ctrl-Shift-Escape, and look at the last tab so see CPU and memory usage.

Can I assume that the input characters will always be A, C, G or T? NO! Your program should work equally well for any letter, upper case or lower case.

It seems strange to define x[i..M] to be the substring consisting of x[i] through x[M-1] instead of x[i] through x[M]. Is this a typo? It's a little strange, but no, it's not a typo. It's consistent with Java's indexing notation where the left endpoint is inclusive and the right endpoint is exclusive.

Which alignment should I print out if there are two or more optimal ones? Output any one you like.

Testing: Sequence Alignment

Testing.   To help you check the part of your program that finds the edit distance, the edit distances of gene57.txt, stx1230.txt, and ftsa1272.txt are 8, 521, and 758, respectively.

To help you check the part of your program that generates the alignment, there are several short test files in the sequence directory whose alignments are easy to generate manually.

In addition, we require that you generate at least one input file of your own to be used for testing special cases. Create a new input file with some interesting property. Then test your code using your file and make sure your program behaves as expected.

Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

  1. Download Path.java. Download and unzip sequence.zip. It contains sample data files.

  2. Create a skeleton Match.java file including the simple test routine below:
    public class Match { public static Path match(String a, String b) { return null; // placeholder so Match will compile } public static void main(String[] args) { Path path = Match.match("AACAGTTACC", "TAAGGTCA"); for (Path p = path; p != null; p = p.next) System.out.println(p.row + " " + p.col); } }

  3. Write the following two short helper methods in Match.java.
    // return the penalty for aligning character a with character b private static int penalty(char a, char b) // return the min of 3 integers private static int min(int a, int b, int c)
    You will call these from your match method to compute penalties and to determine which of the three cases yields the minimum edit distance.

  4. Declare and initialize an a.length() + 1 x b.length() + 1 array Path[][] opt in Match.match(). Create a new Path for each entry of the array (they all start out null). Set the row and col fields of each element to correspond to the node's row and column in the array. Print out the 2D array to check your work.

    To print the matrix out in nicely formatted columns, use

    System.out.printf("%2d %2d %3d | ", opt[row][col].row, opt[row][col].col, opt[row][col].cost);
    within nested for loops. Remember to remove this debugging print section before submitting the final version of your program.

  5. Now, it's time to implement the crucial dynamic programming part. First read the dynamic programming portion of Section 9.6 and make sure you understand how dynamic programming works. Think carefully about the order in which you will perform the computation since this is critical. Hint: first fill in the base case of opt[row][col], i.e., when row == a.length() or col == b.length(). Now, fill in the remaining values using a nested loop. Test that it works by printing out the contents of opt. Make sure you update the cost in each node to the total cost so far and also update the next pointer to keep track of which direction you came from.

  6. Test Match.match() using the sample Match.main() method. It should print out the best path through the opt[][] matrix for the hard-coded strings, like this:
    0 0 7 1 1 6 2 2 6 3 2 4 4 3 4 5 4 4 6 5 3 7 6 3 8 6 1 9 7 1 10 8 0

    This test is a sanity check that Match.match() is doing something reasonable, but is too simple to guarantee everything works. You may want to modify TestMatch or create additional test classes to create more complex test cases as well. Since these classes and Match.main() exist only for testing purposes, you should leave them and any debug output in them when you submit Match.java. However you should remove any print statements from Match.match() that you add for debugging purposes.

  7. Next write EditDistance.java. Make sure when you print out the match in EditDistance.main() that you handle the base case of gaps at the end of one of the strings correctly. Finally, make sure you do not print the last node in the path, which will always correspond to matching gaps at the end of both strings (the lower-right corner of the opt matrix).

Enrichment

  • The idea of dynamic programming was first advanced by Bellman (1957). Here is an interview in which he discusses the birth of dynamic programming and the origin of the name. Levenshtein (1966) formalized the notion of edit distance. Needleman-Wunsch (1970) were the first to apply edit distance and dynamic programming for aligning biological sequences, and our algorithm is essentially the one proposed in their seminal paper. The widely-used Smith-Waterman (1981) algorithm is quite similar, but solves a slightly different problem (local sequence alignment instead of global sequence alignment).

  • The genetic data are taken from GenBank. The National Center for Biotechnology Information also contains many examples of such database and alignment software.