[an error occurred while processing this directive]
Checklist: Dynamic Programming
Frequently Asked Questions: Sequence Alignment
|
What are the main goals of this assignment?
You will (i) solve a fundamental problem in computational biology,
(ii) learn about an important application in computer graphics, and (iii)
learn about a powerful programming paradigm known as dynamic
programming.
How do I tell Java to use more of my computer's memory? By
default, Java will only use 64MB of memory (not very much for this
application). You must explicitly ask for more by executing
with
java -Xmx300m EditDistance < input.txt
The 300m means 300MB, and you should
adjust this number depending on the amount of memory your computer has
and the size of the arrays you will need for the data set you are
running. (Everyone should be able to request 600m which should get
you through ecoli10000.txt)
How do I determine how much memory I have? On Mac,
select About this Mac from the Apple menubar. You can also
observe the CPU and memory usage using the Activity Monitor. On
Windows, type Ctrl-Shift-Escape, and look at the last tab so see CPU
and memory usage.
Can I assume that the input characters will always be A, C, G or T?
NO! Your program should work equally well for any letter, upper case or
lower case.
It seems strange to define x[i..M] to be the substring consisting
of x[i] through x[M-1] instead of x[i]
through x[M]. Is this a typo?
It's a little strange, but no, it's not a typo. It's consistent
with Java's indexing notation where the left endpoint is inclusive
and the right endpoint is exclusive.
Which alignment should I print out if there are two or more
optimal ones?
Output any one you like.
Testing: Sequence Alignment
|
Testing.
To help you check the part of your program that finds the edit distance,
the edit distances of
gene57.txt, stx1230.txt, and ftsa1272.txt are
8, 521, and 758, respectively.
To help you check the part of your program that generates the alignment,
there are several short test files in the sequence directory whose alignments
are easy to generate manually.
In addition, we require that you generate one input file of your own to be
used for testing special cases. Create a new input file with some
interesting property. Then test your code using your file and make sure
your program behaves as expected.
Possible Progress Steps: Sequence Alignment
|
These are purely suggestions for how you might make progress. You do
not have to follow these steps.
- Download DataPair.java
and Path.java Download and unzip sequence.zip to your system. It contains sample
data files.
- Create a skeleton Match.java file including the
simple test routing below:
public class Match {
public static Path match(DataPair dp) {
return null; // placeholder so Match will compile
}
public static void main(String[] args) {
TestMatch tm = new TestMatch();
Path path = Match.match(tm);
for (Path p = path; p != null; p = p.next)
System.out.println(p.row + " " + p.col);
}
}
class TestMatch implements DataPair {
public int rows() { return 10; }
public int cols() { return 10; }
public int cost(int row, int col, int dir) { return baseCost(row, col); }
public int baseCost(int row, int col) { return Math.abs(row - col); }
}
- Declare and initialize
a dp.rows() x dp.cols() array Path
opt[][] in Match.match(). Set the row and col fields of
each element to correspond to the node's row and column in the array.
Print out the 2D array to check your work.
To print the matrix out in nicely formatted columns, use
System.out.printf("%2d %2d %3d",
opt[row][col].row,
opt[row][col].col,
opt[row][col].cost);
with nested for loops. Remember to remove this debugging print
section before submitting the final version of your program.
-
Now, it's time to implement the crucial dynamic programming part.
First read the dynamic programming portion of
Section
9.6 and make sure you understand how dynamic programming works.
Think carefully about the order in which you will perform the
computation since this is critical. Hint: first fill in the base case
of opt[row][col], i.e.,
when row == dp.rows() - 1
or col == dp.cols() - 1. Now, fill in
the remaining values using a nested loop. Test that it works by
printing out the contents of opt. Make sure you update the
cost in each node to the total cost so far and also update the next
pointer to keep track of which direction you came from.
-
Test Match.match() using the TestMatch class
and main method from the template above. TestMatch
is a very simple implementation of the DataPair interface
that yields a 10x10 matrix where the cost of matching a row i
and column j is the absolute value of the difference
between i and j. In particular, the cost of
matching along the diagonal from upper-left to lower-right is 0, and
the cost of matching anywhere else in the matrix is larger than zero.
So Match.match() should return a path that follows this
diagonal, and your output should be:
% java Match
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
This test is a sanity check that Match.match() is doing
something reasonable, but is too simple to guarantee everything works.
You may want to modify TestMatch or create additional test
classes to create more complex test cases as well. Since these
classes and Match.main() exist only for testing purposes, you
should leave them and any debug output in them when you
submit Match.java. However you should remove any print
statements from Match.match() that you add for debugging
purposes.
- Now it is time to write EditDistance.java. Create
private String instance variables for the two strings you are
matching, as well as any other data you need to keep track of. Since
the base case for edit distance involves matching the final characters
of one string with "gaps" in the other, you need one additional row
and column beyond the length of your two strings. baseCost()
and cost need to take this extra row and column into account.
There is more than one way of handling this base case, and you may use
any reasonable method that works.
- Make sure when you print out the match
in EditDistance.main() that you handle the base case of gaps
at the end of one of the strings correctly. Finally, make sure you do
not print the last node in the path, which will always correspond to
matching gaps at the end of both strings (the lower-right corner of
the opt matrix).
Enrichment: Sequence Alignment
|
- The idea of dynamic programming was first advanced by Bellman
(1957). Here is
an
interview in which he discusses the birth of dynamic programming
and the origin of the name. Levenshtein (1966) formalized the notion
of edit distance.
Needleman-Wunsch
(1970) were the first to apply edit distance and dynamic programming
for aligning biological sequences, and our algorithm is essentially
the one proposed in their seminal paper. The widely-used
Smith-Waterman
(1981) algorithm is quite similar, but solves a slightly different
problem (local sequence alignment instead of global sequence
alignment).
-
The genetic data are taken from
GenBank.
The National Center
for Biotechnology Information also contains
many examples of such database and alignment software.
Frequently Asked Questions: Image Stitching
|
Do I need to include a main method in ImageSeam?
You do not have to include a main method, but if you write
one for testing purposes, you should not remove it. You should remove
any debugging-related code from the other methods however.
How robust should ImageSeam be to invalid input?
The ImageSeam constructor and getStitched methods
should handle invalid pictures as described in the assignment.
However the other methods may assume they will only be called if the
data is valid.
Possible Progress Steps: Image Stitching
|
These are purely suggestions for how you might make progress. You do
not have to follow these steps.
- Complete Part 1 of the assignment. Image stitching is
conceptually easier than sequence alignment and involves a simpler
cost function, but it's harder to debug because the intermediate and
final results are much harder to evaluate by hand.
- Implement rows(), cols(), baseCost(),
cost(), and the constructor.
- Write a main method that creates two 10x10 pictures. Set all
pixels of the first picture to Color.BLACK. Set all pixels
of the second picture to Color.WHITE except the
principal diagonal (the diagonal running from the upper-left to
lower-right corners). Set the pixels along that diagonal
to Color.BLACK. Create an ImageSeam object using
these two images, and print out the path computed in the constructor.
The path should follow the principal diagonal where the two pictures
are both black.
- Implement getStitched(). This is the trickiest method
in ImageSeam. Remember that the first node in the path will
always be at row 0 and column 0. The next node in the current path
will always be one to the right, one below, or one to the right and
below the current one.
- Update main() to call getStitched() and
display the result. You should get an image with one white triangle
and one black triangle.
- Test the image examples from the assignment
using Stitcher. Make sure you
have not mixed up which image getStitched() uses for the
portion to the left of the seam, and which portion it uses to the
right of the seam.
Enrichment: Image Stitching
|
- There are numerous tools available for building panoramas,
including Photoshop. One free tool
is Hugin
- A closely related problem to image stitching
is seam
carving (demos here
and here), which uses dynamic
programming to figure out how to intelligently resize pictures to
arbitrary dimensions. The basic dynamic program is identical
to ImageSeam but with a slightly different cost function.
- It makes sense that the seam should run from upper-left to
lower-right because thoses are the corners of the region where the two
images overlap. But there is no reason the seam should always proceed
down and to the right. What we really want is the path whose total
cost is lowest, no matter how it meanders through the image. This is
called the Shortest
Path problem, and is a classic problem in computer science. It
can be solved
using Dijkstra's
algorithm. One way to find the optimal seam for stitching several
images together simultaneously
is graph
cuts. This is
an early
example of using graph cuts to stitch images.
- Another important issue in stitching panoramas is correcting for
exposure differences between images. This can be done by adjusting
the brightness and contrast of images so they match, but a more
successful method is to work with the image gradients, or
changes in color, rather than the pixel colors themselves. These are
stitched just like the colors, and the resulting gradient field is
reintegrated to obtain
the final
image. This technique is now widely used to manipulate images
because it works so well.