[an error occurred while processing this directive]
Checklist: Dynamic Programming
Frequently Asked Questions: Match Sequence Alignment
|
What are the main goals of this assignment?
You will (i) solve a fundamental problem in computational biology,
(ii) learn about an important application in computer graphics, and (iii)
learn about a powerful programming paradigm known as dynamic
programming.
How do I tell Java to use more of my computer's memory? By
default, Java will only use 64MB of memory (not very much for this
application). You must explicitly ask for more by executing
with
java -Xmx1024m EditDistance < input.txt
The 1024m means 1024MB (1 GB), and you
should adjust this number depending on the amount of memory your
computer has and the size of the arrays you will need for the data set
you are running. (Our implementation needs close to 3GB of RAM for
ecoli7000.txt and about 5GB for ecoli10000.txt.)
How do I determine how much memory I have? On Mac,
select About this Mac from the Apple menubar. You can also
observe the CPU and memory usage using the Activity Monitor. On
Windows, type Ctrl-Shift-Escape, and look at the last tab so see CPU
and memory usage.
Can I assume that the input characters will always be A, C, G or T?
NO! Your program should work equally well for any letter, upper case or
lower case.
It seems strange to define x[i..M] to be the substring consisting
of x[i] through x[M-1] instead of x[i]
through x[M]. Is this a typo?
It's a little strange, but no, it's not a typo. It's consistent
with Java's indexing notation where the left endpoint is inclusive
and the right endpoint is exclusive.
Which alignment should I print out if there are two or more
optimal ones?
Output any one you like.
Testing: Sequence Alignment
|
Testing.
To help you check the part of your program that finds the edit distance,
the edit distances of
gene57.txt, stx1230.txt, and ftsa1272.txt are
8, 521, and 758, respectively.
To help you check the part of your program that generates the alignment,
there are several short test files in the sequence directory whose alignments
are easy to generate manually.
In addition, we require that you generate one input file of your own to be
used for testing special cases. Create a new input file with some
interesting property. Then test your code using your file and make sure
your program behaves as expected.
Possible Progress Steps: Match Sequence Alignment
|
These are purely suggestions for how you might make progress. You do
not have to follow these steps.
- Download DataPair.java
and Path.java Download and unzip sequence.zip to your system. It contains sample
data files.
- Create a skeleton Match.java file including the
simple test routing below. Make sure that the DataPair
class is declared after the closing brace of
the Match class or you will get some very strange compile
errors.
public class Match {
public static Path match(DataPair dp) {
return null; // placeholder so Match will compile
}
public static void main(String[] args) {
TestMatch tm = new TestMatch();
Path path = Match.match(tm);
for (Path p = path; p != null; p = p.next)
System.out.println(p.row + " " + p.col);
}
}
class TestMatch implements DataPair {
public int rows() { return 10; }
public int cols() { return 10; }
public int penalty(int row, int col, int dir) { return basePenalty(row, col); }
public int basePenalty(int row, int col) { return Math.abs(row - col); }
}
-
Copy the body of Match.match() from your Sequence Alignment
homework into the body of the new Match.match(). Change all
references to a.length() + 1 and b.length() + 1
to dp.rows() and dp.cols(). Change all penalty
computations to use dp.penalty()
and dp.basePenalty(). Make sure to update all your base
cases as well as the main loop!
-
Test Match.match() using the TestMatch class
and main method from the template above. TestMatch
is a very simple implementation of the DataPair interface
that yields a 10x10 matrix where the penalty of matching a row i
and column j is the absolute value of the difference
between i and j. In particular, the penalty of
matching along the diagonal from upper-left to lower-right is 0, and
the penalty of matching anywhere else in the matrix is larger than zero.
So Match.match() should return a path that follows this
diagonal, and your output should be:
% java Match
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
This test is a sanity check that Match.match() is doing
something reasonable, but is too simple to guarantee everything works.
You may want to modify TestMatch or create additional test
classes to create more complex test cases as well. (Try making test
cases with unequal numbers of rows and columns, and with different
values in the bottom row and last column.) Since these classes
and Match.main() exist only for testing purposes, you should
leave them and any debug output in them when you
submit Match.java. However you should remove any print
statements from Match.match() that you add for debugging
purposes.
- Now it is time to update EditDistance.java. Create
private String instance variables for the two strings you are
matching, as well as any other data you need to keep track of.
Implement the basePenalty() method that deals with the
lower-right corner of the match matrix, and the penalty()
method for the rest of the matrix. Since the lower-right corner of
the match matrix corresponds to matching a gap to the
gap, EditDistance.basePenalty() should return
0 in that case. EditDistance.penalty() should return 0, 1, or 2. Also,
remember that the last row and column of the match matrix correspond
to gaps, so it is possible that Match.match() will
call penalty(i, j, dir) with values of i
and j that are equal to a.length()
or b.length(). EditDistance.penalty() will need to
handle this case properly. There is more than one way of doing so,
and you may use any reasonable approach.
Frequently Asked Questions: Image Stitching
|
Do I need to include a main method in ImageSeam?
You do not have to include a main method, but if you write
one for testing purposes, you should not remove it. You should remove
any debugging-related code from the other methods however.
How robust should ImageSeam be to invalid input?
The ImageSeam constructor and getStitched methods
should handle invalid pictures as described in the assignment.
However the other methods may assume they will only be called if the
data is valid.
Possible Progress Steps: Image Stitching
|
These are purely suggestions for how you might make progress. You do
not have to follow these steps.
- Complete Part 1 of the assignment. Image stitching is
conceptually easier than sequence alignment and involves a simpler
cost function, but it's harder to debug because the intermediate and
final results are much harder to evaluate by hand.
- Implement rows(), cols(), baseCost(),
cost(), and the constructor.
- Write a main method that creates two 10x10 pictures. Set all
pixels of the first picture to Color.BLACK. Set all pixels
of the second picture to Color.WHITE except the
principal diagonal (the diagonal running from the upper-left to
lower-right corners). Set the pixels along that diagonal
to Color.BLACK. Create an ImageSeam object using
these two images, and print out the path computed in the constructor.
The path should follow the principal diagonal where the two pictures
are both black.
- Implement getStitched(). This is the trickiest method
in ImageSeam. Remember that the first node in the path will
always be at row 0 and column 0. The next node in the current path
will always be one to the right, one below, or one to the right and
below the current one.
- Update main() to call getStitched() and
display the result. You should get an image with one white triangle
and one black triangle.
- Test the image examples from the assignment
using Stitcher. Make sure you
have not mixed up which image getStitched() uses for the
portion to the left of the seam, and which portion it uses to the
right of the seam.
Enrichment: Image Stitching
|
- There are numerous tools available for building panoramas,
including Photoshop. One free tool
is Hugin
- A closely related problem to image stitching
is seam
carving (demos here
and here), which uses dynamic
programming to figure out how to intelligently resize pictures to
arbitrary dimensions. The basic dynamic program is identical
to ImageSeam but with a slightly different cost function.
- It makes sense that the seam should run from upper-left to
lower-right because thoses are the corners of the region where the two
images overlap. But there is no reason the seam should always proceed
down and to the right. What we really want is the path whose total
cost is lowest, no matter how it meanders through the image. This is
called the Shortest
Path problem, and is a classic problem in computer science. It
can be solved
using Dijkstra's
algorithm. One way to find the optimal seam for stitching several
images together simultaneously
is graph
cuts. This is
an early
example of using graph cuts to stitch images.
- Another important issue in stitching panoramas is correcting for
exposure differences between images. This can be done by adjusting
the brightness and contrast of images so they match, but a more
successful method is to work with the image gradients, or
changes in color, rather than the pixel colors themselves. These are
stitched just like the colors, and the resulting gradient field is
reintegrated to obtain
the final
image. This technique is now widely used to manipulate images
because it works so well.