The purpose of this assignment is to gain practice with sorting, functions, and coordinate transformations.
Cheat checkers typically compare all pairs of code submission using techniques similar to those of the previous assignment, and give a sorted list of the top matches. One simple, but surprisingly effective way of comparing two code submissions is to compare them line by line, counting up how many characters differ between the two lines, and how many characters are added or removed from the ends of the line. The number of differences is used to computing a matching cost that represents how different the two files are. The lower the matching cost, the more similar the files.
You already know enough to be able to implement this — we often assign this in CIS 110 before the current assignment, although not this semester. We also often include the results of this cheat checker in our reports to the OSC even though we also use much more sophisticated cheat checkers.
We're also interested in the distribution, also called a histogram, of matching costs. That is the focus of this assignment. Any pair of independently written programs should have fairly similar matching costs (actually, you expect the cost distribution to form a bell curve); assignment pairs with abnormally low matching costs relative to the bell curve are likely to be copied. The images below show the kind of histogram your program will generate, as well as a report generated by an on-line cheat checker. All identifying information has been removed.
![]() |
![]() |
Comparing a protein from two different species can give us some information about how closely related they are, but it is even more interesting to compare the protein among numerous species. That helps us answer questions like, "Which two species are most closely related?" or "Which two species are most distant?" We can also build a phylogenetic tree that hierarchically groups species based on how similar the protein is, and thereby gain insight into the entire evolutionary tree. Computational biologists build similar trees of different genes to help learn about their functions — very similar genes often control similar functions. (This type of tree is called a minimum spanning tree, and is at the heart of many compression formats including JPEG, as well as being an essential part of route planning and many other algorithms. You will learn all about minimum spanning trees in CIS 121.) Two examples of phylogenetic trees based on the cytochrome c protein are shown below.
![]() |
![]() |
Complex data often requires illustration. In this assignment, you will be implementing a histogram to visualize complex data. This data examines the similarity between two pairs of information.
This assignment builds on understanding the range of similarity between pairs, and find the most similar pairs. Finding the most similar items out of a pool of many is of course important to a very wide range of applications, starting with searching for similar documents or images. We will focus on two applications: protein matching and cheat checking.
We provide you a reference implementation of the pairwise comparison programs. You can assume this implementation has no defects, and will work as specified in this assignment.
You will write a program RankSimilarity.java that matches all pairs of documents, computes a histogram of the result, and plots it. Each of these steps is explained in detail in its own section of the assignment, in the order we recommend you implement them. The histogram is built using a cost matrix. A cost matrix is a matrix that identifies the similarity between two items being classified. In Cheat Checking, this can be two student homeworks. In biology, this can be used to determine how similar protein chairs are. The lower the cost is, the more similar the documents are. We use a histogram to identify documents whose cost difference is abnormally small, indication cheating. You do not need to worry about the details of how this difference is calculated for this assignment.
To get you started, we provide a complete reference implementation as a compiled .class file that you can play with. We also give you a skeleton version of the program that calls the reference implementation for the interesting bits and a RankHelpers class with the tedious code to read in data. This lets you start by implementing the drawHistogram() method, then replace the other pieces with your own code one by one. Your program must conform to the API given below.
API
public class RankSimilarity -------------------------------------------------------------------------------------------------------- * Draw the histogram */ public static void drawHistogram(double[] histogram) * Compute the matching costs for all pairs of documents * returns an array giving the number of matches in each histogram bin * costs is an array of matches in the same format as costArray * numBins is the number of bins for the histogram (must be at least 1) */ public static int[] buildHistogram(int[][] costs, int numBins, boolean inverse) * Sorts the costs array in increasing order of cost using selection sort * costs is an array of matches in the same format as costArray */ public static void selectionSortCosts(int[][] costs) * Computes the matching costs for all pairs of documents * returns a 2-D array with one row for each match * docs is an array of documents, each represented as an array of lines */ public static int[][] computeCosts(String[][] docs) * read in the protein file, cost matrix, or list of documents, * compute all match costs, sort them, and create a histogram */ public static void main(String[] args)
Files to Download - for your convenience, all files can be found in this zip file here
Before you start
We provide you with a reference implementation of the program that you can run to explore how your program is supposed to work. You can also use the functions it provides as drop-in replacements for many of the parts that you will write so that you end each of the subsequent assignment parts with a complete, working program. This makes your program much easier to test.
At this time, do not worry about the contents of the data files, just know that the ReferenceRankSimilarity.class file works to the specification of this assignment.
There are a few ways to run the reference implementation. We provide the same main() function in the skeleton code for your own program, which you will be able to run in exactly the same ways.
Drawing the Histogram: A histogram is just a type of bar chart. It groups the possible values of each piece of data into a series of ranges (called bins), and counts how many data values lie in each range (i.e. fall into each bin). For example, if the possible matching costs fall between 0 and 100, a 10-bin histogram counts the number of matches with costs of 0 to 9 (bin 0), 10 to 19 (bin 1), 20 to 29 (bin 2), etc. A bar chart of the numbers of A, B, C, D, and F grades given in a course is another example of a histogram.
You will store the histogram of matching scores as an array of doubles. Each entry of the array stores the number of matches whose costs fall into that bin. The length of the array tells you the number of bins.
Checkpoint: to ensure your histogram works, you should get this as a result of calling "java RankSimilarity costs-cheatchecker.txt" in the interactions pane.
![]() |
A Sample Histogram. |
As another test, try creating your own histograms as double
arrays. For example, if in main you write:
double[] thisHist = {0.4, 10.0, 5.3, 3.2, 4.8, 6.1, 3.0, 0.0};
drawHistogram(thisHist);
You should get the histogram pictured below:
![]() |
Writing your event loop: Just like in previous assignments using PennDraw, you need to add a loop to your main() function that checks for keyboard input and draws the histogram. This is your event loop, and it will be responsible for calling drawHistogram().
Inside your main method you should construct a while loop to monitor for key press events listed below. This loops will be similar to StampSketch in HW1. You must implement the following three behaviors:
If the 'p' key is pressed, find the mouse's x location (there is an PennDraw method for doing this) and multiply by 100, and cast to an integer. (The mouse location should be a value between 0 and 1.) Print out that many best matches in descending order of cost. For instance, if the x location is 0.1, you should print the 10 most similar matches, starting with the 10th-most similar and ending with the top match. For each match, you must print all the data about the match (index of the first document, index of the second document, and matching score), but the format does not need to be identical to the reference output.
How to access the individual matches: The matches are stored in the two-dimensional array variable costArray that is sorted in increasing order of cost. This means the most similar pair is at costArray[0], the next most similar pair at costArray[1], etc. Each row of this array corresponds to a single match and has three columns. The first two columns record the indices of the first and second documents in the match; the last column records the match's cost. We have defined symbolic constants in the skeleton for each of these columns that you should use. For instance, the cost of match i is costArray[i][COST].
Checkpoint: if you run java
RankSimilarity costs-cheatchecker.txt and
press p to generate AT LEAST 5 matches, the LAST 5
matches should be (in order):
623 530 2402
294 94 2125
707 94 2079
707 294 2009
759 623 1804
If you are getting different values, you have made an
error.
One common error is printing in reverse order. If
your cost scores are ASCENDING, rather than
DESCENDING, you are likely iterating through the list
in the wrong direction with your for loop.
Another possible error is that you are printing in
descending order, but you get different values that
what is above. In this case, you are likely starting
at the wrong index point in the array.
If the 'i' key is pressed, toggle between the normal histogram and a modified inverse of the bin values. You should handle this by calling buildHistogram() with the inverse argument set to true or false. Toggle between them each time you press 'i'.
The formula for computed inverse bin values is described in the "Building the Histogram" section on the "Preparing the Data" tab. For now you do not need to worry about it. For now, we suggest testing this with the costs-cheatchecker.txt data file. The standard histogram shows a bell-curve-like distribution of similarity between submission. There are hundreds of thousands of submission pairs in this data set, and a few are plagiarized. Those assignments are represented by very small bars at the left edge of the graph—so small you can't see them. Inverting the histogram will make these "pop out," and make the bulge of "normal similarity" disappear.
Call# of bins = 100 * mouse's x location / 5
to rebuild the histogram with the computed number of bins. Use Math.ceil to force the number of bins to round up. If the user clicks in a place such that the number of bins is zero, instead show a histogram with 1 bin (as you cannot have a histogram with zero bins). Note: If r is pressed while the mouse is out of bounds (that is, off the window pane) to the left or right, ensure your histogram does not generate more than 20 bins or less than 1 bin.histogram = buildHistogram(costArray, # of bins, inverse);
Checkpoint: when you alternate r and i, the number of bins and the state of whether or not the histogram is showing the inverse should be maintained. For example, if you press r to set the number of bins to 10, and then press i to show the inverse, the histogram shown should still have 10 bins. Additionally, if you change the number of bins while showing the inverse, the new histogram should still be in an inverse state.
Detecting mouse and key presses: Consult the documentation for PennDraw.
"Building the histogram"
Now that drawing works, you can implement your own buildHistogram() function. The function should use the costs array to create and return a histogram with numBins bins. It is the last step before drawing, which means you can easily test and debug it using a cost matrix that has been computed and sorted by the reference implementation.
costs[0][FIRST]: index of first document in match 0 | costs[0][SECOND]: index of second document in match 0 | costs[0][COST]: cost of comparing (first, second) |
costs[1][FIRST]: index of first document in match 1 | costs[1][SECOND]: index of second document in match 1 | costs[1][COST]: cost of comparing (first, second) |
costs[2][FIRST]: index of first document in match 2 | costs[2][SECOND]: index of second document in match 2 | costs[2][COST]: cost of comparing (first, second) |
... | ... | ... |
Inverting the histogram: When the inverse argument is true, you should compute a modified inverse of each bin value. Replace each bin value n with it's reciprocal 1 / n as long as n is not zero. If n is zero, do not change it.
The effect of inverting the histogram is that very tall bars in the histogram disappear (1 / very large number is close to 0), while very short bars suddenly get tall (the lowest non-zero count a bin could have is 1, and 1 / 1 = 1). For cheat-checking, we aren't interested in the masses of assignment pairs that are equally similar—the ones that correspond to tall histogram bars. Instead, we are interesting in the very few assignment pairs that are much more similar than all other pairs. In terms of the histogram, we expect most bins correspond to low cost/high similarity to have a zero count, but a few my have a non-zero (but small) count corresponding to plagiarized assignments. On the standard histogram, the counts in those bins are still so low that we can't see the bars; but inverting them makes the bars large. So the plagiarized assignments tend to pop out, and the non-plagiarized ones tend to disappear.
Computing costs
computeCosts() takes a 2-D array of documents—docs[i](look below for a visual representation of this array)is the ith document, represented as an array of lines—and computes all pairwise matches. It returns the resulting 2-D array.
docs[0][0]: line 0 of document 0 | docs[0][1]: line 1 of document 0 | docs[0][2]: line 2 of document 0 | ... |
docs[1][0]: line 0 of document 1 | docs[1][1]: line 1 of document 1 | docs[1][2]: line 2 of document 1 | ... |
docs[2][0]: line 0 of document 2 | docs[2][1]: line 1 of document 2 | docs[2][2]: line 2 of document 2 | ... |
... | ... | ... | ... |
Sorting the data: For the following sections of the assignment, you may find it neccessary to have the data sorted by the COST value. While we have not covered sorting in class yet, you will need to have selectionSortCosts(int[][] costs) implemented. Simply use the following code body to implement this function:
public static void selectionSortCosts(int[][] array) { for(int i = 0; i < array.length - 1; i++) { int min = array[i][COST]; int minIndex = i; for(int j = i + 1; j < array.length; j++) { if (array[j][COST] < min) { min = array[j][COST]; minIndex = j; } } int[] temp = array[i]; array[i] = array[minIndex]; array[minIndex] = temp; } }
Sorting will be covered in the class at a later date. However, this is an example of an selection sort. Take some time to try to understand how the code works. To sort the costArray, simply call:
selectionSortCosts(costArray)
Note that we don't directly set costArray equal to the output of this method. This is because arrays are pass-by-reference. This means that when this method alters the input array, it alters the array's location in memory, not just the value of the array inside the function.
We provide a few possible extra credit ideas for this assignment, which are described below.
There are many opportunities for extra credit on this assignment, although you will be able to earn at most 3 extra credit points (and remember that the impact of extra credit on your grade is limited). Some of these ideas are relatively easy, and others are harder. Finish the assignment first, submit it, then work on these ideas as they interest you.
You should implement your extra credit in a program RankSimilarityEC.java that you submit as a separate file from your main submission. This simplifies grading and gives you more latitude to change any part of the program you want. You can also submit a zip file, extra.zip containing any additional java files, data, images, etc. that are needed to run you extra credit.
Group Extra Credit
Important: each group member must have completed the assignment BEFORE working on the group assignment. This is to ensure that you are not sharing code on the base homework.
If you want to get ambitious with the extra credit, You may work in groups of up to four students provided:
Of course, you must not share your code with anyone outside the group either. Remember that we run extra credit code through cheat checkers as well as the main assignment. If you haven't properly documented who you worked with, we won't be able to ascertain that the identical code you are submitting falls within the permitted collaboration!
Fill out the readme_rank.txt file in DrJava and submit the homework using the submission system. If you did extra credit requiring extra classes or other material, remember to submit your RankSimilarityEC.java file as well as any extra material if needed (compress this into an extra.zip file as with previous homework extra credit).