Homework 3: Cheat Checker Visualization

GOALS OF THE ASSIGNMENT

The purpose of this assignment is to gain practice with sorting, functions, and coordinate transformations.

Cheat checkers typically compare all pairs of code submission using techniques similar to those of the previous assignment, and give a sorted list of the top matches. One simple, but surprisingly effective way of comparing two code submissions is to compare them line by line, counting up how many characters differ between the two lines, and how many characters are added or removed from the ends of the line. The number of differences is used to computing a matching cost that represents how different the two files are. The lower the matching cost, the more similar the files.

You already know enough to be able to implement this — we often assign this in CIS 110 before the current assignment, although not this semester. We also often include the results of this cheat checker in our reports to the OSC even though we also use much more sophisticated cheat checkers.

We're also interested in the distribution, also called a histogram, of matching costs. That is the focus of this assignment. Any pair of independently written programs should have fairly similar matching costs (actually, you expect the cost distribution to form a bell curve); assignment pairs with abnormally low matching costs relative to the bell curve are likely to be copied. The images below show the kind of histogram your program will generate, as well as a report generated by an on-line cheat checker. All identifying information has been removed.

Comparing a protein from two different species can give us some information about how closely related they are, but it is even more interesting to compare the protein among numerous species. That helps us answer questions like, "Which two species are most closely related?" or "Which two species are most distant?" We can also build a phylogenetic tree that hierarchically groups species based on how similar the protein is, and thereby gain insight into the entire evolutionary tree. Computational biologists build similar trees of different genes to help learn about their functions — very similar genes often control similar functions. (This type of tree is called a minimum spanning tree, and is at the heart of many compression formats including JPEG, as well as being an essential part of route planning and many other algorithms. You will learn all about minimum spanning trees in CIS 121.) Two examples of phylogenetic trees based on the cytochrome c protein are shown below.

Complex data often requires illustration. In this assignment, you will be implementing a histogram to visualize complex data. This data examines the similarity between two pairs of information.

This assignment builds on understanding the range of similarity between pairs, and find the most similar pairs. Finding the most similar items out of a pool of many is of course important to a very wide range of applications, starting with searching for similar documents or images. We will focus on two applications: protein matching and cheat checking.

We provide you a reference implementation of the pairwise comparison programs. You can assume this implementation has no defects, and will work as specified in this assignment.

You will write a program RankSimilarity.java that matches all pairs of documents, computes a histogram of the result, and plots it. Each of these steps is explained in detail in its own section of the assignment, in the order we recommend you implement them. The histogram is built using a cost matrix. A cost matrix is a matrix that identifies the similarity between two items being classified. In Cheat Checking, this can be two student homeworks. In biology, this can be used to determine how similar protein chairs are. The lower the cost is, the more similar the documents are. We use a histogram to identify documents whose cost difference is abnormally small, indication cheating. You do not need to worry about the details of how this difference is calculated for this assignment.

To get you started, we provide a complete reference implementation as a compiled .class file that you can play with. We also give you a skeleton version of the program that calls the reference implementation for the interesting bits and a RankHelpers class with the tedious code to read in data. This lets you start by implementing the drawHistogram() method, then replace the other pieces with your own code one by one. Your program must conform to the API given below.

API

public class RankSimilarity
--------------------------------------------------------------------------------------------------------
* Draw the histogram */
public static void drawHistogram(double[] histogram)

* Compute the matching costs for all pairs of documents
* returns an array giving the number of matches in each histogram bin
* costs is an array of matches in the same format as costArray
* numBins is the number of bins for the histogram (must be at least 1)
*/
public static int[] buildHistogram(int[][] costs, int numBins, boolean inverse)

* Sorts the costs array in increasing order of cost using selection sort
* costs is an array of matches in the same format as costArray
*/
public static void selectionSortCosts(int[][] costs)

* Computes the matching costs for all pairs of documents
* returns a 2-D array with one row for each match
* docs is an array of documents, each represented as an array of lines
*/
public static int[][] computeCosts(String[][] docs)

* read in the protein file, cost matrix, or list of documents,
* compute all match costs, sort them, and create a histogram
*/
public static void main(String[] args)


Files to Download - for your convenience, all files can be found in this zip file here



Before you start

Reference Implementation

We provide you with a reference implementation of the program that you can run to explore how your program is supposed to work. You can also use the functions it provides as drop-in replacements for many of the parts that you will write so that you end each of the subsequent assignment parts with a complete, working program. This makes your program much easier to test.

At this time, do not worry about the contents of the data files, just know that the ReferenceRankSimilarity.class file works to the specification of this assignment.

There are a few ways to run the reference implementation. We provide the same main() function in the skeleton code for your own program, which you will be able to run in exactly the same ways.

  1. java RankSimilarityReference costfile.txt: If you run the program with exactly one argument, and that argument begins with "cost," (e.g. "java RankSimilarityReference costs-cheatchecker.txt") the program assumes it is the name of a file that contains a pre-computed cost matrix. It will load this data as the costArray, sort it, build a histogram, and plot it. This allows us to provide you datasets for which it would take many hours to compute the cost matrix, and/or whose underlying data is confidential and cannot be distributed.
  2. java RankSimilarityReference filename.txt: If you run the program with exactly one argument, and that argument does not begin with "cost," the program assumes it is a text file containing protein data. This is useful for the cytochrome_c.txt dataset. Each protein will be treated as a one-line document, and the species's English name will be used as the document's name.
  3. java RankSimilarityReference file1.txt file2.txt file3.txt ...: If you run the program with more than one argument, it assumes each argument is a text file that should be compared. Each file is loaded as a separate document, and the filenames are used as the documents' names. You can use this to assemble datasets of your own, but be warned that you need about 30 documents (or more) for the histograms you will plot to become interesting.
Draw Histogram We recommend that you start by implementing the drawHistogram() function. You will be able to completely test it using the skeleton code provided because the stubs for every other function you will implement call the reference implementation. Your plot does not have to look identical to the reference version, and we encourage you to get creative. However it must meet the following criteria:

  • You must have a vertical and horizontal axis. You do not need to label these axes, nor do they need to have a numeric scale, but the axes must be visible.
  • You must compute the width and height of each bin so that they fit no matter how many bins there are. Both the number of bins and maximum value may change while your program is running.
  • You must draw a vertical line (or something equally helpful) at the mouse's location that stretches from the top to the bottom of the plot. The line should not cover the bars of the histogram.

Drawing the Histogram: A histogram is just a type of bar chart. It groups the possible values of each piece of data into a series of ranges (called bins), and counts how many data values lie in each range (i.e. fall into each bin). For example, if the possible matching costs fall between 0 and 100, a 10-bin histogram counts the number of matches with costs of 0 to 9 (bin 0), 10 to 19 (bin 1), 20 to 29 (bin 2), etc. A bar chart of the numbers of A, B, C, D, and F grades given in a course is another example of a histogram.

You will store the histogram of matching scores as an array of doubles. Each entry of the array stores the number of matches whose costs fall into that bin. The length of the array tells you the number of bins.

  • Plot each histogram bin b as a rectangle using PennDraw.filledRectangle() or PennDraw.filledPolygon().
  • Use the number of bins and the fact that the x-axis scales from 0 to 1 in order to determine the width of each bin. Determine the height of the bin by dividing the bin value by the maximum bin value. Use the bin index and the previously calculated bin width in order to determine the x-coordinate of the lower left-hand vertex. Using the lower lefthand vertex, the bin width, and bin height, you can now calculate all x and y coordinates for the bin's rectangle.
  • Pick any color you find attractive for the rectangles.

Checkpoint: to ensure your histogram works, you should get this as a result of calling "java RankSimilarity costs-cheatchecker.txt" in the interactions pane.

A Sample Histogram.

As another test, try creating your own histograms as double arrays. For example, if in main you write:

double[] thisHist = {0.4, 10.0, 5.3, 3.2, 4.8, 6.1, 3.0, 0.0};
drawHistogram(thisHist);


You should get the histogram pictured below:

Handling Events

Writing your event loop: Just like in previous assignments using PennDraw, you need to add a loop to your main() function that checks for keyboard input and draws the histogram. This is your event loop, and it will be responsible for calling drawHistogram().

Inside your main method you should construct a while loop to monitor for key press events listed below. This loops will be similar to StampSketch in HW1. You must implement the following three behaviors:

Detecting mouse and key presses: Consult the documentation for PennDraw.

Preparing the data

"Building the histogram"

Now that drawing works, you can implement your own buildHistogram() function. The function should use the costs array to create and return a histogram with numBins bins. It is the last step before drawing, which means you can easily test and debug it using a cost matrix that has been computed and sorted by the reference implementation.



Computing costs

computeCosts() takes a 2-D array of documents—docs[i](look below for a visual representation of this array)is the ith document, represented as an array of lines—and computes all pairwise matches. It returns the resulting 2-D array.

Sorting the data: For the following sections of the assignment, you may find it neccessary to have the data sorted by the COST value. While we have not covered sorting in class yet, you will need to have selectionSortCosts(int[][] costs) implemented. Simply use the following code body to implement this function:

public static void selectionSortCosts(int[][] array) {
    for(int i = 0; i < array.length - 1; i++) {
        int min = array[i][COST];
        int minIndex = i;
        for(int j = i + 1; j < array.length; j++) {
            if (array[j][COST] < min) {
                min = array[j][COST];
                minIndex = j;
            }
        }
        int[] temp = array[i];
        array[i] = array[minIndex];
        array[minIndex] = temp;
    }
}

Sorting will be covered in the class at a later date. However, this is an example of an selection sort. Take some time to try to understand how the code works. To sort the costArray, simply call:

selectionSortCosts(costArray)

Note that we don't directly set costArray equal to the output of this method. This is because arrays are pass-by-reference. This means that when this method alters the input array, it alters the array's location in memory, not just the value of the array inside the function.

We provide a few possible extra credit ideas for this assignment, which are described below.

There are many opportunities for extra credit on this assignment, although you will be able to earn at most 3 extra credit points (and remember that the impact of extra credit on your grade is limited). Some of these ideas are relatively easy, and others are harder. Finish the assignment first, submit it, then work on these ideas as they interest you.

You should implement your extra credit in a program RankSimilarityEC.java that you submit as a separate file from your main submission. This simplifies grading and gives you more latitude to change any part of the program you want. You can also submit a zip file, extra.zip containing any additional java files, data, images, etc. that are needed to run you extra credit.

Group Extra Credit

Important: each group member must have completed the assignment BEFORE working on the group assignment. This is to ensure that you are not sharing code on the base homework.

If you want to get ambitious with the extra credit, You may work in groups of up to four students provided:

  1. you have all completed and submitted the main assignment before you start on them;
  2. you share only the code specifically related to these ideas with each other (not your entire extra credit program);
  3. you each list everyone you work with on each idea in your readme, along with the contribution each person made.

Of course, you must not share your code with anyone outside the group either. Remember that we run extra credit code through cheat checkers as well as the main assignment. If you haven't properly documented who you worked with, we won't be able to ascertain that the identical code you are submitting falls within the permitted collaboration!

Fill out the readme_rank.txt file in DrJava and submit the homework using the submission system. If you did extra credit requiring extra classes or other material, remember to submit your RankSimilarityEC.java file as well as any extra material if needed (compress this into an extra.zip file as with previous homework extra credit).