Erlang Assignment 2: Scattershot Curve Fitting
Fall 2010, David Matuszek

Purposes of this assignment

General idea of the assignment

Given a set of (x, y) points, curve fitting is the process of finding a polynomial that most nearly goes through those points.

If there are N data points, then a polynomial of degree N-1 can always be found that passes exactly through those points. For a large number of data points (say, more than four or five), the exact polynomial is usually too "bouncy" to give a good estimate for other data points. What we want is a polynomial of smaller degree (typically, not more than about 4). Such a polynomial will not, in general, be perfectly accurate.

The typical error measure used is the sum of squared deviations (SSD).To compute this for a polynomial P, compute P(x) = y' for each data point (x, y), and find the sum of the (y - y')2 values. The smaller the number, the better the fit.

For example, given the data [{0, 4}, {1, 5}, {2, 10}, {-1, 10}, {5, 65}] and the polynomial P = 3x2-2x+5, we can compute
     P(0)=5, P(1)=6, P(2)=13, P(-1)=10, P(5)=70.
     SSD = (4-5)2+(5-6)2+(10-13)2+(10-10)2+(65-70)2=1+1+9+0+25 = 36

Note: I am using integers in this assignment writeup for convenience. Your program should work with floating point numbers. Also for convenience I use the term "best fit" in many places, where I really mean "the best fit your program is reasonably able to find."

The goal in curve fitting is to find the polynomial of some fixed degree N that gives the best fit (least SSD). Lower degree polynomials with "adequate" fit are generally preferable to higher degree polynomials with somewhat better fit.

There are mathematical algorithms for finding a best-fit polynomial of a given degree. I don't know them. If you know them, please forget them for the purposes of this assignment. The assignment is to write an Erlang program to find a "pretty good" fit, using a brute force approach.

In order to allow automated testing, input will be from a file. The input will consist of a single Erlang term, a list of some number of two-tuple {x, y} pairs. For example, the above data would be on the file as:
          [{0, 4}, {1, 5}, {2, 10}, {-1, 10}, {5, 65}].


By "server," I mean an Erlang process that behaves as follows: When it gets a request from another process, it computes a response and sends it back to that same process, then waits for the next request. Hence, the Pid of the process making the request must be included in the message to the server.

Represent a polynomial as a list of coefficients. For example, the polynomial 3x2-2x+5 should be represented as the list [3.0, -2.0, 5.0].

Write an evaluate function that, given a polynomial and a point (x-value), evaluates the polynomial at that point and returns the result. For example, evaluating [3.0, -2.0, 5.0] at 2.0 would give 12.0-4.0+5.0=13.0. (This evaluation can be done with a single call to foldl, but it's tricky to get the parameters right.)

Write a function start() to read in one Erlang term from a file named curve_data (no extension), as a list of {x, y} tuples, and tries to fit curves of various degrees to that data.

Given a polynomial (as a list of coefficients) and a set of data, one way to attempt to improve the fit is to adjust one of the coefficients up or down, holding the other coefficients constant, until you find a value for that coefficient that does a reasonably good job of minimizing the SSD. Then adjust some other coefficient. Iterate, adjusting only one coefficient at a time.

start() should spawn four processes: One to find the best linear (ax+k) fit to the data, one to find the best quadratic (ax2+bx+k) fit, one to find the best cubic (ax3+bx2+cx+k) fit, and one to find the best quartic (ax4+bx3+cx2+dx+k) fit. When these processes have found the best fit they reasonably can, they report back to start(), and quit. When all four processes have responded, start() will print out the final results, and the program terminates.

These four process all have the same structure (and, in fact, can all use the same function, which I'll call fit). I'll take one of them, the quadratic (ax2+bx+k), to explain what I want them to do.

The polynomial, [a, b, k] has three coefficients that can be adjusted to improve the fit. Spawn three servers: one to adjust the first coefficient, another to adjust the second coefficient, and a third to adjust the third coefficient.

When each process finds the best value it reasonably can for its given coefficient, holding the others constant, it returns the modified list of coefficients, and the SSD, as its result.

When fit receives a modified quadratic equation, it compares it to the best equation it currently has. If the new equation is better, it sends a request to each of the other two processes, giving them the new equation to work on. But if it already has a better equation, it sends that equation to the process it just got a (worse) response from.

When the fit server decides it has the best value it is going to get for a polynomial of the given degree, it stops the servers it has spawned, returns its final result, and exits.

It's often desirable to include, in a response, the original request (in this case, the initial polynomial). This is because the program might be written in such a way that the server gets multiple requests from the same process, and when that process gets a response, it may not otherwise know which request is being responded to.

Remember that process ≠ function. While the above asks you to run quite a lot of servers (three for the quadratic equation alone), you do not need to write a lot of different functions. With a little planning, each server can use the same function, with different parameters, such as, which coefficient (1, 2, ...) to adjust. Similarly, you don't need a different function for each polynomial degree; just give it lists of different lengths.

Be sure that the program ends cleanly: Do not leave any servers running!

Put your program on a file named cf.erl.


I've told you what the input should look like (I'll supply my own input, for testing), and I've told you the general approach I want you to take (adjusting one coefficient at a time). I want you to use processes, some of which are servers. And I want a start() function, so that I know how to start your program.

There's a lot I haven't specified, and for that, you are on your own. Parameters to functions can be whatever you need them to be. You probably will want to print a lot of intermediate results, and that's fine (but I want to be able to find the final results at the end). You probably want to link some processes, or send exit signals to them. Fine. Do whatever works.

I'm pretty sure this assignment is doable, because I've done most of it. However, please let me know ASAP if you find any significant errors or inconsistencies.

Finally: Yes, I know this is a poor way to fit curves. I've tried very hard to come up with an assignment that showcases Erlang features, and I think it does that, even if the whole thing is pretty hokey.

Due date:

Turn in your cf.erl file to Blackboard by midnight, Wednesday October 13. As only one file is required, there is no need to zip it.