CIT 597 Assignment 6: HTML --> Plain Text (Perl)
Fall 2004, David Matuszek

Purpose of this assignment:

General idea of the assignment:

Read in an HTML file and turn it into plain text by removing all tags and replacing all entities with the characters that they represent. Try to insert newlines in appropriate places.

Details:

Basically, the problem is to turn a correct HTML file into plain text.

You should read from STDIN and write to STDOUT.

Things to watch out for:

There are many other tags that could be handled specially. However, to do a really good job turning HTML into text is far more work than I want for this assignment, so if your output isn't perfect for every possible HTML file, don't worry about it. If you can do as much as I've asked for, you've learned enough Perl for this course!

Getting Perl:

Linux in Moore 207: ActivePerl is installed in Moore 20. To run it, create a Perl file (with extension .pl), say, test.pl. Open an X Terminal window in the directory with your program, and type perl test.pl. The initial line, #!/usr/bin/perl, is accepted (if correct) but not required.

Windows: Go to http://www.activestate.com/Products/ActivePerl/ and click the DOWNLOAD link at the very bottom of the page. Download ActivePerl 5.8.4, Windows, MSI. When downloaded, double-click the file ActivePerl-5.8.4.810-MSWin32-x86.msi and just accept all the defaults. To run it, create a Perl file (with extension .pl), say, test.pl. Open a DOS (cmd) window in the directory with your program, and type perl test.pl. The initial line, #!/usr/bin/perl, if present, appears to be ignored. (If you really want more detail than this, see http://www.extropia.com/tutorials/winperl/).

Macintosh OS X: You already have Perl (of course!). Open a Terminal window, create a Perl file (say, test), and type perl test. The initial line, #!/usr/bin/perl, if present, appears to be ignored.

Due date:

Tuesday, November 23, before midnight.