| CIT
597 Assignment
6: HTML --> Plain Text (Perl) Fall 2004, David Matuszek |
Purpose of this assignment:
General idea of the assignment:
Read in an HTML file and turn it into plain text by removing all tags and replacing all entities with the characters that they represent. Try to insert newlines in appropriate places.
Details:
Basically, the problem is to turn a correct HTML file into plain text.
< >)
should be removed. However, a few tags will require a bit of special processing.
<a href=URL ...>Link text</a> tags
should be replaced by Link text [URL], that
is, the link text followed by the URL in square brackets.<img src=URL alt=text> tags should
be replaced by text [URL] if the alt
attribute is present, or just by [URL] if the alt
attribute is not present. <br> tag should be replaced by a newline, and
each <p> tag by two newlines.<h1>...<h6>,
<title>, <div>) should be on a line by
itself, with at least one blank line before and after (more than one blank
line is OK). <pre> tag, don't run lines together or split
long lines (other tags within <pre> shoule be handled
normally, however) < > & " ' and .You should read from STDIN and write to STDOUT.
Things to watch out for:
<br>,
for example), and the attributes may be in any order.There are many other tags that could be handled specially. However, to do a really good job turning HTML into text is far more work than I want for this assignment, so if your output isn't perfect for every possible HTML file, don't worry about it. If you can do as much as I've asked for, you've learned enough Perl for this course!
Getting Perl:
Linux in Moore 207: ActivePerl is installed in Moore 20. To run it,
create a Perl file (with extension .pl), say, test.pl.
Open an X Terminal window in the directory with your program, and type perl test.pl.
The initial line, #!/usr/bin/perl, is accepted (if correct) but
not required.
Windows: Go to http://www.activestate.com/Products/ActivePerl/
and click the DOWNLOAD link at the very bottom of the page. Download
ActivePerl 5.8.4, Windows, MSI. When downloaded, double-click the file ActivePerl-5.8.4.810-MSWin32-x86.msi
.pl), say, test.pl. Open a DOS (cmd) window
in the directory with your program, and type perl test.pl.
The initial line, #!/usr/bin/perl, if present, appears to be ignored.
(If you really want more detail than this, see http://www.extropia.com/tutorials/winperl/).
Macintosh OS X: You already have Perl (of course!). Open a Terminal
window, create a Perl file (say, test), and type .
The initial line, #!/usr/bin/perl, if present, appears to be ignored.
Due date:
Tuesday, November 23, before midnight.