In this assignment, the task is to create a two-level grammar that maps from orthographical strings in Portuguese (this will be the lexical level) to strings that represent their pronunciation (this will be the surface side). A sample mapping of written "caso" to spoken "kazu" looks like this, with, by convention, the lexical string on top and the surface string on the bottom.
| Lexical: | caso | Surface: | kazu |
Standard Portuguese orthography is not always a complete guide to the pronunciation of a word (especially in the case of the letter "x" and the vowels written "o" and "e"). As usual, we will restrict and simplify the data slightly to make the solution manageable as a class exercise. Later we will redo the same example using two-level rules.
| casa cimento me disse peruca simpático braço árvore |
The surface level produced by your grammar will be a kind of crude phonemic alphabet, with the following extra symbols:
Because we have limited our input words to lowercase letters, the six special characters will appear only in surface strings, never at the lexical level. The dollar sign $ character is special in regular expressions, so precede it with a percent sign (%) to literalize it or put it in double quotes.
The mapping from orthography (lexical side) to pronunciation (surface side) includes the following:
| braço brasu |
| interesse interes0i |
| cimento simentu |
| chato $0atu |
| casa kasa |
| filho fiL0u |
| ninho niN0u |
| homem 0omem |
The orthographical digraph "rr" is always realized as /R/. Also, the single r at the beginning of a word is always realized as /R/. Elsewhere, r:r, i.e. lexical "r" is realized as /r/.
| carro | rápido | caro | cantar |
| kaR0u | Rapidu | karu | kantar |
| peruca | case |
| piruka | kazi |
| cases kazis |
| braço brasu |
| braços brasus |
| camisa | case |
| kamiza | kazi |
| vez ves |
| lisse | verdade | paredes |
| Jis0i | verdaJi | pareJis |
A "t" is pronounced /C/ when it appears before a surface sound /i/. (N.B. This change occurs in the environment of any SURFACE /i/, no matter what that surface /i/ may have been at the lexical level.) Elsewhere t:t.
| tio | partes |
| Ciu | parCis |
The vowels are a, e, i, o, u, á, é, í, ó, ú, ã, õ, â, ê, ô, ü and à. All lexical symbols map to themselves on the surface level by default.
Note that each two-level rule is a constraint that must be satisfied independently from, and simultaneously with, all the other two-level rules. You can use the lexical side or the surface side or some combination thereof to restrict the context but there is no rule ordering.
Use the command lex-test to test individual words. Because of the accented characters, it is best to run test in an Emacs buffer. To create one, use the command M-x shell.
| twolc> lex-test |
| Lexical string ('q' = quit): braço brasu b r a ç:s o:u |
To facilitate the testing, you can type all the input (upper-side) words into a file, called something like portuguese.words, and use the command lex-test-file to read the various input strings from that file. The command will prompt you to enter the name of the input and the output file.
| twolc> lex-test-file portuguese.words Output file (- = stdout) [cancel]): portuguese.out ........................ twolc> |
The file /mnt/linc/ftp/pub/cis639/assign/portuguese.words contains the following test words. Be sure that you handle them all correctly.
| disse | peru | pedaço | livro | parte | parede | sabe | cada |
| simpático | verdade | casa | braço | chato | vermelho | gatinho | filhos |
| luz | case | braços | partes | paredes | me | antes | ninhos |
Be sure to test ALL the examples to make sure that your rules are really working as they should. Modify your rules and re-apply the input words until the grammar is working perfectly.