In this assignment, the task is to create a rule system that maps from orthographical strings in Portuguese (this will be the lexical level) to strings that represent their pronunciation (this will be the surface side). A sample mapping of written "caso" to spoken "kazu" looks like this, with, by convention, the lexical string on top and the surface string on the bottom.
| Lexical: | caso | Surface: | kazu |
Standard Portuguese orthography is not always a complete guide to the pronunciation of a word (especially in the case of the letter "x" and the vowels written "o" and "e"). As usual, we will restrict and simplify the data slightly to make the solution manageable as a class exercise. Later we will redo the same example using two-level rules.
| casa cimento me disse peruca simpático braço árvore |
The surface level produced by your grammar will be a kind of crude phonemic alphabet, with the following extra symbols:
Because we have limited our input words to lowercase letters, the six special characters will appear only in surface strings, never at the lexical level. The dollar sign $ character is special in regular expressions, so precede it with a percent sign (%) to literalize it or put it in double quotes.
The mapping from orthography (lexical side) to pronunciation (surface side) includes the following:
| braço brasu |
| interesse interes0i |
| cimento simentu |
| chato $0atu |
| casa kasa |
| filho fiL0u |
| ninho niN0u |
| homem 0omem |
The orthographical digraph "rr" is always realized as /R/. Also, the single r at the beginning of a word is always realized as /R/. Elsewhere, r:r, i.e. lexical "r" is realized as /r/.
| carro | rápido | caro | cantar |
| kaR0u | Rapidu | karu | kantar |
| peruca | case |
| piruka | kazi |
| cases kazis |
| braço brasu |
| braços brasus |
| camisa | case |
| kamiza | kazi |
| vez ves |
| lisse | verdade | paredes |
| Jis0i | verdaJi | pareJis |
A "t" is pronounced /C/ when it appears before a surface sound /i/. (N.B. This change occurs in the environment of any SURFACE /i/, no matter what that surface /i/ may have been at the lexical level.) Elsewhere t:t.
| tio | partes |
| Ciu | parCis |
The vowels are a, e, i, o, u, á, é, í, ó, ú, ã, õ, â, ê, ô, ü and à. All lexical symbols map to themselves on the surface level by default.
Write a set of that performs the mappings indicated. As in the kaNpat example, the rules should be organized in a cascade, with the composition operator (.o.) between the rules. Be very careful about ordering your rules correctly; the rules cannot be expressed in exactly the same order as the facts listed just above. Compile the rules using the read regex from utility in and test them using the apply down utility.
You should be able to handle the following examples, entering the lexical (top) string in each case and getting back the surface (bottom) string. (The zeros are not shown here and should not appear in your output.) To facilitate the testing, you can type all the input (upper-side) words into a file, called something like mydata, and tell apply down to read the various input strings from that file.
| xfst[1] apply down < mydata |
| disse | peru | pedaço | livro | parte | parede | sabe | cada |
| Jisi | piru | pedasu | livru | parCi | pareJi | sabi | kada |
| simpático | verdade | casa | braço | chato | vermelho | gatinho | filhos |
| simpáCiku | verdaJi | kaza | brasu | $atu | vermeLu | gaCiNu | fiLus |
| luz | case | braços | partes | paredes | me | antes | ninhos |
| lus | kazi | brasus | parCis | pareJis | mi | anCis | niNus |
Be sure to test ALL the examples to make sure that your rules are really working as they should. Modify your rules and re-apply the input words until the grammar is working perfectly.