* * Lecture notes by Edward Loper * * Course: CIS 639 (Statistical Approaches to Natural Language Processing) * Professor: Mitch Marcus * Institution: University of Pennsylvania * # http://www.cis.upenn.edu/~mitch/cis639.html > Logistics go over section III of manning carefully go over mike collin's & adwait's code.. Fred Jelloneck? Bulk pack Topics - brill taggers - hmms - maxent - pcfgs - generative statistical models - svms - memory based learning - voting methods > Why corpus based approaches? in 80's, people said parser problem was solved informal IBM study in 1990: parsers get <40% correct Why did people say the parser problem was solved? People can "magically adapt" to the capabilities of the system.. e.g., zork. The apparent problem: NL grammars are very big - lexical ideosyncracies? Pervasive ambiguity working hypothesis: build systems that learn? \to Supervised learning [01/10/02 12:35 PM] > Plan for the 1st part of the class - HMM basics - Chapter 10 (tagging), quickly - Chapter 9 (HMMs), in depth - chunking > Tagging >> Langauge Modelling Task: predict the next item in a sequence >> Markov Models We can think of a bigram model as a markov model. Start out with a finite state automaton: # farm \to subsidy \to for # \searrow\nearrow \searrow\nearrow # \nearrow\searrow \nearrow\searrow # form \to subsidies far Instead of using a transition function, use a transition matrix: a(i,j) = probability of going from state i to state j. Estimate a(i,j): # count(w_i,w_j) # a(i,j) = ------------ # count(w_i) >> POS tagging corpus-based techniques do very well - unigram \to 91% - simple, impossible approach: use argmax_T P(T|W).. - T = sequence of tags, W = sequence of words - sparse data (zeros) - computationally expensive - But we only have probability estimates.. >>> What can we estimate well? If we make assumptions about the distributions, then we can figure things out. First, assume there's a uniform distribution of 5k words, and 40 part of speech tags. Then we can figure out what the best case # of samples/instance we can get. E.g., for \langle word,tag\rangle, we have 5 samples/instance, on average. Accurate models require a very large amount of data > A practical statistical tagger By bayes rule: # P(W|T) P(T) # P(T|W) = ------------ # P(W) So we want to maximize P(W|T) P(T). What distribution are we actually drawing from? (Joint of W and T) >> Computing P(T) # P(T) \approx P(t_1) * P(t_2|t_1) * P(t_3|t_1,t_2) * \ldots * P(t_n|t_1\ldots t_{n-1}) But we can't estimate P(T). So make the markov assumption: # P(t_i|t_1, \ldots, t_{i-1}) = P(t_i|t_{i-1}) >> Computing P(W|T) # P(w_i|t_1,\ldots t_n) \approx P(w_i|t_i) [01/15/02 12:34 PM] >> HMM.. Use an HMM to implement the statistical tagger # \downarrow\nwarrow \downarrow\nwarrow \downarrow\nwarrow # [Det]\longrightarrow[Adj]\longrightarrow[Noun] # \searrow \nearrow\searrow \nearrow # \longrightarrow \longrightarrow (fully connected) # P(w|t); p(t|t_{i-1} P(T,W) reduces to: # \pi(t1) * \prod_i P(w_i|t_i) * \prod_i P(t_i|t_{i-1}) So the markov model gives us the same equation as the baysian rule.. >> Noisy Channel The noisy channel is memoryless # \longrightarrow P(e_k|e_{k-1}) \longrightarrow P(f_k|e_k) \longrightarrow out >> HMMs again But HMMs can be trained without a tagged corpus. In particular, if we have a set of possible tags for words, and a large unannotated corpus, we can learn all HMM parameters. [01/17/02 12:36 PM] > Symbolic Learning for POS Tagging Fun with Brill taggers! - Use iterative improvement with transformational rules - Use transformational rule templates Problems with overfitting? - Some rule templates can have many many parameters - You basically don't get overfitting -- you're only selecting some rules, so few parameters? >> Why does it work? - Try scorings other than (right-wrong)? - They don't help - GPS: Generalized problem solver - Newell, Shaw, Simon 1958 - Means-ends analysis >> So what: Brill's Existential Despair What if we just have an undergrad do this? - in a short time, they can produce rules that are just as good. [01/22/02 12:33 PM] > Formalizing HMMs # HMM = \langle A,B,\Pi,Q,V\rangle # A = {A_{ij}} # A_{ij} = P(q_j at t+1|q_i at t) # B = {B_{jk}} # B_{jk} = P(v_k|q_j at t) # \Pi = {\pi_i} # \pi_i = P(q|i at 0) # v_k = vocabulary items # Q = states Vector quantization: - method of producing small fixed vocab (v_k) - first, make a vector discrete by quantizing - then produce a fixed-size vocab by picking the best n exemplars, and then vocab item selected is whatever's closest. (For speech recognition: use a cepstrum, not an FFT.) For speech recognition: - use fixed number of transitions/state. eg: # \downarrow\nwarrow \downarrow\nwarrow \downarrow\nwarrow # [Det]\longrightarrow[Adj]\longrightarrow[Noun] # \searrow \nearrow\searrow \nearrow # \longrightarrow \longrightarrow - use vector quantization to produce vocab Arc Emission HMMs: # B = {B_{ijk}} # B_{jk} = P(v_k\in V|q_i at t, q_j at t+1) State emission HMMs are a special case of arc emission HMMs. We love Viterbi! What's P(i\to j with output v_k)? # = a_{i,j}b_{i,j,k} # \alpha_t[i] = P(q_i at t|given the output) # \alpha_{t+1}[j] = \sum_i \alpha_t[i] a[i,j] b[i,j,k] >> Decoder lattice Arc is the joint probability on a transition and an output. # s_i --a(i,j)b(i,j,o_k)\longrightarrow s_j Node is the joint probability of being in a state and having seen a partial output: # \alpha(j,t) = P(x_t=s_j,o_{0\ldots k}|M) We can compute it in O(N^2T) time. Basically, this is because of the markov property: the only things we need to know about the probability for a state is the probabilities for the last state (and the input symbol). Locality is essential for dynamic programming. [01/29/02 12:34 PM] > Forward-backward algorithm (aka Baum-Welsch algorithm) For any 1\leq t\leq T+1: # [Eq. 8] P(O|M) = \sum_{i=1}^N\alpha(i,t)\beta(i,t) Find the best model for this output: # max_mP(O|m) >> Expectation phase Compute: # P_{arc}(i,j,t) # (=P_t(i,j) in the book) The probability that we go from i to j at time t, given the output and the model. # \alpha_t(i)a_{i,j}b_{i,j,ot}\beta_{t+1}(j) # P_{arc}(i,j,t) = ------------------------- # P(O|M) This is basically just: # =P(x_i at t; x_j at t+1 | O,M) Note that for all t: # \sum_i\sum_j P_{arc}(i,j,t) = 1 >> Maximization # \pi'_i = \sum_j P_{arc}(i,j,1) # \sum_t P_{arc}(i,j,t) # a'_{i,j} = ------------------- # \sum_t\sum_j P_{arc}(i,j,t) # \sum_{t s.t. ot=k} p_{arc}(i,j,t) # b'_{i,j,k} = ------------------------- # \sum_t P_{arc}(i,j,t) [01/31/02 12:37 PM] > Speech recognition >> Vowels - characterized by 3 resonance frequencies. Model throat as tube with semi-divider: # -------------------------- # # | # -------------------------- We can move the divider left/right (affects height of freqs) or up/down (affects the relative power of formants) [02/05/02 12:37 PM] > Statistical MT We want: # ehat = argmax_e P(E|F) Noisy channel: # P(E) \to P(F|E) \to - P(E) is language model - P(F|E) is an HMM with state=word What's wrong with this model for P(F|E)? - word order problem: english and french word orders may differ. - fertility: n-to-m translations (e.g., not \to ne...pas) Use a generative model (HMMs are a generative model) Why estimate P(F|E) rather than just directly estimating P(E|F)? Well, one problem is that our model of P(E|F) gives a lot of probability mass to giberish. This happens because there are many more sentences that are not english than those that are. But now consider running P(E)P(F|E). Here, P(E) helps us select things that are not gibberish; and then it doesn't matter that P(F|E) assigns lots of probability mass to gibberish; we throw that out. Since we're doing an argmax, that's ok. Picture: # [ JUNK --] # [ -- -\]\ # [ [ ]\--]>---[ ] # [ [ Fr ]---]----[ En ] # [ [ ]-x-]>---[ ] # [ x-/]/ Put another way: we have a generative model that overgenerates. How do we know where it's overgenerating? Use "alignment" to take care of order and fertility - consists of connections: eg., \langle2,1\rangle. - for an english sentence with length l, and a french target length m, there are L*m possible connections, so 2^{l*m} possible alignments. Generative model (roughly model 1): 1. pick target length 2. pick connections (independantly) 3. for each connection, generate a word (conditional only on the english word) cepts = concepts: map via an intermediate generative locus, which allow us to handle fertility more gracefully. Empty \to empty cept; multi\to helps us with order. Adding alignments: # P(F=f,A=a|E=e) So to get P(f|e): # P(f|e) = \sum_aP(f,a|e) To get P(f,a|e) we can do: # P(f,a|e) = P(f|a,e)P(a|e) Note that the length m is hidden in the choice of alignment. >> Model 1 Limit ourselves to n-to-1 (n\geq0). (n English words to 1 French word, if we're translating French\to English) - Add a single {\o} symbol to each French sentence, to generate any 0-to-1 translations. The following is true (derived from chain rule) # P(f,a|e) = P(m|e)\prod_{j=1\ldots m}P(a_j|a_1^{j-1},f_1^{j-1},m,e) # P(f_j|a_1^j,f_1^{j-1},m,e) Where: # a_1^j = a_1a_2\ldots a_j # f_1^j = f_1f_2\ldots f_j # a_j = position in English of French word j. Note that this a_j notation imposes the constraint that each French word connects to exactly one English word. This equation says: # For each i: first, generate the next alignment for i; then # pick the word for that alignmnent. Now simplify it, because of sparse data problem: use backoff to simpler things. [02/12/02 12:44 PM] fun with mt >> More Model 1 >>> Estimating P(m|e) - Assume that it's independant of both e and m. - Assume that there's a maximum length 1/\epsilon. So: # P(m|e) \approx \epsilon >>> Estimating P(a_j|a_1^{j-1},f_1^{j-1},m,e) This is basically the alignment probability. Some plausible alternatives: - Condition it on j - Condition it on a_{j-1} But we'll be even more simple. Free word order, but make sure you at least get some word. So: # P(a_j|a_1^{j-1},f_1^{j-1},m,e) \approx 1/(l+1) (The "+1" is for the empty cept) >>> Estimating P(f_j|a_1^j,f_1^{j-1},m,e) Use the alignment to directly translate words. Estimate probability of french words using the "translation probability:" # P(f_j|a_1^j,f_1^{j-1},m,e) \approx P(f_j|e[a_j]) Use frequency, or maybe smoothed frequencies >>> Putting it all together # P(f,a|e) = \epsilon * (l+1)^{-m} \prod_j P(f_j|e[a_j]) Sum that over all alignments.. But there are (l+1)^m possible alignments. We could just enumerate all alignments, and sum the P's. >> Model 2 >>> Estimating P(a_j|a_1^{j-1},f_1^{j-1},m,e) Make a smarter model: alignment depends on j, a_j, m, and l. # P(a_j|a_1^{j-1},f_1^{j-1},m,e) \approx P(a_j|j,m,l) Impose the constraint: # \sum_i P(a_i|j,m,l) = 1 >>> Putting it all together # P(f,a|e) = \epsilon * \prod_j P(f_j|e[a_j])P(a_j|j,m,l) >> Model 3 New basic model. Explicitly model the cepts. - \phi Fertility - \tau Tableau (translation table for a given fertility) - \pi Permutation [02/14/02 12:43 PM] Full model: # P(\tau,\pi|e) = # \prod_{i=1\ldots l} P(\phi_i|\phi_1^{i-1},e) * # P(\phi_0|\phi_1^l,e) * # \prod_{i=0\ldots l}\prod_{k=1\ldots\phi i} P(\tau_{ik}|\tau_{i1}^{k-1},t_0^{i-1},\phi_0^l,e) * # \prod_{i=1\ldots l}\prod_{k=1\ldots\phi i} P(\pi_{ik}|\pi_{i1}^{k-1},\pi_1^{i-1},\tau_0^l,\phi_0^l,e) * # \prod_{k=1\ldots\phi0} P(\pi_{0k}|\pi_{01}^{k-1},\pi_1^l,\tau_0^l,\phi_0^l,e) - Line 0: to find the probability of a translation\ldots - Line 1: pick a fertility for each word - Line 2: pick a fertility for the empty cept - Line 3: pick a translation for each word - Line 4: pick a permutation for each word - Line 5: pick a permutation for the empty cept. What independance assumptions do we want to use? (Too many parameters!) Model with independences: # P(\tau,\pi|e) = # \prod_{i=1\ldots l} P(\phi_i|e_i) * # CHOOSE(\phi_1+\ldots\phi_l, \phi_0)(1-p_1)^{\phi1+\ldots+\phi l-\phi0}p_1^{\phi0} # \prod_{i=0\ldots l}\prod_{k=1\ldots\phi i} P(\tau_{ik}|e_i,\phi_i) * # \prod_{i=1\ldots l}\prod_{k=1\ldots\phi i} P(\pi_{ik}|l, m, i) * # \prod_{k=1\ldots\phi0} P(\pi_{0k}|\pi_{01}^{k-1},\pi_1^l,\tau_0^l,\phi_0^l) - Line 1: fertility of normal words just depends on the word - Line 2: assume we can get fertility of the emtpy cept by using a probability p_1 for getting an empty cept for a given word.. - Line 3: translation depends on the english word & its fertility - Line 4: permutation depends on the length of the english & french sentences; and the index of the current (english) word. - Line 5: permutation of the empty cept: put the empty cept only in places where there weren't already words. Give it even distribution in empty spots.. [02/19/02 12:37 PM] > Parsing by "Chunking" find non-recursive grammatical chunks non-nested NPs up to the head. [02/28/02 12:42 PM] > Smoothing Smoothing is fun! Smoothing is good! >> Theory - Small counts yield bad estimtes [03/05/02 12:41 PM] >> Good-Turing Estimation Make the assumption that the material is binomial. I.e., words in a document are iid. - Let N_r be the number of items that occur r times - Insight: N_r can provide a better estimate of r - Adjusted frequency r^*: # E[N_{r+1}] # r^* = -------- (r+1) # E[N_r] Works well for language modelling, despite the fact that the binomial condition doesn't really hold.. Problems: - To estimate r^* for r=0, we must know how many things never occured (=N_0) - For large r, N_r gets small, so E[N_r]'s must be smoothed >> Terminology - Smoothing: average two distributions - Backoff: switch from one distribution to another distribution, depending on (some aspect of) the input. - So in some cases, backoff is a subset of smoothing. [03/05/02 01:15 PM] > Smoothing in Practice\ldots ! Dan Bikel Smoothing.. - We want to estimate the likelihood of things that weren't observed (esp zero-count items). >> Deleted interpolation - Create a smoother distribution by linearly interpolating several (hopefully) related distributions p(A|B) = \sum_i\alpha_iP(A|\phi_i(B)) >> Witten-Bell - Instead of modifying "rough" estimates from one part of the corpus, using counts gathered from a held-out section, try to estimate the confidence in estimates directly. - We want direct confidence estimates for probability estimates (which can be used as \alpha s in smoothing) - How do we estimate confidence in a conditional probability estimate? - Base it on the *shape* of the distribution Begin Digression\ldots Define probability theory: - \Omega: set of events - F \subseteq 2^\Omega - P: F \to R Define expectation: - E[X] = \sum_x xp(x) is the center of mass (\mu_x). - Also consider the center of mass in the y dimension, \mu_y. This quantity is related to entropy (in particular, entropy is the expected value of log[p(x)]; \mu_y is the expected value of p(x)). End Digression\ldots For more uniform distributions, we have less confidence; for less uniform distributions, we have more confidence. For example, MLE will do very badly if everything occurs exactly once. This is esp bad if we don't know the underlying set of events.. But still true otherwise. So, we trust distributions with lower entropy, and distrust distributions with higher entropy. >> Basic Witten-Bell confidence for a pdf P(A|B): # c(B) # \lambda = --------------------- # |{A_i:c(A_i,B)>0}|+c(B) c is count. or simpler notation: # \lambda = d/(d+u) = 1/(1+u/d) ("u" for unique, aka diversity) The link to entropy: # u/d = 1/nbar # nbar = average of n.. Using weights.. # \lambda_1e_1 + (1-\lambda_1)[\lambda_2e_2 + (1-\lambda_2)e_3] Chen & Goodman (96, 98) did analysis of smoothing techniques for language modeling. They found that Witten-Bell was very bad for language modeling.. Why? Does this mean we shouldn't use Witten-Bell? - They didn't explore the formula as it actually get used People actually use 1/1+(k*u/d) instead of 1/1+(u/d). k is a "fudge factor," typically at least one. "Fudges" the number of unique outcomes for some history. This allows us to reserve more of the probability mass. But what do we use for k? Use held-out data to optimize k.. >>> Other tricks.. - Add a factor to compensate for equi-trained submodels: if we have 2 models that are equally well trained for some instance, then we tend to trust the one with more confidence. # [ c(\phi_{i-1}(B)) ] d_i # \lambda_i = [ 1 - ---------- ] * ------- # [ c(\phi_i(B)) ] d_i+k*u_i - Use an additive factor Witten-Bell just takes one pass through the data. We can count things directly, etc. Good for estimating probabilities of unseen events. Fast reestimation, etc. check newsgroup\ldots (next time: PCFGs & probablistic parsing) [03/07/02 12:41 PM] > Introduction to Statistical Parsing Determining Grammatical Structure.. We need (roughly): - A grammar that specifies which sentences are legal - A parsing algorithm that assigns possible structures to new word strings. - A method for resolving ambiguities >>> Begin digression Terminology: "recursive transition networks" are those FSA things which consist of a set of named FSAs, where edges can be labeled with the name of an FSA.. eg: # [S] --NP\longrightarrow[ ]--VP\longrightarrow[E] # \_\_ # / \ adj # \searrow / # [NP] --det\longrightarrow[ ]--Noun\longrightarrow[E] # \ \nearrow # -NP\to[ ]-'s augmented transition networks: recursive transition networks with registers.. Woods '69 >>> End digression [04/02/02 01:54 PM] Assembling Current Parsing Technology - Inside algorithm -- PCKY - (outside prob) * (inside prob) = prob that constituant in sentence (used to do a beam search of the space; usually, approximate outside prob). - lexicalized CFGs: associate a head word with each node. Gives us a good stand-in for context sensitivity. But creates a *lot* of rules. - So we need to deal with sparse data Today: - Prepositional phrase attatchment as language modelling - sparse data -- backoff - "linguistic" analysis & sparse data - steting & nagao > Presentatoin schedule # Thurs 4 [Cardie & Pierce] Erwin, Seung-Yun # Mon 8 [Veenstra] Mike, Dave # Tues 9 [ADK] Xiayi, Shudong # Thurs 11 [TKS, Veenstra] Edward, Nikhil # Mon 15 [MPRZ] Jinying Chen, Libin Shen # Tues 16 [Kudo, Matsumoto] Alex, Anne # Thurs 18 Fernando