Publications
RSS publication feed
Contact
Teaching
Research
Blog
Old blog
Photos
Oldies

Links

Penn
Penn CIS
Machine Learning at Penn
Genomics and Computational Biology

Fernando C. N. Pereira

Andrew and Debra Rachleff Professor
Dept of Computer and Information Science
University of Pennsylvania

After skiing in Chillán

On Leave

I have just stepped down as chair of Penn's CIS department, after 6 1/2 action-filled years. The new department chair is Susan Davidson. I am on leave from Penn at Google in Mountain View. I'm continuing to work with my research group at Penn, but I am not taking any new graduate students in 2008.


The “other” Fernando Pereira

“Fernando Pereira” is a pretty common first-last name combination in Portugal, Brasil, and several other countries. In particular, I am not the “MPEG” Fernando Pereira, who is a professor at IST in Lisbon. If you came to this page looking for multimedia, signal processing, MPEG, or IEEE matters, you are probably in the wrong place.


To Prospective Graduate Students

I'm on leave and not taking any new students. I receive so many e-mail messages from potential graduate students that if I replied thoughtfully to each I would have no time left for research. I try to reply to specific questions about my research. Anything you have to say about your background, qualifications, and research interests would be better put in your official application to the Computer and Information Science graduate program or to the Genomics and Computational Biology graduate group, depending on your background and interests. Graduate students are admitted to a graduate program, not into individual research groups, based on the record of their studies and tests, recommendation letters, and the match between their stated interests and the ongoing research in the program. All applications are studied carefully. Successful applicants will be offered full research assistantships. E-mail from applicants to individual faculty is not necessary.


Research

The main goal of my research group is to develop machine-learnable models of language and other natural sequential data such as biological sequences. Penn, with its strong machine learning group, and its deep connections between computer science, linguistics, and computational biology, is the ideal place to pursue those goals. My most recent work has been on machine-learning techniques for parsing and text information extraction, but I have also worked on finite-state methods for speech recognition, information-theoretic approaches to inducing compact representations of multivariate data, and on bridging the gap between distributional and logical views of natural-language syntax and semantics.

Structured linear models for information extraction and parsing

Many sequence-processing problems involve breaking it into subsequences (person name vs other), or labeling sequence elements (parts of speech). Higher-level analyses involve parsing the sequence into a tree or graph representing its hierarchical structure. Previous approaches using probabilistic generative models like HMMs and PCFGs have difficulty in dealing with correlated features of the input sequence. We have been developing and applying structured linear models, starting with conditional random fields, as a more flexible, effective approach to learning how to segment and parse sequences. We are applying these models to information extraction from biomedical text, dependency parsing, and gene finding.

Finite-state speech processing

What do regular expressions turn into when we need to assign weights (maybe probabilities) to alternative matches, and to compose pattern matchers? Weighted finite-state transducers. At AT&T, I was involved in developing these as a framework for speech recognition, leading to a creation of a powerful library that has been made available for non-commercial use. An open-source reimplementation can be found here.

The information bottleneck

How does one quantify the notion of information about something? Given some variables of interest, sources of information about those variables can be compressed while preserving the information about the variables. The tradeoff between compression and information preservation, which we call the information bottleneck, answers the question. Using this model, we can build compact representations of complex relationships, for instance word cooccurrences in text.

Formal semantics of natural language

The syntactic structures of natural-language sentences and their meanings must be linked by a systematic, compositional process for language learning and use to be possible. However, this form of compositionality is more subtle than those used in logical and programming languages. Linear logic turns out to be a good metalanguage to describe the natural-language syntax-semantics mapping.


Bio

I was born and raised around Lisbon , Portugal. I started college studying electrical engineering but majored in mathematics. While in college, I worked part-time for a architectural CAD project at LNEC, a government engineering laboratory. After graduating, I stayed at LNEC for two years as a systems programmer and administrator, but got also involved in urban traffic modeling, artificial intelligence and logic programming. In 1977 I took a scholarship from the British Council to study artificial intelligence at the University of Edinburgh. There I worked on natural-language understanding and logic programming, and for a while again in architectural CAD. I was involved in creating the first Prolog compiler (for the PDP-10), and I also wrote the first widely-used Prolog interpreter for 32-bit Unix machines. I graduated in 1982 and joined the Artificial Intelligence Center of SRI International in Menlo Park, CA, where I worked on logic programming, natural-language understanding and later on speech-understanding systems. During 1987-88, I headed SRI's Cambridge, England, research center. I joined AT&T in the summer of 1989, were worked on speech recognition, speech retrieval, probabilistic language models, and several other topics. From 1994 to 2000, I headed the Machine Learning and Information Retrieval department of AT&T Labs -- Research. I spent the 2000-2001 academic year as a research scientist at WhizBang! Labs, where I developed finite-state models and algorithms for information extraction from the Web. I have been at Penn since 2001.