Department of Statistics Seminar
North Carolina State University
presents
David Madigan
Rutgers University
Bayesian Text Categorization
ABSTRACT
Text categorization concerns the assignment of documents to predefined categories. Traditionally librarians and human indexers have carried out such categorization tasks, sometimes on a large scale. For example, the US National Library of Medicine engages over 100 human indexers to assign medical subject headings to 400,000 medical articles a year.
Applications such as e-mail filtering, pornography detection, medical coding, and news filtering are creating a growing demand for automated text categorization, especially for categorization algorithms that can learn from examples. The statistical challenges revolve around issues of scale - the number of predictor variables can run to the tens of thousands - and model structure.
In recent empirical evaluations, support vector machines and boosting algorithms have overtaken more traditional probabilistic classifiers like Naive Bayes. This talk will describe these approaches. The talk will also present tractable variants of the probabilistic approach, in particular "sparse Bayesian" classifiers that perform well predictively. (joint work with David D. Lewis)
Friday, May 2, 2003
3:35 - 4:35 pm
206 Cox Hall
Refreshments will be served on the second floor of Dabney Hall (left of Room 222) at 3:00 pm.