Research Seminar: Statistical Natural Language Processing
Class Objectives (Learning Outcomes)
After completion of the course, attendants will be able to:
- use concepts in text mining and natural language processing (NLP), more precisely statistical NLP (see Section Literature for basic literature),
- employ tools and methods used in text mining and NLP
- develop extensions to the text mining infrastructure of R, i.e. create so-called plugin packages to tm
- create R package vignette
Prerequisits
Good knowledge of the R language, statistics, and the tm package.
Teaching and Learning Methods
Reading seminar and R-based mini labs.
Contact
Instructor: Kurt Hornik (Kurt.Hornik _AT_ wu.ac.at)
Assistant: Stefan Theussl (Stefan.Theussl _AT_ wu.ac.at)
Schedule
Unit | Date | Time | Topic | Slides |
---|---|---|---|---|
1 | 08.10. | 15:00 -- 18:00 | Preliminary Talk | -- |
2 | 22.10. | 15:00 -- 18:00 | tm and plugins: Ingo, Stefan (Chapter 2/11) | [1, 2, R] |
3 | 05.11. | 09:00 -- 12:00 | Raw Text/Tagging: Karl, Norbert / Kamran (Chapter 3/5) | [1, 2, R] |
4 | 26.11. | 09:00 -- 12:00 | Classification/Information extraction: Paul, Thomas, Willy / Angela, Mathias (Chapter 6/7) | [1, 2] |
5 | 10.12. | 09:00 -- 12:00 | Data retrieval/Sentiment Analysis: Mario | [1] |
6 | 14.01. | 09:00 -- 12:00 | Project discussion | -- |
7 | 28.01. | 09:00 -- 12:00 | Project discussion | -- |
The seminar will take place at the Besprechungsraum of the Institute for Statistics and Mathematics, UZA2, Ebene 4.
Projects
Participants should pick one project out of the following list, preferably the one which is related to the presented topic (in the first half of the course).
- Wordnet (review/extend wordnet package)
- Interfaces to web resources (RSS feed parser, wikipedia, etc.)
- Lexical resources for onthology learning (e.g., build corpora from sources like Project Euclid)
- Sentiment analysis (literature overview, implement/compare scoring methods)
- POS Tagger/OpenNLP (literature/software overview, data structures, review/extend openNLP package), Stanford Tagger
Criteria for Successful Completion
Presentations and package vignette.
Literature
- Bird, Steven; Klein, Ewan ; Loper, Edward: Natural Language Processing with Python O'Reilly Media, 2009, http://www.nltk.org/book.
- Friedman, Jerom; Hastie,Trevor; Tibshirani, Robert: The Elements of Statistical Learning Springer, 2008.
- Jurafsky,Daniel; Martin, James H.: Speech and language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition MIT Press, 2008.
- Manning, Christopher D.; Schuetze, Hinrich: Foundations of Statistical Natural Language Processing MIT Press, 2000.
- Manning, Christopher D.; Raghavan, Prabhakar; Schuetze, Hinrich: Introduction to Information Retrieval Cambridge University press New York, 2009 Download.
- Wasserman, Larry: All of Statistics, A Concise Course in Statistical InferenceSpringer , 2004.
- Zhai, ChengXiang: Statistical Language Models for Information RetrievalMorgan Claypool, 2009.
Additional Ressources
See also
Last change: 2010-12-02 by Stefan Theussl