Research Seminar: Statistical Natural Language Processing

Class Objectives (Learning Outcomes)

After completion of the course, attendants will be able to:

use concepts in text mining and natural language processing (NLP), more precisely statistical NLP (see Section Literature for basic literature),
employ tools and methods used in text mining and NLP
develop extensions to the text mining infrastructure of R, i.e. create so-called plugin packages to tm
create R package vignette

Prerequisits

Good knowledge of the R language, statistics, and the tm package.

Teaching and Learning Methods

Reading seminar and R-based mini labs.

Contact

Instructor: Kurt Hornik (Kurt.Hornik _AT_ wu.ac.at)

Assistant: Stefan Theussl (Stefan.Theussl _AT_ wu.ac.at)

Schedule

Unit Date Time Topic Slides

1 08.10. 15:00 -- 18:00 Preliminary Talk --

2 22.10. 15:00 -- 18:00 tm and plugins: Ingo, Stefan (Chapter 2/11) [1, 2, R]

3 05.11. 09:00 -- 12:00 Raw Text/Tagging: Karl, Norbert / Kamran (Chapter 3/5) [1, 2, R]

4 26.11. 09:00 -- 12:00 Classification/Information extraction: Paul, Thomas, Willy / Angela, Mathias (Chapter 6/7) [1, 2]

5 10.12. 09:00 -- 12:00 Data retrieval/Sentiment Analysis: Mario [1]

6 14.01. 09:00 -- 12:00 Project discussion --

7 28.01. 09:00 -- 12:00 Project discussion --

Unit	Date	Time	Topic	Slides
1	08.10.	15:00 -- 18:00	Preliminary Talk	--
2	22.10.	15:00 -- 18:00	tm and plugins: Ingo, Stefan (Chapter 2/11)	[1, 2, R]
3	05.11.	09:00 -- 12:00	Raw Text/Tagging: Karl, Norbert / Kamran (Chapter 3/5)	[1, 2, R]
4	26.11.	09:00 -- 12:00	Classification/Information extraction: Paul, Thomas, Willy / Angela, Mathias (Chapter 6/7)	[1, 2]
5	10.12.	09:00 -- 12:00	Data retrieval/Sentiment Analysis: Mario	[1]
6	14.01.	09:00 -- 12:00	Project discussion	--
7	28.01.	09:00 -- 12:00	Project discussion	--

The seminar will take place at the Besprechungsraum of the Institute for Statistics and Mathematics, UZA2, Ebene 4.

Projects

Participants should pick one project out of the following list, preferably the one which is related to the presented topic (in the first half of the course).

Wordnet (review/extend wordnet package)
Interfaces to web resources (RSS feed parser, wikipedia, etc.)
Lexical resources for onthology learning (e.g., build corpora from sources like Project Euclid)
Sentiment analysis (literature overview, implement/compare scoring methods)
POS Tagger/OpenNLP (literature/software overview, data structures, review/extend openNLP package), Stanford Tagger

A package vignette (eight to twelve pages) is to be delivered at the end of the project.

Criteria for Successful Completion

Presentations and package vignette.

Literature

Bird, Steven; Klein, Ewan ; Loper, Edward: Natural Language Processing with Python O'Reilly Media, 2009, http://www.nltk.org/book.
Friedman, Jerom; Hastie,Trevor; Tibshirani, Robert: The Elements of Statistical Learning Springer, 2008.
Jurafsky,Daniel; Martin, James H.: Speech and language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition MIT Press, 2008.
Manning, Christopher D.; Schuetze, Hinrich: Foundations of Statistical Natural Language Processing MIT Press, 2000.
Manning, Christopher D.; Raghavan, Prabhakar; Schuetze, Hinrich: Introduction to Information Retrieval Cambridge University press New York, 2009 Download.
Wasserman, Larry: All of Statistics, A Concise Course in Statistical InferenceSpringer , 2004.
Zhai, ChengXiang: Statistical Language Models for Information RetrievalMorgan Claypool, 2009.