Kurt Hornik: Data and Text Mining Summer Semester 2022


All classes start at 09:00.
Unit Date Time Location Topic Materials Assignments
Read Ch 1 pages 15-42
1 2022-03-03 09:00-12:00 TC.3.07 Generalized linear models Slides Ex 1-6 Due: 2022-03-09 23:59 Read Ch 4 pages 129-189
2 2022-03-10 09:00-12:00 TC.3.07 Resampling Slides Ex 7-13 Due: 2022-03-16 23:59 Read Ch 5 pages 197-219
3 2022-03-17 09:00-12:00 TC.3.07 Penalized regression Slides Ex 14-16 Due: 2022-03-23 23:59 Read Ch 6 pages 225-282
4 2022-03-24 09:00-12:00 TC.3.07 Trees Slides Ex 17-21, 24-25 Due: 2022-03-30 23:59 Read Ch 8 pages 327-361
5 2022-03-31 09:00-12:00 TC.3.07 Forests and boosting Slides Ex 22-23, 26-28 Due: 2022-04-06 23:59
6 2022-04-07 09:00-12:00 TC.3.07 Additional topics and case study Slides
7 2022-04-21 09:00-12:00 TC.3.07 Text mining foundations Slides
8 2022-04-28 09:00-12:00 TC.3.07 Text mining applications Slides
9 2022-05-11 17:30-20:00 D4.0.019 Presentations


When submitting homework assignments by email, please use the subject ‘DTM Unit n Team k’, where n is the number of the unit and k is the number of your team.
Submit by email to <Kurt.Hornik@wu.ac.at> and cc all team members.
Chapters in assignments refer to the textbook by James et al.
Homework and presentation teams:
Package Papers Presentations
A Natalie mlr3 Gunnarsson, Loughran et al (2016) mlr3
B Alissia ranger Renault, Gu ranger
C Nikita xgboost Loughran et al (2011) xgboost
D Nicolas caret Loughran et al (2014) caret
E Swapnil finreportr Araci, Chen

R package projects

caret, mlr3, ranger, xgboost, edgar/finreportr/XBRL.

Reading list

Dogu Araci (2019), FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. arXiv:1908.10063.
Mike Chen, George Mussalli, Amir Amel-Zadeh and Michael Oliver Weinberg (2021), NLP for SDGs: Measuring Corporate Alignment with the Sustainable Development Goals. The Journal of Impact and ESG Investing. https://jesg.pm-research.com/content/early/2021/12/12/jesg.2021.1.035.
Chen Gu and Alexander Kurov (2020), Informational role of social media: Evidence from Twitter sentiment. Journal of Banking & Finance, volume 121. DOI:10.1016/j.jbankfin.2020.105969.
Björn Rafn Gunnarsson, Seppe vanden Broucke, Bart Baesens, María Óskarsdóttir and Wilfried Lemahieu (2021), Deep learning for credit scoring: Do or don’t?. European Journal of Operational Research, volume 295, issue 1, pages 292-305. DOI:10.1016/j.ejor.2021.03.006.
Tim Loughran and Bill McDonald (2011), When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks. The Journal of Finance, volume 66, issue 1, 35-65. DOI:10.1111/j.1540-6261.2010.01625.x.
Tim Loughran and Bill McDonald (2014), Measuring Readability in Financial Disclosures. The Journal of Finance, volume 69, issue 4, 1643-1671. DOI:10.1111/jofi.12162.
Tim Loughran and Bill McDonald (2016), Textual Analysis in Accounting and Finance: A Survey. Journal of Accounting Research, volume 54, issue 4, pages 1187-1230. DOI:10.1111/1475-679X.12123.
Thomas Renault (2019), Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages. Digital Finance, volume 2, pages 1-13. DOI:10.1007/s42521-019-00014-x.

Data sets

german (data, docs), firms (data, docs), Financial Phrase bank (data).

Text books

Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, "An Introduction to Statistical Learning (with Applications in R)", Second edition. https://www.statlearning.com/.


Electronic University Calendar (eVVZ)

File translated from TEX by TTH, version 4.15.