Teaching

Machine learning for NLP (2022)

Any question? Contact me at aurelie DOT herbelot AT unitn DOT it.

Description

The course introduces core Machine Learning algorithms for computational linguistics (CL). Its goal is to (1) provide students with an overview of core Machine Learning techniques, widely used in CL; (2) understand in which contexts / for which applications each technique is suitable; (3) understand the experimental pipeline necessary to apply the technique to a particular problem, including possible data collection and choice of evaluation method; (4) get some practice in running Machine Learning software and interpreting the output. The syllabus is meant to cover Machine Learning methods from both a theoretical and practical point of view, and to give students a tool to read relevant scientific literature with a critical mind.

At the end of the course, students will: (1) demonstrate knowledge of the principles of core Machine Learning techniques; (2) be able to read and understand CL literature using the introduced techniques, and critically assess their use in research and applications; (3) have some fundamental computational skills allowing them to run existing Machine Learning software and interpret their output.

Pre-requisites: There are no prerequisites for this course. Students with no computational background will acquire good intuitions for a range of Machine Learning techniques, as well as basic practical skills to interact with existing software. Students with a good mathematical and computational background (including programming and familiarity with the Unix command line) will be invited to gain a deeper understanding of each introduced algorithm, and to try out their own modifications of the course software.

The whole course will introduce ten topics, including a general introduction in the first week. Each topic will be taught over three sessions including 1) a lecture explaining the theory behind a technique/algorithm; 2) a discussion group focusing on a scientific paper where the technique is put to work, demonstrating the experimental pipeline around the algorithm; 3) a hands-on session where the students will have a chance to run some software to familiarise themselves with the practical implementation of the method.

Course schedule:

Week 1: Introduction

Lecture 1: General introduction. Slides
Lecture 2: Basic principles of statistical NLP: Language modelling, Naive Bayes, evaluation with Precision/Recall. Slides
Practical: Run a simple authorship attribution algorithm using Naive Bayes. (The code is on GitHub.)

Week 2: Data preparation techniques

Lecture: How to choose your data. Annotating. Focus on inter-annotator agreement metrics. Slides
Practical: Hands-on intro to Wikipedia pre-processing. (The code is on GitHub.)

Week 3: Supervised learning

Lecture: Introduction to regression (linear, gradient descent, PLSR) and to the k-nearest neighbours algorithm. Slides
Practical: Intro to mapping between semantic spaces for translation. (The code is on GitHub.) For those wanting a simple tutorial on linear regression in Python, check: http://www.dataschool.io/linear-regression-in-python/.

Week 4: Unsupervised learning

Lecture: Dimensionality reduction and clustering (SVD, LSH, random indexing, K-means). Slides
Practical: Implementing the fruit fly for similarity search with random indexing. (The code is on GitHub.)

Week 5: Support Vector Machines

Lecture: SVM principles. Introduction to kernels. Slides
Practical: Classify documents into topics using SVMs. (The code is on GitHub.)

Week 6: Introduction to Neural Networks

Lecture: Basics of NNs. Slides
Practical: Follow tutorial on implementing an NN from scratch: http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/.

Week 7: RNNs and LSTMs

Lecture: Sequence learning with RNNs and LSTMs. Slides
Practical: Generate ASCII cats with an RNN. (The code is on GitHub)

Week 8: Adopt a network week!

The network Zoo: A wild race through a few architectures. Slides
Pick a network from the Network Zoo and check whether it has language applications. If so, adopt it!

Week 9: Reinforcement learning

Lecture: Principles of RL. Slides
Practical: Solving the OpenAI gym ‘frozen lake’ puzzle: https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/Q%20Learning%20with%20FrozenLake.ipynb. And ordering a coffee at Rovereto train station. (The coffee code is on GitHub.)

Week 10: The ethics of machine learning

Lecture: Ethical issues with ML. Bias in distributional vectors. Slides
Practical: Finding indirect gender biases in FastText vectors. (This exercise is open-ended, but there is some data and code on GitHub.)

Week 11: Revision sessions