# Teaching

# Machine learning for NLP (2019)

Any question? Contact me at aurelie DOT herbelot AT unitn DOT it.

### Description

The course introduces core Machine Learning algorithms for computational linguistics (CL). Its goal is to (1) provide students with an overview of core Machine Learning techniques, widely used in CL; (2) understand in which contexts / for which applications each technique is suitable; (3) understand the experimental pipeline necessary to apply the technique to a particular problem, including possible data collection and choice of evaluation method; (4) get some practice in running Machine Learning software and interpreting the output. The syllabus is meant to cover Machine Learning methods from both a theoretical and practical point of view, and to give students a tool to read relevant scientific literature with a critical mind.

At the end of the course, students will: (1) demonstrate knowledge of the principles of core Machine Learning techniques; (2) be able to read and understand CL literature using the introduced techniques, and critically assess their use in research and applications; (3) have some fundamental computational skills allowing them to run existing Machine Learning software and interpret their output.

### Pre-requisites

There are no prerequisites for this course. Students with no computational background will acquire good intuitions for a range of Machine Learning techniques, as well as basic practical skills to interact with existing software. Students with a good mathematical and computational background (including programming and familiarity with the Unix command line) will be invited to gain a deeper understanding of each introduced algorithm, and to try out their own modifications of the course software.

### Methods

The whole course will introduce ten topics, including a general introduction in the first week. Each topic will be taught over three sessions including 1) a lecture explaining the theory behind a technique/algorithm; 2) a lecture discussing a scientific paper where the technique is put to work, demonstrating the experimental pipeline around the algorithm; 3) a hands-on session where the students will have a chance to run some software to familiarise themselves with the practical implementation of the method.

# Course schedule:

### February 25/26/28: Introduction

**Lecture 1:** General introduction. What is ML? What is it for? How do you do it? Slides

**Lecture 2:** Basic principles of statistical NLP: Language modelling, Naive Bayes, evaluation with Precision/Recall. Slides

**Practical:** Set up access to PythonAnywhere for experimentation. Run a simple authorship attribution algorithm using Naive Bayes. (The code is on GitHub.)

### March 4/5/7 Data preparation techniques

**Lecture 1:** How to choose your data. Annotating. Focus on inter-annotator agreement metrics. Slides

**Practical:** Hands-on intro to crowdsourcing. Perform some annotation and calculate your interannotator agreement. (The code is on GitHub.)

**Scientific reading:** Paperno et al (2016) and b) Herbelot & Vecchi (2016). The former to understand what good/bad data is and how it affects a task. The latter to understand issues in annotation.

### March 11/12/14 Supervised learning

**Lecture 1:** Introduction to regression (linear, gradient descent, PLSR) and to the k-nearest neighbours algorithm. Slides

**Practical:** Intro to mapping between semantic spaces for translation. (The code is on GitHub.) For those wanting a simple tutorial on linear regression in Python, check: http://www.dataschool.io/linear-regression-in-python/.

**Scientific reading:** a) Herbelot & Vecchi (2015) on using regression to map between different semantic spaces. and b) Erk & Padó (2010) on k-NN for sense clustering.

### March 18/19/21 Unsupervised learning

**Lecture 1:** Dimensionality reduction and clustering (SVD, LSH, random indexing, K-means). Slides

**Practical:** Implementing the fruit fly for similarity search with random indexing (Dasgupta et al (2017)). Playing with the PeARS search engine. (The code is on GitHub.)

**Scientific reading:** Murphy et al (2012) on the cognitive plausibility of matrix factorisation techniques.

### March 25/26/28 Support Vector Machines

**Lecture 1:** SVM principles. Introduction to kernels. Slides

**Practical:** Classify documents into topics using SVMs. (The code is on GitHub.)

**Scientific reading:** Herbelot & Kochmar (2016) on semantic error detection.

### April 1/2/4 Neural Networks: introduction

**Lecture 1:** Basics of NNs. Slides

**Practical:** Follow tutorial on implementing an NN from scratch: http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/.

**Scientific reading:** Pater (2018) - a historical view on generative linguistics and neural networks. Optional: for those interested, look at Marblestone et al (2016) on what NNs really have to do with neuroscience (this is a long article!)

### April 8/9/11 RNNs and LSTMs

**Lecture 1:** Sequence learning with RNNs and LSTMs. Slides

**Practical:** Generate ASCII cats with an RNN. (The code is on GitHub)

**Scientific reading:** The most famous word / sentence embedding models: Word2Vec, ELMo, BERT.

### April 15/16/18 Adopt a network week!

**The network Zoo:** A wild race through a few architectures. Slides

Pick a network from the Network Zoo and check whether it has language applications. If so, adopt it!

### April 29/30/2 Reinforcement learning

**Lecture 1:** Principles of RL. Slides

**Practical:** Solving the OpenAI gym ‘frozen lake’ puzzle: https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/Q%20Learning%20with%20FrozenLake.ipynb. And ordering a coffee at Rovereto train station. (The coffee code is on GitHub.)

**Scientific reading:** Lazaridou et al (2017) on multi-agent emergence of natural language.

### May 6/7/9 The ethics of machine learning

**Lecture 1:** Ethical issues with ML. Bias in distributional vectors. Slides

**Practical:** Finding indirect gender biases in FastText vectors. (This exercise is open-ended, but there is some data and code on GitHub.)

**Scientific reading:** Bolukbasi et al (2016) on debiasing vectors.

### May 13/14 Revision sessions

# Exam information: what should you know?

**NB: one of the exam questions will focus on describing a neural architecture of your choice. See the slides for Week 8 for suggestions.**

### Introductory material

- Describe an ML pipeline, including data production, dataset split, model construction and evaluation.
- What is a language model (LM)? What is the Markov assumption in relation to LMs?
- Explain the use of Naive Bayes for document classification. (Be able to reproduce the example with email classification.)
- What is the difference between precision / recall / F-score / accuracy? What are their advantages / drawbacks?

### Data preparation techniques

- Explain K-fold cross validation and leave-one-out in relation to evaluation.
- Explain the curse of dimensionality and overfitting.
- Explain all interannotator agreement measures we have seen in class, together with their drawbacks.
- Be able to calculate kappa for a simple example.

### Supervised learning

- What is supervised learning?
- Intuition of linear regression (be able to reproduce the reading speed example).
- Explain gradient descent.
- What is PCA? What is its relation to PLSR? Explain the basic intuition behind PLSR.
- Explain how you might use PLSR to translate one semantic space into another.
- Explain the k-NN algorithm.

### Unsupervised learning

- What is unsupervised learning?
- What is the relation of SVD to PCA? What matrices are produced by SVD and what do they represent?
- What is the difference between flat and hierarchical clustering? Between soft and hard clustering?
- Explain hierarchical clustering. Discuss the effect of similarity functions on clustering quality.
- Explain K-means clustering.
- What is purity, in terms of clustering evaluation?

### Support Vector Machines

- Give the intuition behind SVMs (hyperplane separator, margins, etc). What are support vectors?
- What is the optimisation problem you have to solve to find the optimal separating hyperplane?
- Explain the trade-off between margin size and error, and give a way to solve the issue. What is parameter C?
- What is the kernel trick? Why is it useful? What kind of kernels are available?

### Neural Networks: introduction

- Describe the main advantage of neural networks in terms of feature learning.
- Describe the typical components of a neural net in terms of layers. Explain the function of each layer.
- Describe the components of a single artificial neuron. Be able to explain how a single neuron can act as a classifier.
- Explain the difference between softmax and sigmoid output functions.
- Explain the standard NN learning algorithm (forward propagation, objective function, backpropagation with gradient descent). Be able to calculate forward propagation for a toy example.
- Describe the activation functions we learnt (step, linear, sigmoid, tanh, ReLu), together with their advantages / drawbacks.

### RNNs and LSTMs

- Explain the point of recurrent architectures.
- Explain the principle behind RNNs (unrolling, step function). Be able to draw the basic architecture of a language model using an RNN (as shown on the slides).
- Be able to illustrate the problem of RNNs and long-dependencies, and to explain it in terms of vanishing gradients.
- Explain the architecture of an LSTM.

### Reinforcement learning

- Explain the difference between reinforcement learning and the other types of learning we have covered.
- What is a Markov chain? (Be able to draw a simple example!)
- What is a Markov decision process? (Be able to draw one too, and explain its components.)
- Explain the concepts of policy, cumulative expected reward, value function, and discount.
- How does reinforcement learning expand MDPs?
- What is model-based RL? Direct evaluation? Temporal Difference Learning? (Be able to calculate simple examples, as in the slides).
- Describe the Q-learning algorithm.

### Ethics

- Why do ML models tend to reproduce majority opinion?
- Why can regression
*emphasise*biases in the data? - Give arguments for against debiasing an algorithm vs debiasing the data itself.