# Teaching

# Machine learning for NLP (2018)

Any question? Contact me at aurelie DOT herbelot AT unitn DOT it.

### Description

The course introduces core Machine Learning algorithms for computational linguistics (CL). Its goal is to (1) provide students with an overview of core Machine Learning techniques, widely used in CL; (2) understand in which contexts / for which applications each technique is suitable; (3) understand the experimental pipeline necessary to apply the technique to a particular problem, including possible data collection and choice of evaluation method; (4) get some practice in running Machine Learning software and interpreting the output. The syllabus is meant to cover Machine Learning methods from both a theoretical and practical point of view, and to give students a tool to read relevant scientific literature with a critical mind.

At the end of the course, students will: (1) demonstrate knowledge of the principles of core Machine Learning techniques; (2) be able to read and understand CL literature using the introduced techniques, and critically assess their use in research and applications; (3) have some fundamental computational skills allowing them to run existing Machine Learning software and interpret their output.

### Pre-requisites

There are no prerequisites for this course. Students with no computational background will acquire good intuitions for a range of Machine Learning techniques, as well as basic practical skills to interact with existing software. Students with a good mathematical and computational background (including programming and familiarity with the Unix command line) will be invited to gain a deeper understanding of each introduced algorithm, and to try out their own modifications of the course software.

### Methods

The whole course will introduce ten topics, including a general introduction in the first week. Each topic will be taught over three sessions including 1) a lecture explaining the theory behind a technique/algorithm; 2) a lecture discussing a scientific paper where the technique is put to work, demonstrating the experimental pipeline around the algorithm; 3) a hands-on session where the students will have a chance to run some software to familiarise themselves with the practical implementation of the method.

# Course schedule:

### March 15/19/20: Introduction

**Lecture 1:** General introduction. What is ML? What is it for? What has it got to do with AI? Slides

**Lecture 2:** Basic principles of statistical NLP: Naive Bayes, Maximum Likelihood, Precision/Recall. Slides

**Practical:** Set up computers and/or access to development server for experimentation. Run a simple authorship attribution algorithm using Naive Bayes. (The code is on GitHub under the *Authorship* directory.)

### March 22/26/27 Data preparation techniques

**Lecture 1:** How to choose your data. Annotating. Focus on inter-annotator agreement metrics. Slides

**Lecture 2:** Shekhar et al (2017) and b) Herbelot & Vecchi (2016). The former to understand what good/bad data is and how it affects a task. The latter to understand issues in annotation. Slides

**Practical:** Hands-on intro to crowdsourcing. Perform some annotation and calculate your interannotator agreement. (The code is on GitHub under the *Agreement* directory.)

### March 29, April 3/5 Supervised learning

**Lecture 1:** Introduction to regression and to the k-nearest neighbours algorithm. Slides

**Lecture 2:** Herbelot & Vecchi (2015) on using regression to map between different semantic spaces. Slides

**Practical:** Intro to mapping between semantic spaces for translation. (The code is on GitHub under the *Translation* directory.) For those wanting a simple tutorial on linear regression in Python, check: http://www.dataschool.io/linear-regression-in-python/.

### April 9/10/12 Unsupervised learning

**Lecture 1:** Clustering and dimensionality reduction. Slides

**Lecture 2:** Dasgupta et al (2017) on doing locality-sensitive hashing by imitating the fruit fly! Slides

**Practical:** Implementing the fruit fly for similarity search with random indexing. Playing with the PeARS search engine. (The code is on GitHub under the *FruitFly* directory.)

### April 16/17/19 Support Vector Machines

**Lecture 1:** SVM principles. Introduction to kernels. Slides

**Lecture 2:** Herbelot & Kochmar (2016) on semantic error detection. Slides

**Practical:** Classify documents into topics using SVMs. (The code is on GitHub under the *SVM* directory.)

### April 23/24/26 Neural Networks: introduction

**Lecture 1:** Basics of NNs. Slides

**Lecture 2:** Marblestone et al (2016) - what NNs really have to do with neuroscience. Slides

**Practical:** Follow tutorial on implementing an NN from scratch: http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/.

### May 7/8/10 RNNs and LSTMs

**Lecture 1:** Sequence learning with RNNs and LSTMs. Slides

**Lecture 2:** RNN literature review. Student-selected papers.

**Practical:** Generate ASCII cats with an RNN. (The code is on GitHub under the *RNNCats* directory.)

### May 14/15/17 Reinforcement learning

**Lecture 1:** Principles of RL. Slides

**Lecture 2:** Lazaridou et al (2017) on multi-agent emergence of natural language. Slides

**Practical:** Solving the OpenAI gym ‘frozen lake’ puzzle: https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/Q%20Learning%20with%20FrozenLake.ipynb.

### May 21/22/24 ML and small data

**Lecture 1:** Dealing with no or small data. Issues and solutions. Examples from low-resource languages. Slides

**Lecture 2:** Herbelot & Baroni (2017) on building vectors from tiny data. (Opportunity to review Word2Vec.) Slides

Practical: Play with the Herbelot & Baroni implementation of addition model. Use it on new ‘tiny’ data (one sentence only!)

### May 28/29 The ethics of machine learning

**Lecture 1:** Ethical issues with ML. Bias in distributional vectors. Slides

**Lecture 2:** Bolukbasi et al (2016) on debiasing vectors.

### May 30/31 Revision sessions

# Exam information: what should you know?

**NB:** if you don’t have a background in maths / engineering, don’t try to memorise all the equations! Just understand what they do and be able to explain it in words.

### Introductory material

- Describe an NLP pipeline, including data production, dataset split, model construction and evaluation.
- What is a language model (LM)? What is the Markov assumption in relation to LMs?
- Explain the use of Naive Bayes for document classification. (Be able to reproduce the example with email classification.)
- What is the difference between precision / recall / F-score / accuracy? What are their advantages / drawbacks?

### Data preparation techniques

- Explain K-fold cross validation and leave-one-out in relation to evaluation.
- Explain the curse of dimensionality and overfitting.
- Explain all interannotator agreement measures we have seen in class, together with their drawbacks.
- Be able to calculate kappa for a simple example.

### Supervised learning

- What is supervised learning?
- Intuition of linear regression (be able to reproduce the reading speed example).
- Explain gradient descent.
- What is PCA? How is it used in PLSR? Explain the basic intuition behind PLSR.
- Explain how you might use PLSR to translate one semantic space into another.
- Explain the k-NN algorithm.

### Unsupervised learning

- What is unsupervised learning?
- What is the relation of SVD to PCA? What matrices are produced by SVD and what do they represent?
- What is the difference between flat and hierarchical clustering? Between soft and hard clustering?
- Explain hierarchical clustering. Discuss the effect of similarity functions on clustering quality.
- Explain K-means clustering.
- What is purity, in terms of clustering evaluation?

### Support Vector Machines

- Give the intuition behind SVMs (hyperplane separator, margins, etc). What are support vectors?
- What is the optimisation problem you have to solve to find the optimal separating hyperplane?
- Explain the trade-off between margin size and error, and give a way to solve the issue. What is parameter C?
- What is the kernel trick? Why is it useful? What kind of kernels are available?

### Neural Networks: introduction

- Describe the main advantage of neural networks in terms of feature learning.
- Describe the typical components of a neural net in terms of layers. Explain the function of each layer.
- Describe the components of a single artificial neuron. Be able to explain how a single neuron can act as a classifier.
- Explain the difference between softmax and sigmoid output functions.
- Explain the standard NN learning algorithm (forward propagation, objective function, backpropagation with gradient descent). Be able to calculate forward propagation for a toy example.
- Describe the activation functions we learnt (step, linear, sigmoid, tanh, ReLu), together with their advantages / drawbacks.

### RNNs and LSTMs

- Explain the point of recurrent architectures.
- Explain the principle behind RNNs (unrolling, step function). Be able to draw the basic architecture of a language model using an RNN (as shown on the slides).
- Be able to illustrate the problem of RNNs and long-dependencies, and to explain it in terms of vanishing gradients.
- Explain the architecture of an LSTM.

### Reinforcement learning

- Explain the difference between reinforcement learning and the other types of learning we have covered.
- What is a Markov chain? (Be able to draw a simple example!)
- What is a Markov decision process? (Be able to draw one too, and explain its components.)
- Explain the concepts of policy, cumulative expected reward, value function, and discount.
- How does reinforcement learning expand MDPs?
- What is model-based RL? Direct evaluation? Temporal Difference Learning? (Be able to calculate simple examples, as in the slides).
- Describe the Q-learning algorithm.

### ML and small data

- Explain the issues in collecting data for low-resource languages.
- Describe an algorithm for language classification in a multilingual setup (the Linguini system).
- What is the intuition behind using projections for dealing with low-resource languages?
- Explain morphological analysis induction with projections.