Default Presentation

class: center, middle, inverse
# Deviation Inequalities in Machine Learning
## Jacob Abernethy - Georgia Tech
### Herman Chernoff Symposium
#### April 27, 2018

$$
\def\DD{\mathbf{D}}
\def\EE{\mathop{\mathbb{E}}}
\def\argmin{\mathop{\arg\min}}
\def\argmax{\mathop{\arg\max}}
\def\K{\mathcal{K}}
\def\reals{\mathbb{R}}
\def\reg{\mathcal{R}}
\def\areg{\overline{\reg}}
$$

---
class: top

## Machine Learning

- How can we use computers to make helpful predictions?

- How can data drive automation?

---
class: top

## Object recognition

- In the **imagenet** dataset, we are given *labelled* images:

Can we "train" an algorithm to detect these objects without human intervention?

---
class: top

## We have made **huge** strides in ML in recent years
<img src="../../assets/imagenet_performance.png" width=85%>

---
class: middle

## Object detection for autonomous vehicles

---
class: middle

## Making Nicolas Cage appear everywhere

---
class: middle

## Can use learning to generate **new** data!

---
class: top

## Why does Machine Learning *work*?

- More specifically, when does our estimation *generalize*?

- Key components: a typical supervised learning problem is characterized by
  + An *unknown* distribution $\mathcal{D}$ on $\mathcal{X} \times \\{-1,1\\}$
  + A "training" dataset $(x\_1, y\_1), \ldots, (x\_n, y\_n)$ sampled IID from $\mathcal{D}$
      + $x\_i \in \mathcal{X} \subset \reals^n$ is the *observation*
      + $y\_i  \in \{-1,1\}$ is the *label*, the thing we want to predict
  + A learning algorithm selects a function from a *hypothesis class* $\mathcal{H} := \\{f_\theta : \theta \in \Theta \\}$
      + a hypothesis $f\_\theta : \mathcal{X} \to \\{-1,1\\}$ *predicts* that $y$ is $f\_\theta(x)$ on $x$
  + A loss function $\ell(y,\hat y)$, specifies how "wrong" the prediction $\hat y$ is from true $y$

---
class: top

## Machine Learning: Just fit the data

- The most basic foundational tool in ML/Statistics is... fit the data!
- **Algorithm** (Empirical Risk Minimization):
  + compute an (almost) optimal $\theta^*$ that minimizes "training loss":
<div>
  $$\min_{\theta \in \Theta} \frac 1 n \sum_{i=1}^n \ell(f_\theta(x_i), y_i)$$
</div>
--

**Big Question**: Does $\theta^*$ generalize?
- In other words, if $f\_{\theta}$ predicts really well on my *training data*, can I expect it to perform well on *unseen data*? 
- The generalization loss of $f\_{\theta^*}$ is
  <div>
    $$\EE_{(x,y)\sim \mathcal D}[\ell(f_{\theta^*}(x),y)]$$
  </div>

---
class: top

## Typical Problem in ML/Statistics: OVERFITTING

---
class: top

## Learning curves in practice

---
class: top

## What does overfitting look like?

---
class: middle

# CHERNOFF BOUNDS
## The fundamental theorem of machine learning
### (In a sense)

---
class: top

**Important Note**: despite the namesake, it is apparently another Herman (Rubin) that deserves credit for this idea!!

---
class: top

## Chernoff Bounding Technique

Let $X\_1,X\_2,\ldots,X\_n$ be independent random variables such that, $-1 \le X\_i\le 1$ and $\mathbf{E}[X\_i]=0$ for all $i=1,2,\ldots,n.$ Then, for all $t>0$
<div>
  $$\mathbf{Pr}\left[\sum_{i=1}^nX_i\geq t\right]\leq\exp{\left(-\frac{t^2}{2n}\right)}.$$
</div>

Equivalently: for any sequence of bounded IID random variables $Z_1, \ldots, Z_n$ with mean $\mu$, that with prob. at least $1 - \delta$,
<div>
  $$\left|\frac 1 n \sum_{i=1}^n Z_i - \mu \right| \leq \sqrt{\frac{C \log (1/\delta)}{n}}$$
</div>

Often this is referred to as the Hoeffding or Azuma Inequality.

---
class: top

## Proof

**FACT**: For a random variable $-1\le X\le 1$ such that $\mathbf{E}[X]=0,$ we have
$\mathbf{E}[\exp{\left(\lambda X\right)}]\leq \exp{\left(\lambda^2/2\right)}.$

<div>
  \begin{align*}
  \mathbf{Pr}\left[\sum_{i=1}^nX_i\geq t\right]&=\mathbf{Pr}\left[\exp{\left(\lambda\sum_{i=1}^nX_i\right)}\geq \exp{\left(\lambda t\right)}\right]\\
  &\leq \frac{\mathbf{E}\left[\exp{\left(\lambda\sum_{i=1}^nX_i\right)}\right]}{\exp{\left(\lambda t\right)}}\\
  &\textstyle \le \exp{\left(-\lambda t\right)}\cdot\mathbf{E}\left[\prod_{i=1}^n\exp{\left(\lambda X_i\right)}\right]\\
  &= \exp{\left(-\lambda t\right)}\cdot\prod_{i=1}^n\mathbf{E}\left[\exp{\left(\lambda X_i\right)}\right]\\
  & \leq \exp{\left(-\lambda t\right)} \prod_{i=1}^n \exp(\lambda^2/2).
  \end{align*}
</div>

Plug in optimal $\lambda = t/n$ and you get the result!

---
class: top

## First Attempt: Chernoff/Hoeffding to control deviation

**Observation**: say we use the optimally chosen $\theta$ from our algorithm
- Let's set $Z\_i = \ell(f\_\theta(x\_i), y\_i)$
- Then the mean $\mu$ of $Z\_i$ is $\EE\_{(x,y)\sim \mathcal D}[\ell(f\_{\theta}(x),y)]$
- We can use Chernoff/Hoeffding to bound the deviation between training and testing error! With probability at least $1-\delta$, we have 
<div>
  $$\frac 1 n \sum_{i=1}^n \ell(f_\theta(x_i), y_i) - \EE_{(x,y)\sim \mathcal D}[\ell(f_{\theta}(x),y)] \leq \sqrt{\frac{C \log (1/\delta)}{n}}$$
</div>
--

**WRONG!!!!!**
- We used the data to pick $\theta$! The $Z_i$'s are no longer independent!!

---
class: top

## Second Attempt: finite class, use a union bound!

- Assume the hypothesis class $\mathcal{H} := \\{f_\theta : \theta \in \Theta \\}$ is *finite*
- We can now throw in a union bound:
<div>
\begin{eqnarray*}
  & & \mathbf{Pr}\left[ \exists \theta \in \Theta :  \frac 1 n \sum_{i=1}^n \ell(f_\theta(x_i), y_i) - \EE_{(x,y)\sim \mathcal D}[\ell(f_{\theta}(x),y)] \geq t \right]  \\ 
   & \leq  & \sum_{\theta \in \Theta} \mathbf{Pr}\left[  \frac 1 n \sum_{i=1}^n \ell(f_\theta(x_i), y_i) - \EE_{(x,y)\sim \mathcal D}[\ell(f_{\theta}(x),y)] \geq t \right] \\
   & \leq  & |\Theta| \exp{\left(-\frac{t^2}{2n}\right)}
\end{eqnarray*}
</div>
--

With prob. $\geq 1-\delta$, we have for any $\theta \in \Theta$:
<div>
  $$\frac 1 n \sum_{i=1}^n \ell(f_\theta(x_i), y_i) - \EE_{(x,y)\sim \mathcal D}[\ell(f_{\theta}(x),y)] \leq \sqrt{\frac{C \log (|\Theta|/\delta)}{n}}$$
</div>

---
class: top

## Generalization Bound for Infinite Classes?

- This is where things get **deep**! Relies on many ideas developed by Vapnik, Chervonenkis, and others, around the 1970s.
--

- We say that a binary function class $\mathcal H$ *shatters* $x\_1, \ldots, x\_d$ if $|\{(f(x\_1), \ldots, f(x\_d)) : f \in \mathcal H \}| = 2^d$; i.e., can produce all labellings!

- **VC Dimension**: The Vapnik-Chervonenkis dimension of $\mathcal H$ is the size of the largest set *shatter* by $\mathcal H$. 
--

VC-dim of linear functions in $\reals^2$ is 3!
<img src="../../assets/vcdim.png" width=50%>

---
class: top

## Core Result of Statistical Learning Theory

We have for any $f \in \mathcal H$:
<div>
  \begin{eqnarray*}
    & & \EE \left[ \frac 1 n \sum_{i=1}^n \ell(f_\theta(x_i), y_i) - \EE_{(x,y)\sim \mathcal D}[\ell(f_\theta(x),y)] \right]\\
    & \leq & \EE_{\sigma_{1:n}\sim \{-1,1\}} \left[ \sup_{f \in \mathcal H} \frac 1 n \sum_{i=1}^n \sigma_i f(x_i) \right] \quad \quad \quad \quad \quad  (\text{Rademacher Complexity})\\
    & \leq & \sqrt{\frac{\log |\{(f(x_1), \ldots, f(x_n)) : f \in \mathcal H \}|}{n}} \quad  \quad \text{(Massart/Chernoff Lemma)} \\
    & \leq & \sqrt{\frac{\text{VC-dim}(\mathcal H) \log n }{n}}\quad \quad \quad \quad \quad \quad \quad \quad  \quad \text{(Sauer/Shelah Lemma)}
  \end{eqnarray*}
</div>

**What this means:** If you want to learn a complicated model, i.e. with higher VC-dimension $d$, you will need $d$ times as much data for the same generalization performance!

---
class: top

## Key Takeaway

- The ability to estimate models from data becomes "harder" if your functions are more "complex"
- Complexity is measured in terms of the Vapnik-Chervonenkis dimension
- **But** the Chernoff bounding technique sits under all of such results
- Chernoff bounds arise constantly in all types of computational statistical problems

---
class: middle

.footnote[Slides made entirely with Markdown, Remark.js, MathJax, and animated GIFs]