class: center, middle, inverse # Deviation Inequalities in Machine Learning ## Jacob Abernethy - Georgia Tech ### Herman Chernoff Symposium #### April 27, 2018 $$ \def\DD{\mathbf{D}} \def\EE{\mathop{\mathbb{E}}} \def\argmin{\mathop{\arg\min}} \def\argmax{\mathop{\arg\max}} \def\K{\mathcal{K}} \def\reals{\mathbb{R}} \def\reg{\mathcal{R}} \def\areg{\overline{\reg}} $$ --- class: top ## Machine Learning - How can we use computers to make helpful predictions?
- How can data drive automation? --- class: top ## Object recognition - In the **imagenet** dataset, we are given *labelled* images:
Can we "train" an algorithm to detect these objects without human intervention? --- class: top ## We have made **huge** strides in ML in recent years
--- class: middle ## Object detection for autonomous vehicles
--- class: middle ## Making Nicolas Cage appear everywhere
--- class: middle ## Can use learning to generate **new** data!
--- class: top ## Why does Machine Learning *work*? - More specifically, when does our estimation *generalize*? - Key components: a typical supervised learning problem is characterized by + An *unknown* distribution $\mathcal{D}$ on $\mathcal{X} \times \\{-1,1\\}$ + A "training" dataset $(x\_1, y\_1), \ldots, (x\_n, y\_n)$ sampled IID from $\mathcal{D}$ + $x\_i \in \mathcal{X} \subset \reals^n$ is the *observation* + $y\_i \in \{-1,1\}$ is the *label*, the thing we want to predict + A learning algorithm selects a function from a *hypothesis class* $\mathcal{H} := \\{f_\theta : \theta \in \Theta \\}$ + a hypothesis $f\_\theta : \mathcal{X} \to \\{-1,1\\}$ *predicts* that $y$ is $f\_\theta(x)$ on $x$ + A loss function $\ell(y,\hat y)$, specifies how "wrong" the prediction $\hat y$ is from true $y$ --- class: top ## Machine Learning: Just fit the data - The most basic foundational tool in ML/Statistics is... fit the data! - **Algorithm** (Empirical Risk Minimization): + compute an (almost) optimal $\theta^*$ that minimizes "training loss":
$$\min_{\theta \in \Theta} \frac 1 n \sum_{i=1}^n \ell(f_\theta(x_i), y_i)$$
-- **Big Question**: Does $\theta^*$ generalize? - In other words, if $f\_{\theta}$ predicts really well on my *training data*, can I expect it to perform well on *unseen data*? - The generalization loss of $f\_{\theta^*}$ is
$$\EE_{(x,y)\sim \mathcal D}[\ell(f_{\theta^*}(x),y)]$$
--- class: top ## Typical Problem in ML/Statistics: OVERFITTING
--- class: top ## Learning curves in practice
--- class: top ## What does overfitting look like?
--- class: middle # CHERNOFF BOUNDS ## The fundamental theorem of machine learning ### (In a sense) --- class: top
**Important Note**: despite the namesake, it is apparently another Herman (Rubin) that deserves credit for this idea!!
--- class: top ## Chernoff Bounding Technique Let $X\_1,X\_2,\ldots,X\_n$ be independent random variables such that, $-1 \le X\_i\le 1$ and $\mathbf{E}[X\_i]=0$ for all $i=1,2,\ldots,n.$ Then, for all $t>0$
$$\mathbf{Pr}\left[\sum_{i=1}^nX_i\geq t\right]\leq\exp{\left(-\frac{t^2}{2n}\right)}.$$
Equivalently: for any sequence of bounded IID random variables $Z_1, \ldots, Z_n$ with mean $\mu$, that with prob. at least $1 - \delta$,
$$\left|\frac 1 n \sum_{i=1}^n Z_i - \mu \right| \leq \sqrt{\frac{C \log (1/\delta)}{n}}$$
Often this is referred to as the Hoeffding or Azuma Inequality. --- class: top ## Proof **FACT**: For a random variable $-1\le X\le 1$ such that $\mathbf{E}[X]=0,$ we have $\mathbf{E}[\exp{\left(\lambda X\right)}]\leq \exp{\left(\lambda^2/2\right)}.$
\begin{align*} \mathbf{Pr}\left[\sum_{i=1}^nX_i\geq t\right]&=\mathbf{Pr}\left[\exp{\left(\lambda\sum_{i=1}^nX_i\right)}\geq \exp{\left(\lambda t\right)}\right]\\ &\leq \frac{\mathbf{E}\left[\exp{\left(\lambda\sum_{i=1}^nX_i\right)}\right]}{\exp{\left(\lambda t\right)}}\\ &\textstyle \le \exp{\left(-\lambda t\right)}\cdot\mathbf{E}\left[\prod_{i=1}^n\exp{\left(\lambda X_i\right)}\right]\\ &= \exp{\left(-\lambda t\right)}\cdot\prod_{i=1}^n\mathbf{E}\left[\exp{\left(\lambda X_i\right)}\right]\\ & \leq \exp{\left(-\lambda t\right)} \prod_{i=1}^n \exp(\lambda^2/2). \end{align*}
Plug in optimal $\lambda = t/n$ and you get the result! --- class: top ## First Attempt: Chernoff/Hoeffding to control deviation **Observation**: say we use the optimally chosen $\theta$ from our algorithm - Let's set $Z\_i = \ell(f\_\theta(x\_i), y\_i)$ - Then the mean $\mu$ of $Z\_i$ is $\EE\_{(x,y)\sim \mathcal D}[\ell(f\_{\theta}(x),y)]$ - We can use Chernoff/Hoeffding to bound the deviation between training and testing error! With probability at least $1-\delta$, we have
$$\frac 1 n \sum_{i=1}^n \ell(f_\theta(x_i), y_i) - \EE_{(x,y)\sim \mathcal D}[\ell(f_{\theta}(x),y)] \leq \sqrt{\frac{C \log (1/\delta)}{n}}$$
-- **WRONG!!!!!** - We used the data to pick $\theta$! The $Z_i$'s are no longer independent!! --- class: top ## Second Attempt: finite class, use a union bound! - Assume the hypothesis class $\mathcal{H} := \\{f_\theta : \theta \in \Theta \\}$ is *finite* - We can now throw in a union bound:
\begin{eqnarray*} & & \mathbf{Pr}\left[ \exists \theta \in \Theta : \frac 1 n \sum_{i=1}^n \ell(f_\theta(x_i), y_i) - \EE_{(x,y)\sim \mathcal D}[\ell(f_{\theta}(x),y)] \geq t \right] \\ & \leq & \sum_{\theta \in \Theta} \mathbf{Pr}\left[ \frac 1 n \sum_{i=1}^n \ell(f_\theta(x_i), y_i) - \EE_{(x,y)\sim \mathcal D}[\ell(f_{\theta}(x),y)] \geq t \right] \\ & \leq & |\Theta| \exp{\left(-\frac{t^2}{2n}\right)} \end{eqnarray*}
-- With prob. $\geq 1-\delta$, we have for any $\theta \in \Theta$:
$$\frac 1 n \sum_{i=1}^n \ell(f_\theta(x_i), y_i) - \EE_{(x,y)\sim \mathcal D}[\ell(f_{\theta}(x),y)] \leq \sqrt{\frac{C \log (|\Theta|/\delta)}{n}}$$
--- class: top ## Generalization Bound for Infinite Classes? - This is where things get **deep**! Relies on many ideas developed by Vapnik, Chervonenkis, and others, around the 1970s. -- - We say that a binary function class $\mathcal H$ *shatters* $x\_1, \ldots, x\_d$ if $|\{(f(x\_1), \ldots, f(x\_d)) : f \in \mathcal H \}| = 2^d$; i.e., can produce all labellings! - **VC Dimension**: The Vapnik-Chervonenkis dimension of $\mathcal H$ is the size of the largest set *shatter* by $\mathcal H$. -- VC-dim of linear functions in $\reals^2$ is 3!
--- class: top ## Core Result of Statistical Learning Theory We have for any $f \in \mathcal H$:
\begin{eqnarray*} & & \EE \left[ \frac 1 n \sum_{i=1}^n \ell(f_\theta(x_i), y_i) - \EE_{(x,y)\sim \mathcal D}[\ell(f_\theta(x),y)] \right]\\ & \leq & \EE_{\sigma_{1:n}\sim \{-1,1\}} \left[ \sup_{f \in \mathcal H} \frac 1 n \sum_{i=1}^n \sigma_i f(x_i) \right] \quad \quad \quad \quad \quad (\text{Rademacher Complexity})\\ & \leq & \sqrt{\frac{\log |\{(f(x_1), \ldots, f(x_n)) : f \in \mathcal H \}|}{n}} \quad \quad \text{(Massart/Chernoff Lemma)} \\ & \leq & \sqrt{\frac{\text{VC-dim}(\mathcal H) \log n }{n}}\quad \quad \quad \quad \quad \quad \quad \quad \quad \text{(Sauer/Shelah Lemma)} \end{eqnarray*}
**What this means:** If you want to learn a complicated model, i.e. with higher VC-dimension $d$, you will need $d$ times as much data for the same generalization performance! --- class: top ## Key Takeaway - The ability to estimate models from data becomes "harder" if your functions are more "complex" - Complexity is measured in terms of the Vapnik-Chervonenkis dimension - **But** the Chernoff bounding technique sits under all of such results - Chernoff bounds arise constantly in all types of computational statistical problems --- class: middle
$\sim \sim$ FIN $\sim \sim$
.footnote[Slides made entirely with Markdown, Remark.js, MathJax, and animated GIFs]