Study Note: Clustering

Posted on 2019-06-15 | Edited on 2019-10-19 | In Machine Learning

Symbols count in article: 7.1k | Reading time ≈ 6 mins.

Clustering Methods

Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set.

When we cluster the observations of a data set, we seek to partition them into distinct groups so that the observations within each group are quite similar to each other
This is an unsupervised problem because we are trying to discover structure

Machine Learning Q&A Part II: COD, Reg, Model Evaluation, Dimensionality Reduction

Posted on 2019-06-15 | Edited on 2019-10-19 | In Machine Learning

Symbols count in article: 16k | Reading time ≈ 15 mins.

[TOC]

Curse of dimensionality

1. Describe the curse of dimensionality with examples.

Curse of dimensionality: as the dimensionality of the features space increases, the number configurations can grow exponentially, and thus the number of configurations covered by an observation decreases.

As the number of feature or dimensions grows, the amount of data we need to generalise accurately grows exponentially.

（fun example: It's easy to hunt a dog and maybe catch it if it were running around on the plain (two dimensions). It's much harder to hunt birds, which now have an extra dimension they can move in. If we pretend that ghosts are higher-dimensional beings ）

Machine Learning Q&A Part I: Learning Theory & Model Selection

Posted on 2019-06-14 | Edited on 2019-10-19 | In Machine Learning

Symbols count in article: 21k | Reading time ≈ 19 mins.

Learning Theory

1. Describe bias and variance with examples.

Variance: refers to the amount by which \(\hat{f}\) would change if we estimated it using a different training data set. more flexible statistical methods have higher variance

Explanation: different training data sets will result in a different \(\hat{f}\). But ideally the estimate for f should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in \(\hat{f}\)

Bias: refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model.

Explanation: As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test MSE declines. However, at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance. When this happens the test MSE increases.

Decomposition：The expected test MSE, for a given value \(x_0\) can always be decomposed into the sum of three fundamental quantities: the variance of \(\hat{f}(x_0)\), the squared bias of \(\hat{f}(x_0)\), and the variance of the error variance terms \(\epsilon\). \[ \begin{align} E(y_0-\hat{f}(x_0))^2=Var(\hat{f}(x_0))+[Bias(\hat{f}(x_0))]^2+Var(\epsilon) \end{align} \] The overall expected test MSE can be computed by averaging \(E(y_0-\hat{f}(x_0))^2\) over all possible values of \(x_0\) in the test set.

Study Note: SVM

Posted on 2019-06-12 | Edited on 2019-10-19 | In Machine Learning

Symbols count in article: 11k | Reading time ≈ 10 mins.

Maximal Margin Classifier

What Is a Hyperplane?

Hyperplane: In a p-dimensional space, a hyperplane is a flat affine subspace of dimension \(p − 1\).

e.g. in two dimensions, a hyperplane is a flat one-dimensional subspace—in other words, a line.

Mathematical definition of a hyperplane: \[ \beta_0+\beta_1X_1+\beta_2X_2,...+\beta_pX_p=0, \quad (9.1) \]

Any \(X = (X_1,X_2,…X_p)^T\) for which (9.1) holds is a point on the hyperplane.