Realtime Twitter Data Analysis using Spark Streaming

Posted on 2019-10-19 | In Project

Symbols count in article: 4.3k | Reading time ≈ 4 mins.

In this project, I built an application that extract streaming tweets from Twitter, transform the data, and visualize using Apache Sparking Streaming to gain the trending hashtags of a specific topic. In particular, I used a window size of 5 minutes to always get the latest 5 minutes result.

Analysis of 2018 H-1B Sponsorship for Data Science Employees

Posted on 2019-10-19 | In Project

Symbols count in article: 3.7k | Reading time ≈ 3 mins.

Introduction

The H-1B is a visa in the U.S. that allows U.S. employers to temporarily employ foreign workers in specialty occupations. For international students who are trying to find Data science jobs in the U.S., H-1B visa is the most common working visa. Job hunting is stressful, so the tatics show more importance when selecting the companies to apply. Not saying there are no chances in companies with no H-1B sponsor records in 2018, as policies vary in each company every year. Accessing more information and prioritizing tasks in hand are what new grads need to do.

I want to dig into the data of H-1B case disclosure file in 2018 a little. The data could be found at U.S. Department of Labor's website: https://www.foreignlaborcert.doleta.gov/performancedata.cfm#dis. I would just focus on data related entry level jobs.

The Jupyter notebook is here.

Study Note: Linear Regression Example Prostate Cancer

Posted on 2019-10-19 | In Machine Learning

Symbols count in article: 8.5k | Reading time ≈ 8 mins.

import scipy
import scipy.stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

1 2	data=pd.read_csv('./data/prostate.data',delimiter='\t',index_col=0) data.head()

Study Note: Assessing Model Accuracy

Posted on 2019-10-19 | In Machine Learning

Symbols count in article: 6.4k | Reading time ≈ 6 mins.

no free lunch in statistics: no one method dominates all others over all possible data sets.

Explanation: On a particular data set, one specific method may work best, but some other method may work better on a similar but different data set. Hence it is an important task to decide for any given set of data which method produces the best results.

Measuring the Quality of Fit

mean squared error (MSE)

\[ \begin{align} MSE=\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{f}(x_i))^2 \end{align} \]

overfitting: When a given method yields a small training MSE but a large test MSE.

Explanation: a less flexible model would have yielded a smaller test MSE. This happens because our statistical learning procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function f.

Study Note: Bias, Variance and Model Complexity

Posted on 2019-10-19 | In Machine Learning

Symbols count in article: 3k | Reading time ≈ 3 mins.

Bias, Variance and Model Complexity

Test error (generalization error): the prediction error over an independent test sample \[ 𝐸𝑟𝑟𝜏=𝐸[𝐿(𝑌,\hat{f} (𝑋))|𝜏] \] Here the training set \(\tau\) is fixed, and test error refers to the error for this specific training set.

Study Note: Dimension Reduction - PCA, PCR

Posted on 2019-10-19 | In Machine Learning

Symbols count in article: 15k | Reading time ≈ 14 mins.

Dimension Reduction Methods

Subset selection and shrinkage methods all use the original predictors, X1,X2, . . . , Xp.

Dimension Reduction Methods transform the predictors and then fit a least squares model using the transformed variables.

Approach

Let \(Z_1,Z_2, . . . ,Z_M\) represent \(M < p\) linear combinations of our original \(p\) predictors. That is,

\[ \begin{align} Z_m=\sum_{j=1}^p\phi_{jm}X_j \end{align} \]

Study Note: Decision Trees, Random Forest, and Boosting

Posted on 2019-10-19 | In Machine Learning

Symbols count in article: 17k | Reading time ≈ 15 mins.

Introduction to Descision Tree

Regression Trees

Predicting Baseball Players’ Salaries Using Regression Trees

Terminal nodes: The regions R1, R2, and R3 are known as terminal nodes or leaves of the tree.

Internal nodes: The points along the tree where the predictor space is split are referred to as internal nodes.

Branches: The segments of the trees that connect the nodes as branches

Study Note: Model Selection and Regularization (Ridge & Lasso)

Posted on 2019-10-19 | In Machine Learning

Symbols count in article: 27k | Reading time ≈ 24 mins.

Introduction to Model Selection

Setting:

In the regression setting, the standard linear model \(Y = β_0 + β_1X_1 + · · · + β_pX_p + \epsilon\)
In the chapters that follow, we consider some approaches for extending the linear model framework.

Reason of using other fitting procedure than lease squares:

Prediction Accuracy:
- Provided that the true relationship between the response and the predictors is approximately linear, the least squares estimates will have low bias.
- If n \(\gg\) p, least squares estimates tend to also have low variance \(\Rightarrow\) perform well on test data.
- If n is not much larger than p, least squares fit has large variance \(\Rightarrow\) overfitting \(\Rightarrow\) consequently poor predictions on test data
- If p > n, no more unique least squares coefficient estimate: the variance is infinite so the method cannot be used at all
By constraining or shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias.
Model Interpretability：
- irrelevant variables leads to unnecessary complexity in the resulting model. By removing these variables—that is, by setting the corresponding coefficient estimates to zero—we can obtain a model that is more easily interpreted.
- least squares is extremely unlikely to yield any coefficient estimates that are exactly zero \(\Rightarrow\) feature selection

Alternatives of lease squares:

Subset Selection
Shrinkage
Dimension Reduction

Study Note: Resampling Methods - Cross Validation, Bootstrap

Posted on 2019-10-19 | In Machine Learning

Symbols count in article: 5.3k | Reading time ≈ 5 mins.

Resampling methods:involve repeatedly drawing samples from a training set and refitting a mode of interest on each sample in order to obtain additional information about the fitted model.

model assessment： The process of evaluating a model’s performance

model selection：The process of selecting the proper level of flexibility for a model

cross-validation: can be used to estimate the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility.

bootstrap:provide a measure of accuracy of a parameter estimate or of a given selection statistical learning method.

Study Note: Comparing Logistic Regression, LDA, QDA, and KNN

Posted on 2019-10-19 | In Machine Learning

Symbols count in article: 5.7k | Reading time ≈ 5 mins.

Logistic regression and LDA methods are closely connected.

Setting: Consider the two-class setting with \(p = 1\) predictor, and let \(p_1(x)\) and \(p_2(x) = 1−p_1(x)\) be the probabilities that the observation \(X = x\) belongs to class 1 and class 2, respectively.

In LDA, from

\[ \begin{align} p_k(x)=\frac{\pi_k \frac{1}{\sqrt{2\pi}\sigma}\exp{\left( -\frac{1}{2\sigma^2}(x-\mu_k)^2 \right)}}{\sum_{l=1}^K\pi\_l\frac{1}{\sqrt{2\pi}\sigma}\exp{\left( -\frac{1}{2\sigma^2}(x-\mu_l)^2 \right)}} \end{align} \]

\[ \begin{align} \delta\_k(x)=x\frac{\\mu\_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2}+\log(\pi_k) \end{align} \] The log odds is given by

\[ \begin{align}\log{\frac{p_1(x)}{1-p_1(x)}}=\log{\frac{p_1(x)}{p_2(x)}}=c_0+c_1x \end{align} \] where c0 and c1 are functions of μ1, μ2, and σ2.

In Logistic Regression,

\[ \begin{align} \log{\frac{p_1}{1-p_1}}=\beta\_0+\beta_1x \end{align} \]