In this project, I built an application that extract streaming tweets from Twitter, transform the data, and visualize using Apache Sparking Streaming to gain the trending hashtags of a specific topic. In particular, I used a window size of 5 minutes to always get the latest 5 minutes result.
Analysis of 2018 H-1B Sponsorship for Data Science Employees
Introduction
The H-1B is a visa in the U.S. that allows U.S. employers to temporarily employ foreign workers in specialty occupations. For international students who are trying to find Data science jobs in the U.S., H-1B visa is the most common working visa. Job hunting is stressful, so the tatics show more importance when selecting the companies to apply. Not saying there are no chances in companies with no H-1B sponsor records in 2018, as policies vary in each company every year. Accessing more information and prioritizing tasks in hand are what new grads need to do.
I want to dig into the data of H-1B case disclosure file in 2018 a little. The data could be found at U.S. Department of Labor's website: https://www.foreignlaborcert.doleta.gov/performancedata.cfm#dis. I would just focus on data related entry level jobs.
The Jupyter notebook is here.
Study Note: Linear Regression Example Prostate Cancer
1 | import scipy |
1 | data=pd.read_csv('./data/prostate.data',delimiter='\t',index_col=0) |
Study Note: Assessing Model Accuracy
no free lunch in statistics: no one method dominates all others over all possible data sets.
- Explanation: On a particular data set, one specific method may work best, but some other method may work better on a similar but different data set. Hence it is an important task to decide for any given set of data which method produces the best results.
Measuring the Quality of Fit
mean squared error (MSE)
\[ \begin{align} MSE=\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{f}(x_i))^2 \end{align} \]
overfitting: When a given method yields a small training MSE but a large test MSE.
- Explanation: a less flexible model would have yielded a smaller test MSE. This happens because our statistical learning procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function f.
Study Note: Bias, Variance and Model Complexity
Bias, Variance and Model Complexity
Test error (generalization error): the prediction error over an independent test sample \[ 𝐸𝑟𝑟𝜏=𝐸[𝐿(𝑌,\hat{f} (𝑋))|𝜏] \] Here the training set \(\tau\) is fixed, and test error refers to the error for this specific training set.
Study Note: Dimension Reduction - PCA, PCR
Dimension Reduction Methods
Subset selection and shrinkage methods all use the original predictors, X1,X2, . . . , Xp.
Dimension Reduction Methods transform the predictors and then fit a least squares model using the transformed variables.
Approach
Let \(Z_1,Z_2, . . . ,Z_M\) represent \(M < p\) linear combinations of our original \(p\) predictors. That is,
\[ \begin{align} Z_m=\sum_{j=1}^p\phi_{jm}X_j \end{align} \]
Study Note: Decision Trees, Random Forest, and Boosting
Introduction to Descision Tree
Regression Trees
Predicting Baseball Players’ Salaries Using Regression Trees
Terminal nodes: The regions R1, R2, and R3 are known as terminal nodes or leaves of the tree.
Internal nodes: The points along the tree where the predictor space is split are referred to as internal nodes.
Branches: The segments of the trees that connect the nodes as branches
Study Note: Model Selection and Regularization (Ridge & Lasso)
Introduction to Model Selection
Setting:
In the regression setting, the standard linear model \(Y = β_0 + β_1X_1 + · · · + β_pX_p + \epsilon\)
In the chapters that follow, we consider some approaches for extending the linear model framework.
Reason of using other fitting procedure than lease squares:
- Prediction Accuracy:
- Provided that the true relationship between the response and the predictors is approximately linear, the least squares estimates will have low bias.
- If n \(\gg\) p, least squares estimates tend to also have low variance \(\Rightarrow\) perform well on test data.
- If n is not much larger than p, least squares fit has large variance \(\Rightarrow\) overfitting \(\Rightarrow\) consequently poor predictions on test data
- If p > n, no more unique least squares coefficient estimate: the variance is infinite so the method cannot be used at all
By constraining or shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias.
- Model Interpretability:
- irrelevant variables leads to unnecessary complexity in the resulting model. By removing these variables—that is, by setting the corresponding coefficient estimates to zero—we can obtain a model that is more easily interpreted.
- least squares is extremely unlikely to yield any coefficient estimates that are exactly zero \(\Rightarrow\) feature selection
Alternatives of lease squares:
- Subset Selection
- Shrinkage
- Dimension Reduction
Study Note: Resampling Methods - Cross Validation, Bootstrap
Resampling methods:involve repeatedly drawing samples from a training set and refitting a mode of interest on each sample in order to obtain additional information about the fitted model.
model assessment: The process of evaluating a model’s performance
model selection:The process of selecting the proper level of flexibility for a model
cross-validation: can be used to estimate the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility.
bootstrap:provide a measure of accuracy of a parameter estimate or of a given selection statistical learning method.
Study Note: Comparing Logistic Regression, LDA, QDA, and KNN
Logistic regression and LDA methods are closely connected.
Setting: Consider the two-class setting with \(p = 1\) predictor, and let \(p_1(x)\) and \(p_2(x) = 1−p_1(x)\) be the probabilities that the observation \(X = x\) belongs to class 1 and class 2, respectively.
In LDA, from
\[ \begin{align} p_k(x)=\frac{\pi_k \frac{1}{\sqrt{2\pi}\sigma}\exp{\left( -\frac{1}{2\sigma^2}(x-\mu_k)^2 \right)}}{\sum_{l=1}^K\pi\_l\frac{1}{\sqrt{2\pi}\sigma}\exp{\left( -\frac{1}{2\sigma^2}(x-\mu_l)^2 \right)}} \end{align} \]
\[ \begin{align} \delta\_k(x)=x\frac{\\mu\_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2}+\log(\pi_k) \end{align} \] The log odds is given by
\[ \begin{align}\log{\frac{p_1(x)}{1-p_1(x)}}=\log{\frac{p_1(x)}{p_2(x)}}=c_0+c_1x \end{align} \] where c0 and c1 are functions of μ1, μ2, and σ2.
In Logistic Regression,
\[ \begin{align} \log{\frac{p_1}{1-p_1}}=\beta\_0+\beta_1x \end{align} \]