This outline has not been updated since 2020 and may be substantially different than the currently syllabus. I will remove this note when I update the content.
Number of elements in Dataset - Before and after modification
Create a summary of the data
dataset %>% summary() or dataset %>% glimpse()
Discuss Target Variable Mention what the flag means and ratio the number in each class
dataset %>% count(target_variable)
If the responses do not appear equally distributed then the classes would appear to be imbalanced
Some candidates may find it easier to manipulate the data in Excel and then import the adjusted file into R. Note: Many excel features, such as Pivot Tables, are disabled which prevents Excel being a good tool for Data Exploration
Delete Values Subset using set operators
dataset <- dataset[dataset$variable!=0,]
Replace Character Variables with Factors If any show character as a class replace them with factors:
dataset <- dataset %>% modify_if(is.character, as.factor) Alternatively, you can convert characters to releveled factors in one step: dataset <- dataset %>% mutate_if(is.character, fct_infreq)
Level Reduction Consider both n and mean of the target variable before combining groups.
dataset <- dataset %>% mutate(variable_bucket = case_when(variable < 50 ~ 0,variable <= 70 ~ 1,variable > 70 ~ 2))
Set Reference Levels
This does not effect predictions nor any measures of model fit
This does effect decisions if hypothesis tests are conducted to consider removing specific factor levels. * This is because the test compares each factor level to the base level
Explicitly set a reference level - If you want to be testing other variables against one in particular
dataset <- relevel(dataset$variable, ref = “newBaseLevel”)
Set a reference level to have the most observations - If there are no obvious choices Displays the number of observations at each level
dataset$column %>% as.factor() %>% summary() Set the highest frequency “column” to be the base level (Tidyverse Required) dataset <- dataset %>% mutate(column = fct_infreq(column))
Train/Test Datasets Verify the mean of the target variable is similar between training and testing datasets
Regularized Regression (Penalized Regression) Standardize units - Coefficients are penalized the larger they are; therefore, if a variable is measuring Units consider changing to Dollars.
- Sensitivity - True Positive Rate = True Positives / (True Positives + False Negatives)
- Specificity - True Negative Rate = True Negatives / (False Positives + True Negatives)
- Precision - True Positives / (True Positives + False Positives)
- Recall - Sensitivity
Predictions Including response will adjust the output to be probabilities; otherwise, the output will be curtailed to the specified distribution.
sample.data %>% mutate(predicted_profit = predict(glm,sample.data,type=”response”))
- Linear Regression (Ordinary Least Squares Regression)
- Logistic Regression
- Regularized Regression
- Elastic Net
See casact article
The more fit the model is to the training data, the less bias there will be; however, this tends to result in more variance in the testing data.
Overfitting - “Overfitting” to the training data resulting in low bias, but high variance
Underfitting - “Underfitting” to the training data resulting in high bias, but low variance
Random Forest Hyperparameters
- Impurity measure - In most cases there is not much difference between selecting one method over the other.
- Gini - Default
- Entropy - More computationally intensive since it uses the logarithmic function
- Splitter - How a split is determined at each node.
- Best - Default - chooses the one based on the impurity measures and is computationally intensive
- Random - Chooses a random feature and generally ends up with longer and less precise trees, but can reduce overfitting
- Max_Depth - Controlling this is the main way to combat overfitting
- None - Default
- ntree - Number of trees in the forest
- mtry - Number of variables compared in the trees
Summary of K-means clustering
- Pick k points randomly, these are the clusters
- Group each other points to the nearest cluster
- Calculate mean of each cluster
- Repeat steps 2 and 3 until the mean does not move — Repeat Steps 1-4 —-
- The result of each repitition are compared based on the variance in the distance between each observation and the cluster mean. The clustering with the smallest variance is the best clustering for a given k.
The higher k is, the lower the total variance will be. The ideal k would tend to be when there is less reduction in variance by the addition of additional clusters. This is graphically shown in a elbow plot.
K-means vs hierarchical clustering
Start with one observation and find the most similar one to it, then repeat for all observations, the most similar observations become the first cluster. Repeat, but treat the first cluster as a combined unit and use it to compare. Keep repeating until there is only one cluster.
Dendrograms show similarities and the order which clusters were formed.
Similarity needs to be defined:
- Commonly Euclidean Distance.
?dist - Can show other methods along with calculations
Compare also needs to be defined:
- Centroid - Average of each cluster
- “Ward” - Like centroid, but also takes into account the variance within each cluster
- Single-linkage - closest point in the cluster
- Complete-linkage - furthest point in each cluster
?hclust - Can show other methods, but knowing the above helps significantly
Hierarchical clustering can be agglomerative (bottomup) or divisive (top-down) and they should give similar results. K-means clustering can simulate the divisive technique and is computationally much faster.
- Shows all possible linkages between clusters
- Understand how much clusters differ based on dendrogram length
- No hyperparameters, the analyst can select the appropriate number of clusters for the business need
- Many methods to test and select which fits data best
- scalability, increasing observations will prevent interpretation
- computationally intensive
Summary of PCA Method
- Two variables are graphically compared
- The average point is then calculated and plotted
- The graph is shifted so that this average point is at the origin
- PCA finds the best fitting line by maximizing the sum of the squared distances from the projected points to the origin
- This is computationally easier than finding the minimum distances between each point and the line
- Sum of squared distances are also called “eingenvalues”
- This is repeated for all variables
- Line with the largest eigenvalue becomes PC1