This outline has not been updated since 2020 and may be substantially different than the currently syllabus. I will remove this note when I update the content.

# Misc.

### Data exploration

Number of elements in Dataset - Before and after modification

Create a summary of the data

dataset %>% summary() or dataset %>% glimpse()

Discuss Target Variable Mention what the flag means and ratio the number in each class

dataset %>% count(target_variable)

If the responses do not appear equally distributed then the classes would appear to be **imbalanced**

### Issue Resolution

Some candidates may find it easier to manipulate the data in Excel and then import the adjusted file into R. *Note: Many excel features, such as Pivot Tables, are disabled which prevents Excel being a good tool for Data Exploration*

Delete Values Subset using set operators

dataset <- dataset[dataset$variable!=0,]

Replace Character Variables with Factors If any show character as a class replace them with factors:

dataset <- dataset %>% modify_if(is.character, as.factor) Alternatively, you can convert characters to releveled factors in one step: dataset <- dataset %>% mutate_if(is.character, fct_infreq)

### Pre-Modeling

**Level Reduction** Consider both **n** and **mean** of the target variable before combining groups.

dataset <- dataset %>% mutate(variable_bucket = case_when(variable < 50 ~ 0,variable <= 70 ~ 1,variable > 70 ~ 2))

**Set Reference Levels**

This **does not** effect predictions nor any measures of model fit

This *does* effect decisions if hypothesis tests are conducted to consider removing specific factor levels. * This is because the test compares each factor level to the base level

Explicitly set a reference level - If you want to be testing other variables against one in particular

dataset <- relevel(dataset$variable, ref = “newBaseLevel”)

Set a reference level to have the most observations - If there are no obvious choices Displays the number of observations at each level

dataset$column %>% as.factor() %>% summary() Set the highest frequency “column” to be the base level (Tidyverse Required) dataset <- dataset %>% mutate(column = fct_infreq(column))

**Train/Test Datasets** Verify the mean of the target variable is similar between training and testing datasets

#### Decision Trees

#### General Linear Models

**Regularized Regression (Penalized Regression)** Standardize units - Coefficients are penalized the larger they are; therefore, if a variable is measuring Units consider changing to Dollars.

### Post-Modeling

Confusion Matrices

- Sensitivity - True Positive Rate = True Positives / (True Positives + False Negatives)
- Specificity - True Negative Rate = True Negatives / (False Positives + True Negatives)
- Precision - True Positives / (True Positives + False Positives)
- Recall - Sensitivity

Predictions Including response will adjust the output to be probabilities; otherwise, the output will be curtailed to the specified distribution.

sample.data %>% mutate(predicted_profit = predict(glm,sample.data,type=”response”))

## 6. Generalized Linear Models

### a) Implement ordinary least squares regression in R and understand model assumptions.

### b) Understand the specifications of the GLM and the model assumptions.

- Linear Regression (Ordinary Least Squares Regression)
- Logistic Regression
- Regularized Regression
- Ridge
- Lasso
- Elastic Net

#### Logistic Regression

#### Regularized Regression

### c) Create new features appropriate for GLMs.

### d) Interpret model coefficients, interaction terms, offsets, and weights.

See casact article

### e) Select and validate a GLM appropriately.

### f) Explain the concepts of bias, variance, model complexity, and the bias-variance trade-off.

The more fit the model is to the training data, the less bias there will be; however, this tends to result in more variance in the testing data.

**Overfitting** - “Overfitting” to the training data resulting in **low bias, but high variance**

**Underfitting** - “Underfitting” to the training data resulting in **high bias, but low variance**

### g) Select appropriate hyperparameters for regularized regression

## 7. Decision Trees

### a) Understand the basic motivation behind decision trees.

### b) Construct regression and classification trees.

### c) Use bagging and random forests to improve accuracy.

### d) Use boosting to improve accuracy.

### e) Select appropriate hyperparameters for decision trees and related techniques.

**Random Forest Hyperparameters**

- Impurity measure - In most cases there is not much difference between selecting one method over the other.
- Gini - Default
- Entropy - More computationally intensive since it uses the logarithmic function

- Splitter - How a split is determined at each node.
- Best - Default - chooses the one based on the impurity measures and is computationally intensive
- Random - Chooses a random feature and generally ends up with longer and less precise trees, but can reduce overfitting

- Max_Depth - Controlling this is the main way to combat overfitting
- None - Default

- ntree - Number of trees in the forest
- mtry - Number of variables compared in the trees

**GBM Hyperparameters**

## 8. Cluster and Principal Component Analyses

### a) Understand and apply K-means clustering.

**Summary of K-means clustering**

- Pick k points randomly, these are the clusters
- Group each other points to the nearest cluster
- Calculate mean of each cluster
- Repeat steps 2 and 3 until the mean does not move — Repeat Steps 1-4 —-
- The result of each repitition are compared based on the
*variance*in the distance between each observation and the cluster mean. The clustering with the**smallest variance**is the best clustering for a given*k*.

The higher *k* is, the lower the total variance will be. The ideal *k* would tend to be when there is less reduction in variance by the addition of additional clusters. This is graphically shown in a elbow plot.

**K-means vs hierarchical clustering**

### b) Understand and apply hierarchical clustering.

Start with one observation and find the most similar one to it, then repeat for all observations, the most similar observations become the first cluster. Repeat, but treat the first cluster as a combined unit and use it to compare. Keep repeating until there is only one cluster.

Dendrograms show similarities and the order which clusters were formed.

Similarity needs to be defined:

- Commonly Euclidean Distance.

?dist - Can show other methods along with calculations

Compare also needs to be defined:

- Centroid - Average of each cluster
- “Ward” - Like centroid, but also takes into account the variance within each cluster

- Single-linkage - closest point in the cluster
- Complete-linkage - furthest point in each cluster

?hclust - Can show other methods, but knowing the above helps significantly

Hierarchical clustering can be agglomerative (bottomup) or divisive (top-down) and they should give similar results. K-means clustering can simulate the divisive technique and is computationally much faster.

Pro:

- Shows all possible linkages between clusters
- Understand how much clusters differ based on dendrogram length
- No hyperparameters, the analyst can select the appropriate number of clusters for the business need
- Many methods to test and select which fits data best

Con:

- scalability, increasing observations will prevent interpretation
- computationally intensive

### c) Understand and apply principal component analysis.

**Summary of PCA Method**

- Two variables are graphically compared
- The average point is then calculated and plotted
- The graph is shifted so that this average point is at the origin
- PCA finds the best fitting line by
**maximizing the sum of the squared distances from the projected points to the origin**- This is computationally easier than finding the minimum distances between each point and the line
- Sum of squared distances are also called “eingenvalues”

- This is repeated for all variables
- Line with the largest eigenvalue becomes PC1

## Useful Links

http://uc-r.github.io/predictive

https://www.casact.org/pubs/monographs/papers/05-Goldburd-Khare-Tevet.pdf