This outline has not been updated since 2020 and may be substantially different than the currently syllabus. I will remove this note when I update the content.

Misc.

Data exploration

Number of elements in Dataset - Before and after modification

Create a summary of the data

dataset %>% summary() or dataset %>% glimpse()

Discuss Target Variable Mention what the flag means and ratio the number in each class

dataset %>% count(target_variable)

If the responses do not appear equally distributed then the classes would appear to be imbalanced

Issue Resolution

Some candidates may find it easier to manipulate the data in Excel and then import the adjusted file into R. Note: Many excel features, such as Pivot Tables, are disabled which prevents Excel being a good tool for Data Exploration

Delete Values Subset using set operators

dataset <- dataset[dataset$variable!=0,]

Replace Character Variables with Factors If any show character as a class replace them with factors:

dataset <- dataset %>% modify_if(is.character, as.factor) Alternatively, you can convert characters to releveled factors in one step: dataset <- dataset %>% mutate_if(is.character, fct_infreq)

Pre-Modeling

Level Reduction Consider both n and mean of the target variable before combining groups.

dataset <- dataset %>% mutate(variable_bucket = case_when(variable < 50 ~ 0,variable <= 70 ~ 1,variable > 70 ~ 2))

Set Reference Levels

This does not effect predictions nor any measures of model fit

This does effect decisions if hypothesis tests are conducted to consider removing specific factor levels. * This is because the test compares each factor level to the base level

Explicitly set a reference level - If you want to be testing other variables against one in particular

dataset <- relevel(dataset$variable, ref = “newBaseLevel”)

Set a reference level to have the most observations - If there are no obvious choices Displays the number of observations at each level

dataset$column %>% as.factor() %>% summary() Set the highest frequency “column” to be the base level (Tidyverse Required) dataset <- dataset %>% mutate(column = fct_infreq(column))

Train/Test Datasets Verify the mean of the target variable is similar between training and testing datasets

Decision Trees

General Linear Models

Regularized Regression (Penalized Regression) Standardize units - Coefficients are penalized the larger they are; therefore, if a variable is measuring Units consider changing to Dollars.

Post-Modeling

Confusion Matrices

Sensitivity - True Positive Rate = True Positives / (True Positives + False Negatives)
Specificity - True Negative Rate = True Negatives / (False Positives + True Negatives)
Precision - True Positives / (True Positives + False Positives)
Recall - Sensitivity

Predictions Including response will adjust the output to be probabilities; otherwise, the output will be curtailed to the specified distribution.

sample.data %>% mutate(predicted_profit = predict(glm,sample.data,type=”response”))

6. Generalized Linear Models

a) Implement ordinary least squares regression in R and understand model assumptions.

b) Understand the specifications of the GLM and the model assumptions.

Linear Regression (Ordinary Least Squares Regression)
Logistic Regression
Regularized Regression
- Ridge
- Lasso
- Elastic Net

Logistic Regression

Regularized Regression

c) Create new features appropriate for GLMs.

d) Interpret model coefficients, interaction terms, offsets, and weights.

See casact article

e) Select and validate a GLM appropriately.

f) Explain the concepts of bias, variance, model complexity, and the bias-variance trade-off.

The more fit the model is to the training data, the less bias there will be; however, this tends to result in more variance in the testing data.

Overfitting - “Overfitting” to the training data resulting in low bias, but high variance

Underfitting - “Underfitting” to the training data resulting in high bias, but low variance

g) Select appropriate hyperparameters for regularized regression

7. Decision Trees

a) Understand the basic motivation behind decision trees.

b) Construct regression and classification trees.

c) Use bagging and random forests to improve accuracy.

d) Use boosting to improve accuracy.

Random Forest Hyperparameters

Impurity measure - In most cases there is not much difference between selecting one method over the other.
- Gini - Default
- Entropy - More computationally intensive since it uses the logarithmic function
Splitter - How a split is determined at each node.
- Best - Default - chooses the one based on the impurity measures and is computationally intensive
- Random - Chooses a random feature and generally ends up with longer and less precise trees, but can reduce overfitting
Max_Depth - Controlling this is the main way to combat overfitting
- None - Default
ntree - Number of trees in the forest
mtry - Number of variables compared in the trees

GBM Hyperparameters

8. Cluster and Principal Component Analyses

a) Understand and apply K-means clustering.

Summary of K-means clustering

Pick k points randomly, these are the clusters
Group each other points to the nearest cluster
Calculate mean of each cluster
Repeat steps 2 and 3 until the mean does not move — Repeat Steps 1-4 —-
The result of each repitition are compared based on the variance in the distance between each observation and the cluster mean. The clustering with the smallest variance is the best clustering for a given k.

The higher k is, the lower the total variance will be. The ideal k would tend to be when there is less reduction in variance by the addition of additional clusters. This is graphically shown in a elbow plot.

K-means vs hierarchical clustering

b) Understand and apply hierarchical clustering.

Start with one observation and find the most similar one to it, then repeat for all observations, the most similar observations become the first cluster. Repeat, but treat the first cluster as a combined unit and use it to compare. Keep repeating until there is only one cluster.

Dendrograms show similarities and the order which clusters were formed.

Similarity needs to be defined:

Commonly Euclidean Distance.

?dist - Can show other methods along with calculations

Compare also needs to be defined:

Centroid - Average of each cluster
- “Ward” - Like centroid, but also takes into account the variance within each cluster
Single-linkage - closest point in the cluster
Complete-linkage - furthest point in each cluster

?hclust - Can show other methods, but knowing the above helps significantly

Hierarchical clustering can be agglomerative (bottomup) or divisive (top-down) and they should give similar results. K-means clustering can simulate the divisive technique and is computationally much faster.

Pro:

Shows all possible linkages between clusters
Understand how much clusters differ based on dendrogram length
No hyperparameters, the analyst can select the appropriate number of clusters for the business need
Many methods to test and select which fits data best

Con:

scalability, increasing observations will prevent interpretation
computationally intensive

c) Understand and apply principal component analysis.

Summary of PCA Method

Two variables are graphically compared
The average point is then calculated and plotted
The graph is shifted so that this average point is at the origin
PCA finds the best fitting line by maximizing the sum of the squared distances from the projected points to the origin
- This is computationally easier than finding the minimum distances between each point and the line
- Sum of squared distances are also called “eingenvalues”
This is repeated for all variables
Line with the largest eigenvalue becomes PC1

Useful Links

http://uc-r.github.io/predictive

https://www.casact.org/pubs/monographs/papers/05-Goldburd-Khare-Tevet.pdf

SOA Predictive Analytics Exam Notes