Topic 1. Categorical Variables
Topic 2. Regularization
Topic 3. Ridge Regression (L2 Regularization)
Topic 4. LASSO (L1 Regularization)
Topic 5. Logistic Regression
Topic 6. Logistic Function & Thresholding
Q1. A financing company uses household income ranges to model loan applications. The ranges are in $25,000 increments up to $200,000, and then the last range is anything above $200,000. If a family has a household income of $85,000, the dummy variable most likely assigned will be:
A. 0.
B. 1.
C. 2.
D. 3.
Explanation: D is correct.
This category has a natural order, so the dummy variables will be assigned (starting with 0) for each consecutive range. Dummy variable 0 is for income from $0–$25,000, 1 for income from $25,000–$50,000, 2 for income from $50,000–$75,000, and 3 for income from $75,000–$100,000. $85,000 in household income falls within the latter range, which means the dummy variable will be set to 3.
The first component of the expression is the regression objective function (i.e., the residual sum of squares [RSS]), and the second component is the shrinkage penalty for large slope coefficients.
Q2. Which of the following statements about shrinkage penalty terms in regularization is most accurate?
A. Penalty terms are not applied in model regularization.
B. The sum of the squares of the slope coefficients is the penalty term for LASSO.
C. Elastic net sums the penalty terms found in both ridge regression and LASSO.
D. The sum of the absolute values of the slope coefficients is the penalty term for ridge regression.
Explanation: C is correct.
Elastic net is a regularization technique where the loss function incorporates penalty terms from ridge regression and LASSO. The sum of the squares of the coefficients is the penalty term for ridge regression, while the sum of the absolute values of the coefficients is the penalty term for LASSO.
Specify which category observation j will belong to using the following criteria:
Q3. For a given logistic regression model, default is assigned a value of 1 and no default is assigned a value of 0. Because the cost of default is very high on loans the bank expects its consumers to pay back, a reasonable threshold level for Z should be:
A. 0.00.
B. 0.10.
C. 0.50.
D. 0.90.
Explanation: B is correct.
The Z value should be low to account for the asymmetry in risk associated with a loan that defaults (when unexpected) versus when loan payments are met (when unexpected). Setting a value equal to zero is unrealistic, and a value equal to 0.50 implies risk symmetry (which is not the case here). So, a Z of 0.10 is the most appropriate choice in this situation.
Topic 1. Decision Trees
Topic 2. Ensemble Learning
Topic 3. K-Nearest Neighbors (KNN)
Topic 4. Support Vector Machines (SVMs)
Every node of the tree has a question that reflects an observation that is connected to another node (leaf) by a branch.
Decision trees are useful for classification problems and continuous variable estimations and, as such, are sometimes referred to as classification and regression trees (CARTs).
Because CARTs are easy to interpret, they are sometimes referred to as “white-box models” as opposed to neural networks, which are considered “black-box models.”
Below figure shows an example of a decision tree that a bank may use to evaluate the default probability of a potential borrower.
Example: Predict whether a firm will meet its interest obligations during the next fiscal year.
The Gini coefficient for this output is computed as follows:
7 firms had sales decreases, and 5 of those firms met interest obligations and 2 did not:
Therefore, the information gain is equal to the base Gini measure of 0.46875 minus the weighted Gini measure of 0.45635. This results in an information gain of 0.0124.
The decision tree is complete when all features have been used or when a leaf is reached that is a pure set.
A key risk of decision trees is overfitting, which can be mitigated by setting stopping rules such that there is a maximum number of branches.
Q4. The Gini coefficient for a model is 0.375. If the weighted Gini of one of the features is 0.329, the information gain will be closest to:
A. 0.046.
B. 0.352.
C. 0.704.
D. 0.750.
Explanation: A is correct.
The information gain is the difference between the base Gini and the weighted Gini. For this model, the base Gini is 0.375 and the weighted Gini is 0.329. The difference is equal to 0.375 – 0.329 = 0.046.
Random forests apply bagging techniques but improve upon them by reducing the correlations between decision trees.
The number of features used is often equivalent to the square root of total model predictors available.
Model outputs with low correlations produce the greatest performance for ensembles.
Boosting consists of gradient boosting and adaptive (AdaBoost) boosting.
Q5. In relation to ensembles of learners, which of the following statements best describes the “wisdom of crowds”?
A. Masses of people are never wrong.
B. It is only evident using decision trees.
C. It protects against over or underfitting.
D. It offers the benefits of averaging many predictions.
Explanation: D is correct.
The wisdom of crowds is evident through ensembles of learners, where many different model predictions are made, and they can be averaged to derive a best estimate proxy. While individual models on their own are vulnerable to error, predictions from multiple models can be averaged to produce the best estimate.
SVMs create the widest path using two parallel lines to separate the different observation classes.
Support vectors are the data points lying on the edge of the paths, while the separation boundary represents the center of the path.
Although a model with two features is the most basic, the optimization framework and underlying principles are the same regardless of how many features are modeled.
The output is a hyperplane with the dimension count equal to the number of features minus one.
Beyond a two-feature model, there will be a tradeoff between the path width and the extent of path-driven misclassifications.
In a simplified example where assessing borrower default incorporates only two features (e.g., credit score and income), the goal of SVMs is to use these features to develop a line that graphically separates the two groups (e.g., default versus no default).
Q6. In a two-feature support vector machine, the separation boundary is best described as:
A. the center of the path.
B. the lower bound of the path.
C. the upper bound of the path.
D. all data points lying on both edges of the path.
Explanation: A is correct.
In a two-feature support vector machine, the widest path using two parallel lines which separate the observations is created. Support vectors are the data points lying on the edge of the path, while the separation boundary represents the center of the path.
Topic 1. Neural Networks
Topic 2. Model Predictive Performance
Topic 3. ROC Curve
Topic 4. Logistic Regression vs Neural Networks
The feedforward network with backpropagation is the most common type of ANN.
Backpropagation describes how biases and weights are constantly updated through model iterations
Weights (w) are applied to the inputs to determine the value of hidden layer nodes.
A constant bias (i.e., activation parameter) is then added, which represents how easy it is to get a node to “fire” (i.e., generate an output of 1).
Accuracy | Precision | Recall (Sensitivity) | Error Rate |
---|---|---|---|
1- Accuracy |
Q7. Which of the following statements best describes when to stop the gradient descent algorithm in a neural network model?
A. When the value for the objective function increases for both the validation set and the training set.
B. When the value for the objective function decreases for both the validation set and the training set.
C. When the value for the objective function declines for the training set, even as it improves for the validation set.
D. When the value for the objective function declines for the validation set, even as it improves for the training set.
Explanation: D is correct.
To prevent overfitting, performing calculations for the validation and training datasets at the same time can be done. Moving down the valley will improve both dataset objective functions, but the point to stop the gradient descent algorithm is the point where the objective function value declines for the validation set even as it continues to improve for the training set.
One model may predict more defaults for one sample, while the other model may predict more defaults for the other sample.
The set with the highest true positive and true negative rates would therefore differ between the models.
Q8. In a confusion matrix established for a logistic regression model, which two performance metrics must sum to 100%?
A. Precision and recall.
B. Recall and error rate.
C. Accuracy and error rate.
D. Accuracy and precision.
Explanation: C is correct.
The numerator of the accuracy metric captures true positives and true negatives. The numerator of the error rate captures false positives and false negatives. Both metrics have all four outcomes in the denominator, which implies that when the metrics are added together, all four outcomes are in both the numerator and denominator. The sum must therefore be 100%.
Q9. A confusion matrix on a logistic regression model shows 35 true positives, 28 false negatives, 12 false positives, and 25 true negatives. The precision metric will show a percentage output of:
A. 40%.
B. 56%.
C. 60%.
D. 74%.
Explanation: D is correct.
Precision is equal to true positives (35) divided by the sum of true positives and false positives (35 + 12, or 47). 35/47 = 74%. There are 100 total outcomes (35 + 28 + 12 + 25). The error rate is the “false” outcomes (28 + 12) divided by total outcomes or 40%. Recall is the true positives (35) divided by the sum of the true positives and the false negatives (35 + 28, or 63). 35/63 = 56%. Accuracy is the “true” outcomes (35 + 25) divided by total outcomes or 60%.