Topic 1. Machine Learning vs. Classical Econometrics
Topic 2. Supervised, Unsupervised, and Reinforcement Learning Models
Topic 3. Data Preparation and Cleaning
Machine Learning (ML): Models recognize data patterns for practical applications.
Classical Econometrics: Economic and/or financial theory drives the data-generating process.
Q1. Compared to traditional statistical methodologies, machine learning provides all of the following benefits except:
A. greater flexibility.
B. there is no need to scale the data.
C. the ability to manage large volumes of data.
D. the capacity to capture non-linear transactions.
Explanation: B is correct.
Many machine learning models require the need to scale the data used in the model. Common techniques for doing so include standardization and normalization. Relative to traditional statistical models, machine learning models provide greater flexibility, can manage large amounts of data, and can potentially capture non-linear transactions.
Supervised Learning: Used to predict the value of a variable (e.g., car value) or classify an observation (e.g., sports game outcome).
Unsupervised Learning: Involves pattern recognition in data with no specific target.
Reinforcement Learning: Incorporates a trial-and-error approach for decision-making in a changing environment.
Q2. The compliance manager at a bank uses machine learning approaches to review journal entries posted to the bank’s general ledger. In particular, she is concerned with employees using the wrong ledger accounts to record transactions. This type of machine learning is best categorized as:
A. supervised learning.
B. unsupervised learning.
C. reinforcement learning.
D. linear regression analysis.
Explanation: B is correct.
Unsupervised learning is used in situations like this where a compliance manager wishes to learn more about the data but is not using it for predictive purposes. While supervised learning and reinforcement learning are established methodologies, linear regression is not an applicable machine learning category.
Two primary methods to achieve scale consistency: standardization and normalization
Standardization: Used to create a scale for measuring variables with zero mean and unit variance.
Preferred methodology for data covering a wide scope (including outliers).
Normalization: Also called min-max transformation, creates a variable between zero and one which will not usually have a zero mean or unit variance.
Reasons for data cleaning:
Missing data
Outliers
Duplicate observations
Inconsistent recording
Unwanted (i.e. irrelevant) observations
Topic 1. Principal Components Analysis
Topic 2. K-Means Clustering
Q3. The end goal of principal components analysis (PCA) is the use of which of the following to manage dimensionality?
A. A small number of correlated components.
B. A large number of correlated components.
C. A small number of uncorrelated components.
D. A large number of uncorrelated components.
Explanation: C is correct.
The goal of principal components analysis is dimensionality reduction, so a small number of components will be the output. The components should be uncorrelated, as correlated components do not independently add much value.
Q4. The optimal number of centroids can be found by choosing the:
A. value that produces the lowest possible inertia.
B. value that produces the highest possible inertia.
C. point where inertia declines at a faster pace as K increases.
D. point where inertia declines at a slower pace as K increases.
Explanation: D is correct.
The “elbow” is the point where inertia starts to decline at a slower pace as K increases. This represents the optimal number of centroids. A lower inertia is ideal, however, because inertia will always fall as more centroids are added, continuing to add K will not add value beyond a certain point.
Topic 1. Underfitting and Overfitting
Topic 2. Training, Validation, and Test Data Sub-Sample
Overfitting: Occurs when a model is too complex, too large, or has too many parameters.
Underfitting: Occurs when a model is too simple and fails to capture relevant patterns.
Bias-Variance Tradeoff: The size of the machine learning model determines if it's appropriately fitted, overfitted, or underfitted.
Q5. The predictions that are generated from an underfitted model will likely have:
A. low bias and low variance.
B. low bias and high variance.
C. high bias and low variance.
D. high bias and high variance.
Explanation: C is correct.
An underfitted model excludes relevant factors and fails to capture relevant patterns. As a result, the predictions generated from such a model will have low variance but will otherwise have higher bias.
Purpose: To test the fitted model by keeping part of the data sample out (holdout data) and see how well it predicts unseen observations.
Typical Division for ML Models:
Allocation (Typical): Two-thirds to training set.
Cross-sectional data: No natural order; allows random placement into sets.
Time-series data: Natural order; training data typically first, followed by validation, then test data.
Larger datasets have lower risk of improper allocations
Small Datasets: k-fold cross-validation can be utilized for small datasets.
Q6. An analyst is choosing between two machine learning models. Which of the following datasets will the analyst most likely use to make the determination of which model to select?
A. Test set.
B. Training set.
C. Variance set.
D. Validation set.
Explanation: D is correct.
The validation data set is used to decide between alternative machine learning models. The test set determines the effectiveness of the model once it is already chosen. The training set is used to estimate model parameters. There is no such thing as a variance set in this context.
Topic 1. Reinforcement Learning
Topic 2. Natural Language Processing
Q7. An analyst applying a reinforcement learning model has assigned a probability to exploitation of 65%. As she completes more trials, she can reasonably expect that the probability will increase above:
A. 35% for exploration.
B. 65% for exploitation.
C. 50% for exploration.
D. 50% for exploitation.
Explanation: B is correct.
Reinforcement learning model algorithms will choose between the best action already identified (exploitation) and new actions (exploration). Exploitation has a probability of p, which is expected to rise with additional trials. Exploration has a probability of 1 – p and is expected to fall with additional trials. If the exploitation probability is already at 65%, the expectation is that it will rise further as more trials are conducted.
Q8. Natural language processing (NLP) is used to evaluate the MD&A (Management, Discussion, and Analysis) section of a company’s annual report. In removing the “stopwords,” the NLP algorithm will remove all of the following words except:
A. “or.”
B. “are.”
C. “have.”
D. “fallen.”
Explanation: D is correct.
Stopwords are used to help sentences flow but otherwise have no value. Words like “or,” “are,” and “have” are considered stopwords. “Fallen” is not a stopword, as it has value in describing the direction of something (e.g., earnings, sales, etc.).