Abdullah Fathi
Set of techniques/processes used to discover relationships, recognize patterns, predict trends, and find associations in your data
What's the difference?
is an umbrella term for a more comprehensive set of fields that are focused on mining big data sets and
discovering innovative new insights, trends, methods, and processes.
is a discipline based on gaining actionable insights to assist in a business's professional growth in an immediate sense.
It is part of a wider mission and could be considered a branch of data science.
Data Science: Sources broader insights centered on the questions that need asking and subsequently answering
Data Analytics: Process dedicated to providing solutions to problems, issues, or roadblocks that are already present
Data Science
Data Analytics
Data Science
Data Analytics
Data Science
Data Analytics
Data Science
Data Analytics
Both Data Science and Data Analytics can be used to enhance your business’s efficiency, vision, and intelligence
Use Data Science to uncover new insight and Data Analytics for current insight to ensure the sustainable progress of your business
Data Analyst | Data Engineer | Data Scientist |
---|---|---|
Data Analyst analyes numeric data and use it to help agencies/organisation/company make better decision | Data Engineer involves in preparing data. They develop, constructs, tests & maintain complete architecture | Data Scientist analyses and interpret complex data. They are data wranglers who organize big data |
Data Analyst | Data Engineer | Data Scientist |
---|---|---|
Data Warehousing | Data Warehousing & ETL | Statistical & Analytical Skills |
Adobe & Google Analytic | Advanced Programming Knowledge | Data Mining |
Programming Knowledge | Hadoop-based Analytics | ML & Deep Learning Principles |
Scripting & Statistical Skills | In-dept knowledge of SQL/database | In-depth programming knowledge (SAS/R/Python coding) |
Reporting & Data visualization | Data architecture & pipelining | Hadoop-based analytics |
SQL/ database knowledge | ML concept knowledge | Data Optimization |
Spread-Sheet knowledge | Scripting, reporting & data visualization | Decision making & soft skills |
Data Analyst's primary skill set revolves around data acquisition, handling, and processing
Data Engineer requires an intermediate level understanding of programming to build thorough algorithms along with master statistics and math
Data Scientist needs to master Data Analyst & Data Engineering. Data, stats and math along with in-depth programming knowledge for ML and Deep earning
Roles and responsibilities for data analyst, data engineer and data scientist are quite similar.
Data Analyst | Data Engineer | Data Scientist |
---|---|---|
Pro-processing and data gathering | Develop, test & maintain architectures | Responsible for developing Operational Models |
Emphasis on representing data via reporting and visualization | Understand programming and its complexity | Carry out data analytics and optimization using ML & Deep Learning |
Responsible for statitical analysis & data interpretation | Deploy ML & statistical models | Involved in strategic planning for data analytics |
Ensures data acquisition & aintenance | Building pipelines for various ETL operations | Integrate data & perform ad-hoc analysis |
Optimize Statistical Efficiency & Quality | Ensures data accuracy and flexibility | Fill in the gap between the stakeholders and customer |
Data scientist and data engineers roles are quite similar, but a data scientist is the one who has the upper hand on all the data related activities
Explains What Happened
Descriptive analytics juggles raw data from multiple data sources to give valuable insights into the past
Describing, summarizing, and identifying patterns through calculations of existing data, like mean, median, mode, percentage, frequency, and range
The baseline from which other data analysis begins.
It is only concerned with statistical analysis and absolute numbers, it can’t provide the reason or motivation for why and how those numbers developed
Minumum | Maximum | Mean | Standard Deviation | |
---|---|---|---|---|
Total Amount (RM) | 12 | 500 | 51 | 56 |
As we can see, each customer has spent an average of RM51, but some people have spent up to RM500
As we can see, most of the customers are between
30 and 40 years old
Explains Why Did Something Happened
Like descriptive analytics, diagnostic analytics also focus on the past
Look for cause and effect to illustrate why something happened
The objective is to compare past occurrences to determine causes
Provided in the context of probability, likelihood, or a distributed outcome.
Scatter charts might help to discover dependencies between the output variables and the input variables
The chart above shows that as the value of Feature 1 increases, the value of Target decreases
The maximum correlation (-0.287) is a yield between the recency and the conversion. Such a high correlation indicates that we have to study this variable more thoroughly
What is Likely To Happen
Uses the findings of descriptive and diagnostic analytics to detect clusters and exceptions and to predict future trends
Based on Machine Learning or Deep Learning
The more data points you have, the more accurate the prediction is likely to be
No analytics will be able to tell you exactly what WILL happen in the future.
Predictive analytics put in perspective what MIGHT happen, providing respective probabilities of likelihoods
given the variables that are being looked at
K-nearest neighbors is a very simple method used for classification and approximation
The outputs from the neural network depend on the inputs fed to it and the different parameters within the neural network
The graph above shows a neural network with four inputs (feature 1, 2, 3, and 4). When we introduce the values of the four features in the neural network, we get an output
What Action Should be Taken
Combines all of your data and analytics, then outputs a model prescription: What action to take
Analyze multiple scenarios, predict the outcome of each, and decide which is the
best course of action based on the findings
Prescribe what action to take to eliminate a
future problem or take full advantage of a promising trend
Uses Machine Learning, Algorithms, Artificial Intelligence (AI)
In this phase, we not only is predicted what will happen in the future using our predictive model but also is shown to the decision-maker the implications of each option
A good data analyst will spend around 60-80% of their time cleaning the data. Focusing on the wrong data points will severly impact your analytical result
Wrong data class prevents calculations to be performed
Missing data prevents
functions to work properly
Outliers corrupt the output and produce bias
The size of the data requires too much computation
Deleting and exchanging methods have their drawbacks with small samples
Assumption: The data point is faulty
An outlier totally out of scale might be a wrong measurement
Outliers can be valid measurements and they might reveal hidden potentials
3 Methods applied:
Exploratory Data Analysis (EDA): To discover relationships between measures in data and to gain insight on the trends, pattern and relationship among various entities with the help of statistic and visualisation tools
Uni means one and variate means variable
Reflects how often an occurrence has taken place in the data. It gives a brief idea of the data and make it easier to find a pattern
IQ Range | Number |
---|---|
118-125 | 3 |
126-133 | 7 |
134-141 | 4 |
142-149 | 2 |
150-157 | 1 |
Example: The list of IQ scores is: 118, 139, 124, 125, 127, 128, 129, 130, 130, 133, 136, 138, 141, 142, 149, 130, 154
The bar graph is very convenient while comparing categories of data or different groups of data. It helps to track changes over time. Best to visualizing discrete data (variable that only store certain value)
Bar chart is a great way to display categorical variables in the x-axis. This type of graph denotes two aspects in the y-axis.
Similar to bar charts. Represent the group of variables with values in the y-axis
Mainly used to comprehend how a group is broken down into smaller pieces. The whole pie represents 100%.
Bi means two and variate means variable. Relationship between two variables
Represents individual pieces of data using dots. These plots make it easier to see if two variables are related to each other. The resulting pattern indicates the type (linear or non-linear) and strength of the relationship between two variables
Represents the strength of linear relationship between two numerical variables
Use for determining the association between categorical variables
A t-test is a statistical test that is used to compare the means of two groups. It is often used in hypothesis testing to determine whether a process or treatment actually has an effect on the population of interest, or whether two groups are different from one another.
You want to know whether the mean petal length of iris flowers differs according to their species. You find two different species of irises growing in a garden and measure 25 petals of each species. You can test the difference between these two groups using a t-test.
From the output table, we can see that the difference in means for our sample data is -4.084 (1.456 – 5.540), and the confidence interval shows that the true difference in means is between -3.836 and -4.331. So, 95% of the time, the true difference in means will be different from 0. Our p-value of 2.2e-16 is much smaller than 0.05, so we can reject the null hypothesis of no difference and say with a high degree of confidence that the true difference in means is not equal to zero.
Required when more than two variables have to be analyzed simultaneously. It is hard to visualize a relationship among 4 variables in graph.
Classify different objects into clusters in a way that the similarity between two objects from the same group is maximum and minimal otherwise
Reducing dimensionality of a data table with large number of interrelated measures.
Use various algorithms to build predictive models
Check the efficiency of our model
Are people who purchase tea more or less likely to purchase carbonated drinks?
People who bought also bought ...
Association Rule Learning is being used to help:
Which categories does this document belong to?
Which categories does this document belong to?
Statistical technique used to identify trends and cycles over time
Time series visualization
How well is our new return policy being received?
Wordcloud Visualization: Most frequent word appear in the data
Visualization for sentiment analysis: Overall Sentiment
Visualization for sentiment analysis: Sentiment over time
Visualization for sentiment analysis: Sentiment by topic
List of unsupervised learning algorithms
CART
Classification And Regression Trees
Classification Trees
Regression
Trees
X1 and X2 is Independent Variable
Y is our Dependent Variable which we could not see because it is in another dimension (z-axis)
Calculate Mean/Average for each leaf
We are not just predicting based on 1 Tree, We are predicting based on forest of trees. It will improve the accuracy of prediction because we take the average of many prediction
We can use other Distance as well such as Manhattan Distance. But Euclidean is the commonly used for geometry
Mach1: 30 wrenches/hr
Mach2: 20 wrenches/hr
Out of all produced parts:
We can SEE that 1% are defective
Out of all defective parts:
We can SEE that 50% came from mach1 And 50% cam from mach2
Question:
What is the probability that a part produced by mach2 is defective = ?
Assign class based on probability
Text
0.75 VS 0.25
0.75 > 0.25
CART
Classification And Regression Trees
Classification Trees
Regression
Trees
We can apply K-Means for different purposes:
Deep Learning is part of Machine Learning to find better patterns but when data is unstructured, it is difficult to find the pattern by ML algorithms. Basically it emulates the way human gain certain types of knowledge
When the neural network is learning(actually, it is not learning yet), initially it won't do a good job at predicting the correct output. As the whole training is to get the right value of weight and bias of each node so that the NN generalizes well. So actual learning happen when NN has to correct the value of weight and bias of each node
The error (difference between actual and predicted) is passed backward, cascading to the input layer. This cascade changes the weight and bias of each node in each layer. The entire process of cascading backward is called backpropagation, and this is how neural network learns.
Example: Fruit image layers can learn different features of the object like texture, color, shape, size, etc
Machine's that has cognitive intelligence, in short, act like humans
Assume that AI is a car. Machine Intelligence is the fuel that runs it
Best example to understand modern AI engineering is Self Driving cars
Recruitment and Retention in Military
Graph below illustrate how unemployment is shrinking the recruit pool
Information describing past recruits:
Utilizing ML will help every recruiter better prioritize their time and quickly turn the two recruits they need to enlist per month into three. High-quality recruits able to contribute to the overall defense and security of the nation in the process.
We can also look at the recruiting issue as a marketing problem. What would a 1% increase in awareness accomplish?
Can be accomplish by utilizing information the military has already collected:
A step further, dig deeper into the data to understand what jobs and interests each prospect may be interested in and put those options in front of them, increasing their propensity to engage and ultimately enlist
With BDA, the military can effectively and efficiently overcome the obstacles of finding the best recruits to serve