Shefali Bansal
Tanya Sharma
Arvind Srinivasan
Aarushi Sharma
Arkav Banerjee
Simran Singh
The Team
This section deals with the preprocessing of the dataset to extract the necessary features for seamless modelling
sklearn.feature_extraction.text.CountVectorizer
max_features : int or None
analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable
ngram_range : tuple (min_n, max_n)
max_df : float in range [0.0, 1.0] or int, default=1.0
min_df : float in range [0.0, 1.0] or int, default=1
Long Short Term Memory
To analyse the reviews from the Yelp Dataset and classify them as being positive, negative or neutral reviews using the Long Short Term Memory Model.
Long short-term memory (LSTM) units (or blocks) are a building unit for layers of a recurrent neural network (RNN). A RNN composed of LSTM units is often called an LSTM network.
A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate.
Can remove or add Information with the help of structures known as Gates. Sends information optionally.
There is the problem of long term dependencies in RNN.
Sometimes, we only need to look at recent information to perform the present task.
Sometimes, we need more context, or more older information.
This section deals with the preprocessing of the training dataset such that the data can be fitted over the Model under focus.
First Check if any null values exist in the dataframe.
If found, drop any row found with inconsistency.
Remove everything other than alphabetical text such that there is no inconsistency between type of data in a column of a Dataframe.
Then perform Tokenisation.
Keras represents each word as a number, with the most common word in a given dataset being represented as 1, the second most common as a 2, and so on.
This is useful because we often want to ignore rare words, as usually, the neural network cannot learn much from these, and they only add to the processing time.
If we have our data tokenized with the more common words having lower numbers, we can easily train on only the N most common words in our dataset, and adjust N as necessary.
Tokenizer
keras.preprocessing.text.Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~ ', lower=True, split=' ', char_level=False, oov_token=None)
Text tokenization utility class.
This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...
Arguments
By default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the ' character). These sequences are then split into lists of tokens. They will then be indexed or vectorized.
0 is a reserved index that won't be assigned to any word.
Explains how the Model approaches the problem when certain parameters that influence the accuracy or convergence of the model are tuned.
(nb_words, vocab_size) x (vocab_size, embedding_dim) = (nb_words, embedding_dim)
To verify that the model is accurately predicting the expected result for the test data set along with the validation accuracy that is obtained with the validation data set and the training dataset.
As you can see, we are working on a sample of 10,000 data from the Dataset.
Left Graph is for Loss and the Right graph is for Accuracy
We got an accuracy of 66% when running a epoch across the whole dataset.
We looked for all the possible errors that might have occured during the process and concluded on a few pointers.
Random Forests
To analyse the reviews from the Yelp Dataset and classify them as being positive, negative or neutral reviews using the Random Forest Model.
An ensemble learning method of for classification and regression tasks.
Outputs the mode of the classes (classification) or mean prediction (regression) of the individual trees
Random forests use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging".
Accuracy : Almost always better than Decision Trees.
The accuracy can be improved or tweaked by changing the max-depth parameter that limits the problem of overfitting, while other parameters like max-leaf-nodes, min_samples_split, min_samples_leaf etc. will help, to a certain extent for pruning process.
Robust : doesn’t suffer the instability problems of decision trees.
Unlike decision tree, finding root node and splitting feature nodes will happen randomly
It works in 2 stages:
This section deals with the preprocessing of the training dataset such that the data can be fitted over the Model under focus.
First Check if any null values exist in the dataframe.
If found, drop any row found with inconsistency.
Now follow the Steps as mentioned in the Feature Text Preprocessing Section.
Multinomial Naive Bayes
To analyse the reviews from the Yelp Dataset and classify them as being positive, negative or neutral reviews using the Multinomial Naive Bayes Model.
A classification algorithm.
Follows supervised learning approach.
Models a problem probabilistically.
Based on an assumption that all the attributes are conditionally independent.
Can solve a problem involving categorical attributes.
Multinomial Naive Bayes is a specialized version of Naive Bayes that is designed more for text documents. Whereas simple Naive Bayes would model a document as the presence and absence of particular words, multinomial naive bayes explicitly models the word counts and adjusts the underlying calculations to deal with in.
So, to classify into multiple categories, (3) Multinomial NB is used.
This section deals with the preprocessing of the training dataset such that the data can be fitted over the Model under focus.
First Check if any null values exist in the dataframe.
If found, drop any row found with inconsistency.
Now follow the Steps as mentioned in the Feature Text Preprocessing Section.
To verify that the model is accurately predicting the expected result for the test data set along with the validation accuracy that is obtained with the validation data set and the training dataset.
Support Vector Machines
To analyse the reviews from the Yelp Dataset and classify them as being positive, negative or neutral reviews using the Support Vector Machine Model.
Logistic Regression
Applied to binary dependent variable.
The binary logistic regression model can be generalized to more than two levels of the dependent variable: categorical outputs with more than two values are modelled by multinomial logistic regression
To verify that the model is accurately predicting the expected result for the test data set along with the validation accuracy that is obtained with the validation data set and the training dataset.
Shefali Bansal
Tanya Sharma
Arvind Srinivasan
Aarushi Sharma
Arkav Banerjee
Simran Singh