Dimentionality reduction and classification over MRI anatomical images using PCA
What is Dimensionality Reduction?
- Too many features while the availability of training examples is not infinite
E.g. 3D Image processing 256*256*256 = 16777216 features
- As the number of feature or dimensions grows, the amount of data we need to generalise accurately grows exponentially.
- Sometimes, most of these features are correlated and hence redundant
- Feature selection & Feature extraction
Avoid curse of dimensionality:
- Reduce computational time and space required
- Ease of fitting models with high accuracy (bye to over-fitting)
- Possibility to visualize the data when reduced to 2D or 3D
Some dimensionality reduction techniques:
- ICA (independent source finding)
- SVD (efficient computation of PCA)
- PCA (correlated variables to linearly uncorrelated variables)
Why dimensionality reduction...
Data & Data Structure
- MRI brain images (volumetric representation)
- NIfTI (.nii) files
- Size: 17.3 GB
- 1113 samples (brains)
- Preprocessed version of data
- Skull stripping.
- Normalized.
The human connectome project, 1113 subjects’ brain T1w structural images. These 3D MPRAGE (this sounds nice) images obtained from Siemens 3T platforms using a 32-channel head coil.
Cropping
cropped = data[50:205,60:225,30:225] #155x165x195
data = img.get_data() #256x256x256
Resize
(155, 165, 195) array #4987125 voxels (50, 50, 50) array #125000 voxels
resampled = zoom(cropped, (50/cropped.shape[0], 50/cropped.shape[1] , 50/cropped.shape[2])
Flatten
(50, 50, 50) array #125000 voxels
vector = array.flatten()
(125000)array #125000 voxels
256x256x256
155x165x195
50x50x50
Original
Cropped
Resized
50x50x50
1113x125000
Save
1113 x 125000
1113
1
Flatten
file.npy
jupyter notebook
Loading, cropping , resizing and saving
50x50x50
1113x125000
ML methods
1113
1
Flatten
SVM
KNN
PCA
SVM
Principal Component Analysis
(PCA)
What is PCA
- Finding correlation between variables.
- Separating samples on a dataset.
- Eigenvalues and eigenvectors (axes of subspace)
Principal Component Analysis (PCA) is a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set.
Visually
If more than 4 columns...
- PCA(1113x125000)
Reduce the amount of features from the original data to a serie of principal components that hold the information of the variance (axis of subspace)
EigenBrains
- (PCA(125000x1113))
Try to find a set of "standardized brain ingredients".
Feature extraction
Feature extraction
- Loading data
- Applying PCA choosing amount of components
- Plotting
- Reconstructing data from components
import numpy as np
to_save_load=np.load("to_save.npy")
to_save_load.shape
(1113, 125000)
Loading data
from sklearn.decomposition import PCA
pca = PCA(n_components=10)#% of variance or number
principalComponents = pca.fit_transform(to_save_load)
(1113, 10)
Applying PCA
l = pca.explained_variance_ratio_
sum(l)#0.412558136188802
Plotting
import pandas as pd
subjects = pd.read_csv('subjects.csv')
columns = subjects[['Subject','class']]
labels= columns.set_index('Subject').T.to_dict('list')
import os
dirs = os.listdir( "brain_nii" )
-
Getting Colors
{585256: [-1], 585862: [-1], 586460: [-1], 587664: [-1], 588565: [1],...}
colors = []
for name in dirs:
id,ext = name.split('.')
try:
colors.append(labels[int(id)][0])
except:
colors.append(0)
[1, -1, 1, 1, 1, 1, 1, 1, -1, 1, -1, -1, 1, 1,...] #1 ,0(unlabeled) ,-1
Data reconstruction
dim=50
approximation = pca.inverse_transform(principalComponents)
restored = approximation.reshape(-1,dim,dim,dim)
(1113, 50,50,50)
10
100
500
0.41
0.81
0.59
Eigen brains
- What is an Eigen brain
- Applying PCA
- Plotting
EigenBrains??
Applying PCA
1113x125000
125000x1113
125000x809
Transpose
PCA(99.9)
Reshaping
125000x809
Transpose
Reshape
50x50x50
eigen brain 809
eigen brain 1
Plotting
0.417
0.069
0.035
0.022
1
2
3
4
Separate classes
class1 = frame[(frame['class']== 1)] print("Total class 1:", len(class1)) class2 = frame[(frame['class']== 0)] print("Total class 0:", len(class2))
Total class 1: 550
Total class 0: 558
Average brains
from sklearn.preprocessing import normalize average_brain1 = normalize(class1).mean(0) average_brain2 = normalize(class2).mean(0) avg_vol1 = average_brain1.reshape(50,50,50) print(avg_vol1.shape) avg_vol2 = average_brain2.reshape(50,50,50) print(avg_vol2.shape)
(50, 50, 50)
(50, 50, 50)
c1
c2
diff
class1
class2
3D volume (fail)
jupyter notebook
PCA , Eigen brains
Classification
- Data pre-processing
- a bit more about the data.
- Resampling and hyperparameter tunning
- cross validation
- testing with different parameter values
- Models fittings
- underfitting and overfitting
- convergence
Post-PCA Data Proprocessing
After PCA:
- Data set of dimension 1113x1062 ('PCA.npy')
- Data in the form of array (npy format)
Clasification
- Data label of dimension 1113x1 ('Y.npy')
- Label 1 represents positive brains
- Label -1 represent negative brains
- Label 0 represent unlabeled brains
Post-PCA Data Proprocessing
Post-PCA Data Proprocessing
-Store the index of unlabeled examples & Remove them from the data set
-Replace examples of label -1 with 0 for classification training process (it is better to change to dummy label 0 and 1)
-Concatenate data set with the corresponding labels, rename the columns as PCA1, PCA2,....,PCA1062,CLASS
Post-PCA Data Proprocessing After PCA: Data set of dimension 1113x1062 ('PCA.npy') Data in the form of array (npy format) Clasification
Post-PCA Data Proprocessing
- Data label of dimension 1113x1 ('Y.npy')
- Label 1 represents positive brains
- Label -1 represent negative brains
- Label 0 represent unlabeled brains
Cleaning data
- Remove unlabeled data
- Change labels
- Concatenate data vs labels
- Evenly distribute data
Load Data & label data
(Remove unlabeled samples from the data set)
data=pd.DataFrame(np.load('PCA.npy'))
y=np.load('Y.npy')
y=pd.DataFrame(y)
dataset=pd.concat([data,y],axis=1)
index_remove=list(y[y[0]==0].index)
print(index_remove)
dataset=dataset.drop(index_remove)
[71, 239, 277, 330, 529]
Cleaning data
Replace examples of label -1 with 0 for classification training process (it is better to change to dummy label 0 and 1)
dataset['class'].replace(-1,0,inplace=True)
Cleaning data
Change labels
Rename the columns as PCA1, PCA2,....,PCA1062,CLASS
a=['PCA' + str(i+1) for i in range(1063)]
a[-1]='CLASS'
dataset.columns=a
Cleaning data
Evenly distribute data
Problem: Dataset is organized by example labels
class1 = dataset[(dataset['CLASS']== 1)]
print("Total class 1:", len(class1))
class0 = dataset[(dataset['CLASS']== 0)]
print("Total class 0:", len(class0))
df = []
for i in range(550):
df.append(class1.iloc[i].values)
df.append(class0.iloc[i].values)
for j in range(550,558):
df.append(class0.iloc[i].values)
class1 = dataset[(dataset['class']== 1)]
class2 = dataset[(dataset['class']== 0)]
Final processed data
np.save('X_y.npy',df)
- Save the processed data in .npy
- Load the final data
data=pd.DataFrame(np.load('X_y.npy'))
data.iloc[:,-1]=data.iloc[:,-1].apply(lambda i: int(i))
- Assign value to independent and dependent variables
y=np.array(data.iloc[:,-1].replace(-1,0)).ravel()
X=np.array(data.iloc[:,:-1])
Resampling Methods
The process of repeatedly drawing samples from data and refitting a given model on each sample
Resampling methods help:
- Learn more about the fitted model
- Estimates more accurately the error or accuracy for future data
- Estimates the bias and standard deviation of the hypothesized parameters
- Avoids over-fitting
But:
- Computationally expensive
- Underestimate or overestimate the real error
Resampling Methods
Techniques:
- Cross Validation
+ Validation set
+ Leave-one-out cross-validation (LOOCV)
+ K-fold cross validation
-
The Bootstrap
Resampling Methods
Validation Set
- Training data (70%): fitting purpose
- Validation set (20%): how good is the model for future data
- Test set (10%): generalization
train_percentage = 0.7
validation_percentage = 0.2
test_percentage = 0.1
1108 examples
Resampling Methods
Validation Set
Problem: High bias and variability in error/accuracy between the splits.
E.g. For Support Vector Machine using Linear Kernel
Random split 1:
Random plit 2:
Random split 3:
The result highly dependent on 1 split is potentially biased
Resampling Methods
Leave-one-out cross-validation (LOOCV)
Leave-one-out cross-validation withholds only a single observation for the validation set.
Okay, for our data set, at least 1103 times the model is fit!!!
Resampling Methods
K-fold cross validation
- The technique that randomly divides the whole data set into k subsets, groups or folds (k can be freely chosen)
- Each of k folds can be used as test set while other k-1 folds as training set.
Resampling Methods
K-fold cross validation
Advantage:
- Less computationally expensive
- More accurate estimation for future data
- Easy to apply
- Applicable to large data
Disadvantage:
- Prone to underestimation of real error/overestimation of accuracy
- Not really effective for small data set
K-fold cross validation
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X): train_index
test_index
K-fold cross validation
Assign train_index,test_index to form
- X_train, y_train
- X_test, y_test
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Model fittings
- Logistic Regression using Regularization method (Ridge regression)
- Support Vector Machine with Linear Kernel
- Support Vector Machine with Gaussian Kernel
- K-nearest neighbors (k-nn)
- Neural Network
Model fittings
Problems: Over-fitting vs Under-fitting
Bias:
- how closeness is our predictive model’s to training data
- very little attention to the training data and oversimplifies the model
=>high error on training and test data
Variance:
- variability of model prediction for a given data point or a value showing spread of our data
- a lot of attention to training data and does not generalize on new data
=> perform very well on training data but badly on test data
Model fittings
Underfitting happens when a model unable to capture the underlying pattern of the data => high bias, low variance
Overfitting happens when model captures the noise along with the underlying pattern in data. In other words, our model is trained a lot over noisy data => low bias, high variance
Model fittings
Why over-fitting:
- Too complex models, too many features
- Too small training data set
- Too small regularization term
...
Why under-fitting:
- Too simple model while the data are too complex
- Too large regularization term
Model fittings
Logistic Regression
- Also called Sigmoid Function
- In classification problem, the models should better output the prediction values over (0,1) => no sense for predicted values > 1 or < 0
E.g. 0 as non-spam email
1 as spam email
- Sometimes, linear regression does not work well (classification is not actually a linear function)
=> binary classification problem
Logistic Regression
Apply Sigmoid or Logistic Function
where:
Hypothesis or predicted output:
Logistic Regression
Cost function
To address overfitting as our dataset includes up to 1062 features
=> L1 Regularization term added
Logistic Regression using Regularization (Ridge regression)
Note:
Too large lambda could lead to under-fitting
Too small lambda could keep the model still over-fitting
Logistic Regression
Apply Logistic Regression
from sklearn.linear_model import LogisticRegression
def log_reg(X_train,X_test,y_train,y_test,reg_term):
classifier = LogisticRegression(solver='liblinear',penalty='l1',C=(1/reg_term),fit_intercept=True,max_iter=500)
classifier.fit(X_train,y_train)
return classifier.score(X_train,y_train),classifier.score(X_test,y_test)
Return the accuracy rate of the model over training and test set
Note:
C=1/(regularization term)
Logistic Regression
Define the best regularization term "lambda" for logistic regression model using cross validation
Number of folds: k=5
kf = KFold(n_splits=5)
List of regularization term :
param_logreg=[0.001*10**i for i in range(10)]
Logistic Regression
Step 1: find the best range with the best performance over both training and test set
training_acc_col=[]
test_acc_col=[]
for train_index, test_index in kf.split(X):
training_acc_col.append([])
test_acc_col.append([])
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in param_logreg:
training_acc,test_acc=log_reg(X_train,X_test,y_train,y_test,i)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)
Logistic Regression
Plot averaged training accuracy & test accuracy over the range of regularization terms after running through 5 folds
x_i=[i for i in range(len(param_logreg))]
plt.figure(figsize=(10,10))
plt.plot(x_i,avg_training_acc,marker='o',mfc='blue',color='r')
plt.plot(x_i,avg_test_acc,marker='X',mfc='red',color='black')
plt.xticks(x_i,param_logreg)
plt.legend(['Training Accuracy','Test Accuracy'])
plt.xlabel('The Value of Regularization Terms',size=20)
plt.ylabel('Accuracy Rate',size=20)
plt.show()
avg_training_acc=np.mean(np.array(training_acc_col),axis=0)
avg_test_acc=np.mean(np.array(test_acc_col),axis=0)
Logistic Regression
Logistic Regression
Step 2: Find the exact value of the best regularization term (we choose from 1000 to 10000)
param_logreg=[]
training_acc_col=[]
test_acc_col=[]
for i in range(1000,10000,500):
param_logreg.append(i)
for train_index, test_index in kf.split(X):
training_acc_col.append([])
test_acc_col.append([])
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in param_logreg:
training_acc,test_acc=log_reg(X_train,X_test,y_train,y_test,i)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)
Logistic Regression
Plotting the averaged accuracy
avg_training_acc=np.mean(np.array(training_acc_col),axis=0)
avg_test_acc=np.mean(np.array(test_acc_col),axis=0)
x_i=[i for i in range(len(param_logreg))]
plt.figure(figsize=(10,10))
plt.plot(x_i,avg_training_acc,marker='o',mfc='blue',color='r')
plt.plot(x_i,avg_test_acc,marker='X',mfc='red',color='black')
plt.xticks(x_i,param_logreg)
plt.xlabel('The Value of Regularization Terms',size=20)
plt.ylabel('Accuracy Rate',size=20)
plt.legend(['Training Accuracy','Test Accuracy'])
# plt.xlim(min(param_logreg)/10,max(param_logreg))
plt.show()
Logistic Regression
Logistic Regression
Regularization term: log_reg_term=2500
Averaged training accuracy: 83.78%
Averaged test accuracy: 80.51%
Support Vector Machine
- A discriminative classifier formally defined by a separating hyperplane
- The algorithm outputs an optimal hyperplane which categorizes new examples (a line in 2-D dimension)
Support Vector Machine
Support Vector Machine
Cost Function
Objective Function
If C=(1/lambda), SVM with linear kernel and Logistic Regression using regularization give exactly the same hypothesized parameters
C too large: over-fitting
C too small: under-fitting
Instead of A + λB, we use CA + B
Support Vector Machine
kf = KFold(n_splits=5)
Applying Support Vector Machine with Linear Kernel
Number of folds (iterations): k=5
List of regularization terms:
=> Define the best regularization term
Support Vector Machine
Define the best range of regularization terms
training_acc_col=[]
test_acc_col=[]
for train_index, test_index in kf.split(X):
training_acc_col.append([])
test_acc_col.append([])
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in param_linear_svc:
training_acc,test_acc=linear_svc(X_train,X_test,y_train,y_test,i)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)
Support Vector Machine
Plotting the averaged training and test error after running through 5 folds
avg_training_acc=np.mean(np.array(training_acc_col),axis=0)
avg_test_acc=np.mean(np.array(test_acc_col),axis=0)
x_i=[i for i in range(len(param_linear_svc))]
plt.figure(figsize=(10,10))
ax = plt.subplot(111)
plt.plot(x_i,avg_training_acc,marker='o',mfc='blue',color='r')
plt.plot(x_i,avg_test_acc,marker='X',mfc='red',color='black')
xlabel=['10^'+str(i) for i in range(3,13)]
plt.xticks(x_i,xlabel)
plt.xlabel('The Value of Regularization Terms',size=20)
plt.ylabel('Accuracy Rate',size=20)
plt.show()
Support Vector Machine
Support Vector Machine
Find the exact value of the best regularization term (we choose from C = 10^7 to 10^8)
training_acc_col=[]
test_acc_col=[]
param_linear_svc=[i for i in range(10**7,10**8+1,int((10**7)))]
print('Regularization terms:',param_linear_svc)
for train_index, test_index in kf.split(X):
training_acc_col.append([])
test_acc_col.append([])
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in param_linear_svc:
training_acc,test_acc=linear_svc(X_train,X_test,y_train,y_test,i)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)
Support Vector Machine
Plotting the averaged training and test error after running through 5 folds
Support Vector Machine
Regularization linear_svc_reg=3*10**7
Averaged training accuracy: 86.98%
Averaged test accuracy: 80.69%
Support Vector Machine
with Gaussian Kernel
Gaussian kernel a function whose value depends on the distance from the origin or from some point.
Support Vector Machine with Gaussian Kernel
Support Vector Machine with Gaussian Kernel
Defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’
E.g. Small sigma: small variance, 2 points must be so close to be defined as similar
Support Vector Machine with Gaussian Kernel
Support Vector Machine with Gaussian Kernel
Applying Support Vector Machine with Gaussian Kernel
Number of folds (iterations): k=5
kf = KFold(n_splits=5)
List of regularization terms:
Regularization terms: Regs= [1, 10, 100, 1000] Gammas= [10000, 100000, 1000000, 10000000, 100000000, 1000000000]
=> Define the best regularization term
Support Vector Machine with Gaussian Kernel
training_acc_col=[]
test_acc_col=[]
for train_index, test_index in kf.split(X):
print('FOLD:',k)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in reg_gaussian_svc:
training_acc_col.append([])
test_acc_col.append([])
for j in gamma_gaussian_svc:
print(j)
training_acc,test_acc=
gaussian_svc(X_train,X_test,y_train,y_test,i,j)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)
Support Vector Machine with Gaussian Kernel
avg_training_acc=np.mean(np.array(training_acc_col).reshape((5,-1)),axis=0).reshape(4,6)
avg_test_acc=np.mean(np.array(test_acc_col).reshape((5,-1)),axis=0).reshape(4,6)
Plotting the averaged training and test error after running through 5 folds
reg_gaussian_svc_i,gamma_gaussian_svc_i=np.meshgrid(np.array(reg_gaussian_svc),np.array(gamma_gaussian_svc),indexing='ij')
x_i=[i for i in range(len(reg_gaussian_svc))]
y_i=[i for i in range(len(gamma_gaussian_svc))]
x_ii,y_ii=np.meshgrid(x_i,y_i,indexing='ij')
plt.figure(figsize=(10,10))
plt.contourf(x_ii,y_ii,avg_training_acc)
plt.colorbar()
plt.xticks(x_i,reg_gaussian_svc)
plt.yticks(y_i,gamma_gaussian_svc)
plt.show()
plt.figure(figsize=(10,10))
plt.contourf(x_ii,y_ii,avg_test_acc)
plt.colorbar()
plt.xticks(x_i,reg_gaussian_svc)
plt.yticks(y_i,gamma_gaussian_svc)
plt.show()
Support Vector Machine with Gaussian Kernel
Support Vector Machine with Gaussian Kernel
best_reg=0
best_gamma=0
training_acc_col=[]
test_acc_col=[]
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in reg_gaussian_svc:
for j in gamma_gaussian_svc:
training_acc,test_acc=
gaussian_svc(X_train,X_test,y_train,y_test,i,j)
test_acc_col.append(test_acc)
avg_test_acc=np.mean(np.array(test_acc_col).reshape((5,-1)),axis=0).reshape(4,6)
best_test_accuracy=np.max(avg_test_acc)
i,j=int(np.where(avg_test_acc==np.max(avg_test_acc))[0]),int(np.where(avg_test_acc==np.max(avg_test_acc))[1])
best_reg=reg_gaussian_svc[i]
best_gamma=gamma_gaussian_svc[j]
Support Vector Machine with Gaussian Kernel
Support Vector Machine with Gaussian Kernel
Regularization terms: Regs= [1, 10, 100, 1000] Gammas= [10000, 100000, 1000000, 10000000, 100000000, 1000000000] Best test accuracy: 0.8014716073539603 Best regularization term: 1 Best gamma: 100000000
K nearest neighbors
(KNN)
the input consists of the k closest training examples in the feature space.
(modified pipeline)
PCA should only be applied on the training set
best parameters only for that test fold!!!
Model Performance
Model Performance
PCA applied to generated folds
PCA + Nested cross-validation
???
[1108x155x165x195x1]
1108
1
[256x256x256]
[155x165x195]
Original
Cropped
Input Tensor
Layer (type) Output Shape Param #
=================================================================
conv1 (Conv3D) (None, 155, 165, 195, 8) 1728008
_________________________________________________________________
max_pooling3d (MaxPooling3D) (None, 51, 55, 65, 8) 0
_________________________________________________________________
conv2 (Conv3D) (None, 51, 55, 65, 16) 2000016
_________________________________________________________________
max_pooling3d_1 (MaxPooling3 (None, 25, 27, 32, 16) 0
_________________________________________________________________
conv3 (Conv3D) (None, 25, 27, 32, 32) 1728032
_________________________________________________________________
max_pooling3d_2 (MaxPooling3 (None, 12, 13, 16, 32) 0
_________________________________________________________________
conv3d (Conv3D) (None, 12, 13, 16, 64) 2048064
_________________________________________________________________
max_pooling3d_3 (MaxPooling3 (None, 6, 6, 8, 64) 0
_________________________________________________________________
conv3d_1 (Conv3D) (None, 6, 6, 8, 128) 1024128
_________________________________________________________________
max_pooling3d_4 (MaxPooling3 (None, 3, 3, 4, 128) 0
_________________________________________________________________
flatten (Flatten) (None, 4608) 0
_________________________________________________________________
dense (Dense) (None, 2048) 9439232
_________________________________________________________________
dense_1 (Dense) (None, 512) 1049088
_________________________________________________________________
predictions (Dense) (None, 1) 513
=================================================================
Total params: 19,017,081
Trainable params: 19,017,081
Non-trainable params: 0
[1108] [X^3]
1108
1
[256x256x256]
[200^3]
Original
Cropped
Input Tensor
[150^3, 100^3, 50^3]
Resized
Conclusions
- It is possible to reduce the dimension using simple operations like: cropping and resizing , however, this also means we lose information.
- Aspect ratio in not conserved during resize step (warping)
- PCA help us reduce the amount of features by finding a subspace in which classes can be seperated normally using the principal components as new axes for our data.
- PCA can be used for different purposes depending the application.
- The computation of PCA is fairly expensive and can be speeded up by using SVD algorithm.
- Vizualization of volumes is expensive...(GPU based solutions)
- Feature selection should be applied ONLY to the train data BEFORE cross validation step.
- Nested cross validation is needed in order to eliminate the bias coming from parameter search (learning only for a piece of data).
- PCA should be applied with care due to introdution of bias if done wrong (leak of knowledge)
- Nestes cross validation is computational expensive due to inner and outer loops (parameter search + cross val)
- Work with cropped 3D volumes in future developments 5million voxels instead of 17million (convolutional neural networks starting point)
official presentation file
By gabriel munoz
official presentation file
- 246