Dimentionality reduction and classification over MRI anatomical images using PCA

What is Dimensionality Reduction?

Too many features while the availability of training examples is not infinite

E.g. 3D Image processing 256*256*256 = 16777216 features

As the number of feature or dimensions grows, the amount of data we need to generalise accurately grows exponentially.

Sometimes, most of these features are correlated and hence redundant

Feature selection & Feature extraction

Avoid curse of dimensionality:

Reduce computational time and space required
Ease of fitting models with high accuracy (bye to over-fitting)
Possibility to visualize the data when reduced to 2D or 3D

Some dimensionality reduction techniques:

ICA (independent source finding)
SVD (efficient computation of PCA)
PCA (correlated variables to linearly uncorrelated variables)

Why dimensionality reduction...

Data & Data Structure

MRI brain images (volumetric representation)
- NIfTI (.nii) files
- Size: 17.3 GB
- 1113 samples (brains)
- Preprocessed version of data
  - Skull stripping.
  - Normalized.

The human connectome project, 1113 subjects’ brain T1w structural images. These 3D MPRAGE (this sounds nice) images obtained from Siemens 3T platforms using a 32-channel head coil.

Cropping

cropped = data[50:205,60:225,30:225] #155x165x195

data = img.get_data() #256x256x256

Resize

(155, 165, 195) array  #4987125 voxels
(50, 50, 50) array #125000 voxels

resampled = zoom(cropped, (50/cropped.shape[0], 50/cropped.shape[1] , 50/cropped.shape[2])

Flatten

(50, 50, 50) array #125000 voxels

vector = array.flatten()

(125000)array #125000 voxels

256x256x256

155x165x195

50x50x50

Original

Cropped

Resized

50x50x50

1113x125000

Save

1113 x 125000

1113

Flatten

file.npy

jupyter notebook

Loading, cropping , resizing and saving

Principal Component Analysis

(PCA)

What is PCA

Finding correlation between variables.
Separating samples on a dataset.
Eigenvalues and eigenvectors (axes of subspace)

Principal Component Analysis (PCA) is a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set.

Visually

If more than 4 columns...

PCA(1113x125000)

Reduce the amount of features from the original data to a serie of principal components that hold the information of the variance (axis of subspace)

EigenBrains

(PCA(125000x1113))

Try to find a set of "standardized brain ingredients".

Feature extraction

Loading data
Applying PCA choosing amount of components
Plotting
Reconstructing data from components

import numpy as np
to_save_load=np.load("to_save.npy")
to_save_load.shape

(1113, 125000)

Loading data

from sklearn.decomposition import PCA
pca = PCA(n_components=10)#% of variance or number
principalComponents = pca.fit_transform(to_save_load)

(1113, 10)

Applying PCA

l = pca.explained_variance_ratio_

sum(l)#0.412558136188802

Plotting

import pandas as pd
subjects = pd.read_csv('subjects.csv')
columns = subjects[['Subject','class']]
labels= columns.set_index('Subject').T.to_dict('list')

import os
dirs = os.listdir( "brain_nii" )

Getting Colors

{585256: [-1], 585862: [-1], 586460: [-1], 587664: [-1], 588565: [1],...}

colors = []
for name in dirs:
id,ext = name.split('.')
try:
colors.append(labels[int(id)][0])
except:
colors.append(0)

[1, -1, 1, 1, 1, 1, 1, 1, -1, 1, -1, -1, 1, 1,...] #1 ,0(unlabeled) ,-1

Data reconstruction

dim=50
approximation = pca.inverse_transform(principalComponents)
restored = approximation.reshape(-1,dim,dim,dim)

(1113, 50,50,50)

100

500

0.41

0.81

0.59

Eigen brains

What is an Eigen brain
Applying PCA
Plotting

EigenBrains??

Applying PCA

1113x125000

125000x1113

125000x809

Transpose

PCA(99.9)

Reshaping

125000x809

Transpose

Reshape

50x50x50

eigen brain 809

eigen brain 1

Plotting

0.417

0.069

0.035

0.022

3D volume (fail)

jupyter notebook

PCA , Eigen brains

Classification

Data pre-processing
- a bit more about the data.
Resampling and hyperparameter tunning
- cross validation
- testing with different parameter values
Models fittings
- underfitting and overfitting
- convergence

Post-PCA Data Proprocessing

After PCA:

Data set of dimension 1113x1062 ('PCA.npy')

Data in the form of array (npy format)

Clasification

Data label of dimension 1113x1 ('Y.npy')
Label 1 represents positive brains
Label -1 represent negative brains
Label 0 represent unlabeled brains

Post-PCA Data Proprocessing

-Store the index of unlabeled examples & Remove them from the data set

-Replace examples of label -1 with 0 for classification training process (it is better to change to dummy label 0 and 1)

-Concatenate data set with the corresponding labels, rename the columns as PCA1, PCA2,....,PCA1062,CLASS

Post-PCA Data Proprocessing After PCA: Data set of dimension 1113x1062 ('PCA.npy') Data in the form of array (npy format) Clasification

Post-PCA Data Proprocessing

Data label of dimension 1113x1 ('Y.npy')
Label 1 represents positive brains
Label -1 represent negative brains
Label 0 represent unlabeled brains

Cleaning data

Remove unlabeled data
Change labels
Concatenate data vs labels
Evenly distribute data

Load Data & label data

(Remove unlabeled samples from the data set)

data=pd.DataFrame(np.load('PCA.npy'))

y=np.load('Y.npy')
y=pd.DataFrame(y)

dataset=pd.concat([data,y],axis=1)
index_remove=list(y[y[0]==0].index)
print(index_remove)

dataset=dataset.drop(index_remove)

[71, 239, 277, 330, 529]

Cleaning data

Replace examples of label -1 with 0 for classification training process (it is better to change to dummy label 0 and 1)

dataset['class'].replace(-1,0,inplace=True)

Cleaning data

Change labels

Rename the columns as PCA1, PCA2,....,PCA1062,CLASS

a=['PCA' + str(i+1) for i in range(1063)]

a[-1]='CLASS'

dataset.columns=a

Cleaning data

Evenly distribute data

Problem: Dataset is organized by example labels

class1 = dataset[(dataset['CLASS']== 1)]
print("Total class 1:", len(class1))
class0 = dataset[(dataset['CLASS']== 0)]
print("Total class 0:", len(class0))
df = []

for i in range(550):

df.append(class1.iloc[i].values)
df.append(class0.iloc[i].values)
for j in range(550,558):
df.append(class0.iloc[i].values)

class1 = dataset[(dataset['class']== 1)]

class2 = dataset[(dataset['class']== 0)]

Final processed data

np.save('X_y.npy',df)

Save the processed data in .npy

Load the final data

data=pd.DataFrame(np.load('X_y.npy'))

data.iloc[:,-1]=data.iloc[:,-1].apply(lambda i: int(i))

Assign value to independent and dependent variables

y=np.array(data.iloc[:,-1].replace(-1,0)).ravel()
X=np.array(data.iloc[:,:-1])

Resampling Methods

The process of repeatedly drawing samples from data and refitting a given model on each sample

Resampling methods help:

Learn more about the fitted model
Estimates more accurately the error or accuracy for future data
Estimates the bias and standard deviation of the hypothesized parameters
Avoids over-fitting

But:

Computationally expensive
Underestimate or overestimate the real error

Resampling Methods

Techniques:

Cross Validation

+ Validation set

+ Leave-one-out cross-validation (LOOCV)

+ K-fold cross validation

The Bootstrap

Resampling Methods

Validation Set

Training data (70%): fitting purpose
Validation set (20%): how good is the model for future data
Test set (10%): generalization

train_percentage = 0.7
validation_percentage = 0.2
test_percentage = 0.1

1108 examples

Resampling Methods

Validation Set

Problem: High bias and variability in error/accuracy between the splits.

E.g. For Support Vector Machine using Linear Kernel

Random split 1:

Random plit 2:

Random split 3:

The result highly dependent on 1 split is potentially biased

Resampling Methods

Leave-one-out cross-validation (LOOCV)

Leave-one-out cross-validation withholds only a single observation for the validation set.

Okay, for our data set, at least 1103 times the model is fit!!!

Resampling Methods

K-fold cross validation

The technique that randomly divides the whole data set into k subsets, groups or folds (k can be freely chosen)
Each of k folds can be used as test set while other k-1 folds as training set.

Resampling Methods

K-fold cross validation

Advantage:

Less computationally expensive
More accurate estimation for future data
Easy to apply
Applicable to large data

Disadvantage:

Prone to underestimation of real error/overestimation of accuracy
Not really effective for small data set

K-fold cross validation

from sklearn.model_selection import KFold

kf = KFold(n_splits=5)

kf.get_n_splits(X)

for train_index, test_index in kf.split(X): train_index

test_index

K-fold cross validation

Assign train_index,test_index to form

X_train, y_train
X_test, y_test

for train_index, test_index in kf.split(X):

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

Model fittings

Logistic Regression using Regularization method (Ridge regression)
Support Vector Machine with Linear Kernel
Support Vector Machine with Gaussian Kernel
K-nearest neighbors (k-nn)
Neural Network

Model fittings

Problems: Over-fitting vs Under-fitting

Bias:

how closeness is our predictive model’s to training data
very little attention to the training data and oversimplifies the model

=>high error on training and test data

Variance:

variability of model prediction for a given data point or a value showing spread of our data
a lot of attention to training data and does not generalize on new data

=> perform very well on training data but badly on test data

Model fittings

Underfitting happens when a model unable to capture the underlying pattern of the data => high bias, low variance

Overfitting happens when model captures the noise along with the underlying pattern in data. In other words, our model is trained a lot over noisy data => low bias, high variance

Model fittings

Why over-fitting:

Too complex models, too many features
Too small training data set
Too small regularization term

...

Why under-fitting:

Too simple model while the data are too complex
Too large regularization term

Model fittings

Logistic Regression

Also called Sigmoid Function
In classification problem, the models should better output the prediction values over (0,1) => no sense for predicted values > 1 or < 0

E.g. 0 as non-spam email

1 as spam email

Sometimes, linear regression does not work well (classification is not actually a linear function)

=> binary classification problem

Logistic Regression

Apply Sigmoid or Logistic Function

where:

Hypothesis or predicted output:

Logistic Regression

Cost function

To address overfitting as our dataset includes up to 1062 features

=> L1 Regularization term added

Logistic Regression using Regularization (Ridge regression)

Note:

Too large lambda could lead to under-fitting

Too small lambda could keep the model still over-fitting

Logistic Regression

Apply Logistic Regression

from sklearn.linear_model import LogisticRegression

def log_reg(X_train,X_test,y_train,y_test,reg_term):
classifier = LogisticRegression(solver='liblinear',penalty='l1',C=(1/reg_term),fit_intercept=True,max_iter=500)

classifier.fit(X_train,y_train)

return classifier.score(X_train,y_train),classifier.score(X_test,y_test)

Return the accuracy rate of the model over training and test set

Note:

C=1/(regularization term)

Logistic Regression

Define the best regularization term "lambda" for logistic regression model using cross validation

Number of folds: k=5

kf = KFold(n_splits=5)

List of regularization term :

param_logreg=[0.001*10**i for i in range(10)]

Logistic Regression

Step 1: find the best range with the best performance over both training and test set

training_acc_col=[]
test_acc_col=[]

for train_index, test_index in kf.split(X):
training_acc_col.append([])
test_acc_col.append([])
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

for i in param_logreg:
training_acc,test_acc=log_reg(X_train,X_test,y_train,y_test,i)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)

Logistic Regression

Plot averaged training accuracy & test accuracy over the range of regularization terms after running through 5 folds

x_i=[i for i in range(len(param_logreg))]

plt.figure(figsize=(10,10))
plt.plot(x_i,avg_training_acc,marker='o',mfc='blue',color='r')
plt.plot(x_i,avg_test_acc,marker='X',mfc='red',color='black')
plt.xticks(x_i,param_logreg)
plt.legend(['Training Accuracy','Test Accuracy'])
plt.xlabel('The Value of Regularization Terms',size=20)
plt.ylabel('Accuracy Rate',size=20)
plt.show()

avg_training_acc=np.mean(np.array(training_acc_col),axis=0)
avg_test_acc=np.mean(np.array(test_acc_col),axis=0)

Logistic Regression

Step 2: Find the exact value of the best regularization term (we choose from 1000 to 10000)

param_logreg=[]
training_acc_col=[]
test_acc_col=[]
for i in range(1000,10000,500):
param_logreg.append(i)
for train_index, test_index in kf.split(X):
training_acc_col.append([])
test_acc_col.append([])
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in param_logreg:
training_acc,test_acc=log_reg(X_train,X_test,y_train,y_test,i)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)

Logistic Regression

Plotting the averaged accuracy

avg_training_acc=np.mean(np.array(training_acc_col),axis=0)
avg_test_acc=np.mean(np.array(test_acc_col),axis=0)

x_i=[i for i in range(len(param_logreg))]
plt.figure(figsize=(10,10))

plt.plot(x_i,avg_training_acc,marker='o',mfc='blue',color='r')
plt.plot(x_i,avg_test_acc,marker='X',mfc='red',color='black')
plt.xticks(x_i,param_logreg)
plt.xlabel('The Value of Regularization Terms',size=20)
plt.ylabel('Accuracy Rate',size=20)
plt.legend(['Training Accuracy','Test Accuracy'])
# plt.xlim(min(param_logreg)/10,max(param_logreg))
plt.show()

Logistic Regression

Regularization term: log_reg_term=2500

Averaged training accuracy: 83.78%

Averaged test accuracy: 80.51%

Support Vector Machine

A discriminative classifier formally defined by a separating hyperplane
The algorithm outputs an optimal hyperplane which categorizes new examples (a line in 2-D dimension)

Support Vector Machine

Cost Function

Objective Function

If C=(1/lambda), SVM with linear kernel and Logistic Regression using regularization give exactly the same hypothesized parameters

C too large: over-fitting

C too small: under-fitting

Instead of A + λB, we use CA + B

Support Vector Machine

kf = KFold(n_splits=5)

Applying Support Vector Machine with Linear Kernel

Number of folds (iterations): k=5

List of regularization terms:

=> Define the best regularization term

Support Vector Machine

Define the best range of regularization terms

training_acc_col=[]
test_acc_col=[]

for train_index, test_index in kf.split(X):
training_acc_col.append([])
test_acc_col.append([])
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

for i in param_linear_svc:
training_acc,test_acc=linear_svc(X_train,X_test,y_train,y_test,i)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)

Support Vector Machine

Plotting the averaged training and test error after running through 5 folds

avg_training_acc=np.mean(np.array(training_acc_col),axis=0)
avg_test_acc=np.mean(np.array(test_acc_col),axis=0)

x_i=[i for i in range(len(param_linear_svc))]
plt.figure(figsize=(10,10))
ax = plt.subplot(111)
plt.plot(x_i,avg_training_acc,marker='o',mfc='blue',color='r')
plt.plot(x_i,avg_test_acc,marker='X',mfc='red',color='black')
xlabel=['10^'+str(i) for i in range(3,13)]
plt.xticks(x_i,xlabel)
plt.xlabel('The Value of Regularization Terms',size=20)
plt.ylabel('Accuracy Rate',size=20)
plt.show()

Support Vector Machine

Find the exact value of the best regularization term (we choose from C = 10^7 to 10^8)

training_acc_col=[]
test_acc_col=[]
param_linear_svc=[i for i in range(10**7,10**8+1,int((10**7)))]
print('Regularization terms:',param_linear_svc)
for train_index, test_index in kf.split(X):
training_acc_col.append([])
test_acc_col.append([])
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in param_linear_svc:
training_acc,test_acc=linear_svc(X_train,X_test,y_train,y_test,i)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)

Support Vector Machine

Plotting the averaged training and test error after running through 5 folds

Support Vector Machine

Regularization linear_svc_reg=3*10**7

Averaged training accuracy: 86.98%

Averaged test accuracy: 80.69%

Support Vector Machine

with Gaussian Kernel

Gaussian kernel a function whose value depends on the distance from the origin or from some point.

Conclusions

It is possible to reduce the dimension using simple operations like: cropping and resizing , however, this also means we lose information.
PCA help us reduce the amount of features by finding a subspace in which classes can be seperated normally using the principal components as new axes for our data.
PCA can be used for different purposes depending the application.
The computation of PCA is fairly expensive and can be speeded up by using SVD algorithm.
Vizualization of volumes is expensive...(GPU based solutions)

Dimentionality reduction and classification over MRI anatomical images using PCA

What is Dimensionality Reduction?

Data & Data Structure

Cropping

Resize

Flatten

jupyter notebook

Principal Component Analysis

(PCA)

What is PCA

Visually

If more than 4 columns...

EigenBrains

Feature extraction

Feature extraction

import numpy as np to_save_load=np.load("to_save.npy") to_save_load.shape

Loading data

from sklearn.decomposition import PCA pca = PCA(n_components=10)#% of variance or number principalComponents = pca.fit_transform(to_save_load)

Applying PCA

Plotting

Getting Colors

Data reconstruction

Eigen brains

EigenBrains??

Applying PCA

Reshaping

Plotting

3D volume (fail)

jupyter notebook

Classification

Post-PCA Data Proprocessing

Post-PCA Data Proprocessing

Post-PCA Data Proprocessing

Post-PCA Data Proprocessing

Cleaning data

Load Data & label data

Cleaning data

Cleaning data

Change labels

Rename the columns as PCA1, PCA2,....,PCA1062,CLASS

Cleaning data

Evenly distribute data

Final processed data

Resampling Methods

Resampling Methods

Techniques:

Resampling Methods

Validation Set

Resampling Methods

Validation Set

Resampling Methods

Leave-one-out cross-validation (LOOCV)

Resampling Methods

K-fold cross validation

Resampling Methods

K-fold cross validation

K-fold cross validation

K-fold cross validation

Model fittings

Model fittings

Model fittings

Model fittings

Model fittings

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression using Regularization (Ridge regression)

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Support Vector Machine

Support Vector Machine

Support Vector Machine

Support Vector Machine

import numpy as np
to_save_load=np.load("to_save.npy")
to_save_load.shape

from sklearn.decomposition import PCA
pca = PCA(n_components=10)#% of variance or number
principalComponents = pca.fit_transform(to_save_load)