E.g. 3D Image processing 256*256*256 = 16777216 features
Avoid curse of dimensionality:
Some dimensionality reduction techniques:
Why dimensionality reduction...
The human connectome project, 1113 subjects’ brain T1w structural images. These 3D MPRAGE (this sounds nice) images obtained from Siemens 3T platforms using a 32-channel head coil.
cropped = data[50:205,60:225,30:225] #155x165x195
data = img.get_data() #256x256x256
(155, 165, 195) array #4987125 voxels (50, 50, 50) array #125000 voxels
resampled = zoom(cropped, (50/cropped.shape[0], 50/cropped.shape[1] , 50/cropped.shape[2])
(50, 50, 50) array #125000 voxels
vector = array.flatten()
(125000)array #125000 voxels
256x256x256
155x165x195
50x50x50
Original
Cropped
Resized
50x50x50
1113x125000
Save
1113 x 125000
1113
1
Flatten
file.npy
Loading, cropping , resizing and saving
50x50x50
1113x125000
ML methods
1113
1
Flatten
SVM
KNN
PCA
SVM
Principal Component Analysis (PCA) is a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set.
Reduce the amount of features from the original data to a serie of principal components that hold the information of the variance (axis of subspace)
Try to find a set of "standardized brain ingredients".
(1113, 125000)
(1113, 10)
l = pca.explained_variance_ratio_
sum(l)#0.412558136188802
import pandas as pd
subjects = pd.read_csv('subjects.csv')
columns = subjects[['Subject','class']]
labels= columns.set_index('Subject').T.to_dict('list')
import os
dirs = os.listdir( "brain_nii" )
{585256: [-1], 585862: [-1], 586460: [-1], 587664: [-1], 588565: [1],...}
colors = []
for name in dirs:
id,ext = name.split('.')
try:
colors.append(labels[int(id)][0])
except:
colors.append(0)
[1, -1, 1, 1, 1, 1, 1, 1, -1, 1, -1, -1, 1, 1,...] #1 ,0(unlabeled) ,-1
dim=50
approximation = pca.inverse_transform(principalComponents)
restored = approximation.reshape(-1,dim,dim,dim)
(1113, 50,50,50)
10
100
500
0.41
0.81
0.59
1113x125000
125000x1113
125000x809
Transpose
PCA(99.9)
125000x809
Transpose
Reshape
50x50x50
eigen brain 809
eigen brain 1
0.417
0.069
0.035
0.022
1
2
3
4
class1 = frame[(frame['class']== 1)] print("Total class 1:", len(class1)) class2 = frame[(frame['class']== 0)] print("Total class 0:", len(class2))
Total class 1: 550
Total class 0: 558
from sklearn.preprocessing import normalize average_brain1 = normalize(class1).mean(0) average_brain2 = normalize(class2).mean(0) avg_vol1 = average_brain1.reshape(50,50,50) print(avg_vol1.shape) avg_vol2 = average_brain2.reshape(50,50,50) print(avg_vol2.shape)
(50, 50, 50)
(50, 50, 50)
c1
c2
diff
class1
class2
PCA , Eigen brains
After PCA:
Clasification
-Store the index of unlabeled examples & Remove them from the data set
-Replace examples of label -1 with 0 for classification training process (it is better to change to dummy label 0 and 1)
-Concatenate data set with the corresponding labels, rename the columns as PCA1, PCA2,....,PCA1062,CLASS
Post-PCA Data Proprocessing After PCA: Data set of dimension 1113x1062 ('PCA.npy') Data in the form of array (npy format) Clasification
(Remove unlabeled samples from the data set)
data=pd.DataFrame(np.load('PCA.npy'))
y=np.load('Y.npy')
y=pd.DataFrame(y)
dataset=pd.concat([data,y],axis=1)
index_remove=list(y[y[0]==0].index)
print(index_remove)
dataset=dataset.drop(index_remove)
[71, 239, 277, 330, 529]
Replace examples of label -1 with 0 for classification training process (it is better to change to dummy label 0 and 1)
dataset['class'].replace(-1,0,inplace=True)
a=['PCA' + str(i+1) for i in range(1063)]
a[-1]='CLASS'
dataset.columns=a
Problem: Dataset is organized by example labels
class1 = dataset[(dataset['CLASS']== 1)]
print("Total class 1:", len(class1))
class0 = dataset[(dataset['CLASS']== 0)]
print("Total class 0:", len(class0))
df = []
for i in range(550):
df.append(class1.iloc[i].values)
df.append(class0.iloc[i].values)
for j in range(550,558):
df.append(class0.iloc[i].values)
class1 = dataset[(dataset['class']== 1)]
class2 = dataset[(dataset['class']== 0)]
np.save('X_y.npy',df)
data=pd.DataFrame(np.load('X_y.npy'))
data.iloc[:,-1]=data.iloc[:,-1].apply(lambda i: int(i))
y=np.array(data.iloc[:,-1].replace(-1,0)).ravel()
X=np.array(data.iloc[:,:-1])
The process of repeatedly drawing samples from data and refitting a given model on each sample
Resampling methods help:
But:
+ Validation set
+ Leave-one-out cross-validation (LOOCV)
+ K-fold cross validation
The Bootstrap
train_percentage = 0.7
validation_percentage = 0.2
test_percentage = 0.1
1108 examples
Problem: High bias and variability in error/accuracy between the splits.
E.g. For Support Vector Machine using Linear Kernel
Random split 1:
Random plit 2:
Random split 3:
The result highly dependent on 1 split is potentially biased
Leave-one-out cross-validation withholds only a single observation for the validation set.
Okay, for our data set, at least 1103 times the model is fit!!!
Advantage:
Disadvantage:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X): train_index
test_index
Assign train_index,test_index to form
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Problems: Over-fitting vs Under-fitting
Bias:
=>high error on training and test data
Variance:
=> perform very well on training data but badly on test data
Underfitting happens when a model unable to capture the underlying pattern of the data => high bias, low variance
Overfitting happens when model captures the noise along with the underlying pattern in data. In other words, our model is trained a lot over noisy data => low bias, high variance
Why over-fitting:
...
Why under-fitting:
E.g. 0 as non-spam email
1 as spam email
=> binary classification problem
Apply Sigmoid or Logistic Function
where:
Hypothesis or predicted output:
Cost function
To address overfitting as our dataset includes up to 1062 features
=> L1 Regularization term added
Note:
Too large lambda could lead to under-fitting
Too small lambda could keep the model still over-fitting
Apply Logistic Regression
from sklearn.linear_model import LogisticRegression
def log_reg(X_train,X_test,y_train,y_test,reg_term):
classifier = LogisticRegression(solver='liblinear',penalty='l1',C=(1/reg_term),fit_intercept=True,max_iter=500)
classifier.fit(X_train,y_train)
return classifier.score(X_train,y_train),classifier.score(X_test,y_test)
Return the accuracy rate of the model over training and test set
Note:
C=1/(regularization term)
Define the best regularization term "lambda" for logistic regression model using cross validation
Number of folds: k=5
kf = KFold(n_splits=5)
List of regularization term :
param_logreg=[0.001*10**i for i in range(10)]
Step 1: find the best range with the best performance over both training and test set
training_acc_col=[]
test_acc_col=[]
for train_index, test_index in kf.split(X):
training_acc_col.append([])
test_acc_col.append([])
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in param_logreg:
training_acc,test_acc=log_reg(X_train,X_test,y_train,y_test,i)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)
Plot averaged training accuracy & test accuracy over the range of regularization terms after running through 5 folds
x_i=[i for i in range(len(param_logreg))]
plt.figure(figsize=(10,10))
plt.plot(x_i,avg_training_acc,marker='o',mfc='blue',color='r')
plt.plot(x_i,avg_test_acc,marker='X',mfc='red',color='black')
plt.xticks(x_i,param_logreg)
plt.legend(['Training Accuracy','Test Accuracy'])
plt.xlabel('The Value of Regularization Terms',size=20)
plt.ylabel('Accuracy Rate',size=20)
plt.show()
avg_training_acc=np.mean(np.array(training_acc_col),axis=0)
avg_test_acc=np.mean(np.array(test_acc_col),axis=0)
Step 2: Find the exact value of the best regularization term (we choose from 1000 to 10000)
param_logreg=[]
training_acc_col=[]
test_acc_col=[]
for i in range(1000,10000,500):
param_logreg.append(i)
for train_index, test_index in kf.split(X):
training_acc_col.append([])
test_acc_col.append([])
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in param_logreg:
training_acc,test_acc=log_reg(X_train,X_test,y_train,y_test,i)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)
Plotting the averaged accuracy
avg_training_acc=np.mean(np.array(training_acc_col),axis=0)
avg_test_acc=np.mean(np.array(test_acc_col),axis=0)
x_i=[i for i in range(len(param_logreg))]
plt.figure(figsize=(10,10))
plt.plot(x_i,avg_training_acc,marker='o',mfc='blue',color='r')
plt.plot(x_i,avg_test_acc,marker='X',mfc='red',color='black')
plt.xticks(x_i,param_logreg)
plt.xlabel('The Value of Regularization Terms',size=20)
plt.ylabel('Accuracy Rate',size=20)
plt.legend(['Training Accuracy','Test Accuracy'])
# plt.xlim(min(param_logreg)/10,max(param_logreg))
plt.show()
Regularization term: log_reg_term=2500
Averaged training accuracy: 83.78%
Averaged test accuracy: 80.51%
Cost Function
Objective Function
If C=(1/lambda), SVM with linear kernel and Logistic Regression using regularization give exactly the same hypothesized parameters
C too large: over-fitting
C too small: under-fitting
Instead of A + λB, we use CA + B
kf = KFold(n_splits=5)
Applying Support Vector Machine with Linear Kernel
Number of folds (iterations): k=5
List of regularization terms:
=> Define the best regularization term
Define the best range of regularization terms
training_acc_col=[]
test_acc_col=[]
for train_index, test_index in kf.split(X):
training_acc_col.append([])
test_acc_col.append([])
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in param_linear_svc:
training_acc,test_acc=linear_svc(X_train,X_test,y_train,y_test,i)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)
Plotting the averaged training and test error after running through 5 folds
avg_training_acc=np.mean(np.array(training_acc_col),axis=0)
avg_test_acc=np.mean(np.array(test_acc_col),axis=0)
x_i=[i for i in range(len(param_linear_svc))]
plt.figure(figsize=(10,10))
ax = plt.subplot(111)
plt.plot(x_i,avg_training_acc,marker='o',mfc='blue',color='r')
plt.plot(x_i,avg_test_acc,marker='X',mfc='red',color='black')
xlabel=['10^'+str(i) for i in range(3,13)]
plt.xticks(x_i,xlabel)
plt.xlabel('The Value of Regularization Terms',size=20)
plt.ylabel('Accuracy Rate',size=20)
plt.show()
Find the exact value of the best regularization term (we choose from C = 10^7 to 10^8)
training_acc_col=[]
test_acc_col=[]
param_linear_svc=[i for i in range(10**7,10**8+1,int((10**7)))]
print('Regularization terms:',param_linear_svc)
for train_index, test_index in kf.split(X):
training_acc_col.append([])
test_acc_col.append([])
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in param_linear_svc:
training_acc,test_acc=linear_svc(X_train,X_test,y_train,y_test,i)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)
Plotting the averaged training and test error after running through 5 folds
Regularization linear_svc_reg=3*10**7
Averaged training accuracy: 86.98%
Averaged test accuracy: 80.69%
Gaussian kernel a function whose value depends on the distance from the origin or from some point.
Defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’
E.g. Small sigma: small variance, 2 points must be so close to be defined as similar
Applying Support Vector Machine with Gaussian Kernel
Number of folds (iterations): k=5
kf = KFold(n_splits=5)
List of regularization terms:
Regularization terms: Regs= [1, 10, 100, 1000] Gammas= [10000, 100000, 1000000, 10000000, 100000000, 1000000000]
=> Define the best regularization term
training_acc_col=[]
test_acc_col=[]
for train_index, test_index in kf.split(X):
print('FOLD:',k)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in reg_gaussian_svc:
training_acc_col.append([])
test_acc_col.append([])
for j in gamma_gaussian_svc:
print(j)
training_acc,test_acc=
gaussian_svc(X_train,X_test,y_train,y_test,i,j)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)
avg_training_acc=np.mean(np.array(training_acc_col).reshape((5,-1)),axis=0).reshape(4,6)
avg_test_acc=np.mean(np.array(test_acc_col).reshape((5,-1)),axis=0).reshape(4,6)
Plotting the averaged training and test error after running through 5 folds
reg_gaussian_svc_i,gamma_gaussian_svc_i=np.meshgrid(np.array(reg_gaussian_svc),np.array(gamma_gaussian_svc),indexing='ij')
x_i=[i for i in range(len(reg_gaussian_svc))]
y_i=[i for i in range(len(gamma_gaussian_svc))]
x_ii,y_ii=np.meshgrid(x_i,y_i,indexing='ij')
plt.figure(figsize=(10,10))
plt.contourf(x_ii,y_ii,avg_training_acc)
plt.colorbar()
plt.xticks(x_i,reg_gaussian_svc)
plt.yticks(y_i,gamma_gaussian_svc)
plt.show()
plt.figure(figsize=(10,10))
plt.contourf(x_ii,y_ii,avg_test_acc)
plt.colorbar()
plt.xticks(x_i,reg_gaussian_svc)
plt.yticks(y_i,gamma_gaussian_svc)
plt.show()
best_reg=0
best_gamma=0
training_acc_col=[]
test_acc_col=[]
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in reg_gaussian_svc:
for j in gamma_gaussian_svc:
training_acc,test_acc=
gaussian_svc(X_train,X_test,y_train,y_test,i,j)
test_acc_col.append(test_acc)
avg_test_acc=np.mean(np.array(test_acc_col).reshape((5,-1)),axis=0).reshape(4,6)
best_test_accuracy=np.max(avg_test_acc)
i,j=int(np.where(avg_test_acc==np.max(avg_test_acc))[0]),int(np.where(avg_test_acc==np.max(avg_test_acc))[1])
best_reg=reg_gaussian_svc[i]
best_gamma=gamma_gaussian_svc[j]
Regularization terms: Regs= [1, 10, 100, 1000] Gammas= [10000, 100000, 1000000, 10000000, 100000000, 1000000000] Best test accuracy: 0.8014716073539603 Best regularization term: 1 Best gamma: 100000000
(KNN)
the input consists of the k closest training examples in the feature space.
(modified pipeline)
PCA should only be applied on the training set
best parameters only for that test fold!!!
PCA + Nested cross-validation
???
[1108x155x165x195x1]
1108
1
[256x256x256]
[155x165x195]
Original
Cropped
Input Tensor
Layer (type) Output Shape Param #
=================================================================
conv1 (Conv3D) (None, 155, 165, 195, 8) 1728008
_________________________________________________________________
max_pooling3d (MaxPooling3D) (None, 51, 55, 65, 8) 0
_________________________________________________________________
conv2 (Conv3D) (None, 51, 55, 65, 16) 2000016
_________________________________________________________________
max_pooling3d_1 (MaxPooling3 (None, 25, 27, 32, 16) 0
_________________________________________________________________
conv3 (Conv3D) (None, 25, 27, 32, 32) 1728032
_________________________________________________________________
max_pooling3d_2 (MaxPooling3 (None, 12, 13, 16, 32) 0
_________________________________________________________________
conv3d (Conv3D) (None, 12, 13, 16, 64) 2048064
_________________________________________________________________
max_pooling3d_3 (MaxPooling3 (None, 6, 6, 8, 64) 0
_________________________________________________________________
conv3d_1 (Conv3D) (None, 6, 6, 8, 128) 1024128
_________________________________________________________________
max_pooling3d_4 (MaxPooling3 (None, 3, 3, 4, 128) 0
_________________________________________________________________
flatten (Flatten) (None, 4608) 0
_________________________________________________________________
dense (Dense) (None, 2048) 9439232
_________________________________________________________________
dense_1 (Dense) (None, 512) 1049088
_________________________________________________________________
predictions (Dense) (None, 1) 513
=================================================================
Total params: 19,017,081
Trainable params: 19,017,081
Non-trainable params: 0
[1108] [X^3]
1108
1
[256x256x256]
[200^3]
Original
Cropped
Input Tensor
[150^3, 100^3, 50^3]
Resized