E.g. 3D Image processing 256*256*256 = 16777216 features
Avoid curse of dimensionality:
Some dimensionality reduction techniques:
Why dimensionality reduction...
The human connectome project, 1113 subjects’ brain T1w structural images. These 3D MPRAGE (this sounds nice) images obtained from Siemens 3T platforms using a 32-channel head coil.
cropped = data[50:205,60:225,30:225] #155x165x195
data = img.get_data() #256x256x256
(155, 165, 195) array #4987125 voxels (50, 50, 50) array #125000 voxels
resampled = zoom(cropped, (50/cropped.shape[0], 50/cropped.shape[1] , 50/cropped.shape[2])
(50, 50, 50) array #125000 voxels
vector = array.flatten()
(125000)array #125000 voxels
256x256x256
155x165x195
50x50x50
Original
Cropped
Resized
50x50x50
1113x125000
Save
1113 x 125000
1113
1
Flatten
file.npy
Loading, cropping , resizing and saving
Principal Component Analysis (PCA) is a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set.
Reduce the amount of features from the original data to a serie of principal components that hold the information of the variance (axis of subspace)
Try to find a set of "standardized brain ingredients".
(1113, 125000)
(1113, 10)
l = pca.explained_variance_ratio_
sum(l)#0.412558136188802
import pandas as pd
subjects = pd.read_csv('subjects.csv')
columns = subjects[['Subject','class']]
labels= columns.set_index('Subject').T.to_dict('list')
import os
dirs = os.listdir( "brain_nii" )
{585256: [-1], 585862: [-1], 586460: [-1], 587664: [-1], 588565: [1],...}
colors = []
for name in dirs:
id,ext = name.split('.')
try:
colors.append(labels[int(id)][0])
except:
colors.append(0)
[1, -1, 1, 1, 1, 1, 1, 1, -1, 1, -1, -1, 1, 1,...] #1 ,0(unlabeled) ,-1
dim=50
approximation = pca.inverse_transform(principalComponents)
restored = approximation.reshape(-1,dim,dim,dim)
(1113, 50,50,50)
10
100
500
0.41
0.81
0.59
1113x125000
125000x1113
125000x809
Transpose
PCA(99.9)
125000x809
Transpose
Reshape
50x50x50
eigen brain 809
eigen brain 1
0.417
0.069
0.035
0.022
1
2
3
4
PCA , Eigen brains
After PCA:
Clasification
-Store the index of unlabeled examples & Remove them from the data set
-Replace examples of label -1 with 0 for classification training process (it is better to change to dummy label 0 and 1)
-Concatenate data set with the corresponding labels, rename the columns as PCA1, PCA2,....,PCA1062,CLASS
Post-PCA Data Proprocessing After PCA: Data set of dimension 1113x1062 ('PCA.npy') Data in the form of array (npy format) Clasification
(Remove unlabeled samples from the data set)
data=pd.DataFrame(np.load('PCA.npy'))
y=np.load('Y.npy')
y=pd.DataFrame(y)
dataset=pd.concat([data,y],axis=1)
index_remove=list(y[y[0]==0].index)
print(index_remove)
dataset=dataset.drop(index_remove)
[71, 239, 277, 330, 529]
Replace examples of label -1 with 0 for classification training process (it is better to change to dummy label 0 and 1)
dataset['class'].replace(-1,0,inplace=True)
a=['PCA' + str(i+1) for i in range(1063)]
a[-1]='CLASS'
dataset.columns=a
Problem: Dataset is organized by example labels
class1 = dataset[(dataset['CLASS']== 1)]
print("Total class 1:", len(class1))
class0 = dataset[(dataset['CLASS']== 0)]
print("Total class 0:", len(class0))
df = []
for i in range(550):
df.append(class1.iloc[i].values)
df.append(class0.iloc[i].values)
for j in range(550,558):
df.append(class0.iloc[i].values)
class1 = dataset[(dataset['class']== 1)]
class2 = dataset[(dataset['class']== 0)]
np.save('X_y.npy',df)
data=pd.DataFrame(np.load('X_y.npy'))
data.iloc[:,-1]=data.iloc[:,-1].apply(lambda i: int(i))
y=np.array(data.iloc[:,-1].replace(-1,0)).ravel()
X=np.array(data.iloc[:,:-1])
The process of repeatedly drawing samples from data and refitting a given model on each sample
Resampling methods help:
But:
+ Validation set
+ Leave-one-out cross-validation (LOOCV)
+ K-fold cross validation
The Bootstrap
train_percentage = 0.7
validation_percentage = 0.2
test_percentage = 0.1
1108 examples
Problem: High bias and variability in error/accuracy between the splits.
E.g. For Support Vector Machine using Linear Kernel
Random split 1:
Random plit 2:
Random split 3:
The result highly dependent on 1 split is potentially biased
Leave-one-out cross-validation withholds only a single observation for the validation set.
Okay, for our data set, at least 1103 times the model is fit!!!
Advantage:
Disadvantage:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X): train_index
test_index
Assign train_index,test_index to form
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Problems: Over-fitting vs Under-fitting
Bias:
=>high error on training and test data
Variance:
=> perform very well on training data but badly on test data
Underfitting happens when a model unable to capture the underlying pattern of the data => high bias, low variance
Overfitting happens when model captures the noise along with the underlying pattern in data. In other words, our model is trained a lot over noisy data => low bias, high variance
Why over-fitting:
...
Why under-fitting:
E.g. 0 as non-spam email
1 as spam email
=> binary classification problem
Apply Sigmoid or Logistic Function
where:
Hypothesis or predicted output:
Cost function
To address overfitting as our dataset includes up to 1062 features
=> L1 Regularization term added
Note:
Too large lambda could lead to under-fitting
Too small lambda could keep the model still over-fitting
Apply Logistic Regression
from sklearn.linear_model import LogisticRegression
def log_reg(X_train,X_test,y_train,y_test,reg_term):
classifier = LogisticRegression(solver='liblinear',penalty='l1',C=(1/reg_term),fit_intercept=True,max_iter=500)
classifier.fit(X_train,y_train)
return classifier.score(X_train,y_train),classifier.score(X_test,y_test)
Return the accuracy rate of the model over training and test set
Note:
C=1/(regularization term)
Define the best regularization term "lambda" for logistic regression model using cross validation
Number of folds: k=5
kf = KFold(n_splits=5)
List of regularization term :
param_logreg=[0.001*10**i for i in range(10)]
Step 1: find the best range with the best performance over both training and test set
training_acc_col=[]
test_acc_col=[]
for train_index, test_index in kf.split(X):
training_acc_col.append([])
test_acc_col.append([])
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in param_logreg:
training_acc,test_acc=log_reg(X_train,X_test,y_train,y_test,i)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)
Plot averaged training accuracy & test accuracy over the range of regularization terms after running through 5 folds
x_i=[i for i in range(len(param_logreg))]
plt.figure(figsize=(10,10))
plt.plot(x_i,avg_training_acc,marker='o',mfc='blue',color='r')
plt.plot(x_i,avg_test_acc,marker='X',mfc='red',color='black')
plt.xticks(x_i,param_logreg)
plt.legend(['Training Accuracy','Test Accuracy'])
plt.xlabel('The Value of Regularization Terms',size=20)
plt.ylabel('Accuracy Rate',size=20)
plt.show()
avg_training_acc=np.mean(np.array(training_acc_col),axis=0)
avg_test_acc=np.mean(np.array(test_acc_col),axis=0)
Step 2: Find the exact value of the best regularization term (we choose from 1000 to 10000)
param_logreg=[]
training_acc_col=[]
test_acc_col=[]
for i in range(1000,10000,500):
param_logreg.append(i)
for train_index, test_index in kf.split(X):
training_acc_col.append([])
test_acc_col.append([])
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in param_logreg:
training_acc,test_acc=log_reg(X_train,X_test,y_train,y_test,i)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)
Plotting the averaged accuracy
avg_training_acc=np.mean(np.array(training_acc_col),axis=0)
avg_test_acc=np.mean(np.array(test_acc_col),axis=0)
x_i=[i for i in range(len(param_logreg))]
plt.figure(figsize=(10,10))
plt.plot(x_i,avg_training_acc,marker='o',mfc='blue',color='r')
plt.plot(x_i,avg_test_acc,marker='X',mfc='red',color='black')
plt.xticks(x_i,param_logreg)
plt.xlabel('The Value of Regularization Terms',size=20)
plt.ylabel('Accuracy Rate',size=20)
plt.legend(['Training Accuracy','Test Accuracy'])
# plt.xlim(min(param_logreg)/10,max(param_logreg))
plt.show()
Regularization term: log_reg_term=2500
Averaged training accuracy: 83.78%
Averaged test accuracy: 80.51%
Cost Function
Objective Function
If C=(1/lambda), SVM with linear kernel and Logistic Regression using regularization give exactly the same hypothesized parameters
C too large: over-fitting
C too small: under-fitting
Instead of A + λB, we use CA + B
kf = KFold(n_splits=5)
Applying Support Vector Machine with Linear Kernel
Number of folds (iterations): k=5
List of regularization terms:
=> Define the best regularization term
Define the best range of regularization terms
training_acc_col=[]
test_acc_col=[]
for train_index, test_index in kf.split(X):
training_acc_col.append([])
test_acc_col.append([])
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in param_linear_svc:
training_acc,test_acc=linear_svc(X_train,X_test,y_train,y_test,i)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)
Plotting the averaged training and test error after running through 5 folds
avg_training_acc=np.mean(np.array(training_acc_col),axis=0)
avg_test_acc=np.mean(np.array(test_acc_col),axis=0)
x_i=[i for i in range(len(param_linear_svc))]
plt.figure(figsize=(10,10))
ax = plt.subplot(111)
plt.plot(x_i,avg_training_acc,marker='o',mfc='blue',color='r')
plt.plot(x_i,avg_test_acc,marker='X',mfc='red',color='black')
xlabel=['10^'+str(i) for i in range(3,13)]
plt.xticks(x_i,xlabel)
plt.xlabel('The Value of Regularization Terms',size=20)
plt.ylabel('Accuracy Rate',size=20)
plt.show()
Find the exact value of the best regularization term (we choose from C = 10^7 to 10^8)
training_acc_col=[]
test_acc_col=[]
param_linear_svc=[i for i in range(10**7,10**8+1,int((10**7)))]
print('Regularization terms:',param_linear_svc)
for train_index, test_index in kf.split(X):
training_acc_col.append([])
test_acc_col.append([])
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for i in param_linear_svc:
training_acc,test_acc=linear_svc(X_train,X_test,y_train,y_test,i)
training_acc_col[-1].append(training_acc)
test_acc_col[-1].append(test_acc)
Plotting the averaged training and test error after running through 5 folds
Regularization linear_svc_reg=3*10**7
Averaged training accuracy: 86.98%
Averaged test accuracy: 80.69%
Gaussian kernel a function whose value depends on the distance from the origin or from some point.