D. Vilsmeier, M. Sapinski and R. Singh
Can we use Machine Learning for IPM profile reconstruction?
What is Machine Learning?
Field of study that gives computers the ability to learn without being explicitly programmed.
- Arthur Samuel (1959)
What is Machine Learning really?
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
- Tom Mitchell (1997)
A Brief History of AI
Autonomously driving car
Playing Atari Games
Machine Learning Toolbox
Ionization Profile Monitors
Extracting the ions / electrons from the ionization region is expected to provide a one-dimensional projection of the transverse beam profile, but ...
Particles move on straight lines towards the detector
Trajectories are distorted due to various effects
IPM Profile Distortion
Interaction with beam fields:
+ instrumental effects such as camera tilt, point-spread-functions from optical system and multi-channel-plate granularity, etc.
Increase electric guiding field
Increased electric field 🡒 smaller extraction times 🡒 smaller displacement
Additional magnetic guiding field
Increased magnetic field 🡒 smaller gyroradius 🡒 smaller possible displacement
6.5 TeV, 2.1e11 Protons, σ = (270, 360) μm, 4σz = 0.9ns
Actually this type of oscillation can be seen from analytical considerations, using a simplified, namely linear, model for the beam electric field. This assumption holds well for the region close to the beam center.
The corresponding equations of motion are (for an electron):
Analytical considerations (II)
One arrives at the following expressions for the velocity:
What about particles that have their velocity decreased?
Compare with particle that has its velocity increased:
Electron motion is different for two distinct regions:
Distortion through gyromotion
Gyromotion is parametrized via
Particle spirals around
and will be detected in the range
Provided that the moment of detection is random, the gyromotion of electrons gives rise to probabilities of being detected at a specific position.
Applying the limit results in divergence at the "edges" of the motion, so the expression is only valid for the center. For real profiles however we can work with a discretized version of the abovementioned relation.
Summary of displacements
- position at ionization
- gyration center (detector region)
- position of detection
Two types of displacement:
In principal z-velocities are relatively small (shift ~ 10 μm; ionized electrons are mainly scattered in transverse direction), the longitudinal electric field can be neglected and the center shift due to the transverse electric field is around a few micrometers which is small compared to the magnitude of the gyro-motion.
Describing the gyro-distortion via point-spread-functions is however not possible because the gyro-velocity increase is not uniform along the profile.
Simulating the process
Virtual-IPM simulation tool was used to generate the data
|Bunch pop. [1e11]||1.1 -- 2.1 ppb||0.1 ppb|
|Bunch width (1σ)||270 -- 370 μm||5 μm|
|Bunch height (1σ)||360 -- 600 μm||20 μm|
|Bunch length (4σ)||0.9 -- 1.2 ns||0.05 ns|
4kV / 85mm
= 21,021 cases | 5h / case | 12 yrs | Good thing there are computing clusters :-)
Analytical expression for the electric field of a two-dimensional Gaussian charge distribution
🡒 Longitudinal field is neglected; justifiable for highly relativistic beam
🡒 Magnetic field is considered
M. Bassetti and G.A. Erskine, "Closed expression for the electrical field of a two-dimensional Gaussian charge", CERN-ISR-TH/80-06, 1980
Double differential cross section for relativistic incident particles
A. Voitkiv et al, "Hydrogen and helium ionization by relativistic projectiles in collisions with small momentum transfer", J.Phys.B: At.Mol.Opt.Phys, vol.32, 1999
Uniform electric and magnetic guiding fields
Simulation outputs particle parameters as csv data frame
Summarize initial and final positions into 1 μm histograms ranging [-10, 10] mm
Rebin to 55 μm (silicon pixel detector) and 100 μm (optical acquisition system)
Bunch population and bunch length
|ppb||4σz||f ||...||f [n]||i ||...||i [n]|
Drop zero variance variables (as they don't provide any information)
Rescale data to have zero mean and unit variance:
Compute σ from profile
Train, Test, Validate
Divide the machine "learning" into three stages and corresponding data sets:
This data set is used to derive rules from the data, i.e. to fit the model (e.g. polynomial fit). Split size ~ 60%.
This data set is used to ensure that the model generalizes well to unseen data, between multiple iterations of hyper-parameter tuning (e.g. polynomial degree) and model selection. Split size ~ 20%.
This data set is used for evaluating the performance of the final model (in order to prevent "tuning-bias"). Split size ~ 20%.
Mean squared error
Similar mean, variance and R2 (approx.)
Residuals plot is important
Visualization helps in assessing the quality of a model
Multivariable linear regression
Multivariable linear regression - Results
|Mean squared error||0.25212|
Kernel ridge regression
Twofold extension of "classical" linear regression:
L2 "ridge" regularization:
Adding the norm of the weights to the loss function puts a penalty on the weights.
Strength of penalization can be adjusted.
Regularization term prevents overfitting
The Kernel Trick
Features can't be separated by a linear function ...
... at least not in two-dimensional space.
Define a mapping, for example:
Now the space in which K maps may be very high dimensional, so computing the corresponding vectors is expensive.
The trick is to express the dot product of K(x, y) in terms of the initial vectors (x, y) 🡒 no need to compute K(x, y).
Kernel ridge regression - Results
Radial basis function kernel
degree = 2
|Poly kernel||RBF kernel|
|Mean squared error||0.15074||0.23331|
Support Vector Machine
intercept of hyperplane
Fitting and predicting only depends on dot products of feature vectors, so we can use the kernel trick
Support Vector Regression
The fitted model is similar to Kernel Ridge Regression but the loss function is different (KRR uses ordinary least squares)
Support Vector Regression - Results
|Mean squared error||0.13527|
Artificial Neural Networks
Inspired by the human brain, many "neurons" linked together
Apply non-linearity, e.g. sigmoid, tanh
What means deep?
Usually two or more hidden layers.
Using fully connected layers, the number of parameters can grow quite large (e.g. one of the MNIST top classifiers uses 6 layers = 11,972,510 parameters).
All those parameters are adjusted using the training data.
How deep networks learn
Define a loss function that assesses how good the network's predictions are, e.g. mean squared error:
Compute the gradient of the loss with respect to network parameters, e.g. Gradient Descent or Adam (uses "momentum").
Update the network's parameters by applying the gradient with a learning rate:
Deep Learning Frameworks
IDense = partial(Dense, kernel_initializer=VarianceScaling()) # Create feed-forward network. model = Sequential() # Since this is the first hidden layer we also need to specify # the shape of the input data (49 predictors). model.add(IDense(200, activation='relu', input_shape=(49,)) model.add(IDense(170, activation='relu')) model.add(IDense(140, activation='relu')) model.add(IDense(110, activation='relu')) # The network's output (beam sigma). This uses linear activation. model.add(IDense(1))
Fully-connected (dense) feed-forward (sequential) architecture
D. Kingma and J. Ba, "Adam: A Method for Stochastic Optimization", arXiv:1412.6980, 2014
Other possible optimizers are Stochastic Gradient Descent for example.
After each epoch compute the loss on the validation data in order to prevent overfitting
Could apply early stopping
ANN - Results
|Mean squared error||0.10959|
Summary of model performance
|μ (res.)||σ (res.)||R2-score||MSE|
Values are given in units of
Performance on test data
|Mean squared error||0.10815|
Measuring space-charge induced profile distortion
Measurements taken at CERN / SPS / BGI-H
|Electric field||4 kV / 85 mm|
|Magnetic field||0.2 T (50A)|
At the max. magnetic field no profile distortion occurs, but we can reduce the magnet current to make that happen
MCP + Phosphor + Camera
|Beam width (1σ)||0.835 mm|
|Beam height (1σ)||0.451 mm|
|Bunch length (4σ)||1.6 ns|
Taking into account:
Wire Scanners were used to obtain undistorted reference profiles, however only the sigma values from fit were available
Bad luck with the RNG
What about the PSF?
Trying a 520 μm point-spread function instead:
This resolves the deviations for all magnetic field values.
However such a large PSF is difficult to explain.
Hyper-parameter tuning is tedious but there are helpful tools
Collect more data from measurements
Test the method against measured reference profiles (e.g. from Wire Scanners)
Test the method with "unclean" data (e.g. noise added, PSF applied)
from sklearn.model_selection \ import GridSearchCV from autosklearn.regression \ import AutoSklearnRegressor
Investigate the effect of different profile binnings, i.e. different number of predictors
All models are wrong but some are useful.
- George Box (1978)
Piqued your interest?
Icons by icons8.
Please find more information at https://ipmsim.gitlab.io/IPMSim/
The data presented in this talk was generated with Virtual-IPM