NRx(20) With Demapper:
- This is a configuration where we only have a trainable neural receiver, and the transmission pipeline is fixed, i.e, no trainable parameters, default conventional transmission pipeline. During training, the speed of the user is randomly sampled from 0-20 when getting the channel impulse response. The output of the Neural Receiver is then passed onto a Demapper to get the LLRs.
2) NRx(20) Without Demapper:
- This is an identical configuration as above, just without the demapper. The Nural receiver is expected to produce the LLRs directly.
3) Tx-Rx:
- A light-weight neural modulator is added to the transmitter side, which takes in a 14x128 Resource Grid and gives out another 14x128 modulated Grid. The Receiver side configuration is identical to (1): speed randomly sampled from 0-20, neural_Receiver -> Demapper-> LLRs.
Baselines: Baselines are Perfect CSI, where there is no estimation at the receiver's side, and perfect information is available, and LS Estimation, which is a baseline from Sionna's tutorial that deepOFDM work also uses.
After our last meeting, Nasim and I spoke and discussed that perhaps we should start with training the model with speed=0 and seeing what happens when we change speed at test time. That is our next configuration,
4) NRx(0) with Demapper: Identical configuration as (1), but during training, we only get channel impulse responses with the user speed being 0. Then, during test time, we change the channel model to have speeds of 10,20, etc, and see what happens to the clusters.
NRx(20) with demapper
Demapper
LLRs
NRx(20) wo demapper
LLRs
NRx(20)
Demapper
Quick Sanity Check
For Traning both Tx and Rx
TX
RX
NRX(0) setting
Latent from NRx
Next, we wanted to see what happens if we have "Trainable Mapper" in the NRX(0) setting
Next, we wanted to see what happens if we have "Trainable Mapper" in the NRX(0) setting
Learned Symbols
Next, we wanted to see what happens if we have "Trainable Mapper" in the NRX(0) setting
Next, we wanted to see what happens if we have "Trainable Mapper" in the NRX(0) setting
Also when we have mapper = trainable, loss curve appears to be lot smoother
With Trainable Mapper
NRx(0) setting
Also, when running the eval loop, the eval stopped at ebno=-10, because it satisfied the low error rate threshold, which was met for other configs only at ebno=10-14
Trained for 20k steps with normal loss
No trainable component in TX
Neural Receiver
Note: This TSNE might not have one to one Correspondence. Randomness comes from input data being generated
To Make the clusters more coherent
Loss = Information Rate - 0.1*(Intra-class variance)
Metric to measure cluster similarity
Assume Clustered as Gaussians with
Centroid movement
ch. 0m/s
ch. 20m/s
Change in spread
Total
Metric to measure cluster similarity
Metric to measure cluster similarity
Tried on speed 80m/s
Metric to measure cluster similarity
Tried on speed 80m/s
What other parameters of the channel should be changed for us to observe a more drastic difference in performance?
Setting
Sp. 0
Cluster Centers mu_0
Cluster loss: L_ref
Setting
Sp. 10
Cluster Centers mu_0
Cluster loss: L_ref
Sp. 0
Setting
Sp. 10
Cluster Centers mu_0
Cluster loss: L_ref
Every N steps:
- Assign each point to its nearest
center
- Recompute cluster centeres
Setting
Sp. 10
Cluster Centers mu_0
Cluster loss: L_ref
Every N steps:
- Assign each point to its nearest
center
- Recompute cluster centeres
Every step, update the following loss
Setting
Sp. 10
Cluster Centers mu_0
Cluster loss: L_ref
Every N steps:
- Assign each point to its nearest
center
- Recompute cluster centeres
Every step, update the following loss
Note: Rate is NOT in the loss function; it is only measured
Setting
Sp. 10
Cluster Centers mu_0
Cluster loss: L_ref
Every N steps:
- Assign each point to its nearest
center
- Recompute cluster centeres
Every step, update the following loss
Note: Rate is NOT in the loss function; it is only measured
Setting
Sp. 10
Cluster Centers mu_0
Cluster loss: L_ref
Every N steps:
- Assign each point to its nearest
center
- Recompute cluster centeres
Every step, update the following loss
Stop iff:
Now run SSL....
Now run SSL....
Speed 0 to 20
Now run SSL....
Speed 0 to 20
Before SSL
After SSL
Now run SSL....
Speed 0 to 60
Before SSL
After SSL
Now run SSL....
However, the Bit Error Rate is not improving before and after SSL
Now run SSL....
However, the Bit Error Rate is not improving before and after SSL
Thoughts:
1. Should we "pin" the cluster centers to the original cluster centers, as the model's decision boundary had no incentive to change?
2. "Few-shot" updates to the model after SSL? Any advantages over standard supervised learning approaches?
Reason for discrepancy in previous plots: The model's ber and baselines ber were evaluated with different step,s and the plotting function did not take into account that. Now I have updated the plotting function and am running all evaluations from a single config file, so that evaluations all share the same parameters
LLRs
L1: Cluster-Loss
L2: Plabel Loss
BER became slightly worse after training
BER became slightly worse after training
Hypothesis: backbone updates shifted the clusters -> pseudo-labels became wrong -> training on wrong labels corrupted the backbone further.
BER became slightly worse after training
Hypothesis: backbone updates shifted the clusters -> pseudo-labels became wrong -> training on wrong labels corrupted the backbone further.
Soln: Break the feedback loop by detaching L2 gradients from the backbone, so L2 only updates the final classifier layer.
LLRs
L1: Cluster-Loss
L2: Plabel Loss
Stop gradient
BER became slightly worse after training
Inference still degrades
L1 might affect abstract features already learnt when we update the full backbone. Update only the latter layers
loss that modifies the full backbone based on clustering (L1) tends to be harmful.
What if we used only L2? How much effect does only using L2 have?
Before and After SSL curves are identical
Inference: We *MUST* update the backbone for gains, however too much cluster shift is causing us harms
Before and After SSL curves are identical
Motivation: Saw some papers that used Entropy as a common objective for test time adaptation
However, How does it compare to baselilnes?
However, How does it compare to baselilnes?
Text
Speed 20.0 m/s (relative to speed=0):
Average normalized shift: 0.088 (spread-widths)
Average spread ratio: 1.332
→ Clusters expanded by 33.2%
Speed 40.0 m/s (relative to speed=0):
Average normalized shift: 0.307 (spread-widths)
Average spread ratio: 2.075
→ Clusters expanded by 107.5%
Speed 60.0 m/s (relative to speed=0):
Average normalized shift: 0.604 (spread-widths)
Average spread ratio: 2.818
→ Clusters expanded by 181.8%
Delay spread : 150ns
Speed: 0->20
Delay spread : 300ns
Speed: 0->20
Delay spread : 700ns
Speed: 0->20
If I change channel type to D->A,
Everything breaks and SSL does not help
Speed: 0->20
Next step:
- Understand when SSL works?
- Run an array of experiments by varying
1. Speed
2. Delay spread
3. Channel type
Independently
- Look at misclassification rate and relative cluster movement to see if we can say if SSL can fix or not
Next step:
- Understand when SSL works?
- Run an array of experiments by varying
1. Speed
2. Delay spread
3. Channel type
Next step:
- Understand when SSL works?
- Run an array of experiments by varying
1. Speed
2. Delay spread
3. Channel type
Before
After
Base Channel is trained on channel configuration :
(TDL-D,150ns delay spread, speed 0)
- if just the channel type changes from D->A,B,C even with speed 0 same delay spread, we get gains of 0.96, 0.86 and 1.04 respectively, which says before and after ssl there is not much of a difference in BER
- however, When channel changes from D-> A, B, C with the speed changing to 20, with same delay spread, we get gains of 1.76,1.48 ad 2.01
Base Channel is trained on channel configuration :
(TDL-D,150ns delay spread, speed 0)
Within the same channel type (D->D), the performance is satisfactory
| Model | Delay sp | speed | gain |
|---|---|---|---|
| D | 150 | 0 | 0.00 |
| D | 300 | 0 | 0.00 |
| D | 1000 | 0 | 0.00 |
| D | 150 | 20 | 3.45 |
| D | 300 | 20 | 4.23 |
| D | 1000 | 20 | 3.18 |
| D | 150 | 60 | 3.13 |
| D | 300 | 60 | 2.39 |
| D | 1000 | 60 | 2.34 |
Base Channel is trained on channel configuration :
(TDL-D,150ns delay spread, speed 0)
Within the same channel type (D->D), the performance is satisfactory
- Works well when the channel type does not change
- If the channel type changes, struggles in general.
- Across all channels, struggles with high mobility settings
Experiments with Linear Mode Connectivity
Linear Mode Connectivity
Experiments with Linear Mode Connectivity
Goal: To empirically see if LMC hypothesis holds.
LMC can be empirically verified by seeing if this is the case
Example from: Kim B, Ahn C, Baddar WJ, Kim K, Lee H, Ahn S, Han S, Suh S, Yang E. Test-time ensemble via linear mode connectivity: A path to better adaptation. InThe Thirteenth International Conference on Learning Representations 2025.
Experiments with Linear Mode Connectivity
Goal: To empirically see if LMC hypothesis holds.
In our setting
TDL-D, 150ns, speed 0
TDL-D, 150ns, speed 20
Experiments with Linear Mode Connectivity
Goal: To empirically see if LMC hypothesis holds.
TDL-D, 150ns, speed 0
TDL-D, 150ns, speed 20
LMC Holds
Experiments with Linear Mode Connectivity
Goal: To empirically see if LMC hypothesis holds.
TDL-D, 150ns, speed 0
TDL-D, 150ns, speed 60
LMC Holds
Experiments with Linear Mode Connectivity
Goal: To empirically see if LMC hypothesis holds.
TDL-D, 150ns, speed 0
TDL-A, 150ns, speed 60
Does it hold even when channel changes?
yes
Experiments with Linear Mode Connectivity
Next steps:
See how to exploit LMC for test time adaptation?
Only one major prior work , published in ICLR 2025
Kim B, Ahn C, Baddar WJ, Kim K, Lee H, Ahn S, Han S, Suh S, Yang E. Test-time ensemble via linear mode connectivity: A path to better adaptation. InThe Thirteenth International Conference on Learning Representations 2025.
Only one major prior work , published in ICLR 2025
Experiments with Linear Mode Connectivity
Next steps:
See how to exploit LMC for test time adaptation?
Initial Idea (Might change)
Hypernetwork
Catch: Cost of LMC \ finetuning
It is known that NNs loose their ability to "adapt" in nonstationary environments (plasticity)
How much is that cost?
I.E if we were to train a model from scratch on a new environment, how much better would we do?
Catch: Cost of LMC \ finetuning
It is known that NNs loose their ability to "adapt" in nonstationary environments (plasticity)
How much is that cost?
I.E if we were to train a model from scratch on a new environment, how much better would we do?
Can we close the gap?
I Think yes. I feel this issue is because the
Network being "not" plastic enough.
There are some known tricks that will
improve plasticity
Catch: Cost of LMC \ finetuning
It is known that NNs loose their ability to "adapt" in nonstationary environments (plasticity)
How much is that cost?
I.E if we were to train a model from scratch on a new environment, how much better would we do?
Can we close the gap?
I Think yes. I feel this issue is because the
Network being "not" plastic enough.
There are some known tricks that will
improve plasticity
(for ex. see Lyle C, Zheng Z, Khetarpal K, Van Hasselt H, Pascanu R, Martens J, Dabney W. Disentangling the causes of plasticity loss in neural networks. arXiv preprint arXiv:2402.18762. 2024 Feb 29. )
Catch: Cost of LMC \ finetuning
It is known that NNs loose their ability to "adapt" in nonstationary environments (plasticity)
How much is that cost?
I.E if we were to train a model from scratch on a new environment, how much better would we do?
Can we close the gap?
However, I am skeptical, as these works mention loss of plasticity is due to:
1. Growth in parameter norm, because of which the gradient norm is overshadowed
2. The preactivations shift to "non-linear" regions of the activations
3. Plus other factors
Catch: Cost of LMC \ finetuning
It is known that NNs loose their ability to "adapt" in nonstationary environments (plasticity)
Can we close the gap?
However, I am skeptical, as these works mention loss of plasticity is due to:
1. Growth in parameter norm, because of which the gradient norm is overshadowed
2. The preactivations shift to "non-linear" regions of the activations
3. Plus other factors
We do not observe 1
C1
C2
Ck
Model Bank
Base
NRx
Adapted to Channel k
C1
C2
Ck
Model Bank
NRx
Hypernetwork
C1
C2
Ck
Model Bank
NRx
Hypernetwork
Because C1..Ck are LMC connected,
sum_i alpha_i c_i will always be better?
Online, no training at test time
C1
C2
Cn
All channel models
Results
Train
Test
80/20 split
Results
Results
Results
Results
1: tdl-C_300ns_speed5,
4: tdl-B_100ns_speed10
8: tdl-C_300ns_speed20
Results
C1
C2
Cn
All channel models
Train on D, E (line of sight) and B (non line of sight)
Split
Speed: 10/20/40
Delay spread: 100ns
Test on other channel types
Speed: 60
Delay spread: 300ns/150ns
Results
Results
Results
bank[1] (tdl-D,150ns,5m/s)
bank[3] (tdl-B,150ns,10m/s)
bank[8] (tdl-B,100ns,10m/s)
Results
C1
C2
Ck
Model Bank
NRx
Hypernetwork