CTC loss is powerful, but needs a language model rescoring to beat similar sized Enc-Dec structures.
CNN encoders can match LSTM performance, if networks are sufficiently deep.
The popular vision techniques of BN and Residual Connections work well in Speech too.
1-D CNNs work better than 2-D CNNs and dilated CNNs.
Multitask learning with lower level supervision is promising.
Add additional data to the n-gram language model, such as the Fischer transcripts.
Integrate an RNN language model.
Normalize by letter prior.
Investigate different character inventories.
The lower layer supervision seems to be an interesting idea and works well in practice. It will be interesting to see in what scenarios it actually works, and whether it helps train deeper networks better.
The filter patterns being learnt are interesting, and it will be interesting to carry out a qualitiative analysis. One idea would be to try de-convolutions to look at the activate regions.
Setting up an automation system early saves a lot of time in the future.
It's better to keep track of experiments (especially during tuning), with a spreadsheet rather than a doc.
Writing down major findings once a week helps keep track of progress, and uncovers flaws.
I used C++ after 8 months, and feel I need to learn it better.
Talking to others about your ideas helps your own clarity + uncovers major flaws.
Blogging is a lot of fun!
Thank you TTIC! The Force is strong with this one!