Conclusions

CTC loss is powerful, but needs a language model rescoring to beat similar sized Enc-Dec structures.
CNN encoders can match LSTM performance, if networks are sufficiently deep.
The popular vision techniques of BN and Residual Connections work well in Speech too.
1-D CNNs work better than 2-D CNNs and dilated CNNs.
Multitask learning with lower level supervision is promising.

Encoders for CTC