Can represent q (and w, the policy) by neural networks and maximise this! Mohamed and Rezende do this by alternating between maximising w.r.t q (maximum likelihood) and w (they derive an expression for the functional derivative of the above) - using SGD in both cases.