Hello, thank you for the excellent work on VITS.
My question is about the volume-preserving design of the normalizing flow in the prior encoder (Section 2.5.2 of the paper).
Why did you choose this design over a more expressive non-volume-preserving flow? Was it primarily for training stability, simplicity, or due to empirical results showing no significant performance gain?
I would greatly appreciate any insights into this design choice. Thank you.