Continuous Voice Transcription and Enhancement

How would I train a model that could continuously do voice enhancement (eg pix2pix but for audio and continuously) and continuous voice transcription? Could you guys please link some starting papers? I’m not trying to create a breakthrough models, as I just want to deploy such models for practical use.