MGE-LDM: Simultaneous Music Generation and Extraction via Joint Latent Modeling

1Music and Audio Research Group (MARG),
2IPAI, 3AIIS, 4Department of Intelligence and Information,
Seoul National University

Abstract

We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture generation, (2) partial generation (i.e., source imputation), and (3) text-conditioned extraction of arbitrary sources. By formulating both separation and imputation as conditional inpainting tasks in the latent space, our approach supports flexible, class-agnostic manipulation of arbitrary instrument sources. Notably, MGE-LDM can be trained jointly across heterogeneous multi-track datasets (e.g., Slakh2100, MUSDB18, MoisesDB) without relying on predefined instrument categories.

Proposed Method.

Overall Architecture Figure
(a) Training pipeline: We train a three-track latent diffusion model on mixtures, submixtures, and sources. Each track is perturbed independently and conditioned on its corresponding timestep and CLAP embedding. (b) Inference pipeline: At test time, task-specific latents are either generated or inpainted based on available context and text prompts. The resulting latents are decoded into waveforms.

Audio Demos for Slakh (Bass, Drums, Guitar, Piano only)

Beyond Slakh: + MUSDB18 and MoisesDB