MGE-LDM

MGE-LDM: Simultaneous Music Generation and Extraction via Joint Latent Modeling

¹Music and Audio Research Group (MARG),
²IPAI, ³AIIS, ⁴Department of Intelligence and Information,
Seoul National University

Abstract

We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture generation, (2) partial generation (i.e., source imputation), and (3) text-conditioned extraction of arbitrary sources. By formulating both separation and imputation as conditional inpainting tasks in the latent space, our approach supports flexible, class-agnostic manipulation of arbitrary instrument sources. Notably, MGE-LDM can be trained jointly across heterogeneous multi-track datasets (e.g., Slakh2100, MUSDB18, MoisesDB) without relying on predefined instrument categories.

Proposed Method.

(a) Training pipeline: We train a three-track latent diffusion model on mixtures, submixtures, and sources. Each track is perturbed independently and conditioned on its corresponding timestep and CLAP embedding. (b) Inference pipeline: At test time, task-specific latents are either generated or inpainted based on available context and text prompts. The resulting latents are decoded into waveforms.

MGE-LDM: Simultaneous Music Generation and Extraction via Joint Latent Modeling

Abstract

Proposed Method.

Audio Demos for Slakh (Bass, Drums, Guitar, Piano only)

Beyond Slakh: + MUSDB18 and MoisesDB

Additional Real-World Audio Samples with MGE-LDM