You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -43,10 +42,7 @@ <h2>InspireMusic: A Unified Framework for Controlled High-Fidelity Long-Form Mus
43
42
<p><b>Alibaba Group</b></p>
44
43
</div>
45
44
<p><b>Abstract</b>
46
-
We introduce <b>InspireMusic</b>, a unified framework designed to generate high-fidelity music, songs, and audio, which integrates an autoregressive transformer with a super-resolution flow-matching model.
47
-
This framework enables to generate high-fidelity long-form audio at 48kHz from both text and audio modalities. Our model differs from previous approaches, we utilize dual audio tokenizers: a high-bitrate compression audio tokenizer contains richer semantic information,
48
-
thereby reducing training costs and enhancing efficiency, and an acoustic codec that preserves fine-grained acoustic details during flow-matching model training. This combination enables us to achieve high-quality audio generation with long-form coherence.
49
-
Then an autoregressive transformer model based on Qwen2.5 to predict 75Hz audio tokens. Next, we employ a super resolution flow matching model to learn the latent features of the audio from 150Hz music tokenzier, and finally, we output high-quality audio waveforms through a Vocoder. This framework represents a significant advancement in music generation by directly modeling raw audio, ensuring both diversity and high-fidelity output.
45
+
We introduce <b>InspireMusic</b>, a framework integrated super resolution and large language model for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sampling rate from both text and audio prompts. Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information, thereby reducing training costs and enhancing efficiency. This combination enables us to achieve high-quality audio generation with long-form coherence. Then, an autoregressive transformer model based on Qwen 2.5 predicts audio tokens. Next, we employ a super-resolution flow-matching model to generate high-sampling rate audio with fine-grained details learned from an acoustic codec model.
50
46
</p>
51
47
</p>
52
48
@@ -55,10 +51,10 @@ <h2>InspireMusic: A Unified Framework for Controlled High-Fidelity Long-Form Mus
55
51
- Long-form music generation.
56
52
</ul>
57
53
<ul>
58
-
- High audio quality, support 48kHz, 24kHz.
54
+
- High audio quality.
59
55
</ul>
60
56
<ul>
61
-
- A unified high efficiency music generation framework.
57
+
- A unified music, song and audio generation framework.
62
58
</ul>
63
59
</p>
64
60
</p>
@@ -100,16 +96,16 @@ <h2 id="InspireMusic-overview" style="text-align: center;">Overview of InspireMu
100
96
<pstyle="text-align: center;" >
101
97
<b>Figure 1.</b> An overview of the InspireMusic framework. We introduce InspireMusic, a unified framework for music, song, and audio generation capable of producing high-quality 48kHz long-form audio. InspireMusic consists of three key components:
102
98
103
-
- **Dual Audio Tokenizers**:
104
-
The framework first converts raw audio waveforms into discrete tokens that are efficiently processed by the autoregressive model. We employ two tokenizers: WavTokenizer converts 24kHz audio into 75Hz discrete tokens, while Hifi-Codec transforms 48kHz audio into 150Hz latent features suited for our flow matching model.
99
+
- **Audio Tokenizers**:
100
+
Convert the raw audio waveform into discrete audio tokens that can be efficiently processed and trained by the autoregressive transformer model. Audio waveform of lower sampling rate has converted to discrete tokens via a high bitrate compression audio tokenizer.
105
101
106
102
- **Autoregressive Transformer**:
107
-
This component is trained using a next-token prediction approach on both text and audio tokens, enabling it to generate coherent and contextually relevant audio sequences.
103
+
Autoregressive transformer model is based on Qwen2.5 as the backbone model and is trained using a next-token prediction approach on both text and audio tokens, enabling it to generate coherent and contextually relevant token sequences. The audio and text tokens are the inputs of an autoregressive model with the next token prediction to generate tokens.
108
104
109
105
- **Super-Resolution Flow Matching** Model:
110
-
An ODE-based diffusion model, specifically a super-resolution flow matching (SRFM) model, maps the lower-resolution audio tokens to latent features with a higher sampling rate. A vocoder then generates the final audio waveform from these enhanced latent features.
106
+
An ODE-based flow matching model, specifically a super-resolution flow matching (SRFM) model, maps the generated tokens to latent features with high-resolution fine-grained acoustic details obtained from a higher sampling rate of audio to ensure the acoustic information flow connected with high fidelity through models. A vocoder then generates the final audio waveform from these enhanced latent features.
111
107
112
-
InspireMusic supports a range of tasks including text-to-music, music continuation, music reconstruction, and music super-resolution.-- [<ahref="https://arxiv.org/abs/">Paper</a>]-->
108
+
InspireMusic supports a range of tasks including text-to-music, music continuation, music reconstruction, and music super-resolution.-- [<ahref="http://arxiv.org/abs/2503.00084">Technical Report</a>]-->
0 commit comments