MusicLM Text to Music

Google’s MusicLM is an AI model that generates high-fidelity music from text descriptions, such as “a calming violin melody backed by a distorted guitar”. The model can be conditioned on both text and a melody, and it generates music at 24 kHz that remains consistent over several minutes. MusicLM outperforms previous systems in audio quality and adherence to the given descriptions. To support future research, Google has publicly released MusicCaps, a dataset composed of 5.5k music-text pairs with rich text descriptions provided by human experts. The model’s capabilities extend beyond generating short clips of songs, as it can capture nuances like instrumental riffs, melodies, and moods from meandering descriptions. However, some samples have a distorted quality, and the system’s ability to generate vocals, including choral harmonies, is limited.