Dance-driven music generation aims to generate musical pieces conditioned on dance videos. Previous works focus on monophonic or raw audio generation, while the multi-instruments scenario is under-explored. The challenges associated with the dance-driven multi-instrument music (MIDI) generation are twofold: 1) no publicly available multi-instruments MIDI and video paired dataset and 2) the weak correlation between music and video. To tackle these challenges, we build the first multi-instruments MIDI and dance paired dataset (D2MIDI). Based on our proposed dataset, we introduce a multi-instruments MIDI generation framework (Dance2MIDI) conditioned on dance video. Specifically, 1) to capture the relationship between dance and music, we employ the Graph Convolutional Network to encode the dance motion. This allows us to extract features related to dance movement and dance style, 2) to generate a harmonious rhythm, we utilize a Transformer model to decode the drum track sequence, leveraging a cross-attention mechanism, and 3) we model the task of generating the remaining tracks based on the drum track as a sequence understanding and completion task. A BERT-like model is employed to comprehend the context of the entire music piece through self-supervised learning. We evaluate the generated music of our framework trained on the D2MIDI dataset and demonstrate that our method achieves State-of-the-Art performance. The audio and video examples below demonstrate the generated effects in detail. The music content embedded within the showcased videos is representative of the results generated through Dance2MIDI. The audio bar in red denotes the audio content of the original video. The audio bar in green signifies the results generated by the CMT model. The audio bar in blue represents the results generated by Dance2Music model.

It mainly consists of three main components: the Context Encoder, the Drum Rhythm Generator, and the Multi-Track BERTGen. In the Context Encoder, we commence by employing the joint point extraction to obtain the spatial coordinates of the human joints within the dance video. Subsequently, via the utilization of two distinct branches, we extract the dance style features and dance movement features. These extracted features are then combined to form a concatenated representation, which serves as a guide for generating conditional control information that corresponds to the MIDI music. It is worth noting that in the realm of MIDI music, drums play a pivotal role in generating fundamental rhythm patterns that underlie the musical composition. Moreover, in the context of composition and improvisation, it is customary for composers to initiate the creation of a new musical piece by designing the rhythm for the accompanying drum track. As the piece progresses, additional instrumental tracks are incrementally layered on top of the drum track, thereby culminating in the production of the final musical composition. So we first leverage Drum Rhythm Generator to incrementally generate drum tracks in an autoregressive manner, thus establishing the foundational melody of the music. Subsequently, we augment the overall music composition by incorporating note information from other tracks and instruments, thereby enhancing its richness and complexity. We conceptualize this process as a sequence completion task, wherein the BERT-like model is employed to enrich the remaining music track, facilitating a comprehensive understanding of the entirety of the musical piece.

The D2MIDI Dataset

D2MIDI is the first dance-to-MIDI multi- instrument dataset, which has several important features:
1) High-quality solo dance video: it comprises high-quality solo dance videos that have been carefully curated from internet sources to exclude low-quality footage and videos featuring multiple dancers.
2) Multi-instrument and polyphonic MIDI: the dataset contains multi-instrumental and polyphonic MIDI transcriptions that are temporally synchronized with the corresponding dance videos.
3) Multi-style and large-scale: the dataset is both multi-style and large-scale, encompassing a diverse range of dance styles across 71,754 clips.
We employ the sliding window method to sample data from the video. Each sampling window has a size of 600 frames, equivalent to a 30-second dance video, with a sliding window size of 40 frames. This process resulted in a total of 71,754 pairs of data, in which the dance type includes classical, hip-hop, ballet, modern, latin, and house. The music in each data pair does not repeat each other. In the D2MIDI dataset, the duration in each data pair is 30 seconds, which is guaranteed to generate music with a rhythmic structure. The music in the pair contains up to 13 instrument types, including Acoustic Grand Piano, Celesta, Drawbar Organ, Acoustic Guitar (nylon), Acoustic Bass, Violin, String Ensemble 1, SynthBrass 1, Soprano Sax, Piccolo, Lead 1 (square), Pad 1 (new age) and Drum.

Dance2MIDI: Dance-Driven Multi-Instrument Music Generation

Bo Han

Zhejiang University

Yuheng Li

Zhejiang University

Yixuan Shen

National University Singapore

Yi Ren

ByteDance AI Lab

Feilin Han

Beijing Film Academy

Classical Dance

Pop Dance

Ballet Dance

Latin Dance

House & Spica Dance

The Dance2MIDI Model

The D2MIDI Dataset