The Neural Drum Machine

This repository describes the additional material and experiments around the paper “Neural Drum Machine” submitted at the 2019 International Conference on Computational Creativity. Follow this link to the arXiv page.

In this work, we introduce an audio synthesis architecture for real-time generation of drum sounds. The generative model is based on a conditional convolutional Wasserstein autoencoder (WAE) that learns to generate Mel-scaled magnitude spectrograms of short percussion samples. A Multi-Head Convolutional Neural Network implicitly performs phase reconstruction (MCNN) by estimating the audio signal directly from the magnitude spectrogram. The system is trained on a dataset of drum sounds containing 11 categories. In the end, it is capable of synthesizing sounds in real-time on a CPU. Furthermore, we describe a Max/MSP-based interface designed to interact with the model. With this setup, the system can be easily integrated into a studio-production environment. Moreover, the interface provides simple control over the sound generation process, which allows the user to quickly explore the space of possible drum timbres.

Our system's global architecture. The generative model (1) learns how to reconstruct spectrograms from a parameters’ space. Then, the second part of the system (2) is dedicated to spectrogram inversion, to generate some signal from a Mel spectrogram. Finally, the software interface (3) allows a user to interact with the model and to generate sound from the parameters’ space.

Here, we directly embed the exposed elements

Reconstruction results for all parts of our model
Audio generated by sampling the latent space
Video demonstration of our plugin in Ableton Live

Reconstruction Results

**Bongo**
Input	WAE Output	Output

**Kick**
Input	WAE Output	Output

Click here if you want to see more reconstruction results …

Audio Generations

Here, we present results of audio generations from the latent space. We sample a 64-dimensional Gaussian and give this latent code \(z\) to our decoder, along with a class label \(c\). This is supposed to allow us to generate a sound corresponding to that class label. However, listening to the obtained samples, we can say that sometimes, this conditioning mechanism is not strong enough. Indeed, samples that were generated from the class label corresponding to hihats actually sound more like claps. Therefore, generating samples by sampling the prior seems not to be the good way to do.

Class	Sample 1	Sample 2
Kick
Snare
Closed Hihat
Bongo

Video Demonstration

Here, we showcase the lastest version of our plugin in a studio production environment. This plugin is a VST Instrument developed with Juce, and can be loaded in any Digital Audio Workstation. The model has been compiled with LibTorch, hence a python server is not required anymore.

To generate sound through the interface, we provide controllers: First, the XY pad allows to control two dimensions, selected randomly. Also, a selector allows the user to define the range of the pad. Then, a menu allows the user to set a type of sounds one wants to generate, which amounts to changing the conditioning label \(c\).

This video brings more details on the encoding mechanism that allows a user to re-use their favorite samples.