This is the companion website of the paper
Vector Quantized Contrastive Predictive Coding for Templatebased
Music Generation by Hadjeres and Crestel.
In this paper, we proposed a flexible method for generating variations of discrete sequences in which tokens can be grouped into basic units, like sentences in a text or bars in music. More precisely, given a template sequence, we aim at producing novel sequences sharing perceptible similarities with the original template without relying on any annotation. The novelty of our approach is to cast the problem of generating variations as a representation learning problem.
We introduce
 a selfsupervised encoding technique, named VectorQuantized Contrastive Predictive Coding (VQCPC), which allows to learn a meaningful assignment of the basic units over a discrete set of codes, together with mechanisms allowing to control the information content of these learnt discrete representations,
 an appropriate decoder architecture which can generate sequences from the compressed representations learned by the encoder.
VQCPC consists in the introduction of a quantization bottleneck in the Contrastive Predictive Coding (CPC) objective. Contrary to other approaches like VQVAE which aim at obtaining perfect reconstructions and are trained by minimizing a likelihoodbased loss, we choose instead to highlycompress our representations and consider codebooks orders of magnitude smaller than these approaches (typically 8 or 16) and maximize a different objective. We show that our approach allows to obtain a meanigful clustering of the basic units and that this clustering can then be used in a generation perspective. Moreover, we are able to emphasize the importance of the sampling distribution of the negative samples in the CPC objective on the learnt clusters.
We applied our technique on the corpus of J.S. Bach chorales to derive a generative model particularly wellsuited for generating variations of a given input chorale. Our experiments can be reproduced using our Github repository.
The results of our experiments are presented in the following sections
Clusters
The encoder simply learns to map subsequences of a time series to a label belonging to a discrete alphabet. In other words, an encoder defines a clustering of the space of subsequences. This clustering is learned in a selfsupervised manner, by optimising a contrastive objective.
In our experiment, we considered Bach chorales. We chose to define a structuring element as one beat of a four voices chorale. Note that there is no restriction in using fixed length structuring elements, and variable lengths could be used in other applications such as natural language processing.
In the following animated pictures, each frame represents one cluster, and each bar represents one structuring element belonging to that cluster. A limited number of clusters and elements are diplayed on this site. More examples can be downloaded here clusters.zip.
In our article, we explored three different selfsupervised training objectives: VQCPCUniform, where the negative samples are drawn uniformly from the dataset and VQCPCSameSeq, where the negative samples are drawn from the same chorale as the positive samples. We compared these two approaches with what we termed a Distilled VQVAE inspired from the Hierarchical Autoregressive Image Models with Auxiliary Decoders by De Fauw et al.
Each of them led to a different type of clustering which we display below:
Clusters obtained with the VQCPCUniform model
The negative examples in the contrastive objective are sampled randomly among all chorales. Since chorales have been written in all possible key signatures and we used transposition as a data augmentation, an easy way to discriminate the positive examples from the negatives is to look at the alterations. Hence, the clusters are often composed by elements which can lie in the same or a related key.
Clusters obtained with the VQCPCSameSeq model
The negative examples in the contrastive objective are sampled in the samesequence as the positive example,
but at different locations (either before or after the position of the positive).
In that case, the contrastive objective is similar to learning to sort the elements of the score in a chronological order.
In that case, the key is no longer a discriminative feature of the positive example.
On the contrary, the harmonic function is an informative indicator of the position of a chord in a phrase.
Hence, clusters tend to contain elements which could share similar harmonic functions.
Clusters obtained with the Distilled VQVAE model
With the Distilled VQVAE model, the discrete codes are trained to minimize a likelihoodbased loss. As a result, the encoder tends to focus on capturing the key of the fragments, as was the case with the VQCPC codes with random negative sampling. However, we observe that the range of the soprano voice is also captured: the maximal range of the soprano part in a given cluster is not greater than a sixth. This behaviour can be explained as the soprano voice tends to be more regular than the other voices in the particular case of Bach chorales (it is often composed of conjunct notes).
Generating variations
When the encoders are trained, we can then train a decoder to reconstruct the original chorale given its sequence of codes. Because we limited to 16 the total number of different codes, perfect reconstruction is almost impossible. This results in the possibility to generate variations of a template chorale simply by computing its sequence of codes and then decoding it. The decoded chorale will share perceptual similarities with the template chorale and these similarities will depend on what information is contained in the codes.
In the following, we provide variations of a 6bar template chorale for the three different methods we considered.
Variations obtained from VQCPCUniform codes
Example #1:

Example #2:

Variations obtained from VQCPCSameSeq codes
Example #1:

Example #2:

Variations obtained from Distilled VQVAE codes
Example #1:

Example #2:

Variations of a template chorale
We provide additional examples of variations. In this section, we generate variations based on a full template chorale by using the models from the preceding section with a sliding window.
Variations of a full chorale with VQCPCSameSeq
Original chorale:

Variation #1:

Variation #2:

Variations of a full chorale with Distilled VQVAE
Original chorale:

Variation #1:

Variation #2:
