Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation

Abstract
Autoregressive models are typically applied to sequences of discrete tokens, but recent research indicates that generating sequences of continuous embeddings in an autoregressive manner is also feasible. However, such Continuous Autoregressive Models (CAMs) can suffer from a decline in generation quality over extended sequences due to error accumulation during inference. We introduce a novel method to address this issue by injecting random noise into the input embeddings during training. This procedure makes the model robust against varying error levels at inference. We further reduce error accumulation through an inference procedure that introduces low-level noise. Experiments on musical audio generation show that CAM substantially outperforms existing autoregressive and non-autoregressive approaches while preserving audio quality over extended sequences. This work paves the way for generating continuous embeddings in a purely autoregressive setting, opening new possibilities for real-time and interactive generative applications.

Training Process of CAM

The causal Backbone receives as input a sequence of continuous embeddings with noise augmentation. It outputs z_t, which is used by the Sampler as conditioning to denoise a noise-corrupted version of x_t.

Audio Examples

We train all models on the task of unconditional generation of single instrument samples.

Baselines:

GIVT: GIVT: Generative Infinite-Vocabulary Transformers, trained with 32 modes.
GIVT+Noise: GIVT with 32 modes trained our proposed noise augmentation technique during training.
MAR: Autoregressive Image Generation without Vector Quantization using the configuration with raster order and causal direction.
MAR RF: MAR model trained using the Rectified Flow framework instead of the noise prediction with linear noise schedule framework.
RF: Non-autoregressive diffusion model trained using the Rectified Flow framework.
CAM: Ours.

We show random unconditional samples generated by all models. For all autoregressive models, we generate samples with double the length of the total context of the model.

GIVT	GIVT+Noise	MAR	MAR RF	RF	CAM (Ours)