Musicians have struggled for years to find an automatic solution for composition. As the development of Artificial Intelligence and Deep Learning, many applications have emerged to provide ideas and insights for songwriters. By aligning the notes and beats correctly, Deep Learning models can dig out the melody that was rarely or never used by prior musicians. In this hackthon project, we proposed to convert audios into image-based files to leverage the idea of CNN, to demonstrates a capability of reconstructing melody-based images after model training with a certain number of epochs.

GAN, also known as generative adversarial network that estimate generative models via an adversarial process. Other solutions like RNN or CRNN + GAN also exists, and are capable to generate high-quality music piece that has a taste of “Jazz”. Due to the model features, these solutions shows constraints on issues like overfitting, high-variance or computational expensive. After comparison, GAN is decided to be the most stable and desirable for the task of Jazz music composition.

Methodology

Datasets

We collected music datasets from the Kaggle community. This dataset consists of classical piano midi files containing compositions of 19 famous composers. To pursue better performance, the sub-genres of the song selection tend to be only Mozart’ works collection, and they are identified by left and right piano parts.

Image Conversion and Preprocessing

Melody notes are converted to black and white pixels (0 ~ 255) with a fixed size of 106 X 100 for each image. Row size (106) represents the number of different notes in music. Column size (100) describes the length of the melody. We demonstrate the intuition as the below..

Each image resizes into 106 X 106 pixels for the purpose of convention use in neural network architecture. We also scaled pixel values ranging from -1 to 1 for each image, corresponding to the activate function used in the model. Virtualized process is shown below.

Deep Convolutional Generative Adversarial Network

The model is mainly composed of two components: Generator and Discriminator. Generator aims to generate new plausible melody-based images from a D-dimensional (D = 100) latent space, and Discriminator classifies images as real or fake. GANs are trained in an adversarial manner until the Discriminator fails to identify generated examples as “fake”, in our case, non melody-based images.

Discriminator model (left part) is convolutional-based neural network that takes images with the size of 106 X 106 at the input. Generator model (right part) is an up-sampling neural network that takes a 100 dimensions random noise vector as the input to reconstruct the output similar to melody-based images.

Results

During the training process, the Generator model came to generate plausible melody-based images along training epochs increases (see Melody training section). At the first thirty epochs, Generator model failed to reconstruct images as both Discriminator and Generator were not well trained yet, which corresponds to low loss as well (see Model evaluation section). These can be interpreted as the following: Discriminator can easily identify “fake” random noises from “real” melody-based images at the beginning given that Generator model has not learned the information that can be used for reconstructing the images. We can observe the model began to generate meaningful notes by extending the duration of training.

For performance demonstration, we generated a sample melody-based image by inputting a random nose vector into our well trained Generator model (see Melody generation section). The melody-based image is converted back to midi audio file, together with showing its score for the purpose of evaluation.

Discussion

In this study, we identified music composing as the image generation task by mapping from notes to melody-based images. We successfully generated melody-based images by using Deep Convolutional Generative Adversarial Network. However, this model architecture makes it less durable to train multiple music tracks together. Thus, we combined multi-tracks into one at image level or selected the main melody track (piano), whenever it’s possible, as training data. Future studies should focus on leveraging multiple tracks and fitting into the model simultaneously.