Summer 2018 Research Log

Purpose

Here I will keep a log of progress on my 2018 Monroe summer research project. The first entry in this journal is my abstract for my research project. In short, my project aims to use machine learning to compose coherent and pleasant-sounding music.

Update 09/24/18: Here's a sample of the music to come:

LSTM Generated Music

Abstract

Wednesday, July 18, 2018

Traditional neural networks use underlying patterns in learning data with the goal of classifying unseen data. For example, training a neural network to learn the differences between cats and dogs involves feeding the network labeled dog pictures as well as labeled cat pictures. In general, these neural networks work well for classification, but given training data, they cannot produce data of a similar nature. Recurrent neural networks (RNNs) are a type of neural network that analyze patterns in sequenced data for the purpose of generating similar looking sequences of data. For instance, an RNN trained on the complete works of Shakespeare produces text with a similar syntax.

Although RNNs are effective tools for learning short patterned sequences, such as English words, they are less effective at learning longer term underlying patterns. Long short-term memory recurrent neural networks (LSTM RNNs) seek to enhance RNNs by learning longer term patterns, with applications in areas such as speech synthesis and language modeling. For my summer research project, I will train a LSTM RNN on a large dataset of MIDI files (files for musical notation) to test whether the network is capable of composing high quality music. More specifically, I will test whether the network produces music comparable to human-made compositions, and in the process determine the musical elements that contribute to the best sounding music. First the model will learn genres individually, and afterwards it will learn music indiscriminatingly. In the beginning, I will also only emphasize musical rhythm and pitch. Once the model performs well on note sequences, I will extend it to account for more subtle elements, such as sequences of sequences of notes (motifs), dynamics, and chord progressions. Expect to hear some music in the future!

First Update

Wednesday, August 8, 2018

This week I began my work creating a pipeline to train a machine learning model to compose music. So far, I have read several papers where researchers used recurrent neural networks to compose music, some of which containing deficiencies. Some papers used RNNs instead of LSTMs, which have weaker long term memory capabilities. Other papers used datasets of less than 100 pieces of music. Some of the older papers (early 2000s) limited the size of the neural networks in training due to technological limitations. At the same time, I found some projects online that provide some promising results. This web page provides several music samples from a trained RNN on samples from classical music (still trained on a pretty small dataset). The approach explained on the website is different than the approach I am taking, as it uses a multi-dimensional RNN to establish multi-dimensional musical relationships like chords.

At the moment, I am writing code to process the ~300,000 midi files I collected from the Internet (and remove any duplicates). It is not feasible to train any model on raw midi data, so I am making a tool to convert midi files into an intermediate format, which will later be configured into raw vectors to be fed into the LSTM. Music with multiple parts is much more complicated to preprocess into a one dimensional vector format (hence the use of a multi-dimensional RNN on the web page above), so I am focusing on preprocessing only single part music. Since a large percentage of the MIDI data contains multi-part music, I am trying to extract only the melody from each MIDI file, which is not a straightforward process. There is a precedent for extracting melodies from multi-part music, including using MIDI metadata and heuristics on note data, but there are obvious edge cases in which there is no single melody, or where a melody transitions between parts.

I have some liberty in producing the intermediate format. Namely, I decide how the information from the MIDIs is encoded. One approach is to encode a sequence of notes with their corresponding pitches and rhythm, though I can add other information by finding structure in the music. For example, if the same seven notes play twice in a row, I can encode this series of notes as seven notes with a repeat sign at the end. The same can be done for musical sequences (repeated notes transposed by some constant step). The hope is that by encoding inherent musical structures in the MIDI data, the LSTM will be able to generalize more effectively; it may determine, for instance, that repeat signs usually occur towards the end of a song, and usually span a few measures only.

In encoding pitches, rather than encoding the absolute pitch (i.e. C4), I am encoding the relative pitch within the key (the tonic, dominant, subdominant, etc.). The model will learn which intervals typically precede and succeed other intervals, and as a result will not rely too heavily on arbitrary pitches.

One difficulty in training a deep learning models is overfitting data rather than generalizing. I will play with the amount of data I feed into the model and see how it responds, and I may end up writing code to determine whether melodies are plagiarized.

That's it for this post. I will post again when I have more progress.

Second Update

Wednesday, August 15, 2018

Since the last post, I have made significant progress in my code to preprocess midi files into an intermediate format. I've written an algorithm to filter out percussive notes from midis and to place each note in its corresponding key, based on metadata located within the midi file. With this information, I came up with a robust way to read in only the pitched notes of a midi (i.e. non-percussive notes) while simultaneously giving the note its relative pitch (tonic, supertonic, etc.). The relative pitch can be thought of as the relative placement of the note within its scale; the tonic is the first degree of the scale, the supertonic is the second degree of the scale, and every other degree of the scale is similarly identified. The scale varies based on the key signature, but the same seven identifiers are used for the placement of notes within their scales. For example, in the key of C, C is the tonic and G is the dominant, and in the key of D, D is the tonic and A is the dominant.

Relative pitches become useful in the context of training a recurrent neural network because they compress 128-dimensional data (every possible pitch as delineated in the MIDI standard) into 11-dimensional data (the chromatic scale) while still preserving intervals between notes. Learning is quicker and simpler for a recurrent neural network when training on lower dimensional data, and as such it is expected that the neural network will converge quicker with lower dimensional data.

In addition to encoding notes as relative pitches, my preprocessing code converts note timing data in the MIDIs to familiar note values (quarter notes, half notes, etc.). My plan is to use the rhythms as well as the pitches to train the recurrent neural network, which grows the aforementioned 11-dimensional data linearly by the number of note values to include. For example, if I only include whole notes, half notes, and quarter notes, this would triple the dimension of the data, since 11 dimensions are needed for each note value (i.e. tonic quarter note, tonic half note, tonic whole note are each separate dimensions). Given the fact that most MIDIs I've come across only use 32nd notes as the smallest note value, this would multiply the 11-dimensional vector by six, and a 66-dimensional vector is still not very large in the context of typical LSTM usage. This vector will grow slightly to add a begin and end song token and two tokens for begin repeat and end repeat respectively, but the majority of the vector will be dedicated to encoding the notes and their rhythms.

In the future, I want to look into using a multi-dimensional recurrent neural network, which works the same as a regular RNN except it can encode two orthogonal dimensions independently. I would encode relative pitches in one dimension as an 11D vector, and rhythms in the other dimension with more than six dimensions, allowing for more complex rhythms like dotted notes and triplets.

In my last post, I explained that the vast majority of the MIDI files in my dataset contained more than one simultaneous part. This complicates learning for a LSTM because it provides several more dimensions to encode, where no two dimensions are necessarily aligned. For example, a grand staff containing a piano piece likely will not have both parts with identical rhythms, so the idea of sequencing becomes complicated. Traditionally, each step of data fed into a recurrent neural network is presumed to be equally divided across time. Training a RNN on ASCII characters works well because each character has equal time value. Rhythms, however, do not have equal time value, so sequencing rhythms in multiple parts become impossible. The solution to this problem, which I have seen in one other place before, is to encode pitches at slices of time at some minimum rhythmic step. For instance, a whole note played in the treble clef may get encoded as eight eighth notes, where an additional token is used to indicate the start of a note (so that eight consecutive eighth notes are encoded differently than a single whole note). I considered using this method, but I felt that this left too much ambiguity for the LSTM to sort through; I felt as though the LSTM would have trouble learning the nuances of more than one part if the only data it had were slices of time. Thus, for the purpose of this project, I decided to eradicate polyphony and convert each MIDI file to a monophonic sequence of notes.

The process of converting polyphony to monophony required a notion of which notes in a song are most important, which is typically referred to as the melody of a piece. Extracting the melody from a polyphonic piece has precedent, and I was able to read several algorithms from the paper "Manipulation of Music For Melody Matching" written by Alexandra Uitdenbogerd and Justin Zobel. In their paper, they tested four different algorithms to extract melodies from songs and tested their effectiveness qualitatively with human subjects. Human subjects listened to a piece of music, and then they listened to each of the four extracted melodies, and they indicated the extracted melody that most closely matched the perceived melody from the piece of music. The results indicate that the simplest algorithm performed the best, which was to construct a melody from the highest pitched note at any given time step.

I implemented this algorithm myself, and the results are promising. While this algorithm does not work in all cases, and while it also adds some extra, odd notes to the melody (which only make sense if you know the original piece), the algorithm serves as a baseline and can be improved in several ways. For the moment, I am using this technique, as it provides a monophonic sequence of notes on which to train a LSTM. Here are a couple of pairs containing the original piece with its extracted melody (the slight tempo differences are unintententional). With the Snow Mountain theme (about a minute in) you hear some of the limitations of the melody extraction and note quantization algorithm I use.

Brandenburg Concerto No. 3 (Polyphonic)

Brandenburg Concerto No. 3 (Melody)

Snow Mountain (Polyphonic)

Snow Mountain (Melody)

The next step in my research is to choose a subset of the 300,000 MIDIs to train (since training on all 300,000 is infeasible and likely will not produce good results) and to remove duplicate songs from this subset. I will likely choose a subset of a specific genre or type, since the LSTM will likely learn more coherent patterns this way.

Expect to hear some LSTM generated melodies in the next post!

Final Update (Music Samples)

Monday, September 24, 2018

More than a month has passed since the last post, which is a testament to the difficulty of training neural networks. However, after a month of experimentation, I am back to report successful results!

My last post mentioned my overall thought process in training the LSTM from MIDIs: pick a dataset of midis to train on, extract all of the melodies from these midis in the form of monophonic note sequences, and train the LSTM on these note sequences. My original plan was to encode each note as an 11 * N dimensional vector, where N is the number of distinct rhythms encoded (half note, whole note, etc). Given that this is my first time training a recurrent neural network, I incorrectly made the assumption that encoding vectors this sparsely would not impact the performance of the models, since LSTMs have been used effectively in a variety of tasks involving high dimensional data. The nature of this high dimensional data, however, is such that each dimension encodes a distinct category. Using a 26 dimensional vector to encode the English alphabet, for example, is a logical application of higher dimensional space, since each dimension encodes a unique category. My proposal, however, is less logical, since for N rhythm types, it encodes N orthogonal categories for each note, which renders a sparse representation of notes that should be encoded into a more compact representation to better leverage the learning capabilities of an LSTM.

As a result of this sparse representation, my initial results were unsuccessful. The generated songs had no coherency, and more often that not they contained the same series of repeated notes - indicating a lack of training. In reality, the neural network was too large for the given dataset, and in addition to this the training time was insufficient. Several other attempts were made with this high dimensional representation on different datasets. I tried reducing the number of dimensions to 67, and I even tried encoding relative gaps between notes rather than relative pitches, and the results were still mediocre. I needed a way of compacting the sparse vector representation while preserving the relationships between notes and their rhythms, and I soon came across the idea of word embeddings from the natural language processing community.

Typically, in natural language processing tasks, one represents words as N dimensional vectors, where N is the number of words in the vocabulary. Since the vocabulary of words in many cases is relatively large, this representation presents a similar problem to mine; the vector is too sparse to prove efficient, especially in the case of machine learning. To mitigate this, word embeddings are used, which are mappings from N dimensional vectors to a lower dimensional vectors (say 300D, for example) that attempt to preserve the relative semantic meanings of words. Word embeddings are created by running algorithms over large bodies of text to find mappings that map words frequently used together closer to one another. For instance, many word embeddings map colors close together in the lower dimensional subspace, since colors are typically interchangable in natural language.

In the case of language modelling via recurrent neural networks, word embeddings need not be developed separately from neural network training; very often, word embeddings are developed as the model trains to produce a mapping that makes the most sense for the specific text used. Using this idea, I attempted to input the 250+ dimensional vectors into a word embedding layer in my neural network that changed over time based on vectors in context. I tried several different mappings, all of which were less than 60D, in the hope that in the ideal case, the embedding layer would learn would map all identically pitched notes together. For instance, encoding the dominant pitches close together would entail mapping a dominant quarter note, a dominant half note, etc. to the same dimension. While I knew that it would be unlikely that such a mapping would emerge naturally, I trained several models with this approach.

The following is the best I was able to squeeze out of embeddings. It is a short segment trained on a relatively small dataset of Bach's pieces:

Bach Dataset with Embeddings (poor)

This sample shows signs of poor learning by the recurrent neural network, and it serves as a testament of the importance of properly encoding input data. Word embeddings did not seem to work as I intended, even given days of training time on small datasets. At this point, I felt that encoding notes sparsely just to compress them via an embedding was the main issue holding back the LSTM from learning. In a perfect world, any recurrent neural network can learn patterns of sequences fairly well, so the fact that mine failed to do so indicated that my vector encoding needed a new approach.

The fundamental flaw in my design was the encoding of rhythm. As mentioned before, the high dimensional vector representation considers each note/rhythm pair as distinct; a tonic whole note is encoded differently than a tonic half note. The dimensionality is further increased when incorporating dotted notes and triplets, and in general the dimensionality increases linearly by each additional rhythmic pattern. In order to generate a more compact representation, I had to remove the rhythmic component completely from the vector, and to do so I took inspiration from research done by Daniel Johnson in which he composed polyphonic music.

In his research, Johnson encoded each note as a begin note vector followed by a series of N vectors, where N is the length of the note in time units. In this approach, each vector (except the begin note vector) is an equal slice of time, and the length of an individual note corresponds directly to the number of consecutive identical note tokens. For instance, if "_" represents the begin note vector and 'A', 'B', 'C' represent three distinctly pitched note vectors, the following would be a possible short song:

 _AAAA_BB_BB_B_CCCCCCCC_BB_BB_AAAA

The only ambiguity in this approach is the length of a single slice of time (a time unit) in musical terms; this is entirely a matter of choice, but for my experiments I chose eighth notes and sixteenth notes, as I'll explain later.

Encoding notes into slices of time provides the advantage that the dimensionality of the vector only varies linearly by the number of pitches to encode, which in turn provides a much lower dimensional representation than before. Another advantage is that encoding slices of time in this way leverages the capabilities of a LSTM recurrent neural network more effectively; the network's sole purpose is to model sequences with long term dependencies, and this allows it to do so in a very natural way.

Given that the dimensionality of the vectors had reduced drastically, I was able to encode more pitches into the vectors while still maintaining a low dimensionality. As such, I modified my code to transpose an extracted melody to the key of C and then wrote a mechanism to compress that melody into two consecutive octaves. Transposition puts all of the melodies into a common key, creating a more uniform dataset that allows the LSTM to learn more effectively. Compressing the melody into two octaves was an arbitrary decision based on the intuition that most musical melodies fall within two octaves, and it rendered the dimensionality of the note vectors proportional to 25 (12 * 2 for two chromatic scales, and 1 for an uppermost tonic). Lastly, I filtered out all pitches that were lower than Q1 - ( 1.5 * IQR ) where Q1 represents the first quartile of pitches, and IQR represents the inter-quartile range of pitches; this essentially reduced cases where a rest in the melody of a piece would be filled with notes from the bass part.

After fully implementing the lower dimensional approach, I had 29 dimensional vectors (1 dimension for start song token, 1 dimension for end song token, 1 dimension for begin note token, and 25 dimensions for the pitches).

The aforementioned time unit was entirely my own choice, and I decided to use 32nd notes as the minimum unit of time for each note; with this scheme, a single whole note requires 32 consecutive note vectors. I began training on a dataset of video game music with this approach, and I noticed a slight improvement from my earlier results, as I noticed certain intervals occurring more frequently (such as major fifths), but still not satisfactory music. Here's a short sample of the type of music generated from that attempt:

Generated from the Video Game Dataset (poor)

There were obviously still problems with my approach. After my first attempt, I tried using sixteenth notes as the time unit on the same dataset, but it produced similarly bad results. My conclusions at the time were that the LSTM architecture was too deep for the dataset, and that the dataset itself was flawed. Part of my suspicion regarding the dataset stemmed from the fact that transposing melodies to the key of C required a knowledge of the original key of the notes at each timestep, and I relied on MIDI metadata for this which is not always accurate. A more effective approach would have been to determine the key of a piece algorithmically, which is an avenue for future work. I also suspected that the dataset itself was too varied for the LSTM architecture to effectively learn patterns; I needed a much larger architecture and a much larger dataset to accommodate the variety of genres in the video game dataset.

These suspicions I held lead me to try training on a much smaller dataset that I created from the Nottingham Database, with over 1000 songs and a size of about 440,000 note vectors. The video game dataset, to contrast, had over 15,000 pieces and a length of 20 million note vectors. The Nottingham Database provided a good source for high quality abc files (which I converted to MIDI), with accurate key signature metadata and a rather uniform genre (all pieces were folk tunes of some sort). Here is an example of the type of music in this dataset:

Example Piece from the Nottingham Dataset

After training on this dataset with a much smaller LSTM using eighth notes as the unit of time, the results were much better than anything that I had trained before. Here are a few of my favorite generated samples:

Generated from the Nottingham Dataset (1)

Generated from the Nottingham Dataset (2)

Generated from the Nottingham Dataset (3)

Generated from the Nottingham Dataset (4)

The fourth recording from Nottingham was generated from a separate LSTM than the first three; while the first three are from an LSTM with a single layer, the fourth comes from an LSTM with two layers with a similar number of trainable parameters as the first three. I experimented briefly with this multi-layer model configuration to test whether it produced better results, but more testing is needed in the future to know for sure.

These results are much more interesting than my earlier attempts! Notably, in these generated samples one hears a sequence of musically coherent notes; certain regular intervals are heavily favored, there are some repeated sections, and in each piece there is a tendency to return to the tonic, which is equivalent to resolving a musical phrase. This seemed quite impressive for the relatively small size of the dataset and the LSTM, so I decided to train several more LSTM networks on different datasets to generate different genres of music.

My next attempt trained an LSTM on a dataset of 145 medieval songs at the sixteenth note level. Despite the dataset's small size, there were some interesting pieces generated. Undoubtedly, the LSTM would have benefited from more data, but it was interesting that it generated what it did.

Generated from the Medieval Dataset (1)

Generated from the Medieval Dataset (2)

My final and favorite attempt involved training a LSTM on the Bach dataset from before using the new vector representation of notes (with each time step equal to one sixteenth note). I trained for 100 epochs, but the model likely would have produced adequate results before 100 epochs. After 100 epochs, many songs generated closely resembled melody lines from Bach pieces! To give an appreciation of the learning process, however, I provide a sample the type of music the LSTM generates before training.

Generated from the Bach Dataset (epoch 0)

Not very musical! These were the best of the melodies generated after 100 epochs:

Generated from the Bach Dataset (1)

Generated from the Bach Dataset (2)

Generated from the Bach Dataset (3)

Generated from the Bach Dataset (4)

Generated from the Bach Dataset (5)

Generated from the Bach Dataset (6)

That's it for this post. If you read all of these posts and found any of this interesting, send me an email at ayachnes@email.wm.edu; I would love to hear any feedback or suggestions for future work. I'd like to thank the college and my research advisor for their making this research possible; without them, I would not have had the opportunity to do this project (a project which has interested me for years).

Analysis & Future Work

I am very pleased with the results from my final experiments, but there are many issues with the generated pieces that merit discussion. First, I should state that the LSTM architectures performed well on the task: given a dataset of sequences, they produced sequences of a similar nature. Thus, the problems with the generated music lie in the datasets, or more specifically, in the production of these datasets.

One of the core assumptions of this project was that a recurrent neural network would be able to generate music given a sequence of notes. The task for the neural network is simple: predict the next note given the last N. This poses several problems, however. While the generated music contains a plausible sequence of notes, it contains very few abstract musical elements; the ideas of a musical phrase, repetition, and sequences completely evade the LSTM networks, which is expected given their relatively small sizes in these experiments. Providing these networks with more tokens such as a barline token (to encode a fixed meter), a begin/end repeat token, and a begin/end sequence token would provide the LSTM with very similar sequences that result in much richer music. Additionally, allowing the networks to predict more than one note at a time would greatly improve their compositions.

My argument, and a good starting point for future research, is that more advanced preprocessing of the music beforehand to insert begin/end repeats, barlines, and sequence tokens will result in richer datasets and better music. Additionally, incorporating an algorithm to assign a piece its most likely key will allow for much larger and robust datasets. For the moment, I am leaving these as open questions. In the future, I plan to pursue these and record my results similarly to this research.

I also want to experiment with different machine learning architectures. One major limitation of this project was the fact that all music was transformed into monophonic sequences. While effective in training recurrent neural networks, this constraint produces rather boring music. Multi-dimensional recurrent neural networks could be used to produce much more interesting music. I want to experiment with adding more layers to the LSTMs, analyzing any differences in the produced music. Lastly, I want to experiment with generative adversarial networks, which produce sequences by using two neural networks simultaneously and have the potential to generate more realistic sequences of notes.

That's all for now. If I continue this research in the future, I will be sure to update this post with the link to the new web page.