Here I will keep a log of progress on my 2018 Monroe summer research project. The first entry in this journal is my abstract for my research project. In short, my project aims to use machine learning to compose coherent and pleasant-sounding music. Update 09/24/18: Here's a sample of the music to come:
Traditional neural networks use underlying patterns in learning data with the goal of classifying unseen data. For example, training a neural network to learn the differences between cats and dogs involves feeding the network labeled dog pictures as well as labeled cat pictures. In general, these neural networks work well for classification, but given training data, they cannot produce data of a similar nature. Recurrent neural networks (RNNs) are a type of neural network that analyze patterns in sequenced data for the purpose of generating similar looking sequences of data. For instance, an RNN trained on the complete works of Shakespeare produces text with a similar syntax. Although RNNs are effective tools for learning short patterned sequences, such as English words, they are less effective at learning longer term underlying patterns. Long short-term memory recurrent neural networks (LSTM RNNs) seek to enhance RNNs by learning longer term patterns, with applications in areas such as speech synthesis and language modeling. For my summer research project, I will train a LSTM RNN on a large dataset of MIDI files (files for musical notation) to test whether the network is capable of composing high quality music. More specifically, I will test whether the network produces music comparable to human-made compositions, and in the process determine the musical elements that contribute to the best sounding music. First the model will learn genres individually, and afterwards it will learn music indiscriminatingly. In the beginning, I will also only emphasize musical rhythm and pitch. Once the model performs well on note sequences, I will extend it to account for more subtle elements, such as sequences of sequences of notes (motifs), dynamics, and chord progressions. Expect to hear some music in the future!
This week I began my work creating a pipeline to train a machine learning model to compose music. So far, I have read several papers where researchers used recurrent neural networks to compose music, some of which containing deficiencies. Some papers used RNNs instead of LSTMs, which have weaker long term memory capabilities. Other papers used datasets of less than 100 pieces of music. Some of the older papers (early 2000s) limited the size of the neural networks in training due to technological limitations. At the same time, I found some projects online that provide some promising results. This web page provides several music samples from a trained RNN on samples from classical music (still trained on a pretty small dataset). The approach explained on the website is different than the approach I am taking, as it uses a multi-dimensional RNN to establish multi-dimensional musical relationships like chords. At the moment, I am writing code to process the ~300,000 midi files I collected from the Internet (and remove any duplicates). It is not feasible to train any model on raw midi data, so I am making a tool to convert midi files into an intermediate format, which will later be configured into raw vectors to be fed into the LSTM. Music with multiple parts is much more complicated to preprocess into a one dimensional vector format (hence the use of a multi-dimensional RNN on the web page above), so I am focusing on preprocessing only single part music. Since a large percentage of the MIDI data contains multi-part music, I am trying to extract only the melody from each MIDI file, which is not a straightforward process. There is a precedent for extracting melodies from multi-part music, including using MIDI metadata and heuristics on note data, but there are obvious edge cases in which there is no single melody, or where a melody transitions between parts. I have some liberty in producing the intermediate format. Namely, I decide how the information from the MIDIs is encoded. One approach is to encode a sequence of notes with their corresponding pitches and rhythm, though I can add other information by finding structure in the music. For example, if the same seven notes play twice in a row, I can encode this series of notes as seven notes with a repeat sign at the end. The same can be done for musical sequences (repeated notes transposed by some constant step). The hope is that by encoding inherent musical structures in the MIDI data, the LSTM will be able to generalize more effectively; it may determine, for instance, that repeat signs usually occur towards the end of a song, and usually span a few measures only. In encoding pitches, rather than encoding the absolute pitch (i.e. C4), I am encoding the relative pitch within the key (the tonic, dominant, subdominant, etc.). The model will learn which intervals typically precede and succeed other intervals, and as a result will not rely too heavily on arbitrary pitches. One difficulty in training a deep learning models is overfitting data rather than generalizing. I will play with the amount of data I feed into the model and see how it responds, and I may end up writing code to determine whether melodies are plagiarized. That's it for this post. I will post again when I have more progress.
Since the last post, I have made significant progress in my code to preprocess midi files into an intermediate format. I've written an algorithm to filter out percussive notes from midis and to place each note in its corresponding key, based on metadata located within the midi file. With this information, I came up with a robust way to read in only the pitched notes of a midi (i.e. non-percussive notes) while simultaneously giving the note its relative pitch (tonic, supertonic, etc.). The relative pitch can be thought of as the relative placement of the note within its scale; the tonic is the first degree of the scale, the supertonic is the second degree of the scale, and every other degree of the scale is similarly identified. The scale varies based on the key signature, but the same seven identifiers are used for the placement of notes within their scales. For example, in the key of C, C is the tonic and G is the dominant, and in the key of D, D is the tonic and A is the dominant. Relative pitches become useful in the context of training a recurrent neural network because they compress 128-dimensional data (every possible pitch as delineated in the MIDI standard) into 11-dimensional data (the chromatic scale) while still preserving intervals between notes. Learning is quicker and simpler for a recurrent neural network when training on lower dimensional data, and as such it is expected that the neural network will converge quicker with lower dimensional data. In addition to encoding notes as relative pitches, my preprocessing code converts note timing data in the MIDIs to familiar note values (quarter notes, half notes, etc.). My plan is to use the rhythms as well as the pitches to train the recurrent neural network, which grows the aforementioned 11-dimensional data linearly by the number of note values to include. For example, if I only include whole notes, half notes, and quarter notes, this would triple the dimension of the data, since 11 dimensions are needed for each note value (i.e. tonic quarter note, tonic half note, tonic whole note are each separate dimensions). Given the fact that most MIDIs I've come across only use 32nd notes as the smallest note value, this would multiply the 11-dimensional vector by six, and a 66-dimensional vector is still not very large in the context of typical LSTM usage. This vector will grow slightly to add a begin and end song token and two tokens for begin repeat and end repeat respectively, but the majority of the vector will be dedicated to encoding the notes and their rhythms. In the future, I want to look into using a multi-dimensional recurrent neural network, which works the same as a regular RNN except it can encode two orthogonal dimensions independently. I would encode relative pitches in one dimension as an 11D vector, and rhythms in the other dimension with more than six dimensions, allowing for more complex rhythms like dotted notes and triplets. In my last post, I explained that the vast majority of the MIDI files in my dataset contained more than one simultaneous part. This complicates learning for a LSTM because it provides several more dimensions to encode, where no two dimensions are necessarily aligned. For example, a grand staff containing a piano piece likely will not have both parts with identical rhythms, so the idea of sequencing becomes complicated. Traditionally, each step of data fed into a recurrent neural network is presumed to be equally divided across time. Training a RNN on ASCII characters works well because each character has equal time value. Rhythms, however, do not have equal time value, so sequencing rhythms in multiple parts become impossible. The solution to this problem, which I have seen in one other place before, is to encode pitches at slices of time at some minimum rhythmic step. For instance, a whole note played in the treble clef may get encoded as eight eighth notes, where an additional token is used to indicate the start of a note (so that eight consecutive eighth notes are encoded differently than a single whole note). I considered using this method, but I felt that this left too much ambiguity for the LSTM to sort through; I felt as though the LSTM would have trouble learning the nuances of more than one part if the only data it had were slices of time. Thus, for the purpose of this project, I decided to eradicate polyphony and convert each MIDI file to a monophonic sequence of notes. The process of converting polyphony to monophony required a notion of which notes in a song are most important, which is typically referred to as the melody of a piece. Extracting the melody from a polyphonic piece has precedent, and I was able to read several algorithms from the paper "Manipulation of Music For Melody Matching" written by Alexandra Uitdenbogerd and Justin Zobel. In their paper, they tested four different algorithms to extract melodies from songs and tested their effectiveness qualitatively with human subjects. Human subjects listened to a piece of music, and then they listened to each of the four extracted melodies, and they indicated the extracted melody that most closely matched the perceived melody from the piece of music. The results indicate that the simplest algorithm performed the best, which was to construct a melody from the highest pitched note at any given time step. I implemented this algorithm myself, and the results are promising. While this algorithm does not work in all cases, and while it also adds some extra, odd notes to the melody (which only make sense if you know the original piece), the algorithm serves as a baseline and can be improved in several ways. For the moment, I am using this technique, as it provides a monophonic sequence of notes on which to train a LSTM. Here are a couple of pairs containing the original piece with its extracted melody (the slight tempo differences are unintententional). With the Snow Mountain theme (about a minute in) you hear some of the limitations of the melody extraction and note quantization algorithm I use.
More than a month has passed since the last post, which is a testament to the difficulty of training neural networks. However, after a month of experimentation, I am back to report successful results! My last post mentioned my overall thought process in training the LSTM from MIDIs: pick a dataset of midis to train on, extract all of the melodies from these midis in the form of monophonic note sequences, and train the LSTM on these note sequences. My original plan was to encode each note as an 11 * N dimensional vector, where N is the number of distinct rhythms encoded (half note, whole note, etc). Given that this is my first time training a recurrent neural network, I incorrectly made the assumption that encoding vectors this sparsely would not impact the performance of the models, since LSTMs have been used effectively in a variety of tasks involving high dimensional data. The nature of this high dimensional data, however, is such that each dimension encodes a distinct category. Using a 26 dimensional vector to encode the English alphabet, for example, is a logical application of higher dimensional space, since each dimension encodes a unique category. My proposal, however, is less logical, since for N rhythm types, it encodes N orthogonal categories for each note, which renders a sparse representation of notes that should be encoded into a more compact representation to better leverage the learning capabilities of an LSTM. As a result of this sparse representation, my initial results were unsuccessful. The generated songs had no coherency, and more often that not they contained the same series of repeated notes - indicating a lack of training. In reality, the neural network was too large for the given dataset, and in addition to this the training time was insufficient. Several other attempts were made with this high dimensional representation on different datasets. I tried reducing the number of dimensions to 67, and I even tried encoding relative gaps between notes rather than relative pitches, and the results were still mediocre. I needed a way of compacting the sparse vector representation while preserving the relationships between notes and their rhythms, and I soon came across the idea of word embeddings from the natural language processing community. Typically, in natural language processing tasks, one represents words as N dimensional vectors, where N is the number of words in the vocabulary. Since the vocabulary of words in many cases is relatively large, this representation presents a similar problem to mine; the vector is too sparse to prove efficient, especially in the case of machine learning. To mitigate this, word embeddings are used, which are mappings from N dimensional vectors to a lower dimensional vectors (say 300D, for example) that attempt to preserve the relative semantic meanings of words. Word embeddings are created by running algorithms over large bodies of text to find mappings that map words frequently used together closer to one another. For instance, many word embeddings map colors close together in the lower dimensional subspace, since colors are typically interchangable in natural language. In the case of language modelling via recurrent neural networks, word embeddings need not be developed separately from neural network training; very often, word embeddings are developed as the model trains to produce a mapping that makes the most sense for the specific text used. Using this idea, I attempted to input the 250+ dimensional vectors into a word embedding layer in my neural network that changed over time based on vectors in context. I tried several different mappings, all of which were less than 60D, in the hope that in the ideal case, the embedding layer would learn would map all identically pitched notes together. For instance, encoding the dominant pitches close together would entail mapping a dominant quarter note, a dominant half note, etc. to the same dimension. While I knew that it would be unlikely that such a mapping would emerge naturally, I trained several models with this approach. The following is the best I was able to squeeze out of embeddings. It is a short segment trained on a relatively small dataset of Bach's pieces:
I am very pleased with the results from my final experiments, but there are many issues with the generated pieces that merit discussion. First, I should state that the LSTM architectures performed well on the task: given a dataset of sequences, they produced sequences of a similar nature. Thus, the problems with the generated music lie in the datasets, or more specifically, in the production of these datasets. One of the core assumptions of this project was that a recurrent neural network would be able to generate music given a sequence of notes. The task for the neural network is simple: predict the next note given the last N. This poses several problems, however. While the generated music contains a plausible sequence of notes, it contains very few abstract musical elements; the ideas of a musical phrase, repetition, and sequences completely evade the LSTM networks, which is expected given their relatively small sizes in these experiments. Providing these networks with more tokens such as a barline token (to encode a fixed meter), a begin/end repeat token, and a begin/end sequence token would provide the LSTM with very similar sequences that result in much richer music. Additionally, allowing the networks to predict more than one note at a time would greatly improve their compositions. My argument, and a good starting point for future research, is that more advanced preprocessing of the music beforehand to insert begin/end repeats, barlines, and sequence tokens will result in richer datasets and better music. Additionally, incorporating an algorithm to assign a piece its most likely key will allow for much larger and robust datasets. For the moment, I am leaving these as open questions. In the future, I plan to pursue these and record my results similarly to this research. I also want to experiment with different machine learning architectures. One major limitation of this project was the fact that all music was transformed into monophonic sequences. While effective in training recurrent neural networks, this constraint produces rather boring music. Multi-dimensional recurrent neural networks could be used to produce much more interesting music. I want to experiment with adding more layers to the LSTMs, analyzing any differences in the produced music. Lastly, I want to experiment with generative adversarial networks, which produce sequences by using two neural networks simultaneously and have the potential to generate more realistic sequences of notes. That's all for now. If I continue this research in the future, I will be sure to update this post with the link to the new web page.