#3827 MIDI File Format and Layout

Programming with MIDI in Python | Getting started and sending MIDI Messages

https://www.youtube.com/watch?v=JYslZkc90GI

Code a Wavetable Synth with Python in 6 Minutes Tutorial [Synth #002]

https://www.youtube.com/watch?v=zBFeT8fkjfI

Genetic Algorithm in Python generates Music (code included)

https://www.youtube.com/watch?v=aOsET8KapQQ

Drop the DAW – Sound Design in Python - Isaac Roberts - ADC20

https://www.youtube.com/watch?v=Q40qEg8Yq5c

#3827 What is the file format and layout of MIDI files?

A Standard MIDI File (SMF), often with a .mid extension, is a binary file format designed to store musical performance data. Unlike audio files (like MP3 or WAV) that store recorded sound, a MIDI file is essentially a set of digital instructions—like sheet music for a computer—that tells a synthesizer or other MIDI-compatible device what notes to play, when to play them, on what instrument, and with what dynamics.

The file layout is based on a simple, flexible structure called chunks.

🎹 The Chunk-Based Structure

A MIDI file is just a series of "chunks" concatenated together. Every chunk follows the same basic layout:

Chunk Type (4 bytes): An ASCII identifier that says what kind of chunk this is.
Data Length (4 bytes): A 32-bit integer (big-endian) that specifies the exact size (in bytes) of the data that follows.
Chunk Data (variable size): The actual data for the chunk, with a length matching the Data Length field.

A MIDI file must begin with a Header Chunk and is then followed by one or more Track Chunks.

### 1. The Header Chunk (MThd)

Every MIDI file starts with exactly one Header Chunk. It provides the global information for the entire file.

Chunk Type: MThd (ASCII for "Header")
Data Length: Always 6 (bytes)
Chunk Data (6 bytes): This data is broken into three 16-bit (2-byte) fields:
- Format (2 bytes): Specifies the overall file type.
  - Format 0: The file contains a single track that merges all MIDI channels.
  - Format 1: The file contains one or more simultaneous tracks (e.g., one track for piano, one for drums) that are played back together. This is the most common format.
  - Format 2: The file contains one or more independent, sequential tracks. This format is rare.
- Number of Tracks (2 bytes): The total count of Track Chunks (MTrk) that follow this header.
- Division (2 bytes): This is crucial as it defines the file's musical timing. It has two possible formats:
  - Ticks per Quarter Note: If the first bit is 0, the value represents the number of "ticks" (the smallest time unit) per quarter note. A common value is 96 or 120.
  - SMPTE Time: If the first bit is 1, it defines timing in terms of frames per second, used for syncing to video.

### 2. The Track Chunk (MTrk)

After the header, one or more Track Chunks contain the actual musical performance data.

Chunk Type: MTrk (ASCII for "Track")
Data Length: A 4-byte integer specifying the length of the track data.
Chunk Data (variable size): This is a continuous stream of MIDI Events, where each event is preceded by a delta-time.

Delta-Time: The "When"

A track chunk doesn't store the absolute time for an event. Instead, it stores the delta-time, which is the number of "ticks" (as defined in the header's Division field) to wait since the previous event before executing the current one.

If two notes happen at the same time, the first event will have a delta-time (e.g., 120 ticks), and the second event will have a delta-time of 0.
This delta-time is stored in a special, compact format called a variable-length quantity (VLQ), which uses as few bytes as possible to store the number.

MIDI Events: The "What"

After the delta-time, the file contains one or more event bytes. These are the instructions for the synthesizer.

Example of a Track Data Stream: [delta-time] [event] [delta-time] [event] [delta-time] [event] ...

There are three main categories of events:

MIDI Channel Events: These are the most common and correspond to musical actions. They are "channel-specific," meaning they apply to one of the 16 MIDI channels (which are typically each assigned a different instrument).
- Note On (Command 0x9n): Starts playing a note. Includes the note's pitch (0-127) and velocity (volume, 0-127).
- Note Off (Command 0x8n): Stops playing a note. Includes the note's pitch and release velocity. (A "Note On" event with velocity 0 is often used as a "Note Off").
- Program Change (Command 0xCn): Selects an instrument (e.g., 0 for Acoustic Grand Piano, 24 for Acoustic Guitar) for that channel.
- Control Change (Command 0xBn): Adjusts parameters like sustain pedal, modulation, or volume for that channel.
(The n in the command hex code represents the 4-bit channel number, from 0-15).
Meta-Events (Command 0xFF): These are non-musical events that provide information about the track. They are not sent to the synthesizer.
- Set Tempo (0x51): Defines the tempo in microseconds per quarter note.
- Time Signature (0x58): Sets the time signature (e.g., 4/4, 3/4).
- Track Name (0x03): Gives a name to the track (e.g., "Piano" or "Drums").
- End of Track (0x2F): A mandatory event that must be the very last event in every MTrk chunk.
System Exclusive (SysEx) Events (Command 0xF0 or 0xF7): These are manufacturer-specific messages, often used to send special commands to a particular brand or model of synthesizer.

This chunk-based system makes the MIDI format highly flexible, allowing different software to parse the data it understands (like notes and tempo) while safely ignoring chunks or meta-events it doesn't recognize.

This video provides a great visual, in-depth explanation of how MIDI data is structured into files.

MIDI Files Explained

How can generative-AI be used to generate MIDI files?

Generative AI creates new MIDI files by learning the statistical patterns and structures from a massive database of existing MIDI music. It's a process of "learning by example," much like how a language model learns to write text by reading the internet.

The entire process can be broken down into three main stages:

1. Data Preparation: Turning MIDI into a "Language"

Before an AI can learn from music, the MIDI files must be translated into a simple, sequential format it can understand. A MIDI file, with its simultaneous notes, timing, and instruments, is too complex for an AI to read directly.

Tokenization: The most common method is to convert each MIDI file into a single, long sequence of "events" or "tokens." This process, often called an event-based representation, turns the music into something like a sentence.
Example Events: A musical phrase is broken down into a vocabulary of events like:
- INSTRUMENT=PIANO
- NOTE_ON=60 (Middle C)
- VELOCITY=90 (How hard the note is played)
- TIME_SHIFT=0.5s (Wait for half a second)
- NOTE_OFF=60 (Release Middle C)
Datasets: This "tokenization" is performed on massive databases, with the Lakh MIDI Dataset (containing over 170,000 files) being a popular choice.

2. Model Training: Learning the "Grammar" of Music

Once the music is in a sequential, text-like format, AI models are trained to "read" it. The training objective is simple: predict the next event in the sequence.

Given the sequence NOTE_ON=60, TIME_SHIFT=0.5s, the model's job is to predict that NOTE_ON=64 (the note E) is a highly probable next event, while INSTRUMENT=TRUMPET is very unlikely.

Different AI architectures are used for this:

Recurrent Neural Networks (RNNs/LSTMs): These were early models (like Google's MelodyRNN) that are good at learning short-term sequential patterns, making them effective for generating simple melodies.
Transformers: This is the same architecture that powers large language models like ChatGPT. Models like Google's Music Transformer or OpenAI's MuseNet are far better at understanding long-range structure. This allows them to generate music with coherent melodies, harmonies, and repeating themes over several minutes, rather than just note-to-note phrases.
Generative Adversarial Networks (GANs): This approach (e.g., MidiNet) uses two models. A Generator creates new MIDI sequences, and a Discriminator acts as a "referee," trying to distinguish the AI's fake music from real music from the training data. The Generator's goal is to get good enough to "fool" the Discriminator, which pushes it to create highly realistic compositions.

3. Generation: "Writing" a New MIDI File

After training, the AI model is a powerful statistical "composer." To generate a new MIDI file:

Provide a Prompt: You give the model a "prompt," which can be a few starting notes, a chord progression, or even a text description (e.g., "sad piano melody in C minor").
Predict and Repeat: The model takes the prompt and predicts the single most likely next musical event (e.g., NOTE_ON=72).
Autoregressive Generation: This new event is added to the sequence, which is then fed back into the model to predict the next event. This process is repeated one token at a time:
- [Prompt] -> predicts Event A
- [Prompt, Event A] -> predicts Event B
- [Prompt, Event A, Event B] -> predicts Event C ...and so on, for thousands of events.
Convert Back to MIDI: This final, long sequence of text-like tokens is post-processed and converted back into the standard .mid file format, which can then be played in any Digital Audio Workstation (DAW) or music player.

Why would someone apply genetic algorithms to generate music?

Someone would apply genetic algorithms (GAs) to music generation primarily for creative exploration and collaboration, rather than just imitation.

Unlike models like Transformers (which predict the next note based on what they've seen before), a genetic algorithm is an optimization process.¹ It "evolves" music toward a specific goal, which makes it a powerful tool for discovering novel musical ideas that you might not find otherwise.

Here’s a breakdown of why this approach is so appealing.

🧬 How It Works (The Core Concept)

A genetic algorithm mimics the process of natural selection:²

Population: The AI generates an initial "population" of random, short musical ideas (melodies, rhythms, etc.).³
Fitness: This is the most important part. Each musical idea is given a "fitness score" to determine how "good" it is.⁴
Selection: The "fittest" musical ideas (the ones that sound best) are "selected" to "reproduce."⁵
Crossover & Mutation: The selected ideas are combined ("crossover") to create new "offspring" melodies, which share characteristics of their parents.⁶ A few random changes ("mutation") are thrown in to introduce novelty.⁷
Repeat: This process repeats for many "generations," with the music progressively "evolving" to become better and better.⁸

## 🎵 Why Use This Method?

There are two main reasons a composer or researcher would choose a GA, and both center on the fitness function.

1. The "Human-in-the-Loop" (Interactive Evolution)

This is the most common and powerful application. The human user is the fitness function.⁹

How it works: The AI presents you with 10 short melodies. You listen and give them a rating (e.g., "I like #3 and #8").
Why it's great: The algorithm takes your preferences, "breeds" those melodies together, and presents you with 10 new melodies that are "evolved" based on your taste.
The benefit: It turns music generation into a collaborative process.¹⁰ You're not just giving a prompt and getting a final song; you are actively guiding the AI's creative search. It's a tool for augmenting your own creativity, helping you explore musical possibilities and escape creative ruts.

2. The Automated Critic (Computational Fitness)

In this approach, you define the "fitness" in code.¹¹ This allows you to generate music that is "optimally" good according to a set of rules.¹²

How it works: You create a fitness function that programmatically scores the music.¹³ This function could be:
- Rule-Based: "Award points" for following music theory (e.g., staying in key, using common chord progressions) and "penalize" for dissonance or clashing rhythms.
- AI-Based: A separate neural network (like an RNN) is trained on a dataset of "good" music (e.g., Bach chorales). This network then acts as the "critic," giving a high score to new melodies that "feel" like Bach.
The benefit: This allows you to automatically generate a large volume of high-quality, "correct" music that fits a specific style or set of constraints without you having to listen to every single generation.

## Genetic Algorithms vs. Other AI Models

Here’s a simple comparison:

Method	Approach	Best For...
Genetic Algorithms (GAs)	Evolving & Optimizing	Creative collaboration, discovering novel ideas, and refining music toward a specific user's taste.
Transformers (LLMs)	Predicting & Imitating	Mimicking existing styles and generating long, coherent pieces based on a massive training dataset.
Recurrent Neural Nets (RNNs)	Sequential Prediction	Generating shorter, simpler melodies and phrases that follow learned patterns.

In short, you use a Transformer when you want to say, "Write me a song that sounds like Mozart." You use a genetic algorithm when you want to say, "Let's work together to discover a new melody that I find interesting."

This video shows a genetic algorithm in Python being used to generate music.

AI Music Generator in Python

From <https://gemini.google.com/app/> Google Gemini 2.5 Pro

Search This Blog

Learning from ChatGPT