How (and Why) Are MP3 Files So Damn Small?

View Single Page
A technician works on one of Google's enormous data centers.

A technician works on one of Google’s gargantuan data centers.

The MP3 format was first released 21 years ago this Summer. It goes almost without saying that it, and the other compressed audio formats that followed, changed the music world forever.

Today, there are tens of millions of Americans who grew up with a music marketplace almost entirely different from the one that exists now. And there are tens of millions more who simply can’t remember a time before exhaustive libraries of music were everywhere, ready to roll at the click of a mouse.

Although data compressed audio has been around long enough to vote, drive, serve in the armed forces and buy a bottle of bourbon, there are few audio engineers, nevermind casual listeners, who can quite describe how these lossy formats work.

So many among us also remain puzzled as to why they’ve stayed so small in size in a world where bandwidths and processor speeds continue to become exponentially faster and more powerful. But there are answers, and they are not beyond understanding.

Just How Much Smaller Are They?

In a streaming world, it’s useful to talk about audio files in terms of their “bit rate.” This is a term we can use to describe how many bits of data a file spits out every second.

It’s pretty easy to calculate the bit rate of standard full-resolution “CD quality sound.” Just take the number of samples the file uses per second (44,100 of them) and multiply that by the number of bits used by each sample (16, in this case).

A little simple arithmetic, and we can see that 44,100 samples per second x 16 bits per sample = 705,600 bits per second.

But that’s just for one channel of audio. Chances are your music is going to be in stereo, which means we’ve got to multiply by 2. When all is said and done, a standard, uncompressed, full resolution “PCM” audio file uses 1,411,200, or 1411 kbps.

It doesn’t take a math genius to realize that this is a whole lot bigger than what we’re used to seeing from MP3s and other lossy compressed files. At 128kbps to 320kbps, today’s compressed audio files are anywhere from 1/4th to 1/11th the size of a single track on a standard resolution CD.

How Do They Compare?

Unlike a “lossless” compressed format, such as a FLAC or .zip, which shrink a file by rearranging the data in a more efficient way for storage and transfer, “lossy” formats such as MP3, MP4, AAC and Ogg Vorbis, really do throw huge swaths of data away.

What’s truly amazing is that despite this fact, MP3s still sound as good as they do. Although the earliest encoders and bit rates often sounded pretty terrible — a true triumph of convenience and cost over quality — the story is much different today, and has been for some time.

On principle, the idea of lossless compressed files like FLAC may still seem attractive to some listeners. Unfortunately, these files are only marginally smaller than the originals, require extra processing, and are not especially suited to real-time streaming at the moment. It also appears that once MP3s reach a certain resolution, FLAC doesn’t offer a perceptible increase in fidelity.

Although it is definitely possible for trained listeners to hear the difference at some resolutions, studies show that most musicians and many audio nerds have trouble telling even today’s relatively low-res 128kbps MP3 codecs apart from standard resolution formats. Not too bad for a file 1/11th the size. (You can test yourself at http://mp3ornot.com. Although some of us can get it right 10 times out of 10, even we have to admit it’s a pretty subtle difference.)

When we increase the bit rate to 256kbps, most trained listeners can’t tell the mp3 apart from the source in a blind test. And at 320kbps? There is currently no evidence of even trained listeners telling an MP3 apart from any higher resolution file in a properly controlled listening test. (That includes super-duper-high-resolution files like 24/192 WAV. Not bad for a file 4x – 5x smaller than a standard resolution!)

Even at old and outmoded resolutions like 128 kbps, today’s MP3s are arguably higher in fidelity than AM/FM radio, cassette, vinyl, and essentially any other historical audio format. At higher bit rates, it’s no contest at all. We are now beyond the days of “convenience vs quality” when it comes to consumer audio formats.

How Do They Get So Small?

A technician walks the halls of a Google data center.

A technician walks the halls of a Google data center.

The process of compressing audio files is interesting in itself.

We can’t reasonably make an audio file smaller by just throwing data away will-nilly. If we were to try and just halve the size of a track that way — by say, throwing away 8 bits of information — almost anyone could hear the difference.

If we did this, we wouldn’t just lose half of our dynamic range — We’d end up cutting it exponentially, from 65,536 possible values way down to 256. This is something you’d almost definitely hear by the way of a dramatically increased noise floor. All that sacrifice, and we’re still not coming anywhere near the data savings of even the largest MP3s.

The key to effective lossy compression lies in something called “perceptual coding.” This is basically a fancy way of saying “exploiting all the ways in which our brains don’t work quite right.”

Much like a film camera works by exploiting a quirk of our perception, recording just 24 snapshots per second to create a fluid image and throwing the rest away, an MP3 works because our ear and brain are simply incapable of processing all the acoustic information around us. When we remove this information, we do not miss it, because it is information that we are not equipped to process by nature.

Just like with film and video frame rates, there are two questions to ask: What’s the minimum we can get away with and still make people happy, and what’s the maximum resolution that will confer some kind of advantage?

Below a certain point, we are not likely to create a satisfying aesthetic experience. Above a certain point, the human mind and body can simply not process any additional information, making any addition of resolution an exercise in senseless self-indulgence and misleading marketeering.

So, in order to design effective audio codec, engineers and programmers have to know what we can and can’t hear, and under what circumstances. The three most important quirks of our hearing in this context are Temporal Masking, Simultaneous Masking, and the Absolute Hearing Threshold.

To understand simultaneous masking, imagine standing next to a roaring jackhammer and dropping a pin to the ground. Do you think you’d be able to hear the pin drop? If you answered “no”, then congratulations: You are not lying, and/or not stupid.

Temporal masking is similar: If you set a firecracker off next to ear (please do not do this) and then dropped a pin immediately after, do you think you’d hear that pin? If you dropped it quickly enough, the answer would once again be a clear and certain “no”.

In both cases, it’s not that the pin doesn’t make a sound. It does, and with equipment sensitive enough, we could measure that sound.It’s just that our systems of perception are incapable of hearing it — much  in the same way we are incapable of seeing 24 frames of film per second  as a series of still images. With all confidence, we safely throw away the sound of that pin dropping without affecting anyone’s listening experience in the slightest.

Pages: 1 2

  • Joel Douek

    Great article!