Loading Audio in Node JS

19 May 2021

Working with audio as a developer can unlock many awesome features, and a lot of fun. You can generate music, analyze audio using machine learning, build audio visualizers, music information retrieval systems, and much more. It's an extremely fun field. But working with audio can be tricky - how is sound represented on a computer? How can we manipulate that sound? And how do we serialize sound data to disk?

Pulse Code Modulation Encoding

This post won't be a deep dive into audio encoding - it's a practical guide for how to load audio in Node JS, into a state that you can work with it. Generally, digital signal processing (which means "working with audio data using code") operates on a kind of audio data called Pulse Code Modulation (or "PCM" for short). There's a lot of fancy theory and maths behind PCM encoding - but until you're ready to dive in to Wikipedia, you can think of it as "a long list of numbers that represent the change in air pressure over time that makes up a sound". This is, after all, what a microphone measures and converts into numbers.

Samples

Each number in the list that makes up a sound is called a "sample". The sample can be represented on disk as one of several kinds of numbers - floating point numbers, integers, or other representations. The number of bits that represent the number affect the precision of the number - for example, 16 bit numbers can have much more precision than 8 bit numbers. The number of bits in each sample is referred to as the "bit depth".

Sample Rate

Another important attribute of PCM encoded audio is the "sample rate". This refers to the rate at which samples should be played in order for the sound to be at the right speed. For reasons outside the scope of this post, the sample rate dictates the highest frequency component that can be represented in a sound. For the purposes of most audio intended for human listening, it's important to store audio at a sample rate slightly higher than double the maximum frequencies that humans can hear. Since humans can't really hear audio over 20,000hz, a standard sample rate has emerged at 44,100hz. The "hz" unit here refers to hertz, which means "samples per second". Sometimes you can encounter audio with a higher or lower sample frequency - audio for movies can be up to 192,000hz, and signals representing things that aren't meant for human hearing (for example, geological sonar scans) might not need as many as 44,100 samples per second.

Loading PCM audio from disk

Several audio file formats store PCM encoded audio directly - wav and aiff are examples.

Luckily, other developers have implemented great libraries that handle the complexities of parsing wav files for you. I recommend node-wav, by Andreas Gal. It's got a simple API, and uses the metadata at the start of the wav file to automatically choose the correct sample rate, bit depth, and number encoding. From the readme, here is a code example.

let fs = require("fs");
let wav = require("node-wav");

let buffer = fs.readFileSync("file.wav");
let result = wav.decode(buffer);
console.log(result.sampleRate);
console.log(result.channelData); // array of Float32Arrays

The result.channelData variable contains a list of signals that you can use as standard Javascript Float32Arrays. The result object also exposes the sample rate, which you will likely need to know for many operations.

If you're using Meyda to analyze audio that you load in this way, you will need to make sure that the sample rate of the audio matches the sample rate that Meyda is set to use. Otherwise you'll end up with audio features that are incorrect, and based on a skewed frequency scale. You can either match the Meyda sample rate to the wav sample rate, or you can resample the audio to fit a standard sample rate (i.e. 44,100hz, or 48,000hz). Resampling audio is a complicated topic beyond the scope of this article, but if you have trouble finding information online, let me know and I may find time to write an article.

AIFF files also store PCM audio data, but differ from WAV files in that they have a different header format for storing metadata. node-wav doesn't support AIFF files, and I haven't found a package I would recommend to do so. If you need to analyze AIFF files, I would suggest using a utility like ffmpeg to transcode the audio to wav.

What about non-PCM audio formats?

But what about audio file formats like mp3, ogg, and flac? The difference between these formats and wav is that the audio is compressed on disk. mp3 and ogg are what's called "lossy" compression - that means they change the actual sound in ways that are hopefully imperceptible to most listeners in order to get better compression. flac, meanwhile, is a format that implements lossless compression. This means that it encodes audio on disk in a more efficient format than storing each sample as a full integer or floating point number, but without modifying the audio itself.

Encoding agnostic signal processing code

It's best to write signal processing code that works with one representation of audio, and reuse it by converting the audio - rather than having one implementation of your signal processing code for each audio encoding. We can achieve code reusability by converting all audio to a common format for signal processing, so that your code only has to think about one representation. Libraries that do this are called "codecs" which comes from "enCOding/DECoding". In order to support a particular file format in your program, you will need to make sure that you have the right codec. Luckily, you don't need to understand each audio format and implement a codec yourself - you can use packages to do this. So when you're writing your signal processing code, you should write code that works on raw signals, not encoded or compressed. In many cases, in Javascript, signals are represented as Float32Arrays - and unless you have specific requirements where this causes a limitation for you, I would recommend sticking to writing code that assumes signals are in Float32Arrays.

Loading alternative encodings from disk

While there are some implementations of mp3 encoders in Javascript, I would actually recommend calling out to another technology to do the transcoding. ffmpeg is a long running open source project that excels in media encoding. It can translate between many different media encodings, and I'm confident that it covers a huge portion of transcoding needs. In Node, you can call out to ffmpeg using the child_process API.

import { exec } from "child_process";
import { mkdtemp } from "fs/promises";
import path from "path";
import os from "os";

// Create a temporary directory to store transcoded audio
const TEMP_DIR = await mkdtemp(path.join(os.tmpdir(), "transcoder-storage-"));

async function transcodeToWav(filename) {
  return new Promise((resolve, reject) => {
    let output_filename = `${path.join(TEMP_DIR, filename)}.wav`;
    // "shell out" to ffmpeg
    exec(
      `ffmpeg -i ${filename} ${output_filename}`,
      (error, stdout, stderr) => {
        if (error) {
          console.log("ERROR: ", error);
          reject(error);
        }
        resolve({ filename: output_filename, stdout, stderr });
      }
    );
  });
}

try {
  let result = await transcodeToWav("./164064__cclaretc__rooster.mp3");
  // result.filename is the new filename of the transcoded audio.
  // We can now use node-wav as described above to read the audio

  let buffer = fs.readFileSync("file.wav");
  let decodedAudio = wav.decode(buffer);
  console.log(decodedAudio.sampleRate);
  console.log(decodedAudio.channelData); // array of Float32Arrays
} catch {}

I'm using a modern version of Nodejs which allows imports, top level await in .mjs files, and exposes the fs/promises interface, but this code refactors back to older versions of node if you need.

One thing to bear in mind is that in order for this to work, you need to have a copy of ffmpeg on the system that you're running the code on. Luckily, there's a package for that - ffmpeg-static is a dependency that you can include in your project that installs a statically linked copy of ffmpeg. You can use it to ensure that ffmpeg is always available to your code. Check it out!

But what about the web?

While in theory it might be possible to run ffmpeg through emscripten and run it in a web worker (I certainly assume someone has done this), it's not necessarily practical to try and use the same technique from node to transcode audio on the web. The good news is that the w3c has chartered a working group to focus on web codecs. While this is at the time of writing still in early stages, the working group is powering ahead on designing and proposing an API to enable media transcoding on the web, and hopefully that will become available to us in the near future.

What did we learn?

In this blog post, I covered the basics of Pulse Code Modulation encoding, how to load wav files from disk, the difference between wav files and other audio encoding file formats, transcoding other file formats to wav for loading in node, and how transcoding might soon work outside of node, but on the web. I hope these explanations have been useful to you. If anything is unclear, or you have more questions, please let me know on Twitter! Thanks for reading.