AI-Split Audio Tracks Aren’t Perfect, But They’re Getting There

Audio Splitting AI

Photo Credit: James Kovin

Music start-up Lalal.ai is an AI audio splitter that can separate vocals and instrumentals into separate stems. With every track interaction, the AI splitting capabilities improve — though it’s (literally) a work in progress.

The following piece was developed in collaboration with Lalal.ai, a proud partner of Digital Music News.

So what exactly is AI audio splitting? One of the earlier players in this arena is Lalal.ai, which is pursuing a model designed to supplant more static DAW plugins or conventional splitting software solutions that aren’t designed to automatically improve over time.

By its very nature, the Lalal.ai technology isn’t perfect. Instead, it’s constantly evolving and refining itself: the Lalal.ai algorithm that does the splitting is trained using thousands of tracks, in a process that constantly refines the underlying AI. That means users who upload audio to the service help to train the model for quality audio splitting.

The quality of sound and the mastering of each individual track are also important factors for how accurate an AI audio split will be.

Lalal.ai believes that its AI-driven improvement trajectory will easily outdo existing splitter solutions over time. That includes Deezer, which has created Spleeter, a tool that allows DJs to perform vocal reduction and isolation. Spleeter is a great program in the hands of the right audio engineer, but it’s not quite ready for primetime.

Spleeter is a Python program that requires a command-line tool to use. It always outputs at 44.1KHz/16-bit WAV files, no matter what format is uploaded. That means users need to convert the Spleeter output back to the original format before it can be used.

By contrast, Lalal.ai is a user-friendly AI-powered online audio splitter. It doesn’t need to be installed, and there aren’t any third-party plugins for DAW use. Just drag and drop an audio file in the browser, then download the separate vocal and instrument tracks. The audio format you feed Lalal.ai is the format you receive once it’s done processing.

So how is an AI audio-splitter trained, exactly?

Information modeling is the main way Lalal.ai is trained to split audio tracks. The neural network is fed a model using tracks split by a human. The more accurate the model is, the better the machine learning process can decipher vocal audio from instrumental audio.

Learning is about optimizing the model against the current knowledge available for the best outcome. This model optimization process is called machine learning when assisted by computers (good old-fashioned learning when humans do it).

Which brings us to the core difference between regular audio-splitting technology and AI-powered approaches.

Much like their human counterparts, AI that is trained to one specific task gets better at it over time. Computers are constantly modeling different things and as that model expands, so does its knowledge.

Whereas regular audio splitting programs like Deezer’s Spleeter can help humans split audio, AI programs learn and get better over time as more data is added. That means services like Lalal.ai can improve their stem splitting as more data becomes available.

Creating a working AI mechanism requires a model and a model optimization process or learning. The Lalal.ai technology has both. It can process stereo sound from various input audio formats and transform that into two stems. The network generates a data fragment describing the placement of vocal and instrumental parts in the original input signal.

After that, the data fragment is passed to another algorithm that converts the input signal into separate stems. While the process is quite simple to describe, it is performing several billion math operations to know where to place those data indicators for a good result.

Much like small children need to perform an act several times to learn it correctly – neural networks do too. Lalal.ai needs to process hundreds of thousands of audio tracks before it can learn to separate vocal stems from instrumental stems. Training data can easily reach gigabytes of songs.