MVSEP Logo
  • Home
  • News
  • Plans
  • Demo
  • FAQ
  • Create Account
  • Login

BandIt Plus (speech, music, effects)

BandIt Plus model for separating tracks into speech, music and effects. The model can be useful for television or film clips. The model was prepared by the authors of the article "A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation" in the repository on GitHub. The model was trained on the Divide and Remaster (DnR) dataset. And at the moment it has the best quality metrics among similar models.

Quality table

Algorithm name DnR dataset (test)
SDR Speech SDR Music SDR Effects
BandIt Plus 15.62 9.21 9.69
🗎 Copy link | Use algorithm | Demo

BandIt v2 (speech, music, effects)

Bandit v2 is a model for cinematic audio source separation in 3 stems: speech, music, effects/sfx. It was trained on DnR v3 dataset.

More information in official repository: https://github.com/kwatcharasupat/bandit-v2
Paper: https://arxiv.org/pdf/2407.07275

🗎 Copy link | Use algorithm | Demo

MVSep DnR v3 (speech, music, effects)

MVSep DnR v3 is a cinematic model for splitting tracks into 3 stems: music, sfx and speech. It is trained on a huge multilingual dataset DnR v3. The quality metrics on the test data turned out to be better than those of a similar multilingual model Bandit v2. The model is available in 3 variants: based on SCNet, MelBand Roformer architectures, and an ensemble of these two models. See the table below:

Algorithm name SDR Metric on DnR v3 leaderboard
music (SDR) sfx (SDR) speech (SDR)
SCNet Large  9.94 11.35 12.59
Mel Band Roformer 9.45 11.24 12.27
Ensemble (Mel + SCNet) 10.15 11.67 12.81
Bandit v2 (for reference) 9.06 10.82 12.29
🗎 Copy link | Use algorithm | Demo

DrumSep (4-6 stems: kick, snare, cymbals, toms, ride, hh, crash)

The model separates the drum track into 4, 5, or 6 types: 'kick', 'snare', 'cymbals', 'toms'. In the 5-track models, 'hh' is separated from 'cymbals', and in the case of 6 tracks, 'cymbals' is split into 'hh', 'ride', and 'crash'.

A total of 8 models are available:
1) The DrumSep model from the GitHub repository. It was trained on the HDemucs architecture and splitting drums into 4 tracks.
2) A model based on the mdx23c architecture, prepared by @jarredou and @aufr33. The model splits drums into 6 tracks.
3) A model based on the SCNet XL architecture, which splits drums into 5 tracks.
4) A model based on the SCNet XL architecture, which splits drums into 6 tracks.
5) A model based on the SCNet XL architecture, which splits drums into 4 tracks.
6) Ensemble of 4 models (1 MDX23C + 3 SCNet XL)
7) A model based on the MelBand Roformer architecture, which splits drums into 4 tracks.
8) A model based on the MelBand Roformer architecture, which splits drums into 6 tracks.

All models work only with the drum track. If other instruments or vocals are present in the track, the model will not work correctly. Therefore, the algorithm has two modes of operation. In the first (default) mode, the best model for drums, MVSep Drums, is first applied to the track, extracting only the drum part. Then, the DrumSep model is applied. If your track consists only of drums, it makes sense to use the second mode, where the DrumSep model is applied directly to the uploaded audio.

Quality table (SDR metric, higher is better):

Algorithm name kick snare toms cymbals
hh ride crash
DrumSep model by inagoy (HDemucs, 4 stems) 14.13 8.42 5.67 5.63
DrumSep model by aufr33 and jarredou (MDX23C, 6 stems) 18.32 13.60 13.25 6.71 5.38 7.56
DrumSep SCNet XL (5 stems) 20.21 15.05 16.28 7.05 8.56
DrumSep SCNet XL (6 stems) 20.24 14.80 15.93 6.74 5.02 7.63
DrumSep SCNet XL (4 stems) 20.50 14.69 15.92 10.08
Ensemble of 4 models (3 * SCNet + MDX23C) 20.59 15.11 16.41 7.19 5.59 7.85
DrumSep Mel Band Roformer (4 stems) 22.22 17.09 15.86 11.87
DrumSep Mel Band Roformer (6 stems) 20.21 15.33 15.48 8.79 6.96 8.79

Quality table (L1 Freq metric, higher is better):

Algorithm name kick snare toms cymbals
hh ride crash
DrumSep model by inagoy (HDemucs, 4 stems) 74.34 62.20 73.52 68.87
DrumSep model by aufr33 and jarredou (MDX23C, 4 stems) 78.20 71.27 84.22 80.84 86.74 79.41
DrumSep SCNet XL (5 stems) 81.56 73.16 87.85 80.65 75.44
DrumSep SCNet XL (6 stems) 81.63 72.75 87.46 79.97 85.73 78.67
DrumSep SCNet XL (4 stems) 81.69 72.90 88.43 73.64
Ensemble of 4 models (3 * SCNet + MDX23C) 81.91 73.41 88.24 81.12 86.91 79.41
DrumSep Mel Band Roformer (4 stems) 84.97 77.78 90.13 78.16
DrumSep Mel Band Roformer (6 stems) 81.82 75.63 88.93 85.66 90.50 82.18

Quality table (Fullness metric, higher is better):

Algorithm name kick snare toms cymbals
hh ride crash
DrumSep model by inagoy (HDemucs, 4 stems) 13.61 18.80 20.86 15.80
DrumSep model by aufr33 and jarredou (MDX23C, 4 stems) 18.67 17.85 18.29 12.95 15.76 14.92
DrumSep SCNet XL (5 stems) 18.40 30.94 29.64 13.28 15.15
DrumSep SCNet XL (6 stems) 32.03 29.43 36.04 13.64 14.05 15.05
DrumSep SCNet XL (4 stems) 29.87 30.53 48.35 17.48
Ensemble of 4 models (3 * SCNet + MDX23C) 23.89 30.06 36.19 14.23 18.34 15.43
DrumSep Mel Band Roformer (4 stems) 19.45 23.09 40.32 16.44
DrumSep Mel Band Roformer (6 stems) 15.22 25.98 42.33 19.53 20.51 19.39

Quality table (Bleedless metric, higher is better):

Algorithm name kick snare toms cymbals
hh ride crash
DrumSep model by inagoy (HDemucs, 4 stems) 48.04 18.25 33.85 14.65
DrumSep model by aufr33 and jarredou (MDX23C, 4 stems) 53.25 38.81 56.08 10.52 8.17 14.55
DrumSep SCNet XL (5 stems) 53.33 26.00 51.72 7.97 12.66
DrumSep SCNet XL (6 stems) 36.82 28.82 40.28 7.43 8.25 11.93
DrumSep SCNet XL (4 stems) 44.34 29.05 28.87 16.35
Ensemble of 4 models (3 * SCNet + MDX23C) 51.58 32.20 46.38 8.32 8.51 14.26
DrumSep Mel Band Roformer (4 stems) 69.11 57.86 51.44 50.52
DrumSep Mel Band Roformer (6 stems) 74.12 52.23 46.14 35.19 31.70 36.12

@jarredou prepared new DrumSep validation dataset. It consists of 150 small different tracks. 1st part is Drumkits from 001 to 017 (5 tracks for each of these drumkits, with different playing style) are acoustic drums. From 018 to 082 (1 track by drumkit) are electro drums. This dataset is for 5 stems separation of drums: ['kick', 'snare', 'toms', 'hh', 'cymbals']. For 6 stem models 'ride' and 'crash' were sumed up to 'cymbals'. For 4 stem models 'hh' and 'cymbals' were sumed up to 'cymbals'.

Quality table (SDR metric, higher is better):

Algorithm name kick snare toms cymbals
hh ride crash
DrumSep model by inagoy (HDemucs, 4 stems) 10.52 6.05 4.68 5.03
DrumSep model by aufr33 and jarredou (MDX23C, 6 stems) 14.54 9.79 10.63 3.19 6.08
DrumSep SCNet XL (5 stems) 17.89 12.56 14.14 3.63 6.15
DrumSep SCNet XL (6 stems) 17.74 12.43 14.24 3.39 5.91
DrumSep SCNet XL (4 stems) 17.61 12.37 13.40 7.48
DrumSep Mel Band Roformer (4 stems) 18.67 13.55 13.60 8.76
DrumSep Mel Band Roformer (6 stems) 17.46 12.64 13.69 5.05 7.06

Quality table (L1 Freq metric, higher is better):

Algorithm name kick snare toms cymbals
hh ride crash
DrumSep model by inagoy (HDemucs, 4 stems) 48.68 30.27 42.44 39.26
DrumSep model by aufr33 and jarredou (MDX23C, 6 stems) 56.95 38.31 54.65 47.47 47.39
DrumSep SCNet XL (5 stems) 61.56 43.06 60.76 48.19 47.49
DrumSep SCNet XL (6 stems) 61.46 42.42 60.55 47.32 46.43
DrumSep SCNet XL (4 stems) 61.59 42.91 60.46 44.65
DrumSep Mel Band Roformer (4 stems) 65.24 47.13 63.50 49.77
DrumSep Mel Band Roformer (6 stems) 63.58 46.14 62.94 53.98 51.83

Quality table (Log WMSE metric, higher is better):

Algorithm name kick snare toms cymbals
hh ride crash
DrumSep model by inagoy (HDemucs, 4 stems) 12.76 11.70 11.41 19.27
DrumSep model by aufr33 and jarredou (MDX23C, 6 stems) 16.47 15.13 16.89 23.18 22.32
DrumSep SCNet XL (5 stems) 19.54 17.69 20.12 23.59 22.39
DrumSep SCNet XL (6 stems) 19.41 17.57 20.21 23.38 22.17
DrumSep SCNet XL (4 stems) 19.29 17.52 19.44 21.54
DrumSep Mel Band Roformer (4 stems) 20.27 18.62 19.63 22.74
DrumSep Mel Band Roformer (6 stems) 19.16 17.77 19.71 24.94 23.23
🗎 Copy link | Use algorithm | Demo

Whisper (extract text from audio)

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It has several version. On MVSep we use the largest and the most precise: "Whisper large-v3". The Whisper large-v3 model was trained on several millions hours of audio. It's multilingual model and it guesses the language automatically. To apply model to your audio you have 2 options: 
1) "Apply to original file" - it means that whisper model will be applied directly to file you submit
2) "Extract vocals first" - in this case before using whisper, BS Roformer model is applied to extract vocals first. It can remove unnecessary noise to make output of Whisper better.

Original model has some problem with transcription timings. It was fixed by @linto-ai. His transcription is used by default (Option: New timestamped). You can return to original timings by choosing option "Old by whisper".

More info on model can be found here: https://huggingface.co/openai/whisper-large-v3 and here: https://github.com/openai/whisper

🗎 Copy link | Use algorithm | Demo

Parakeet (extract text from audio)

Parakeet by NVIDIA — is a modern automatic speech recognition (ASR) model designed for accurate and efficient conversion of English speech to text. Unlike Whisper, this model works only with English speech, but delivers higher quality results for English. It also generates quite accurate timestamps. Quality metric WER: 6.03 on Huggingface Open ASR Leaderboard.

Model page: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2

🗎 Copy link | Use algorithm | Demo

Medley Vox (Multi-singer separation)

Medley Vox is a dataset for testing algorithms for separating multiple singers within a single music track. Also, the authors of Medley Vox proposed a neural network architecture for separating singers. However, unfortunately, they did not publish the weights. Later, their training process was repeated by Cyru5, who trained several models and published the weights in the public domain. Now the trained neural network is available on MVSep. 

🗎 Copy link | Use algorithm | Demo

MVSep Multichannel BS (vocals, instrumental)

MVSep Multichannel BS - this model is prepared for extracting vocals from multichannel sound (5.1, 7.1, etc.). Emphasis on lack of transformation and loss of quality. After processing, the model returns multi-channel audio in the same format in which it was sent to the server with the same sample rate.

🗎 Copy link | Use algorithm | Demo

MVSep Male/Female separation

A model for separating male and female voices within a single vocal track. The track should contain only voices, no music.

Quality metrics

Algorithm name Male/Female validation dataset
SDR Male SDR Female L1_Freq Male L1_Freq Female
BSRoformer by Sucial (SDR: 6.52) 6.82 6.23 40.99 40.62
BSRoformer by aufr33 (SDR: 8.18) 8.47 7.89 46.65 44.73
SCNet XL (SDR: 11.83) 12.08 11.58 50.50 51.51
MelRoformer (2025.01) (SDR: 13.03) 13.39 12.68 57.61 56.76

 

🗎 Copy link | Use algorithm | Demo

Demucs3 Model (vocals, drums, bass, other)

Algorithm Demucs3 splits track into 4 stems (bass, drums, vocals, other). The winner of the Music Demuxing Challenge 2021. 

Link: https://github.com/facebookresearch/demucs/tree/v3

Quality table

Algorithm name Multisong dataset Synth dataset
SDR Bass SDR Drums SDR Other SDR Vocals SDR Instrumental SDR Vocals SDR Instrumental
Demucs3 (Model A) 9.50 8.97 4.40 7.21 13.52 --- ---
Demucs3 (Model B) 10.69 10.27 5.35 8.13 14.44 9.78 9.48

Note: For version A only MUSDB18 training data was used for training, so quality is worse than Demucs3 Model B. Demucs3 Model A and Demucs3 Model B has the same architecture, but has different weights.

🗎 Copy link | Use algorithm | Demo

Vit Large 23 (vocals, instrum)

Experimental model VitLarge23 based on Vision Transformers. In terms of metrics, it is slightly inferior to the MDX23C, but may work better in some cases.

Quality table

Algorithm name Multisong dataset Synth dataset MDX23 Leaderboard
SDR Vocals SDR Instrumental SDR Vocals SDR Instrumental SDR Vocals
Vit Large 23 (512px) v1 9.78 16.09 12.33 12.03 10.47 
Vit Large 23 (512px) v2 9.90 16.20 12.38 12.08 ---
🗎 Copy link | Use algorithm | Demo

MVSep MelBand Roformer (vocals, instrum)

Mel Band Roformer - a model proposed by employees of the company ByteDance for the competition Sound Demixing Challenge 2023, where they took first place on LeaderBoard C. Unfortunately, the model was not made publicly available and was reproduced according to a scientific article by the developer @lucidrains on the github. The vocal model was trained from scratch on our internal dataset. Unfortunately, we have not yet been able to achieve similar metrics as the authors.

Quality table

Algorithm name Multisong dataset Synth dataset MDX23 Leaderboard
SDR Vocals SDR Instrumental SDR Vocals SDR Instrumental SDR Vocals
Mel Band Roformer v1 (vocals) 9.07 --- 11.76 --- ---
🗎 Copy link | Use algorithm | Demo

LarsNet (kick, snare, cymbals, toms, hihat)

The LarsNet model divides the drums stem into 5 types: 'kick', 'snare', 'cymbals', 'toms', 'hihat'. The model is from this github repository and it was trained on the dataset StemGMD. The model has two operating modes. The first (default) applies the Demucs4 HT model to the track at stage one, which extracts only the drum part from the track. On the second stage, the LarsNet model is used. If your track consists only of drums, then it makes sense to use the second mode, where the LarsNet model is applied directly to the uploaded audio. Unfortunately, subjectively, the quality of separation is inferior in quality to the model DrumSep.

🗎 Copy link | Use algorithm | Demo

Stable Audio Open Gen

Audio generation based on a given text prompt. The generation uses the Stable Audio Open 1.0 model. Audio is generated in Stereo format with a sample rate of 44.1 kHz and duration up to 47 seconds. The quality is quite high. It's better to make prompts in English.

Example prompts:
1) Sound effects generation: cats meow, lion roar, dog bark
2) Sample generation: 128 BPM tech house drum loop
3) Specific instrument generation: A Coltrane-style jazz solo: fast, chaotic passages (200 BPM), with piercing saxophone screams and sharp dynamic changes

🗎 Copy link | Use algorithm | Demo

MVSep MultiSpeaker (MDX23C)

MVSep MultiSpeaker (MDX23C) - this model tries to isolate the most loud voice from all other voices. It uses MDX23C architecture. Still under development.

🗎 Copy link | Use algorithm | Demo

Aspiration (by Sucial)

The algorithm adds "whispering" effect to vocals. Model was created by SUC-DriverOld. More details here.

The Aspiration model separates out:
1) Audible breaths
2) Hissing and buzzing of Fricative Consonants ( 's' and 'f' )
3) Plosives: voiceless burst of air produced while singing a consonant (like /p/, /t/, /k/).

🗎 Copy link | Use algorithm | Demo

AudioSR (Super Resolution)

Algorithm AudioSR: Versatile Audio Super-resolution at Scale. Algorithm restores high frequencies. It works on all types of audio (e.g., music, speech, dog, raining, ...). It was initially trained for mono audio, so it can give not so stable result on stereo.

Metric on Super Resolution Checker for Music Leaderboard (Restored): 25.3195
Authors' paper: https://arxiv.org/pdf/2309.07314
Original repository: https://github.com/haoheliu/versatile_audio_super_resolution
Original inference script prepared by @jarredou: https://github.com/jarredou/AudioSR-Colab-Fork

🗎 Copy link | Use algorithm | Demo

FlashSR (Super Resolution)

FlashSR - audio super resolution algorithm for restoring high frequencies. It's based on paper FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation. 

Metric on Super Resolution Checker for Music Leaderboard (Restored): 22.1397
Original repository: https://github.com/jakeoneijk/FlashSR_Inference
Inference script by @jarredou: https://github.com/jarredou/FlashSR-Colab-Inference

🗎 Copy link | Use algorithm | Demo

Matchering (by sergree)

Matchering is a novel tool for audio matching and mastering. It follows a simple idea - you take TWO audio files and feed them into Matchering:

  • TARGET (the track you want to master, you want it to sound like the reference)
  • REFERENCE (another track, like some kind of "wet" popular song, you want your target to sound like it)

This algorithm matches both of these tracks and provides you the mastered TARGET track with the same RMS, FR, peak amplitude and stereo width as the REFERENCE track has.

It based on code by @sergree.

🗎 Copy link | Use algorithm | Demo

SOME (Singing-Oriented MIDI Extractor)

SOME (Singing-Oriented MIDI Extractor) is a MIDI extractor that can convert singing voice to MIDI sequence. The model was only trained on Chinese voice, so it might not work well in other languages.

Original page: https://github.com/openvpi/SOME

🗎 Copy link | Use algorithm | Demo

VibeVoice (Voice Cloning)

VibeVoice — is a model for generating natural conversational dialogues from text with the ability to use a reference voice for cloning purposes.

Key features:

  • Two models: small and large
  • Up to 90 minutes of generated audio
  • Language support: 2 languages are supported: English (default) and Chinese
  • Voice cloning: ability to upload a reference audio recording

How to use the model

  • The text must be only in English or Chinese; quality is not guaranteed for other languages. Maximum text length is 5000 characters. Avoid special characters. 
  • Audio with the reference voice requires 5 to 15 seconds. If your track is longer, it will be automatically trimmed at the 15th second. 
  • The reference track should contain only voice and nothing else. If you have background sounds or music, use the "Extract vocals first" option.

How to generate a reference track?

We need phonetic diversity (all sounds of the language) and lively intonation. A text length of about 35–40 words when read calmly will take just ~15 seconds.

Here are three options in English for different tasks:

Option 1: Universal (Balanced & Clear)

The best choice for general use. Contains complex sound combinations to tune clarity.

"To create a perfect voice clone, the AI needs to hear a full range of phonetic sounds. I am speaking clearly, taking small pauses, and asking: can you hear every detail? This short sample captures the unique texture and tone of my voice."

Option 2: Conversational (Vlog & Social Media)

For voiceovers in videos, YouTube, or blogs. Read vividly, with a smile, changing the pitch of your voice.

"Hey! I’m recording this clip to test how well the new technology works. The secret is to relax and speak exactly like I would to a friend. Do you think the AI can really copy my style and energy in just fifteen seconds?"

Option 3: Professional (Business & Narration)

For presentations, audiobooks, or official announcements. Read confidently, slightly slower, emphasizing word endings.

"Voice synthesis technology is rapidly changing how we communicate in the digital age. It is essential to speak with confidence and precision to ensure high-quality output. This brief recording provides all the necessary data for a professional and accurate digital clone."


Tips for recording:

  1. Pronunciation: Try to articulate word endings clearly (especially t, d, s, ing). Models "love" clear articulation.

  2. Flow: Don't read like a robot. In English, melody (voice melody) is important — the voice should "float" up and down a bit, rather than sounding on a single note.

  3. Breathing: If you pause at a comma or period, don't be afraid to take an audible breath. This will add realism to the clone.

🗎 Copy link | Use algorithm | Demo

VibeVoice (TTS)

VibeVoice (TTS) — is a model for generating natural conversational dialogues from text, capable of creating dialogues with up to 4 speakers and durations of up to 90 minutes.

Key Features:

  • Two models: small and large
  • Up to 4 speakers in a single recording
  • Up to 90 minutes of generated audio
  • Language support: officially supports 2 languages: English (default) and Chinese, but it has been verified to work decently for other languages as well.

How to use the model

The text must be in English or Chinese; quality is not guaranteed for other languages. The maximum text length is 5000 characters. Avoid special characters. The text must be formatted specifically to indicate speakers:

Correct format:

Speaker 1: Hello! How are you today?
Speaker 2: I'm doing great, thanks for asking!
Speaker 1: That's wonderful to hear.
Speaker 3: Hey everyone, sorry I'm late!

Incorrect format:

Hello! How are you today?
I'm doing great!

Important:

  • Each line must start with Speaker N: (where N is a number from 1 to 4)
  • Speaker numbering: Speaker 1, Speaker 2, Speaker 3, Speaker 4
  • You can use from 1 to 4 speakers
  • Case does not matter: Speaker 1: = speaker 1: = SPEAKER 1

If you need a monologue, you do not need to specify a speaker.

Example scenarios:

Monologue (1 speaker):

Speaker 1: Today I want to talk about artificial intelligence.
Speaker 1: It's changing our world in incredible ways.
Speaker 1: From healthcare to entertainment, AI is everywhere.

Dialogue (2 speakers):

Speaker 1: Have you tried the new restaurant downtown?
Speaker 2: Not yet, but I've heard great things about it!
Speaker 1: We should go there this weekend.
Speaker 2: That sounds like a perfect plan!

Group conversation (3-4 speakers):

Speaker 1: Welcome to our podcast, everyone!
Speaker 2: Thanks for having us!
Speaker 3: It's great to be here.
Speaker 4: I'm excited to share our thoughts today.
Speaker 1: Let's start with introductions.
🗎 Copy link | Use algorithm | Demo

  • ‹
  • 1
  • 2
  • ›
MVSEP Logo

turbo@mvsep.com

Advanced features

Quality Checker

Algorithms

Full API Documentation

Company

Privacy Policy

Terms & Conditions

Refund Policy

Cookie Notice

Extra

Help us translate!

Help us promote!