Skip to content

Add CAM++ speaker verification/embedding model#3

Merged
BrandonWeng merged 4 commits intoFluidInference:mainfrom
hamzaq2000:cam++-coreml
Sep 23, 2025
Merged

Add CAM++ speaker verification/embedding model#3
BrandonWeng merged 4 commits intoFluidInference:mainfrom
hamzaq2000:cam++-coreml

Conversation

@hamzaq2000
Copy link
Contributor

CAM++ is an efficient speaker embedding model that I use in my diarization pipeline, Senko.

This PR adds a conversion script for it from torch to CoreML, as well as a test script to verify correctness and benchmark inference speed vs torch.

@@ -0,0 +1,210 @@
# https://github.com/modelscope/3D-Speaker/tree/main/speakerlab/models/campplus

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add dependency here instead of copying the model ?

Any specific changes you had to do to make it CoreML friendly ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be cumbersome, imo.
Yeah I did put comments in camplusplus_coreml.py for changes that were made.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like they might not have a pip package, copying is fine as well. Next time could even just build off a cloned repo if thats easier.

Though it would be good to include the commit hash you worked off for future references

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added the commit hash.


input_type = ct.TensorType(
name="input_features",
shape=(BATCH_SIZE, FIXED_FRAMES, FEATURE_DIM),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe having BATCH_SIZE to be RangeDim, so that inference can be done on dynamic batch size.

Copy link
Contributor Author

@hamzaq2000 hamzaq2000 Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamic batch size hurts performance from my testing.

coreml_model.output_description["embeddings"] = f"Speaker embeddings: ({BATCH_SIZE}, {EMBEDDING_DIM})"

# Save the model
output_path = "./models/camplusplus_batch16.mlpackage"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: replace 16 with BATCH_SIZE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch, I'll add that.


warnings.filterwarnings('ignore')

def extract_fbank_features(waveform, sample_rate=16000):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, is it possible to move this to CoreML as separate model ? Else if someone wants to implement in Swift, they have to implement fbank computation in Swift.

Copy link
Contributor Author

@hamzaq2000 hamzaq2000 Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my Senko pipeline, I do fbank extraction efficiently in C++. So perhaps I can link that in here? Not sure if that should be part of Mobius though; what do you think?
This test script I kept in pure python just as an example of how to use the CoreML model, not for production deployment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, please link Senko. It would be useful to show how the model could be used.

I think what Bharat is asking is if its possible to freeze the fbank operations into a CoreML model as well. Could be beneficial for Senko as well so you can strip out the C++ code.

But I would say its optional, not a blocker for the PR. Fbank is probably simple enough to vibe code in Swift and from what we've seen FFT/STFT operations don't help much being in CoreML, it might be faster to use Accelerate via Swift

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, linked Senko.

Very interesting; didn't know FFT/STFT operations aren't sped up much by CoreML. I have looked at Accelerate in the past. I think optimizing Fbank extraction using that will be my next optimization target. If I end up doing that, I'll create another PR to contribute that here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, linked Senko.

Very interesting; didn't know FFT/STFT operations aren't sped up much by CoreML. I have looked at Accelerate in the past. I think optimizing Fbank extraction using that will be my next optimization target. If I end up doing that, I'll create another PR to contribute that here as well.

@Bharat0091
Copy link

CAM++ is an efficient speaker embedding model that I use in my diarization pipeline, Senko.

This PR adds a conversion script for it from torch to CoreML, as well as a test script to verify correctness and benchmark inference speed vs torch.

Thanks for this PR @hamzaq2000 . Gave some highlevel comments. Once we resolves these, will review in detail.

Copy link
Member

@BrandonWeng BrandonWeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,210 @@
# https://github.com/modelscope/3D-Speaker/tree/main/speakerlab/models/campplus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like they might not have a pip package, copying is fine as well. Next time could even just build off a cloned repo if thats easier.

Though it would be good to include the commit hash you worked off for future references


warnings.filterwarnings('ignore')

def extract_fbank_features(waveform, sample_rate=16000):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, please link Senko. It would be useful to show how the model could be used.

I think what Bharat is asking is if its possible to freeze the fbank operations into a CoreML model as well. Could be beneficial for Senko as well so you can strip out the C++ code.

But I would say its optional, not a blocker for the PR. Fbank is probably simple enough to vibe code in Swift and from what we've seen FFT/STFT operations don't help much being in CoreML, it might be faster to use Accelerate via Swift

@hamzaq2000
Copy link
Contributor Author

hamzaq2000 commented Sep 23, 2025

Ok, added the 3D-Speaker commit hash.

Linked Senko fbank_extractor C++ code as well, for production deployment.

Wanted to ask, since 3D-Speaker code will be part of the repo, shall I also create a THIRD_PARTY_LICENSES file in the root of the repo, with 3D-Speaker credit & the license text?
Or if that's too much, then I can make convert.py clone the 3D-Speaker repo and use the CAM++ model definition from that.

@BrandonWeng
Copy link
Member

BrandonWeng commented Sep 23, 2025

3D-Speaker

Good question - a copy of their license in the root of models/emb/cam++/ should be sufficient. The idea is that each folder is its own isolated environment/copy

@hamzaq2000
Copy link
Contributor Author

Great, added the attribution and license text in models/emb/cam++/THIRD_PARTY_LICENSES.

@BrandonWeng BrandonWeng merged commit 7a92545 into FluidInference:main Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants