Add CAM++ speaker verification/embedding model#3
Add CAM++ speaker verification/embedding model#3BrandonWeng merged 4 commits intoFluidInference:mainfrom
Conversation
| @@ -0,0 +1,210 @@ | |||
| # https://github.com/modelscope/3D-Speaker/tree/main/speakerlab/models/campplus | |||
There was a problem hiding this comment.
Should we add dependency here instead of copying the model ?
Any specific changes you had to do to make it CoreML friendly ?
There was a problem hiding this comment.
That would be cumbersome, imo.
Yeah I did put comments in camplusplus_coreml.py for changes that were made.
There was a problem hiding this comment.
It looks like they might not have a pip package, copying is fine as well. Next time could even just build off a cloned repo if thats easier.
Though it would be good to include the commit hash you worked off for future references
There was a problem hiding this comment.
Ok, added the commit hash.
|
|
||
| input_type = ct.TensorType( | ||
| name="input_features", | ||
| shape=(BATCH_SIZE, FIXED_FRAMES, FEATURE_DIM), |
There was a problem hiding this comment.
Maybe having BATCH_SIZE to be RangeDim, so that inference can be done on dynamic batch size.
There was a problem hiding this comment.
Dynamic batch size hurts performance from my testing.
models/emb/cam++/coreml/convert.py
Outdated
| coreml_model.output_description["embeddings"] = f"Speaker embeddings: ({BATCH_SIZE}, {EMBEDDING_DIM})" | ||
|
|
||
| # Save the model | ||
| output_path = "./models/camplusplus_batch16.mlpackage" |
There was a problem hiding this comment.
Oh good catch, I'll add that.
|
|
||
| warnings.filterwarnings('ignore') | ||
|
|
||
| def extract_fbank_features(waveform, sample_rate=16000): |
There was a problem hiding this comment.
Not for this PR, is it possible to move this to CoreML as separate model ? Else if someone wants to implement in Swift, they have to implement fbank computation in Swift.
There was a problem hiding this comment.
In my Senko pipeline, I do fbank extraction efficiently in C++. So perhaps I can link that in here? Not sure if that should be part of Mobius though; what do you think?
This test script I kept in pure python just as an example of how to use the CoreML model, not for production deployment.
There was a problem hiding this comment.
Yeah, please link Senko. It would be useful to show how the model could be used.
I think what Bharat is asking is if its possible to freeze the fbank operations into a CoreML model as well. Could be beneficial for Senko as well so you can strip out the C++ code.
But I would say its optional, not a blocker for the PR. Fbank is probably simple enough to vibe code in Swift and from what we've seen FFT/STFT operations don't help much being in CoreML, it might be faster to use Accelerate via Swift
There was a problem hiding this comment.
Ok, linked Senko.
Very interesting; didn't know FFT/STFT operations aren't sped up much by CoreML. I have looked at Accelerate in the past. I think optimizing Fbank extraction using that will be my next optimization target. If I end up doing that, I'll create another PR to contribute that here as well.
There was a problem hiding this comment.
Ok, linked Senko.
Very interesting; didn't know FFT/STFT operations aren't sped up much by CoreML. I have looked at Accelerate in the past. I think optimizing Fbank extraction using that will be my next optimization target. If I end up doing that, I'll create another PR to contribute that here as well.
Thanks for this PR @hamzaq2000 . Gave some highlevel comments. Once we resolves these, will review in detail. |
| @@ -0,0 +1,210 @@ | |||
| # https://github.com/modelscope/3D-Speaker/tree/main/speakerlab/models/campplus | |||
There was a problem hiding this comment.
It looks like they might not have a pip package, copying is fine as well. Next time could even just build off a cloned repo if thats easier.
Though it would be good to include the commit hash you worked off for future references
|
|
||
| warnings.filterwarnings('ignore') | ||
|
|
||
| def extract_fbank_features(waveform, sample_rate=16000): |
There was a problem hiding this comment.
Yeah, please link Senko. It would be useful to show how the model could be used.
I think what Bharat is asking is if its possible to freeze the fbank operations into a CoreML model as well. Could be beneficial for Senko as well so you can strip out the C++ code.
But I would say its optional, not a blocker for the PR. Fbank is probably simple enough to vibe code in Swift and from what we've seen FFT/STFT operations don't help much being in CoreML, it might be faster to use Accelerate via Swift
|
Ok, added the 3D-Speaker commit hash. Linked Senko fbank_extractor C++ code as well, for production deployment. Wanted to ask, since 3D-Speaker code will be part of the repo, shall I also create a |
Good question - a copy of their license in the root of |
|
Great, added the attribution and license text in |
CAM++ is an efficient speaker embedding model that I use in my diarization pipeline, Senko.
This PR adds a conversion script for it from torch to CoreML, as well as a test script to verify correctness and benchmark inference speed vs torch.