How CallShield uses Mistral AI models for real-time phone scam detection.
| Voxtral Mini | Mistral Large | |
|---|---|---|
| Model ID | voxtral-mini-latest |
mistral-large-latest |
| Purpose | Native audio analysis | Text transcript analysis |
| Input type | Raw audio bytes (base64-encoded) | Transcript text |
| API method | Raw urllib POST to https://api.mistral.ai/v1/chat/completions |
Mistral SDK (mistralai Python package) |
| Unique detections | Vocal stress patterns, background noise anomalies, robocall/IVR audio signatures, speech cadence irregularities, silence gaps | Semantic intent analysis, social engineering language patterns, known scam script matching, information extraction requests |
Both models return structured JSON and contribute to a combined scam score: 60% audio weight (Voxtral Mini) + 40% text weight (Mistral Large).
| Mode | Calls per interaction |
|---|---|
| Audio upload | 1 Voxtral call per uploaded file |
| Streaming | 1 Voxtral call per chunk (chunks arrive via WebSocket, max 512 KB each, max 60 chunks per session) |
Voxtral Mini is called via a raw urllib POST request executed through asyncio's thread-pool executor (loop.run_in_executor). This avoids SDK limitations with multimodal audio payloads.
POST https://api.mistral.ai/v1/chat/completions
Authorization: Bearer {MISTRAL_API_KEY}
Content-Type: application/json
| Parameter | Value |
|---|---|
model |
voxtral-mini-latest |
temperature |
0.3 |
top_p |
0.9 |
response_format |
{"type": "json_object"} |
timeout |
120 seconds |
During live streaming analysis, chunk scores are combined using exponential weighting to prioritize the most recent audio while retaining context:
combined = 0.7 * current_chunk_score + 0.3 * previous_cumulative_score
This ensures the score responds quickly to new scam indicators without being thrown off by a single anomalous chunk.
| Scenario | Expected latency |
|---|---|
| Single chunk (streaming) | 1.5 -- 2.8 seconds (warm instance) / up to 4.5s on cold start |
| Full file upload (short clip) | 2 -- 4 seconds |
| Scenario | Approximate cost |
|---|---|
| 1 chunk (~5s of audio) | ~$0.002--$0.004 |
| 1 minute of streaming (~12 chunks) | ~$0.024--$0.048 |
| Single file upload (1 min) | ~$0.004--$0.008 |
Costs are approximate and depend on Mistral AI pricing at time of use.
Mistral Large is called once per transcript when the user submits a text transcript via the Paste Transcript tab. It analyzes the textual content of the conversation for social engineering patterns, known scam scripts, and semantic red flags that complement the audio-level analysis from Voxtral Mini.
Mistral Large is called via the official Mistral SDK (mistralai Python package):
from mistralai import Mistral
client = Mistral(api_key=MISTRAL_API_KEY)
response = client.chat.complete(
model="mistral-large-latest",
messages=[...],
response_format={"type": "json_object"},
)| Parameter | Value |
|---|---|
model |
mistral-large-latest |
response_format |
{"type": "json_object"} |
| Scenario | Approximate cost |
|---|---|
| Single transcript (short call, ~1 min) | ~$0.003--$0.006 |
| Single transcript (longer call, ~5 min) | ~$0.010--$0.020 |
A traditional scam-detection pipeline requires two separate API calls: first a speech-to-text service, then an LLM for analysis. Voxtral Mini collapses this into a single call with native audio understanding.
| Traditional STT + LLM Pipeline | Voxtral Native Audio (CallShield) | |
|---|---|---|
| API calls required | 2 (STT service + LLM) | 1 (Voxtral Mini) |
| Audio signal | Vocal cues lost after transcription | Full audio signal preserved and analyzed |
| Detectable signals | Text semantics only | Text semantics + vocal stress, pace, background noise, robocall audio signatures |
| Typical latency | ~5--8 seconds (STT latency + LLM latency) | ~2--4 seconds (single model inference) |
| Pipeline complexity | Higher -- must orchestrate two services, handle intermediate transcript format | Lower -- single request, single response |
| Cost | Two billable API calls | One billable API call |
| Failure modes | Transcription errors cascade into incorrect analysis | No transcription step to introduce errors |
The native audio approach is strictly superior for scam detection because vocal patterns (urgency in tone, robotic cadence, background call-center noise) are first-class signals rather than artifacts lost in transcription.
CallShield uses seven specific prompt engineering techniques to maximize detection accuracy while minimizing false positives.
Every prompt begins with an explicit instruction to avoid over-flagging legitimate calls:
"You are a phone scam detection expert. Be precise and conservative in your assessments. Legitimate calls from banks, doctors, family members, and businesses are common -- do NOT flag normal conversations as scams. Only raise the scam score when concrete indicators are present."
This preamble significantly reduces false positives on benign calls such as appointment reminders, family check-ins, and routine business calls.
The model scores each call across seven independent dimensions, each on a 0.0--1.0 scale:
| Dimension | What it measures |
|---|---|
| Urgency | Artificial time pressure ("act now", "limited time", "your account will be closed") |
| Authority Impersonation | Claims to be IRS, police, bank, government, tech support |
| Information Extraction | Requests for SSN, credit card, bank account, passwords, PINs |
| Emotional Manipulation | Fear, guilt, excitement, or romantic manipulation tactics |
| Vocal Patterns | Robotic speech, pre-recorded audio, unnatural cadence, call-center background noise |
| Known Scam Scripts | Matches to known scam templates (IRS, tech support, grandparent, romance, lottery) |
| Robocall/IVR | Automated phone tree systems, "press 1" prompts, pre-recorded menu-driven flows |
The seven dimension scores are aggregated into a single overall scam score.
The prompt includes explicit calibration ranges so the model produces consistent, interpretable scores:
| Score range | Classification | Description |
|---|---|---|
| 0.0 -- 0.2 | Normal | Routine conversation with no suspicious indicators |
| 0.2 -- 0.4 | Minor suspicious elements | One or two mild indicators that could be coincidental |
| 0.4 -- 0.6 | Concerning | Multiple indicators present; warrants caution |
| 0.6 -- 0.8 | Strong scam indicators | Clear pattern of manipulation or impersonation |
| 0.8 -- 1.0 | Clear scam | Textbook scam with multiple strong indicators |
Each analysis returns a severity level tied to the overall score:
| Severity | Triggered when | User-facing meaning |
|---|---|---|
| low | Score < 0.4 | Call appears normal or has only minor oddities |
| medium | Score 0.4 -- 0.7 | Caution advised; some scam indicators detected |
| high | Score > 0.7 | Likely scam; user should consider ending the call |
Audio chunks that are effectively silent are detected before being sent to Voxtral Mini:
- RMS threshold: An RMS amplitude value below 500 classifies the chunk as silence.
- Silent chunks are skipped rather than sent to the model, saving cost and avoiding meaningless analysis.
- This prevents dead air or hold music from being misinterpreted as suspicious.
The prompt includes three calibration examples so the model understands the expected scoring scale:
| Example scenario | Expected score | Rationale |
|---|---|---|
| IRS impersonation scam -- "This is the IRS. You owe back taxes and a warrant has been issued for your arrest. Press 1 to speak to an agent immediately." | 0.95 | Authority impersonation (IRS) + urgency (arrest warrant) + known scam script + robocall/IVR pattern |
| Love letter from a partner -- "Hey babe, just wanted to say I love you and I hope you have a great day at work. Call me tonight!" | 0.0 | No scam indicators whatsoever; entirely normal personal call |
| Medicare robocall -- "Hello, this is an important message about your Medicare benefits. Your coverage may be changing. Press 1 to speak with a benefits coordinator." | 0.75 | Authority impersonation (Medicare) + robocall/IVR pattern + known scam script, but lower urgency than the IRS example |
Both models are required to return responses in a strict JSON schema via the json_object response format. This ensures deterministic parsing and consistent downstream processing:
{
"scam_score": 0.72,
"severity": "high",
"confidence": 0.85,
"indicators": [
"Authority impersonation: caller claims to be from the IRS",
"Urgency: threatens immediate arrest",
"Information extraction: requests Social Security number"
],
"recommendation": "This call shows strong indicators of a scam. Do not provide personal information.",
"dimensions": {
"urgency": 0.9,
"authority_impersonation": 0.95,
"information_extraction": 0.8,
"emotional_manipulation": 0.6,
"vocal_patterns": 0.5,
"known_scam_scripts": 0.9,
"robocall_ivr": 0.3
}
}Estimated token consumption per operation:
| Operation | Approximate tokens |
|---|---|
| Audio upload (single file) | ~2,400 tokens |
| Stream chunk (single) | ~2,000 tokens |
| 1-minute stream (~12 chunks) | ~25,000 tokens |
| Transcript analysis (Mistral Large) | ~1,500 tokens |
Typical full analysis (1-minute streaming call with transcript): ~25,000 tokens (Voxtral streaming) + ~1,500 tokens (Mistral Large transcript) = ~26,500 tokens total.
CallShield combines two Mistral AI models to provide layered scam detection:
- Voxtral Mini analyzes raw audio natively -- capturing vocal cues, robocall signatures, and speech patterns that would be lost in transcription.
- Mistral Large analyzes the text transcript -- catching semantic manipulation, known scam scripts, and social engineering language.
- The two scores are combined with 60/40 audio/text weighting to produce a final verdict that is more accurate than either model alone.