Backends¶
tobira supports multiple inference backends through the BackendProtocol abstraction. Choose based on your accuracy, latency, and resource requirements.
Recommended Base Model¶
For new deployments, we recommend DeBERTa-v3 over BERT. DeBERTa-v3 uses disentangled attention and an enhanced mask decoder, achieving higher classification accuracy at comparable model size.
| Model | ID | Parameters | Languages | Notes |
|---|---|---|---|---|
| mDeBERTa-v3 (recommended) | microsoft/mdeberta-v3-base |
86M | Multilingual | Best for mixed-language email |
| DeBERTa-v3 Japanese | ku-nlp/deberta-v3-base-japanese |
86M | Japanese | Japanese-only environments |
| BERT Japanese v3 (legacy) | tohoku-nlp/bert-base-japanese-v3 |
111M | Japanese | Current default for backward compatibility |
The BERT/ONNX backends accept any HuggingFace model name via the model_name configuration field. No code changes are needed to switch models — just update your tobira.toml.
Migrating from BERT to DeBERTa-v3¶
-
Update
model_namein yourtobira.toml: -
If using ONNX, re-export and re-quantize:
-
Run
tobira doctorto verify the new model loads correctly.
Note
Existing fine-tuned BERT models continue to work. The default model_name in code remains tohoku-nlp/bert-base-japanese-v3 for backward compatibility.
Comparison¶
| Backend | Model Size | Latency | Hardware | Accuracy |
|---|---|---|---|---|
| FastText | ~10 MB | ~1 ms | CPU only | Medium |
| ONNX | ~110 MB (quantized) | ~30 ms | CPU only | High |
| BERT/DeBERTa | ~340-440 MB | ~200 ms | GPU recommended | High |
| Ollama | 1-70 GB | ~500 ms (GPU) | GPU recommended | High |
| LLM API | Remote | ~300 ms | Network | Highest |
| Ensemble | Varies | Varies | Varies | Highest |
| Two-Stage | Combined | ~1-30 ms | CPU | High |
FastText¶
Lightweight n-gram model. Best for high-throughput environments where speed matters most.
When to use: First-stage filter, resource-constrained environments, very high email volumes.
ONNX¶
Quantized model running on ONNX Runtime. Best balance of accuracy and speed on CPU. Works with both BERT and DeBERTa-v3 models.
[backend]
type = "onnx"
model_path = "/var/lib/tobira/model_int8.onnx"
model_name = "microsoft/mdeberta-v3-base" # recommended; legacy: tohoku-nlp/bert-base-japanese-v3
When to use: Production deployments on CPU-only servers, best accuracy-to-cost ratio.
BERT / DeBERTa (PyTorch)¶
Full Transformer model via HuggingFace Transformers. Highest accuracy for fine-tuned models. Supports any AutoModelForSequenceClassification-compatible model including BERT and DeBERTa-v3.
[backend]
type = "bert"
model_name = "microsoft/mdeberta-v3-base" # recommended; legacy: tohoku-nlp/bert-base-japanese-v3
device = "cuda"
When to use: GPU-equipped servers, when maximum fine-tuned accuracy is needed.
Ollama¶
Local LLM inference via Ollama. Supports any Ollama-compatible model.
When to use: When you want LLM-level understanding without external API calls.
LLM API¶
Cloud LLM via OpenAI-compatible API. Works with OpenAI, Anthropic (via proxy), and other providers.
[backend]
type = "llm_api"
model = "gpt-4o-mini"
base_url = "https://api.openai.com/v1"
api_key = "sk-..."
timeout = 30
When to use: Highest accuracy needed, acceptable latency and cost, data privacy allows cloud processing.
Ensemble¶
Combines multiple backends using weighted average or majority vote.
[backend]
type = "ensemble"
strategy = "weighted_average"
weights = [0.3, 0.7]
[[backend.backends]]
type = "fasttext"
model_path = "/var/lib/tobira/fasttext-spam.bin"
[[backend.backends]]
type = "onnx"
model_path = "/var/lib/tobira/model_int8.onnx"
When to use: Maximum accuracy through model agreement, critical environments.
Two-Stage Filter¶
Routes emails through a fast first-stage model, sending only uncertain results to a precise second-stage model. Reduces load on the expensive model by 80-90%.
[backend]
type = "two_stage"
[backend.first_stage]
type = "fasttext"
model_path = "/var/lib/tobira/fasttext-spam.bin"
[backend.second_stage]
type = "onnx"
model_path = "/var/lib/tobira/model_int8.onnx"
[backend.grey_zone]
low = 0.3
high = 0.7
When to use: High-volume environments where you want both speed and accuracy.
The grey zone defines the score range where the first-stage result is uncertain:
- Score < 0.3 (low): Classified as ham by first stage, no second stage needed
- Score > 0.7 (high): Classified as spam by first stage, no second stage needed
- 0.3 <= Score <= 0.7: Grey zone, forwarded to second stage for precise classification