Calibration and Fine-Tuning

Social Inference Engine uses per-signal-type temperature scaling to correct systematic overconfidence and underconfidence. Calibration state is updated online — each analyst correction improves the next inference.

How calibration works

LLMs are systematically miscalibrated: they output high confidence on common patterns regardless of actual accuracy. A model that consistently outputs confidence = 0.92 for churn_risk does not achieve 92% accuracy on that type.

Temperature scaling applies a learned scalar T to the raw logits before softmax. When T < 1.0, overconfident outputs are dampened. When T > 1.0, underconfident outputs are sharpened.

The scalar for each signal type is initialised from the seed dataset and updated online via gradient descent after each analyst correction. One update step takes 6–8 µs and requires no service restart.

Training workflow

Prepare the seed dataset

# The seed dataset is in training/seed_examples.jsonl
# Each line is: {"text": "...", "signal_type": "lead_opportunity", "platform": "reddit"}

# Validate the format
python training/validate_dataset.py --file training/seed_examples.jsonl

The seed dataset ships with 107 examples across all 10 signal types. Add your own examples to improve calibration for your specific use case.

Run initial calibration

python training/calibrate.py --epochs 5

Runs temperature scaling calibration on the seed dataset. Updates training/calibration_state.json. Takes ~30 seconds on 107 examples.

(Optional) Run fine-tuning for the non-frontier tier

# Prepare training data for OpenAI fine-tuning
python training/prepare_training_data.py

# Submit fine-tuning job
python training/fine_tune.py --base-model gpt-4o-mini

# Export the model ID to .env
echo "FINE_TUNED_MODEL_ID=ft:gpt-4o-mini:…" >> .env

Fine-tuning targets: Macro F1 ≥ 0.82 · ECE ≤ 0.05 · False-action rate ≤ 0.08 · Abstention rate 5–15%. The job runs on OpenAI infrastructure and takes 20–60 minutes.

Submit analyst feedback to trigger online calibration

# Via API
curl -X POST http://localhost:8000/api/v1/signals/{id}/feedback \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"predicted_type": "feature_request_pattern", "true_type": "churn_risk"}'

Each feedback submission triggers one gradient-descent step on the ConfidenceCalibrator. The temperature scalar for the corrected type is updated in memory immediately and flushed to disk. No restart required.

Fine-tuning targets (non-frontier tier)

≥ 0.82

Macro F1

≤ 0.05

ECE

≤ 0.08

False-action rate

5–15%

Abstention rate

Current temperature scalars

Calibrated on 107 seed examples · State as of 2026-03-24

Signal Type	Temperature Scalar	Calibrated	Samples
lead_opportunity	0.92	Yes	18
competitor_weakness	0.88	Yes	14
influencer_amplification	1.05	Yes	9
churn_risk	0.79	Yes	21
misinformation_risk	0.85	Yes	11
support_escalation	0.83	Yes	15
product_confusion	1.08	Yes	8
feature_request_pattern	0.97	Yes	6
launch_moment	0.94	Yes	3
trend_to_content	1.12	Yes	2

Temperature = 1.0 means uncalibrated (mathematical identity). Values < 1.0 reduce overconfident outputs. Values > 1.0 sharpen underconfident outputs. Scalars reflect calibration on the 107-example seed dataset.

Full Training Guide on GitHub →