Documentation Index
Fetch the complete documentation index at: https://docs.gildea.ai/llms.txt
Use this file to discover all available pages before exploring further.
Every text unit in Gildea is independently verified against source evidence before being served through the API.
Trust contract
Only verdict=pass data is served. If a text unit fails verification, it’s excluded from API responses. You never see unverified or failed text units.
For sibling text units with mixed results: if 3 or more sibling argument sentences or claims pass, the review and fail siblings are pruned. Thesis and summary text is always served complete.
Reading the verification trail
When you fetch GET /v1/signals/{id}, each unit carries a position_in_source field — its position within the parent in the original article (1-indexed, ascending). Gaps in the sequence are deliberate: if an argument’s surviving sentences are positions 1, 3, and 5, the sentences at 2 and 4 didn’t make the verified set. The field name reflects this — it’s the source position, not a display rank, and it’s faithful to what got pruned.
This is position_in_source rather than display_order precisely because the integer reflects ground truth about the source, not a contiguous render order. An agent reconstructing the author’s argument flow gets both the order and a count of what was filtered out, in a single field.
How verification works
Each text unit goes through three layers of verification:
1. Evidence matching
Every text unit is paired with verbatim evidence passages from the source article. Evidence retrieval combines LLM extraction with dense semantic search to find the best supporting passages.
2. Factual consistency scoring
Each text-evidence pair is scored using one of two modes, depending on the text unit type:
| Mode | Used for | What it measures |
|---|
| nli | Claims | Strict entailment — does the evidence logically entail the claim? Claims make specific factual assertions, so they need a stricter standard. |
| regression | Sentences | Factual consistency — is the sentence consistent with the evidence? Sentences summarize or interpret, so the standard is whether they’re faithful to the source, not whether they’re directly entailed. |
The two modes use different scoring methods and different pass thresholds. This is why you should not compare raw scores across modes — a 0.75 in NLI and a 0.75 in regression don’t mean the same thing. Use the verdict as the primary trust signal.
3. Deterministic safety checks
A battery of rule-based checks runs on every pair, regardless of score. These catch failure modes that neural scoring alone can miss — contradictions, quantity mismatches, entity anchoring failures, epistemic drift, and more.
Any failed check prevents a pass verdict until the issue is resolved — either through better evidence retrieval or human review.
Human review
Text units that pass automated verification are served directly. Units that fall into a gray zone are flagged for human review:
- Verdict override — reviewers can approve or reject with an audit trail
- Evidence correction — reviewers can correct evidence passages, triggering automatic re-scoring
- Theme review — themes are reviewed and corrected
The was_human_reviewed field in verification metadata indicates whether a human reviewed the text unit.
Verification detail levels
Default (compact)
Every text unit includes compact verification:
{
"verification": {
"final_verdict": "pass",
"verification_path": "auto_pass",
"was_human_reviewed": false,
"scoring_mode": "nli",
"primary_score": 0.7421
}
}
Full detail
Add verification_detail=full for complete metadata including scoring breakdown, reason codes, and threshold policy:
| Field | Description |
|---|
scoring_method | The scoring pipeline used (versioned for reproducibility) |
scoring_components_ran | Which scoring components were executed |
scores.primary_score | The main factual consistency score (higher = more consistent with evidence) |
scores.entailment_forward | Does the evidence support the claim/sentence? (0–1) |
scores.contradiction | Does the evidence contradict the claim/sentence? (0–1, lower is better) |
score_computed | Whether each component produced a score (useful for debugging) |
reason_codes | Why this verdict was reached (e.g. HUMAN_OVERRIDE, RESCORED_AFTER_EDIT, AUTO_PASS) |
thresholds.threshold_set_id | Which scoring policy was applied (policies are versioned) |
{
"verification": {
"final_verdict": "pass",
"verification_path": "human_override_pass",
"was_human_reviewed": true,
"reason_codes": ["HUMAN_OVERRIDE", "RESCORED_AFTER_EDIT"],
"scoring_mode": "nli",
"scoring_method": "alignscore-large-nli-v1-cascade",
"scoring_components_ran": ["entailment_forward", "contradiction"],
"scores": {
"primary_score_name": "entailment_forward",
"primary_score": 0.7421,
"entailment_forward": 0.7421,
"contradiction": 0.0412
},
"score_computed": {
"entailment_forward": true,
"contradiction": true
},
"thresholds": {
"threshold_set_id": "stage7_policy_2026-02-11",
"values": {}
}
}
}
Verification paths
| Path | Description |
|---|
auto_pass | Passed all automated checks (score + deterministic) |
human_override_pass | Flagged by automation, approved by human reviewer |