Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.gildea.ai/llms.txt

Use this file to discover all available pages before exploring further.

Every text unit in Gildea is independently verified against source evidence before being served through the API.

Trust contract

Only verdict=pass data is served. If a text unit fails verification, it’s excluded from API responses. You never see unverified or failed text units. For sibling text units with mixed results: if 3 or more sibling argument sentences or claims pass, the review and fail siblings are pruned. Thesis and summary text is always served complete.

Reading the verification trail

When you fetch GET /v1/signals/{id}, each unit carries a position_in_source field — its position within the parent in the original article (1-indexed, ascending). Gaps in the sequence are deliberate: if an argument’s surviving sentences are positions 1, 3, and 5, the sentences at 2 and 4 didn’t make the verified set. The field name reflects this — it’s the source position, not a display rank, and it’s faithful to what got pruned. This is position_in_source rather than display_order precisely because the integer reflects ground truth about the source, not a contiguous render order. An agent reconstructing the author’s argument flow gets both the order and a count of what was filtered out, in a single field.

How verification works

Each text unit goes through three layers of verification:

1. Evidence matching

Every text unit is paired with verbatim evidence passages from the source article. Evidence retrieval combines LLM extraction with dense semantic search to find the best supporting passages.

2. Factual consistency scoring

Each text-evidence pair is scored using one of two modes, depending on the text unit type:
ModeUsed forWhat it measures
nliClaimsStrict entailment — does the evidence logically entail the claim? Claims make specific factual assertions, so they need a stricter standard.
regressionSentencesFactual consistency — is the sentence consistent with the evidence? Sentences summarize or interpret, so the standard is whether they’re faithful to the source, not whether they’re directly entailed.
The two modes use different scoring methods and different pass thresholds. This is why you should not compare raw scores across modes — a 0.75 in NLI and a 0.75 in regression don’t mean the same thing. Use the verdict as the primary trust signal.

3. Deterministic safety checks

A battery of rule-based checks runs on every pair, regardless of score. These catch failure modes that neural scoring alone can miss — contradictions, quantity mismatches, entity anchoring failures, epistemic drift, and more. Any failed check prevents a pass verdict until the issue is resolved — either through better evidence retrieval or human review.

Human review

Text units that pass automated verification are served directly. Units that fall into a gray zone are flagged for human review:
  • Verdict override — reviewers can approve or reject with an audit trail
  • Evidence correction — reviewers can correct evidence passages, triggering automatic re-scoring
  • Theme review — themes are reviewed and corrected
The was_human_reviewed field in verification metadata indicates whether a human reviewed the text unit.

Verification detail levels

Default (compact)

Every text unit includes compact verification:
{
  "verification": {
    "final_verdict": "pass",
    "verification_path": "auto_pass",
    "was_human_reviewed": false,
    "scoring_mode": "nli",
    "primary_score": 0.7421
  }
}

Full detail

Add verification_detail=full for complete metadata including scoring breakdown, reason codes, and threshold policy:
FieldDescription
scoring_methodThe scoring pipeline used (versioned for reproducibility)
scoring_components_ranWhich scoring components were executed
scores.primary_scoreThe main factual consistency score (higher = more consistent with evidence)
scores.entailment_forwardDoes the evidence support the claim/sentence? (0–1)
scores.contradictionDoes the evidence contradict the claim/sentence? (0–1, lower is better)
score_computedWhether each component produced a score (useful for debugging)
reason_codesWhy this verdict was reached (e.g. HUMAN_OVERRIDE, RESCORED_AFTER_EDIT, AUTO_PASS)
thresholds.threshold_set_idWhich scoring policy was applied (policies are versioned)
{
  "verification": {
    "final_verdict": "pass",
    "verification_path": "human_override_pass",
    "was_human_reviewed": true,
    "reason_codes": ["HUMAN_OVERRIDE", "RESCORED_AFTER_EDIT"],
    "scoring_mode": "nli",
    "scoring_method": "alignscore-large-nli-v1-cascade",
    "scoring_components_ran": ["entailment_forward", "contradiction"],
    "scores": {
      "primary_score_name": "entailment_forward",
      "primary_score": 0.7421,
      "entailment_forward": 0.7421,
      "contradiction": 0.0412
    },
    "score_computed": {
      "entailment_forward": true,
      "contradiction": true
    },
    "thresholds": {
      "threshold_set_id": "stage7_policy_2026-02-11",
      "values": {}
    }
  }
}

Verification paths

PathDescription
auto_passPassed all automated checks (score + deterministic)
human_override_passFlagged by automation, approved by human reviewer