Getting Started with DBpedia Spotlight — DBpedia Spotlight API

When to use this API

When you need to extract named entities from unstructured text and link them to a knowledge graph. DBpedia Spotlight takes a prose string and identifies mentions of people, places, and organizations — then resolves each mention to a canonical DBpedia URI backed by Wikipedia's editorial corpus. What makes it non-obviously useful is the disambiguation layer: "Apple" in a sentence about quarterly earnings lands on dbpedia.org/resource/Apple_Inc., not the fruit, because the API scores candidates against surrounding context. For domain-limited NER tasks (medical records, legal documents), a specialized model will outperform it; Spotlight shines when you want Wikipedia-scale general knowledge linking without standing up your own NLP pipeline.

Linking free text to structured knowledge

"What entities are mentioned in this news summary, and where can I learn more?" /annotate is the full pipeline: spot entity surface forms, map to DBpedia candidates, disambiguate to a single URI per mention. No auth, no key — pass the text and a confidence threshold.

curl -H "Accept: application/json" \
  "https://api.dbpedia-spotlight.org/annotate?text=Berlin+is+the+capital+of+Germany&confidence=0.5" | head -c 10000

{
  "@text": "Berlin is the capital of Germany",
  "@confidence": "0.5",
  "@policy": "whitelist",
  "Resources": [
    {
      "@URI": "http://dbpedia.org/resource/Berlin",
      "@support": "96203",
      "@types": "Wikidata:Q515,Schema:City,DBpedia:City",
      "@surfaceForm": "Berlin",
      "@offset": "0",
      "@similarityScore": "0.9987",
      "@percentageOfSecondRank": "7.28E-4"
    },
    {
      "@URI": "http://dbpedia.org/resource/Germany",
      "@support": "187027",
      "@types": "Wikidata:Q6256,Schema:Country,DBpedia:Country",
      "@surfaceForm": "Germany",
      "@offset": "25",
      "@similarityScore": "0.9908",
      "@percentageOfSecondRank": "5.55E-3"
    }
  ]
}

The @support value is the count of Wikipedia articles that mention this entity — 187,027 for Germany, 96,203 for Berlin. It's the prior-probability backbone of the disambiguation model: entities with higher support win contested matches, which is why a plain mention of "Paris" without any context lands on the French capital rather than Paris, Texas. The @percentageOfSecondRank of 0.00073 for Berlin tells you the runner-up captured less than 0.1% of the top score — unambiguous by any measure. (Yes, the probe example is deliberately dull; the real test of this API is on sentences where entity boundaries are unclear or common nouns blur into proper ones.)

The text mentions two entities: Berlin (linked to dbpedia.org/resource/Berlin, typed as a DBpedia City) and Germany (linked to dbpedia.org/resource/Germany, typed as a DBpedia Country). Both were disambiguated with high confidence — the nearest alternative for Berlin scored less than 0.1% of Berlin's own score.

Handling ambiguous entity references in text

"The text says 'Apple' — does this mean the company or the fruit?" /candidates runs spotting and candidate mapping but stops short of committing to a single URI. What comes back is the top candidate per surface form along with the prior and contextual scores that a full disambiguation step would have used — making it the right endpoint when you want to audit the API's reasoning before acting on a link.

curl -H "Accept: application/json" \
  "https://api.dbpedia-spotlight.org/candidates?text=Apple+is+a+technology+company+in+California" | head -c 10000

{
  "annotation": {
    "@text": "Apple is a technology company in California",
    "surfaceForm": [
      {
        "@name": "Apple",
        "@offset": "0",
        "resource": {
          "@label": "Apple Inc.",
          "@uri": "Apple_Inc.",
          "@priorScore": "7.86E-5",
          "@contextualScore": "0.4041",
          "@finalScore": "0.9965",
          "@support": "18834",
          "@percentageOfSecondRank": "0.0028"
        }
      },
      {
        "@name": "California",
        "@offset": "33",
        "resource": {
          "@label": "California",
          "@uri": "California",
          "@priorScore": "8.36E-4",
          "@contextualScore": "0.1728",
          "@finalScore": "0.9999",
          "@support": "200194",
          "@percentageOfSecondRank": "2.70E-5"
        }
      }
    ]
  }
}

The @priorScore for Apple Inc. is 7.86×10⁻⁵ — near zero — because without context, "Apple" in the Wikipedia corpus heavily favors the fruit. The @finalScore of 0.9965 is what happens after the phrase "technology company" shifts the posterior. This prior-vs-final gap is the signal to watch when building a pipeline that needs to know how much to trust a linked entity: a high final score driven by a high prior (like California here) is less meaningful than a high final score that overcame a low prior.

The word "Apple" in that text refers to Apple Inc. with 99.65% confidence — context ("technology company") overrode a prior that would have favored the fruit. California resolved with near-certainty. Both URIs are relative to dbpedia.org/resource/.

Finding entity mentions without disambiguation overhead

"I just want to know which phrases in this text could be entity names — I'll link them myself." /spot runs only the first stage: it identifies surface forms without mapping them to DBpedia URIs. Faster, and useful when you're piping output to your own NER model downstream.

curl -H "Accept: application/json" \
  "https://api.dbpedia-spotlight.org/spot?text=London+is+a+major+city+in+Europe" | head -c 10000

{
  "annotation": {
    "@text": "London is a major city in Europe",
    "surfaceForm": [
      { "@name": "London", "@offset": "0" },
      { "@name": "city",   "@offset": "18" },
      { "@name": "Europe", "@offset": "26" }
    ]
  }
}

"city" appears as a spotted surface form alongside "London" and "Europe" — a common noun, not a proper one. Spotlight's spotting phase works by dictionary lookup against Wikipedia anchor texts, and "city" is a Wikipedia anchor somewhere in the corpus, so it gets flagged. This means /spot output is substantially noisier than /annotate output; the disambiguation step that /spot skips is also the step that filters low-confidence common-noun matches. Use /spot only if you need the raw candidate set for downstream processing, not as a fast alternative to /annotate.

The spotted surface forms in that text are "London" (offset 0), "city" (offset 18), and "Europe" (offset 26). Note that "city" is a common noun flagged via Wikipedia anchor lookup — it would likely be filtered out by the full /annotate pipeline.

Pitfalls

The API returns XML by default. Always pass Accept: application/json as a request header or you'll receive SDMX-style XML with no warning. This isn't prominently documented.
/annotate and /candidates/spot use different response shapes. /annotate returns a top-level Resources array; the other two return an annotation wrapper containing a surfaceForm array. A single response parser will break on one or the other.
Every JSON key has an @ prefix (@URI, @surfaceForm, @offset, @support). This is JSON-LD convention — Spotlight was built when Linked Data was fashionable. Expect "@URI", not "uri".
The public api.dbpedia-spotlight.org endpoint is shared infrastructure. Under load it is slow and occasionally unresponsive. For any production use, the project ships self-hosted Docker images.

One-line summary for the user

I can identify named entities in free text — people, places, organizations — and link each one to a DBpedia (Wikipedia-backed) URI with a disambiguation confidence score, using an unauthenticated GET.