---
A few more points that didn't quite fit in my main post:
My citation verifier is not a wrapper around a language model. It is deterministic. It takes identifier(s), looks them up in authoritative lists (Crossref, NCBI eutils, DataCite, arXiv, ADS, WHO IRIS), and then compares their associated title and author(s) to yours.
I do normalise tricky things: html markup, unicode characters, punctuation, different cases, stop words etc. Then, a similarity score is calculated using token overlap and edit distance. This is harder than it looks! The biggest difficulty was determining reasonable thresholds. Too sensitive and you will flag legitimate variations; too loose and you will fail to catch fabrication. I used the validation fixture to tune this but am deliberately publishing the confidence level it produces rather than claiming a hard pass/fail binary.
The verifier actually performed less well the first time that I did a blind eval; with 5.3% of real citations flagging as mismatches. The problem was extremely simple - I hadn't allowed for author names recorded with initials first. After I fixed that, drew a new citation set, (so it couldn't have been tuned to that test set) and re-ran; this is the result published above which flags 1.8% as false positives. I've published both runs and the receipts, not just the latter.
The web SaaS addresses one of the two potential problems with citation verification: 'Real DOI but wrong title' can be mechanically checked against the underlying system. 'Real article but doesn't support claim' is far harder. To address that requires reading the claim and the paper. I'm deliberately not trying to solve that problem. The furthest automation can easily go at that level appears to be something like: 'the abstract to the cited article appears to not contain any of the concepts contained in the claim'. Sometimes useful, but easy to overstate.
The web SaaS is closed source; due to ongoing hosting and service costs which the anonymous free tier subsidises.
Yes, I am aware there are other tools that solve different problems: retraction watch for withdrawn papers; unpaywall for open-access; Scite for context analysis of citations. However, none directly answer what Topaz et al. Identified as the most common pattern of fabrication: "Is this citation real and correctly attributable to this identifier?"
Areas for ongoing work: the edge-cases will be addressed, and the validation corpus expanded. Later; possibly a streaming / batch verifier for large reference lists, or a conservative semantic-layer flag based on abstract-vs-claim concept overlap. Both of those carry significant risks of over-promising, particularly the last.
Keen to hear thoughts on the project.