When a scraper delivers a row that passes every schema and validation check, it feels like success. The HTTP response is 200. The JSON is valid. The fields are present. But what if the data inside—those values—are entirely fabricated?
This isn’t about broken selectors or shifting page layouts. It’s about a scraper that returns structurally perfect data that is semantically false. The model extracted a rating of 7 on a 5-star scale. It returned a future-dated review. It labeled a unverified review as "verified". And it did all of this while looking absolutely correct at every stage of the pipeline.
The problem isn’t the scraper—it’s the assumption that a clean schema guarantees clean data. It doesn’t.
The Hidden Danger of Structured Output Modes
When you enable structured output modes in large language models—such as response_format: json_schema in many APIs—the model is instructed to return a complete JSON object. It’s not allowed to leave fields empty. So when the model is uncertain about a rating, a date, or a status, it doesn’t return null. It invents a value that fits the schema.
This behavior is well-documented. A May 29 Dev.to post by Paul SANTUS titled “LLMs suck at generating large, structured data” explains it clearly: structured output modes help with syntax, not semantics. The model will fill in uncertain fields rather than risk an incomplete response, even if the result is factually incorrect.
The schema demands completeness. The model delivers. But what it delivers isn’t always true.
How Silent Corruption Slips Through Validation
In a real-world scraping pipeline, the consequences can be severe. Consider these common failure classes that bypass traditional validation:
- Out-of-range values: A scraper extracts a rating of 7 on a 5-star scale because the model misread the context or invented a number.
- Future-dated reviews: A model normalizes a free-text date like "last Tuesday" and returns a date that hasn’t occurred yet.
- False verification flags: The word "verified" appears somewhere on the page, and the model assumes the review was verified—even though no evidence supports it.
- Mismatched counts: The page displays 40 reviews, but the extracted object claims 500. One of those numbers is pure fiction.
- Language inconsistencies: A review labeled as being from the US contains text written in German. The model mapped metadata incorrectly.
In each case, the data looks valid. It’s valid JSON. It has the right fields. But the values are wrong—and the schema check doesn’t catch them because it only validates structure, not truth.
A Practical Sanity Gate for Value-Level Validation
To catch these silent failures, you need a second layer of validation: one that checks the values, not just the shape. A 60-line sanity gate—implemented as a post-processing filter—can catch many obvious errors before they reach your database.
Here’s a simplified example of such a gate in Python:
from datetime import datetime
import re
def validate_review_data(data):
errors = []
# Check rating range
if not (1 <= data.get("rating", 0) <= 5):
errors.append(f"Invalid rating: {data['rating']}")
# Check review date isn't in the future
review_date_str = data.get("review_date")
if review_date_str:
try:
review_date = datetime.strptime(review_date_str, "%Y-%m-%d")
if review_date > datetime.now():
errors.append(f"Future-dated review: {review_date_str}")
except ValueError:
errors.append(f"Invalid date format: {review_date_str}")
# Check verified flag consistency
if data.get("verified", False) and not data.get("verification_evidence"):
errors.append("Verified flag set without evidence")
# Check language consistency
if data.get("country") == "US" and not re.search(r"[a-zA-Z]{3,}", data.get("text", "")):
errors.append("US country but non-English text detected")
return errorsThis gate catches rule violations—like a rating of 7 or a future date—but it’s not foolproof. A rating of 4 when the true value is 2 will still slip through. The gate enforces hard boundaries, not subtle inaccuracies. For nuanced validation, you need external data sources, cross-checks, or human review.
Why Production Scraping Demands More Than Schema Checks
I’ve run scraping pipelines in production for years. Across 2,190 actor runs, the Trustpilot scraper alone has processed nearly 1,000 pages. The sheer volume reveals patterns that controlled experiments miss.
But here’s the caveat: I don’t have precise hallucination rates. Without instrumented, controlled experiments, any percentage claim would be misleading. What I do have is the experience of cleaning up after silent failures—hundreds of times—by hand. The examples above aren’t transcribed incidents; they’re representative classes of errors that I’ve reconstructed to illustrate the problem.
The takeaway? Schema checks are necessary, but not sufficient. They ensure your scraper is receiving and parsing data correctly. But they don’t validate whether that data is true.
Until your pipeline includes value-level validation, your clean JSON might still be dangerously wrong.
AI summary
Valid JSON doesn’t mean valid data. Learn how LLMs fabricate plausible values during extraction and how to catch silent corruption before it corrupts your pipeline.