How to safely ingest MongoDB into Delta Lake without schema headaches

MongoDB’s schema flexibility accelerates application development, but it frequently derails data pipelines. Engineers who move documents from MongoDB into Delta Lake using Apache Spark soon discover that a single malformed record can halt an entire batch, corrupt table consistency, and trigger late-night debugging sessions. After rebuilding this exact pipeline more than 10 times, one engineer decided to build a solution that should have existed from the start: a lightweight bridge that turns fragile ETL into a predictable process.

Why MongoDB breaks ETL pipelines

MongoDB’s “schema-free” design lets each document define its own structure, which is ideal for application code but hazardous for pipelines. Three recurring issues consistently break ingest jobs.

Polymorphic fields that confuse Spark

Some collections mix data types within the same field. A status field might store a string in older records and an integer in newer ones, or a value field may appear as a number, boolean, or nested object depending on which application wrote it. When Spark samples a subset of documents to infer the schema, it locks in a single type for each field. Any document that violates that type triggers a runtime error:

AnalysisException: Cannot cast StringType to IntegerType

Engineers typically work around this by forcing every field to StringType, which removes type safety from the raw Delta table and forces downstream jobs to re-cast values repeatedly.

Nested structs that drift across batches

Arrays of structs often contain inconsistent subfields. A nested address object may include a zipCode in some records but omit it in others, or a metadata field may appear or disappear between versions. Each job ends up rebuilding every struct manually to coerce missing fields to null and drop unexpected ones. The same boilerplate recurs across collections:

def rebuild_struct(df, field, schema):
    return df.withColumn(
        field,
        struct([
            coalesce(col(f"{field}.{f}"), lit(None).cast(t)).alias(f)
            for f, t in schema.items()
        ])
    )

Engineers repeat this pattern for every nested object, a tedious process that still risks silent data loss.

Silent failures that surface too late

When pipelines don’t crash outright, malformed documents are coerced or dropped without audit trails. Problems only appear hours or days later, when downstream jobs fail or produce incorrect aggregates. Teams lack a clear contract defining which fields must exist and which types they should hold, so errors propagate unnoticed until it’s too late.

Why observability tools fall short

Tools like Elementary provide table-level alerts when freshness or schema drift occurs. They confirm that something went wrong but don’t pinpoint the exact document responsible. The typical triage workflow reads like a detective novel:

Alert fires on failed freshness
Engineer scans Spark logs for a cast error
Engineer traces the error back to MongoDB
Searching through 100,000 records to isolate the bad document takes hours
Even after finding it, fixing the schema mismatch requires manual casting or schema evolution, which may not be possible mid-batch

The entire process is manual, error-prone, and delays writes while the rest of the batch waits in limbo.

A two-step bridge to safe ingestion

The open-source library nosql-delta-bridge replaces fragile scripts with a repeatable, document-level validation process. It operates in two stages: schema inference from trusted historical data, followed by validated ingestion that routes bad documents to a dead-letter queue.

Step 1: Infer a schema contract from trusted samples

Instead of letting Spark infer the schema from a random sample, bridge builds a contract from documents you already trust. A single command reads a sample of known-good records and generates a schema file:

bridge infer historical.json --output payments.schema.json

The inference engine resolves type conflicts using a configurable strategy—by default, it selects the widest type and marks fields as nullable.

Step 2: Ingest with validation and automatic DLQ

The ingestion step validates each document against the schema contract. Valid records land in Delta Lake; invalid ones go to a dead-letter queue with detailed rejection reasons:

bridge ingest incoming.json ./delta/payments \
  --schema payments.schema.json \
  --dlq rejected.ndjson

The output reports counts and reasons in real time:

incoming.json · 1,000 documents · schema: payments.schema.json written: 994 → delta/payments rejected: 6 → rejected.ndjson

Each rejected document carries metadata explaining why it failed:

{
  "_id": "abc123",
  "amount": "99.90",
  "_dlq_reason": "cast failed on 'amount': expected double, got string",
  "_dlq_stage": "coerce",
  "_dlq_ts": "2025-04-28T14:32:01Z"
}

No log archaeology. No manual hunting. The problem is isolated and labeled at ingestion time.

Scenarios handled out of the box

Field type mismatch (castable) → Document written after coercion
Field type mismatch (not castable) → Document routed to DLQ with reason
Missing required field → Document rejected with clear error
New field not in schema → Configurable: reject or evolve schema
Full type migration (all docs changed type) → Zero written, all routed to DLQ with warning
Nested struct with missing subfield → Filled with null, document written
Array of mixed types → Configurable: cast to widest type or reject

Why Python instead of Spark

The MongoDB Connector for Apache Spark is the standard choice, but it requires a cluster—overkill for teams managing smaller MongoDB collections. nosql-delta-bridge uses delta-rs, a pure Python implementation of the Delta Lake protocol, so it runs locally, in a Docker container, or on a small VM. Engineers can clone the repository and run the examples in minutes.

For teams already running large-scale Spark workloads, the library’s design allows schema inference and coercion logic to be reused independently, making it a modular addition to existing pipelines.

Where this fits in your data stack

This tool targets the ingestion layer, where schema drift and silent failures often originate. It complements downstream observability solutions by isolating problems at the source, reducing alert fatigue and shortening incident response times. If your pipeline currently relies on defensive casting, manual struct rebuilding, or late-stage debugging, a lightweight bridge to Delta Lake can turn fragile jobs into reliable, auditable processes.

The next time a document’s unexpected shape threatens to break your batch, the solution won’t be a midnight log crawl—it’ll be a clear rejection reason and a path forward.

AI summary

MongoDB verilerini Delta Lake'e aktarırken ortaya çıkan sorunları çözmek için bir çözüm. Spark kullanarak ETL'yi daha öngörülebilir hale getirin.