How a Local AI Agent Learned to Monitor Your Cat Door

A simple iPhone app that once watched toast is now learning to track a cat’s comings and goings—all with a single multimodal AI model running locally. The experiment, centered on Google’s Gemma 4, exposes the gap between a working demo and a reliable, general-purpose watcher.

From Toast to Cats: The Promise of a Single Watcher

OIC, short for Oh, I See, began as a narrow tool that alerted users when their toast turned golden. Its architecture—a camera loop, a lightweight vision model, and a notification flow—was purpose-built for one task. The developer behind OIC wanted to know if that same loop could handle different scenarios with minimal changes. Could a small multimodal model like Gemma 4 replace a patchwork of dedicated detectors?

The next logical step was tracking a household pet. Cats that roam outdoors often return on their own, but owners still waste time searching when the cat is already home. A local visual watcher that logs a cat’s exits and entries could eliminate unnecessary worry. The goal was to repurpose OIC’s core logic for this new use case without relying on cloud services.

The Core Loop That Never Changes

Every watcher, whether for toast or a cat door, follows the same four-step loop:

Capture a scene using the phone’s camera.
Analyze the frame to detect meaningful events.
Decide if an event occurred based on user-defined rules.
Update the app’s state and notify the user if necessary.

What varies is not the loop itself but the what and the how of detection. A toaster watcher looks for browning bread, while a cat-door watcher distinguishes between feline silhouettes entering or leaving. The developer reasoned that if a compact model like Gemma 4 could follow natural-language instructions, OIC could shift from a collection of single-purpose tools to a flexible agent.

What Already Worked Before Gemma

Before integrating Gemma 4, OIC was already a functional app with a fully operational toast watcher. The developer had:

A working iPhone app with an active camera feed.
A pipeline that processed frames to detect toast readiness.
A notification system to alert the user when the toast popped up.

This baseline proved the architecture was sound. It also set a high bar: the new Gemma-powered watcher had to match the reliability of the existing toast detector without sacrificing local-first principles.

The Gaps That Remained After the Experiment

The goal was to extend OIC’s architecture to support multiple watchers, starting with the cat-door scenario. While the developer made significant progress, two critical milestones slipped through:

1. End-to-End Multimodal Inference on Device

The developer successfully integrated a local Gemma GGUF runtime using llama.cpp on iOS. The model loaded, and the app’s plumbing—file handling, session tracing, and state management—was in place. However, the final step—passing a camera frame into the model and receiving a structured interpretation—never fully materialized. Without verified inference results, the cat-door watcher remained a prototype.

2. A Repeatable, Stable Watcher Loop

Local AI on mobile introduces layers of complexity beyond model performance. Frame capture, image preprocessing, and output formatting must all work in harmony. Even minor inconsistencies—such as mismatched image representations or unstable timing logs—could derail the entire pipeline. The developer found that model startup alone did not guarantee a functional watcher; the system had to handle dozens of edge cases before it could be trusted.

The Hidden Challenges of Local AI Deployment

Some of the toughest obstacles had nothing to do with AI and everything to do with mobile app development:

Model size management: Bundling large GGUF files directly into the app would bloat the download size beyond acceptable limits.
Storage discipline: Keeping model files separate from app resources prevented accidental overwrites during updates.
Transfer workflows: Users need a way to download models without disrupting the app’s performance.
Traceability: Without granular logs, it was impossible to distinguish between camera delays and model inference stalls.

These challenges underscored a key insight: local AI on mobile is as much a deployment problem as it is a modeling problem. A model that runs in isolation doesn’t automatically translate into a reliable, user-facing feature.

The Path Forward: From Prototype to Product

The experiment didn’t deliver a fully functional cat-door watcher, but it laid the groundwork for future iterations. The refactored architecture now supports multiple watcher types, and the local Gemma runtime is a proven capability. The next steps involve:

Verifying the full pipeline from frame capture to model output.
Refining the user interface to switch between watchers seamlessly.
Optimizing model loading times and reducing app footprint.

For developers eyeing local AI agents, this experiment serves as a cautionary tale: a working demo is only the first mile. The real journey begins when the model leaves the lab and enters a user’s pocket.

AI summary

Gemma 4’ün yerel iPhone uygulamasındaki kullanımı, OIC aracının dar görevden genel amaçlı izleyiciye nasıl dönüştüğünü ve yerel AI’nın mobildeki sınırlarını keşfedin.