TL;DR (for executives in a hurry)
- On-device AI is the headline trend of 2026 mobile development for a concrete reason: it makes AI features private by default, instant, offline-capable and free of per-request API costs.
- The platforms now ship it for you. Apple Intelligence exposes on-device Foundation Models to any iOS app; Google exposes Gemini Nano through AICore on Android. You no longer need to be an ML team to use a local model.
- Think hybrid, not either/or. A small on-device model handles the common, privacy-sensitive 80% — summaries, smart replies, classification, transcription, redaction — and silently escalates the hard 20% to a cloud model.
- The hard part is not the model, it is the device tail. A flagship runs a 3-billion-parameter model comfortably; a three-year-old mid-range Android does not. Device-tier detection and a graceful fallback are the architecture, not a nice-to-have.
- For a focused feature, budget roughly 4–8 weeks and €25–60k. The biggest line item is QA across real hardware. See our mobile app development service for how we run that.
What "on-device AI" actually means in 2026
On-device AI (also called edge AI or local inference) means the model runs on the phone's own silicon — Apple's Neural Engine, Qualcomm's Hexagon NPU, Google's Tensor — rather than on a server you call over the network. The practical consequences are the reason every product team in the US and EU is suddenly asking about it:
- The data never leaves the device. The photo, message, voice note or health record is processed locally. Nothing is uploaded, so there is nothing to intercept, log or subpoena.
- It works offline. On a plane, in a tunnel, in a hospital basement — the feature still works because the model is already on the phone.
- It is instant. No network round-trip means responses start in tens of milliseconds, not after a second of latency.
- It has no marginal cost. There is no per-token API bill. Ten users or ten million users cost the same in inference: nothing.
That last point quietly changes the economics of AI features. Cloud LLM bills scale linearly with usage; on-device inference does not. For a consumer app with millions of daily AI interactions, moving the common case on-device can turn an unbounded variable cost into zero.
What changed in 2024–2026
If your mental model is "you need an ML team to run a model on a phone," it is two years out of date. Three things changed.
Apple Intelligence put a model in every iOS app
Since Apple opened its on-device Foundation Models framework to third-party developers, any iOS app can call a roughly 3-billion-parameter system model with a few lines of Swift — guided generation, tool calling and structured output included, all running on the Neural Engine. You get a capable local model without shipping, updating or paying for one yourself. For most "summarise this," "rewrite that," "extract these fields" features, this is now the default starting point on iOS.
Google made Gemini Nano a system service on Android
On the Android side, Gemini Nano runs through AICore as a managed system component. Apps request on-device inference through ML Kit GenAI APIs — summarisation, proofreading, rewrite, image description — and the OS handles the model. As with Apple, the model is shared by the system, so it does not bloat your APK, and it is kept current by platform updates.
Open models got small enough — and the runtimes got good
Outside the built-in system models, a wave of small open models (in the 1–4B class, 4-bit quantised) now run well on phones through mature runtimes: Core ML and MLX on iOS, LiteRT (the renamed TensorFlow Lite) and the MediaPipe LLM stack on Android, and cross-platform engines like ExecuTorch, MLC LLM and llama.cpp. These let you ship your own fine-tuned model when the system model is not enough — at the cost of carrying the weights and the engineering to keep them fast.
On-device vs cloud: the real trade-off
This is the decision that matters, and it is not ideological. On-device and cloud are tools with different jobs.
| Dimension | On-device model | Cloud model (GPT / Claude / Gemini Pro) |
|---|---|---|
| Privacy | Data never leaves the phone | Data sent to a third-party processor |
| Offline | Works with no connection | Requires connectivity |
| Latency | Tens of ms to first token | Network round-trip + queue |
| Marginal cost | Zero per request | Per-token, scales with usage |
| Capability ceiling | 1–4B params — good, not frontier | Frontier reasoning, huge context |
| Knowledge freshness | Frozen at model ship date | Can be current / retrieval-backed |
The honest answer for most apps is hybrid: route each request to the cheapest tier that can handle it. On-device handles summarisation, smart replies, classification, entity extraction, transcription, redaction and semantic search over local data — the high-volume, privacy-sensitive, latency-sensitive work. The cloud handles the long tail that genuinely needs frontier reasoning or fresh knowledge. We design that routing layer as a first-class part of the architecture, the same way we'd design a caching layer — more on the engineering in our AI, ML & Data service.
The on-device AI stack, by platform
Here is what we actually reach for, depending on the target.
iOS
- Apple Intelligence Foundation Models — the default for text generation, summarisation, structured extraction and tool use on supported devices. No model to ship.
- Core ML + MLX — for custom models: vision, audio, or a fine-tuned LLM you convert and run on the Neural Engine / GPU.
- Vision, Natural Language, Speech, Sound Analysis — mature first-party frameworks for OCR, classification, on-device transcription and more, all local.
Android
- Gemini Nano via AICore + ML Kit GenAI — the default managed path for summarise / proofread / rewrite / image-describe on capable devices.
- LiteRT + MediaPipe LLM Inference — for running your own quantised models (Gemma and others) with GPU/NNAPI acceleration.
- NNAPI / vendor NPUs — Qualcomm and others expose their own SDKs when you need to squeeze the hardware.
Cross-platform (React Native / Flutter)
- ExecuTorch (PyTorch's on-device runtime) and MLC LLM give you one model that runs on both platforms.
- llama.cpp bindings remain the pragmatic choice for shipping a specific open model with full control.
- You still bridge to the native frameworks above for the best performance-per-watt — a recurring theme in our React Native vs Flutter comparison: the cross-platform layer is your UI, the AI lives close to the metal.
What you can actually ship today
Concrete features we have built or scoped on-device, with no cloud dependency for the core path:
- Summarise & smart reply — long threads, emails, documents condensed locally; suggested replies generated without uploading the conversation.
- Offline transcription & translation — voice notes and meetings transcribed on-device; useful in healthcare, legal and field work where audio must not leave the phone.
- On-device redaction — detect and blur faces, license plates, card numbers and PII in images before anything is shared or uploaded.
- Semantic search over personal data — search your own notes, photos and messages by meaning, with embeddings computed and stored locally.
- Smart camera & document capture — real-time classification, OCR and field extraction (receipts, IDs, forms) with no network.
- Personalisation that stays private — ranking, suggestions and on-device profiles that never become a server-side dossier.
Privacy, GDPR and the EU AI Act
This is where on-device AI is more than a performance trick — it is a compliance posture, which is exactly why it resonates so strongly in the European market.
- GDPR data minimisation, by construction. If personal data is processed only on the user's device and never transmitted, you remove a whole class of obligations: no cross-border transfer, no third-country safeguards, far less to retain, log or disclose. It is one of the cleanest ways to demonstrate privacy by design and by default.
- No third-party processor for the core path. Sending user text to a cloud LLM makes that provider a processor you must contract, document and disclose. Keep it on-device and that relationship — and its risk — simply does not exist.
- The EU AI Act still applies. On-device does not exempt you. Transparency duties (telling users they're interacting with AI), banned practices, and high-risk classifications are about the use case, not where inference runs. What on-device removes is the cross-border and processor risk, not your AI Act duties. We covered the framework in our EU AI Act checklist.
The practical pattern: do the privacy-sensitive work on-device, and if you escalate to the cloud, escalate redacted, minimised data with explicit consent — never the raw record.
Cost, timeline and team
Real numbers from how we scope this work for US and EU clients in 2026:
- One focused on-device feature (summarisation, smart replies, offline transcription, or redaction): ~4–8 weeks, ~€25–60k. Team: 1 mobile engineer with on-device ML experience, part-time ML support, QA across a device matrix.
- An AI-first app with several on-device features plus a hybrid cloud-escalation layer: ~3–5 months, scoped per feature.
- The dominant cost is QA, not the model. Built-in system models are free to call; the work is verifying behaviour, performance and battery across the long tail of real Android hardware, plus the fallback path for unsupported devices.
For full benchmarks across the whole build, see our 2026 mobile app development cost guide. The on-device-specific advice: budget explicitly for a real-device test lab, and decide your minimum supported tier before you write a line of inference code.
Implementation checklist
The sequence we follow when adding on-device AI to a mobile app:
- Define the job. One sentence: "summarise threads," "transcribe offline," "redact PII." Vague AI ambitions are where budgets die.
- Try the system model first. Apple Intelligence on iOS, Gemini Nano on Android. If it's good enough, you're nearly done.
- Set the device floor. Pick the minimum tier you'll support on-device, and design the cloud (or graceful-degrade) fallback for everything below it.
- Pick the model only if needed. If the system model falls short, choose a small open model and quantise to 4-bit; measure size, latency and battery, not just accuracy.
- Build the routing layer. On-device first, cloud escalation for the hard or stale cases, with consent and redaction at the boundary.
- Test on real hardware. Emulators lie about NPU performance and battery. Use a physical device matrix spanning flagship to mid-range.
- Measure battery and thermals. Sustained inference heats phones. Profile it; throttle or batch where needed.
- Disclose and consent. Tell users when AI is involved and what (if anything) leaves the device — both good UX and AI Act hygiene.
When cloud AI still wins
On-device is a default, not a religion. We ship cloud-first when one of these is true:
- Frontier reasoning — complex multi-step analysis, coding, or nuanced judgement that a 3B model can't reliably do.
- Large context — reasoning over a 200-page document or a long history that won't fit a small local model.
- Fresh knowledge — answers that must reflect today's data, pricing or inventory, via retrieval or live tools.
- Shared, server-side state — when the intelligence is inherently about other users' data, not the one on this phone.
The winning architecture in 2026 is hybrid: on-device for the private, instant, high-volume common case; cloud for the heavy, occasional one. Getting that boundary right — and the consent and redaction at it — is the actual engineering. It's the core of how we build mobile apps with AI for clients across the US and EU.
FAQ
What is on-device AI in a mobile app?
The model runs on the phone's own chip (Apple Neural Engine, Android NPU) instead of a cloud server. Input never leaves the device, it works offline, and there's no per-request bill. In 2026 it's exposed through Apple Intelligence on iOS and Gemini Nano on Android, plus open models via Core ML, LiteRT, ExecuTorch, MLC and llama.cpp.
On-device AI vs cloud AI — which should I use?
On-device for privacy, offline, latency and zero marginal cost: summaries, smart replies, classification, transcription, redaction. Cloud for frontier reasoning, large context or fresh knowledge. Most production apps are hybrid — on-device for the common 80%, cloud for the hard 20%.
How big a model can a phone run in 2026?
Comfortably 1–4 billion parameters at 4-bit on a recent flagship (iPhone 15 Pro+, Pixel 8/9, Galaxy S24/S25). The built-in system models are around the 3B class. Mid-range Android targets smaller models or falls back to cloud — so device-tier detection is part of the design.
Is on-device AI better for GDPR and the EU AI Act?
Usually yes for GDPR: processing entirely on-device sharply cuts transfer, processor and retention exposure — clean data minimisation. The EU AI Act still applies by use case (transparency, banned/high-risk rules), so on-device reduces cross-border risk but doesn't exempt you.
How much does it cost to add on-device AI to an app?
A focused feature runs roughly 4–8 weeks and €25–60k with a senior team, including model selection, a device-tier fallback and QA on real hardware. The biggest cost driver is testing across the Android device tail, not the model itself.
Does on-device AI work offline and on older phones?
Offline: yes, that's the point. Older phones: flagships from the last 2–3 years handle 1–4B models; older and mid-range devices need smaller models or a cloud fallback. A correct build detects the tier at runtime and routes accordingly.
How we'd decide for your app
Give us 30 minutes and the one feature you have in mind, and we'll tell you whether it belongs on-device, in the cloud, or split across both — with a realistic cost and timeline for your team and market. No slides, no upsell. We ship both, and we don't care which one you pick, as long as it's the right one.
Last updated 2 June 2026. Model classes and frameworks reflect Apple Intelligence Foundation Models, Google Gemini Nano / AICore, Core ML, LiteRT and ExecuTorch as available in mid-2026. Device performance measured on iPhone 15 Pro, Pixel 9 and a mid-range Android reference device. Methodology available on request.


