How big a model can a phone actually run in 2026?

Comfortably 1–4 billion parameters quantised to 4-bit on a modern flagship (iPhone 15 Pro and newer, Pixel 8/9, Galaxy S24/S25). That is enough for summarisation, rewriting, classification, structured extraction and retrieval-augmented answers over local data. Apple's and Google's built-in system models are roughly in the 3-billion-parameter class. On mid-range Android you target smaller 0.5–2B models or fall back to cloud, which is why device-tier detection is part of the architecture, not an afterthought.

How much does it cost to add on-device AI to a mobile app?

For a focused feature — on-device summarisation, smart replies or offline transcription — budget roughly 4–8 weeks and EUR 25,000–60,000 with a senior team, including model selection, quantisation, a device-tier fallback path and QA across real hardware. A broader AI-first app with several on-device features plus a cloud escalation layer runs higher. The biggest cost driver is not the model, it is testing across the long tail of Android devices.

On-Device AI in Mobile Apps: 2026 Guide

Q: What is on-device AI in a mobile app?

On-device AI means the model runs directly on the phone's own silicon — the Neural Engine on iPhone or the NPU/Tensor core on Android — instead of sending data to a cloud server. The input (text, photo, audio) never leaves the device, inference works offline, and there is no per-request API bill. In 2026 this is exposed through Apple Intelligence Foundation Models on iOS and Google Gemini Nano via AICore on Android, plus open models run through Core ML, LiteRT (formerly TensorFlow Lite), ExecuTorch, MLC or llama.cpp.

Q: On-device AI vs cloud AI — which should I use?

Use on-device AI when privacy, offline support, latency or per-request cost matter: summarisation, smart replies, classification, on-device search, redaction, transcription and image clean-up. Use a cloud model (GPT, Claude, Gemini Pro) when you need frontier reasoning, very large context, or up-to-the-minute knowledge. Most production apps in 2026 are hybrid: a small on-device model handles the common, privacy-sensitive 80% and silently escalates the hard 20% to the cloud.

Q: Is on-device AI better for GDPR and the EU AI Act?

Usually yes. If personal data is processed entirely on the user's device and never transmitted, you sharply reduce GDPR exposure: there is often no transfer, no third-country issue, and far less to disclose or retain. It is one of the cleanest ways to honour data minimisation. The EU AI Act still applies based on the use case (transparency duties, banned practices, high-risk categories), so on-device does not exempt you — but it removes a whole class of cross-border and third-party-processor risk.

Q: Does on-device AI work offline and on older phones?

Offline: yes — that is one of its main advantages, the model is on the device. Older phones: it depends. Flagships from the last two to three years handle 1–4B models well; mid-range and older devices need smaller models or a graceful cloud fallback. A correct implementation detects the device tier at runtime and routes accordingly, so users on a 2026 flagship and users on a three-year-old mid-ranger both get a working experience.

Anna Kowalski Senior Mobile Engineer, YuSMP Group · iOS, Android and cross-platform AI features since 2015

TL;DR — what is on-device AI in 2026?

On-device AI runs the model on the phone's own silicon. Features stay private by default, keep working offline, respond instantly, and cost nothing per request. In 2026 Apple Intelligence and Google's Gemini Nano put it inside every app. The pattern that wins is hybrid: a small local model handles the common 80%, and the cloud takes the hard 20%.

On-device AI leads the 2026 mobile agenda for a concrete reason: it makes features private by default, instant and offline-capable, with no per-request API cost.
The platforms now hand you the model. Apple Intelligence opens its on-device Foundation Models to any iOS app, and Google exposes Gemini Nano through AICore on Android. You no longer need an ML team to run a local model.
Treat it as hybrid rather than either/or. A small on-device model covers the common, privacy-sensitive 80%: summaries, smart replies, classification, transcription, redaction. Anything harder escalates quietly to a cloud model.
What makes this hard is the device tail, not the model. A flagship runs a 3-billion-parameter model comfortably. A three-year-old mid-range Android won't. That puts device-tier detection and a graceful fallback into the architecture, not the nice-to-have column.
For a focused feature, budget roughly 4–8 weeks and €25–60k. The biggest line item is QA across real hardware. See our mobile app development service for how we run that.

What does "on-device AI" actually mean in 2026?

On-device AI — you'll also see it called edge AI or local inference — means the model runs on the phone's own silicon rather than on a server you reach over the network. That silicon is Apple's Neural Engine, Qualcomm's Hexagon NPU or Google's Tensor. The practical consequences explain why product teams across the US and EU keep asking about it:

The data never leaves the device. The photo, message, voice note or health record is processed locally. Nothing is uploaded, so there is nothing to intercept, log or subpoena.
It works offline. On a plane, in a tunnel, in a hospital basement — the feature still works because the model is already on the phone.
It is instant. No network round-trip means responses start in tens of milliseconds, not after a second of latency.
It has no marginal cost. There is no per-token API bill. Ten users or ten million users cost the same in inference: nothing.

That last point rewrites the economics of AI features. A cloud LLM bill scales linearly with usage; on-device inference doesn't move at all. In a consumer app with millions of AI interactions a day, pushing the common case on-device turns an unbounded variable cost into zero.

What changed in 2024–2026

If your mental model is "you need an ML team to run a model on a phone," it is two years out of date. Three things changed.

Close-up of a mobile system-on-chip and neural processing unit on a circuit board — The enabler is silicon. Every recent flagship ships a dedicated neural processing unit — Apple Neural Engine, Qualcomm Hexagon, Google Tensor — fast enough to run multi-billion-parameter models in real time.

Apple Intelligence put a model in every iOS app

Apple opened its on-device Foundation Models framework to outside developers, so any iOS app can now call a roughly 3-billion-parameter system model in a few lines of Swift. Guided generation, tool calling and structured output come with it, and everything runs on the Neural Engine. You get a capable local model without shipping, updating or paying for one. For most "summarise this," "rewrite that" or "extract these fields" jobs, that is the default starting point on iOS today.

Google made Gemini Nano a system service on Android

On Android, Gemini Nano runs through AICore as a managed system component. Apps ask for on-device inference through the ML Kit GenAI APIs — summarisation, proofreading, rewrite, image description — and the OS owns the model. As on iOS, the system shares that model, so it never bloats your APK and platform updates keep it current.

Open models got small enough — and the runtimes got good

Outside the built-in system models, a wave of small open models (in the 1–4B class, 4-bit quantised) now run well on phones through mature runtimes: Core ML and MLX on iOS, LiteRT (the renamed TensorFlow Lite) and the MediaPipe LLM stack on Android, and cross-platform engines like ExecuTorch, MLC LLM and llama.cpp. These let you ship your own fine-tuned model when the system model is not enough — at the cost of carrying the weights and the engineering to keep them fast.

On-device vs cloud: the real trade-off

This is the decision that matters, and it is not ideological. On-device and cloud are tools with different jobs.

Dimension	On-device model	Cloud model (GPT / Claude / Gemini Pro)
Privacy	Data never leaves the phone	Data sent to a third-party processor
Offline	Works with no connection	Requires connectivity
Latency	Tens of ms to first token	Network round-trip + queue
Marginal cost	Zero per request	Per-token, scales with usage
Capability ceiling	1–4B params — good, not frontier	Frontier reasoning, huge context
Knowledge freshness	Frozen at model ship date	Can be current / retrieval-backed

For most apps the honest answer is hybrid: send each request to the cheapest tier that can handle it. On-device takes the high-volume, privacy- and latency-sensitive work — summarisation, smart replies, classification, entity extraction, transcription, redaction, semantic search over local data. The cloud picks up the long tail that really needs frontier reasoning or fresh knowledge. We treat that routing layer as a first-class part of the architecture, the way we'd treat a caching layer; there is more on the engineering in our AI, ML & Data service.

The on-device AI stack, by platform

Here is what we actually reach for, depending on the target.

iOS

Apple Intelligence Foundation Models — the default for text generation, summarisation, structured extraction and tool use on supported devices. No model to ship.
Core ML + MLX — for custom models: vision, audio, or a fine-tuned LLM you convert and run on the Neural Engine / GPU.
Vision, Natural Language, Speech, Sound Analysis — mature first-party frameworks for OCR, classification, on-device transcription and more, all local.

Android

Gemini Nano via AICore + ML Kit GenAI — the default managed path for summarise / proofread / rewrite / image-describe on capable devices.
LiteRT + MediaPipe LLM Inference — for running your own quantised models (Gemma and others) with GPU/NNAPI acceleration.
NNAPI / vendor NPUs — Qualcomm and others expose their own SDKs when you need to squeeze the hardware.

Cross-platform (React Native / Flutter)

ExecuTorch (PyTorch's on-device runtime) and MLC LLM give you one model that runs on both platforms.
llama.cpp bindings remain the pragmatic choice for shipping a specific open model with full control.
You still bridge to the native frameworks above for the best performance-per-watt — a recurring theme in our React Native vs Flutter comparison: the cross-platform layer is your UI, the AI lives close to the metal.

What you can actually ship today

Concrete features we have built or scoped on-device, with no cloud dependency for the core path:

Summarise & smart reply — long threads, emails, documents condensed locally; suggested replies generated without uploading the conversation.
Offline transcription & translation — voice notes and meetings transcribed on-device; useful in healthcare, legal and field work where audio must not leave the phone.
On-device redaction — detect and blur faces, license plates, card numbers and PII in images before anything is shared or uploaded.
Semantic search over personal data — search your own notes, photos and messages by meaning, with embeddings computed and stored locally.
Smart camera & document capture — real-time classification, OCR and field extraction (receipts, IDs, forms) with no network.
Personalisation that stays private — ranking, suggestions and on-device profiles that never become a server-side dossier.

A person using a smartphone alongside a laptop, with data staying on the personal device — The selling point users understand: "your data stays on your phone." For privacy-led products — like the consumer VPN we built, LiMP — that is not a feature, it is the brand.

Privacy, GDPR and the EU AI Act

Here on-device AI stops being a performance trick and becomes a compliance posture. That is exactly why it lands so well in the European market.

GDPR data minimisation, by construction. If personal data is processed only on the user's device and never transmitted, you remove a whole class of obligations: no cross-border transfer, no third-country safeguards, far less to retain, log or disclose. It is one of the cleanest ways to demonstrate privacy by design and by default.
No third-party processor for the core path. Sending user text to a cloud LLM makes that provider a processor you must contract, document and disclose. Keep it on-device and that relationship — and its risk — simply does not exist.
The EU AI Act still applies. On-device does not exempt you. Transparency duties (telling users they're interacting with AI), banned practices, and high-risk classifications are about the use case, not where inference runs. What on-device removes is the cross-border and processor risk, not your AI Act duties. We covered the framework in our EU AI Act checklist.

The practical pattern: do the privacy-sensitive work on-device, and if you escalate to the cloud, escalate redacted, minimised data with explicit consent — never the raw record.

Cost, timeline and team

Real numbers from how we scope this work for US and EU clients in 2026:

One focused on-device feature (summarisation, smart replies, offline transcription, or redaction): ~4–8 weeks, ~€25–60k. Team: 1 mobile engineer with on-device ML experience, part-time ML support, QA across a device matrix.
An AI-first app with several on-device features plus a hybrid cloud-escalation layer: ~3–5 months, scoped per feature.
The dominant cost is QA, not the model. Built-in system models are free to call; the work is verifying behaviour, performance and battery across the long tail of real Android hardware, plus the fallback path for unsupported devices.

For full benchmarks across the whole build, see our 2026 mobile app development cost guide. The on-device-specific advice: budget explicitly for a real-device test lab, and decide your minimum supported tier before you write a line of inference code.

Implementation checklist

The sequence we follow when adding on-device AI to a mobile app:

Define the job. One sentence: "summarise threads," "transcribe offline," "redact PII." Vague AI ambitions are where budgets die.
Try the system model first. Apple Intelligence on iOS, Gemini Nano on Android. If it's good enough, you're nearly done.
Set the device floor. Pick the minimum tier you'll support on-device, and design the cloud (or graceful-degrade) fallback for everything below it.
Pick the model only if needed. If the system model falls short, choose a small open model and quantise to 4-bit; measure size, latency and battery, not just accuracy.
Build the routing layer. On-device first, cloud escalation for the hard or stale cases, with consent and redaction at the boundary.
Test on real hardware. Emulators lie about NPU performance and battery. Use a physical device matrix spanning flagship to mid-range.
Measure battery and thermals. Sustained inference heats phones. Profile it; throttle or batch where needed.
Disclose and consent. Tell users when AI is involved and what (if anything) leaves the device — both good UX and AI Act hygiene.

When cloud AI still wins

On-device is a default, not a religion. We ship cloud-first when one of these is true:

Frontier reasoning — complex multi-step analysis, coding, or nuanced judgement that a 3B model can't reliably do.
Large context — reasoning over a 200-page document or a long history that won't fit a small local model.
Fresh knowledge — answers that must reflect today's data, pricing or inventory, via retrieval or live tools.
Shared, server-side state — when the intelligence is inherently about other users' data, not the one on this phone.

So the architecture that wins in 2026 is hybrid: on-device for the private, instant, high-volume common case, and cloud for the heavy, occasional one. The real engineering is getting that boundary right, along with the consent and redaction that sit on it. That is the core of how we build mobile apps with AI for clients across the US and EU.

FAQ

What is on-device AI in a mobile app?

The model runs on the phone's own chip (Apple Neural Engine, Android NPU) instead of a cloud server. Input never leaves the device, it works offline, and there's no per-request bill. In 2026 it's exposed through Apple Intelligence on iOS and Gemini Nano on Android, plus open models via Core ML, LiteRT, ExecuTorch, MLC and llama.cpp.

On-device AI vs cloud AI — which should I use?

On-device for privacy, offline, latency and zero marginal cost: summaries, smart replies, classification, transcription, redaction. Cloud for frontier reasoning, large context or fresh knowledge. Most production apps are hybrid — on-device for the common 80%, cloud for the hard 20%.

How big a model can a phone run in 2026?

Comfortably 1–4 billion parameters at 4-bit on a recent flagship (iPhone 15 Pro+, Pixel 8/9, Galaxy S24/S25). The built-in system models are around the 3B class. Mid-range Android targets smaller models or falls back to cloud — so device-tier detection is part of the design.

Is on-device AI better for GDPR and the EU AI Act?

Usually yes for GDPR: processing entirely on-device sharply cuts transfer, processor and retention exposure — clean data minimisation. The EU AI Act still applies by use case (transparency, banned/high-risk rules), so on-device reduces cross-border risk but doesn't exempt you.

How much does it cost to add on-device AI to an app?

A focused feature runs roughly 4–8 weeks and €25–60k with a senior team, including model selection, a device-tier fallback and QA on real hardware. The biggest cost driver is testing across the Android device tail, not the model itself.

Does on-device AI work offline and on older phones?

Offline: yes, that's the point. Older phones: flagships from the last 2–3 years handle 1–4B models; older and mid-range devices need smaller models or a cloud fallback. A correct build detects the tier at runtime and routes accordingly.

How we'd decide for your app

Give us 30 minutes and the one feature you have in mind, and we'll tell you whether it belongs on-device, in the cloud, or split across both — with a realistic cost and timeline for your team and market. No slides, no upsell. We ship both, and we don't care which one you pick, as long as it's the right one.

Last updated 2 June 2026. Model classes and frameworks reflect Apple Intelligence Foundation Models, Google Gemini Nano / AICore, Core ML, LiteRT and ExecuTorch as available in mid-2026. Device performance measured on iPhone 15 Pro, Pixel 9 and a mid-range Android reference device. Methodology available on request.

Related services

Get a proposal

Share a few details and a senior consultant will reply within one business day.

Prefer to talk directly? ☎ Call +374 44 871 811 ✉ sales@yusmpgroup.com