A Kotlin Multiplatform wrapper around Google's LiteRT-LM for running Gemma-family models on-device.
Dual-licensed: AGPL-3.0 for open-source / research use; commercial license available for proprietary distribution — see COMMERCIAL.md.
Shipping a production on-device LLM on Android is significantly harder than the LiteRT-LM samples make it look. You need:
- A clean abstraction over the LiteRT-LM Java SDK so your app code stays platform-independent
- A model-management layer that handles 2GB+ artifact downloads with resume + SHA-256 validation
- Hardware-tier logic that picks the right Gemma variant for the device (and refuses gracefully on under-spec hardware)
- Awareness of OEM quirks — Realme Dynamic RAM Expansion, Xiaomi Memory Extension, OPPO all inflate
MemTotaland silently push under-spec devices into the wrong tier - A function-calling layer that converts your typed Kotlin schema into the OpenAPI JSON LiteRT-LM expects
- Stateful, KV-cache-reusing chat sessions so multi-turn memory is lossless and time-to-first-token stays flat as a conversation grows — instead of re-sending the whole history every turn
- An on-device text embedder (MediaPipe USE-Lite) behind a clean
EmbeddingEngine, so you can build retrieval-augmented generation (RAG) — chat grounded in the user's own documents — with no cloud vector service - All of the above shaped to run identically on Android and iOS so you can share code across both apps
This library solves all of these. The bundled sample-app/ — NativeLM — is a
real, shipped product: on-device document chat (answers grounded in your own
PDFs and notes, organized into Projects) plus general chat with conversation
history, built on top of the engine — so you can see exactly what running Gemma
on-device looks like.
A private, fully on-device AI app for Android — no account, no network, no telemetry. Chat with your own documents (on-device RAG, organized into Projects) or just chat — everything runs locally on Gemma via this engine.
Also supported by the engine: an on-device text embedder (EmbeddingEngine,
MediaPipe USE-Lite) for retrieval-augmented generation — NativeLM's "chat with
your documents" is built on it (extract → chunk → embed → ObjectBox HNSW vector
index → relevance-gated retrieval → grounded answer with citations); function
calling (typed Kotlin ToolSchema.Definition → OpenAPI JSON → constrained output
as EngineState.ToolCallEmitted); vision (image input on multimodal Gemma 4);
and real native cancellation of in-flight generation.
| Platform | Core engine | Hardware acceleration | Status |
|---|---|---|---|
| Android (API 24+) | Production | GPU / NPU via LiteRT delegate selection | Production-vetted on flagship + mid-tier devices |
| iOS (arm64 + Apple Silicon sim) | Architecture-ready | Planned: Metal GPU acceleration via LiteRT-LM Swift APIs | Roadmap |
The common module (lib/src/commonMain) carries the engine state machine, model-catalog typing, Ktor-backed download manager, and function-calling schema conversion. iOS-side native bindings are on the roadmap using LiteRT-LM's Swift APIs.
Works with plain Android (non-KMP) apps. Even though the library is published as a Kotlin Multiplatform artifact, the Gradle Module Metadata routes Android consumers directly to the
litertlm-kmp-androidAAR variant — you don't need to apply thekotlinMultiplatformplugin or restructure your project. A standardcom.android.applicationmodule with Kotlin (and optionally Compose) is enough.
In your root settings.gradle.kts:
dependencyResolutionManagement {
repositories {
google()
mavenCentral()
maven { url = uri("https://jitpack.io") }
}
}In your app module's build.gradle.kts:
dependencies {
implementation("com.github.sagar-develop:litertlm-kmp:v0.10.0")
}The library compiles against modern Android tooling:
| Required | |
|---|---|
minSdk |
24 (Android 7.0) |
compileSdk |
34 or higher |
| Gradle | 8.0+ |
| Android Gradle Plugin | 8.0+ |
| Kotlin | 2.0+ (project must be on K2) |
android.useAndroidX |
true in gradle.properties (default for new projects) |
If your project predates these, upgrade your toolchain before adding the dependency.
The library declares ACCESS_NETWORK_STATE in its own manifest, which merges into your app — no action needed there.
Your app's manifest needs INTERNET (you almost certainly already have it):
<uses-permission android:name="android.permission.INTERNET" />If you use the optional SpeechRecognizer surface for voice input, also add:
<uses-permission android:name="android.permission.RECORD_AUDIO" />
<queries>
<intent>
<action android:name="android.speech.RecognitionService" />
</intent>
</queries>No additional rules required. The library's public API is annotation-free at the consumer surface, and its native dependencies (LiteRT-LM, MediaPipe) ship their own consumer ProGuard rules via their AARs.
The library is DI-agnostic. The sample-app/ module shows the manual instantiation path — simplest way to integrate from a vanilla Android project:
import com.sagar.aicore.AndroidHardwareProvider
import com.sagar.aicore.AndroidPlatformFolders
import com.sagar.aicore.KtorModelManager
import com.sagar.aicore.LiteRtLmLocalAiEngine
import io.ktor.client.HttpClient
class MyEngineHolder(context: Context) {
private val httpClient = HttpClient()
private val hardware = AndroidHardwareProvider(context.applicationContext)
private val folders = AndroidPlatformFolders(context.applicationContext)
val modelManager = KtorModelManager(httpClient, folders)
val engine = LiteRtLmLocalAiEngine(hardware)
}Hold one instance for the app lifetime (typically in your Application subclass or your existing DI graph). If you use Hilt, declare these as @Singleton @Provides bindings; if you use Koin, the equivalent single { ... }. If you use kotlin-inject (the library's own DI graph), see AiEngineComponent and AndroidAiEngineComponent for the ready-made interface.
val engine = myEngineHolder.engine
engine.initializeEngine(modelPath = "/data/data/your.app/files/gemma-4-E2B-it.litertlm")
engine.generateStream(
AiEngineRequest(
formattedPrompt = "Explain how RoPE positional encodings work.",
temperature = 0.7f,
maxTokens = 1024,
)
).collect { state ->
when (state) {
is EngineState.TokenGenerated -> print(state.data)
is EngineState.Error -> error(state.fault.message ?: "Engine fault")
else -> Unit
}
}You define the schema once in Kotlin. The library converts it to the OpenAPI 3.0 JSON LiteRT-LM expects, asks the model to call the function rather than reply in free text, and surfaces the parsed arguments back as a Map<String, Any?>.
val toolSchema = ToolSchema.Definition(
name = "extract_event_details",
description = "Extract structured event details from a sentence.",
parameters = listOf(
ToolParameter("title", ToolParameterType.StringT, "Event title.", required = true),
ToolParameter("duration_minutes", ToolParameterType.IntegerT, "Length in minutes.", required = true),
),
)
engine.generateStream(
AiEngineRequest(
formattedPrompt = "Schedule a 30-minute kickoff for Project Apollo on Tuesday.",
requireStructuredOutput = true,
toolSchema = toolSchema,
)
).collect { state ->
if (state is EngineState.ToolCallEmitted) {
println("Extracted: ${state.arguments}")
// → {title=Project Apollo kickoff, duration_minutes=30.0}
}
}How it works under the hood:
ToolSchemaConverter.toOpenApiJson()walks your typedDefinitionand emits canonical OpenAPI 3.0 JSON ({"name": "...", "parameters": {"type": "object", "properties": {...}, "required": [...]}}).LiteRtLmLocalAiEngine.runStructured(...)registers the JSON as a LiteRT-LMOpenApiToolwithautomaticToolCalling = false, then sends the prompt with the system instruction"you MUST call the tool".- The model is constrained at the token level to emit a valid tool call rather than free text. Each call comes back via
message.toolCalls[]as(name, arguments: Map<String, Any?>). - The library re-emits each call as
EngineState.ToolCallEmittedfor your consumer to read.
A few gotchas worth knowing:
- Numeric types come back as
Double. JSON has no integer/float distinction at the wire level, so anIntegerTparameter still arrives asDouble. Coerce with(it as Number).toInt(). - Snake_case is preferred for param names. LiteRT-LM also accepts camelCase, but snake_case round-trips cleaner with the JSON schema vocabulary.
- Arrays nest.
ToolParameterType.ArrayT(ToolParameterType.StringT)becomes{"type": "array", "items": {"type": "string"}}— seeToolSchemaConverterTestfor the round-trip cases.
Multimodal Gemma 4 (E2B / E4B) accepts an image alongside the text prompt. Attach an Attachment.Image to the request — the library bundles it as a LiteRT-LM Content.ImageBytes next to the prompt text. Engines that don't support vision ignore the attachment rather than failing, so the same call is safe across engines; gate your UI on engine.descriptor.supportsVision if you want to hide the affordance when unsupported.
val jpegBytes: ByteArray = /* a photo or screenshot */
engine.generateStream(
AiEngineRequest(
formattedPrompt = "Summarize the text visible in this image.",
attachments = listOf(Attachment.Image(bytes = jpegBytes, mimeType = "image/jpeg")),
)
).collect { state ->
if (state is EngineState.TokenGenerated) print(state.data)
}The engine is initialized with visionBackend = Backend.CPU() and maxNumImages = 1. The .litertlm bundle you load must include vision-encoder weights (the standard Gemma 4 E2B / E4B artifacts do) — a text-only build fails at init when a vision backend is set. Audio attachments are accepted by the API but not yet wired to inference.
The engine ships an on-device text embedder, so you can ground answers in the user's own documents — no cloud vector service.
import com.sagar.aicore.MediaPipeEmbeddingEngine
val embeddings = MediaPipeEmbeddingEngine(context)
embeddings.initialize(modelPath = "/data/.../universal_sentence_encoder.tflite")
val vector: FloatArray = embeddings.embed("Your document chunk here")
// → 100-dim float vector, ready for nearest-neighbor searchNativeLM (sample-app/) builds a complete on-device RAG pipeline on this primitive:
extract a PDF/text source (PDFBox) → chunk it (page-aware) → embed each
chunk → store in an ObjectBox HNSW vector index scoped per Project → at chat
time, embed the question, pull the relevance-gated top-k, fence the context,
and answer grounded with citations. See the
rag/ and
data/db/ packages for the
end-to-end reference.
Under R8/minification, the MediaPipe + Flogger + protobuf surfaces the embedder needs ship in the engine's
consumer-rules.pro, so a consumer's release build keeps them automatically.
// `modelManager` from MyEngineHolder above
modelManager.downloadModel(
url = "https://your-cdn/gemma-4-E2B-it.litertlm",
modelName = "gemma-4-E2B-it.litertlm",
expectedSha256 = "...", // optional, fails atomically on mismatch
).collect { state ->
when (state) {
is DownloadState.Downloading -> updateProgressBar(state.progress)
is DownloadState.Success -> launchEngine(state.localPath)
is DownloadState.Error -> showError(state.message)
else -> Unit
}
}The sample-app's NativeLmViewModel shows the full real-world flow: download → init → open a stateful chat session → stream turns. Read it end-to-end for a working reference.
The sample-app/ module is NativeLM — a Compose app that exercises the whole
library: branded onboarding → model management → a chat surface with stateful
KV-cache sessions and conversation history. Models are downloaded on demand
from Hugging Face with your own token (never bundled), and a previously
selected model auto-loads from disk on later launches.
./gradlew :sample-app:assembleDebug
adb install -r sample-app/build/outputs/apk/debug/sample-app-debug.apk
adb shell am start -n com.nativelm.app/.MainActivityA signed, R8-minified release build is also wired (:sample-app:assembleRelease)
— see sample-app/README.md.
The repo does not ship binary model weights — Gemma's license permits redistribution but each consumer hosts their own; NativeLM downloads directly from Hugging Face.
- Onboarding → Model Management.
- Paste a Hugging Face read token (Settings → Access Tokens on huggingface.co)
into the token field. It's stored encrypted on-device (
EncryptedSharedPreferences). - Download Gemma 4 E2B (~2.6 GB, for 6 GB+ devices) — resumable, SHA-256 validated. Tap Set active to load it into memory.
- Chat. Type a prompt and watch token-by-token streaming with live tokens/sec + TTFT. The model file persists, so later launches load it directly.
The download URLs + per-model metadata live in
NativeLmModelCatalog
(an app-supplied ModelCatalog over the engine's typed descriptors).
- Streaming — the assistant bubble fills in token-by-token; a quiet
TTFT · N tok/sline shows live throughput (≈20 tok/s on a recent flagship for Gemma 4 E2B, CPU backend). - Multi-turn memory — tell it a fact, then ask about it several turns later; it recalls correctly without the app re-sending history (the KV-cache session holds the context). Time-to-first-token stays flat as the chat grows.
- Conversation history — the drawer lists past conversations (auto-titled by the model); switching one re-prefills its history ("Building understanding…") then reuses the cache.
- Stop truly interrupts the native decode loop, not just the UI.
flowchart TB
app["Your app (Compose / SwiftUI / …)"]
sel["RAM-tier selection · HardwareProvider<br/>picks LLM + embedder + reranker for the device"]
lae["LocalAiEngine<br/>LiteRT-LM / Gemma 4"]
ee["EmbeddingEngine<br/>EmbeddingGemma / USE-Lite (ONNX, telemetry-free)"]
ret["DocumentRetriever<br/>vector + BM25 + rerank (RAG)"]
mm["ModelManager<br/>resumable download + SHA-256"]
app --> sel
sel --> lae
sel --> ee --> ret
sel --> mm
classDef n fill:#f5f3ef,stroke:#7FA980,color:#1C1B1A;
class app,sel,lae,ee,ret,mm n;
See ARCHITECTURE.md for the full design — Mermaid diagrams of the engine/product split and the RAG pipeline, the device-tier policy, and why OEM RAM-expansion detection is necessary.
If your team is migrating from cloud LLM APIs to on-device inference, designing a Kotlin Multiplatform AI stack, or needs a commercial license for proprietary distribution:
Typical engagements:
- Commercial licensing (see
COMMERCIAL.md) - Architectural advisory — KMP module layout, agent patterns on top of
LocalAiEngine, cloud-to-edge migration playbooks - Custom implementations — fine-tune integration, multi-model orchestration, RAG pipelines, function-calling schemas tuned to your domain
- v0.1 — initial library release. Android target production-ready, iOS targets compile but native engine bindings deferred.
- v0.2 —
sample-app/Compose Android app with live CPU + RAM + tokens/sec metrics overlay. Library restructured into:libsubproject; published ascom.sagar:litertlm-kmp. - v0.2.4 — multimodal vision: image attachments flow through
EngineConfig.visionBackend+Content.ImageBytes;descriptor.supportsVisionis nowtrue. - v0.3.0 — stateful KV-cache chat sessions (
openChatSession/ChatSession): lossless multi-turn memory with no history re-sending; real native cancellation (cancel()); explicitEngineConfig.backendselection (on-device benchmarking pickedCPU(6));SamplerConfigtemperature/seed plumbed through. The NativeLM showcase app gains conversation history (ObjectBox), model-generated titles, and a signed, R8-minified release build (engine ships its own consumer ProGuard rules). - v0.4.0 — on-device document chat: fully local document RAG in the NativeLM app — import a PDF/text source, and the app extracts → chunks → embeds (MediaPipe USE-Lite) → stores in an ObjectBox HNSW vector index → retrieves project-scoped, relevance-gated context → answers grounded with citations. Projects (notebooks) keep each chat scoped to its own sources; default chat stays general. Engine:
consumer-rules.pronow keeps the MediaPipe text-embedder + Flogger + protobuf so RAG embedding survives R8 minification. - v0.5.0 — on-device OCR (scanned PDFs + images) and hybrid keyword + vector retrieval in NativeLM; tap-to-open-source citations with in-page highlight and pinch-to-zoom in the PDF viewer.
- v0.6.0 — NativeLM Studio: an on-device document studio that turns a project's sources into artifacts via a map-reduce pass — Briefing, FAQ, Key Topics, Study Guide, Timeline, Mind Map, plus Audio Overview and Podcast rendered with on-device Text-to-Speech.
- Future — iOS native engine via LiteRT-LM's Swift Metal-accelerated APIs; a benchmark suite (tokens/sec, RAM ceiling, battery drain) across a device matrix.
The engine library (
com.sagar:litertlm-kmp) and the NativeLM app share one version line; the latest release is v0.10.0 (seelib/build.gradle.ktsandCHANGELOG.mdfor the full history).
litertlm-kmp/
├── lib/ ← the published library
│ ├── src/commonMain/ ← engine interfaces, ModelManager, ToolSchemaConverter
│ ├── src/androidMain/ ← LiteRT-LM JNI, MediaPipe text embedder, OEM-aware HardwareProvider
│ └── src/iosMain/ ← iOS PlatformFolders (full engine actuals — v0.3)
├── sample-app/ ← Compose Android app demonstrating the library
│ ├── src/main/kotlin/com/sagar/litertlmsample/
│ │ ├── metrics/ ← CpuMonitor, MemoryMonitor, TokenRateMonitor
│ │ ├── llm/ ← EngineHolder + SampleViewModel
│ │ └── ui/ ← Chat / FunctionCall / MetricsOverlay
│ └── local.properties.template ← model URL config; copy to local.properties
├── ARCHITECTURE.md ← module layout + design rationale
└── COMMERCIAL.md ← dual-licensing terms
Dual-licensed under the GNU Affero General Public License v3.0 (LICENSE) for open-source / research use, and under a commercial license for proprietary distribution (COMMERCIAL.md).
Copyright © 2026 Sagar Gupta.
- LiteRT-LM (Apache-2.0) — Google's on-device LLM runtime
- MediaPipe (Apache-2.0) — text-embedder bindings
- Ktor (Apache-2.0) — HTTP client for model downloads
- Okio (Apache-2.0) — streaming file I/O + SHA-256
- kotlin-inject (Apache-2.0) — compile-time DI
- Jetpack Compose (Apache-2.0) — sample-app UI
- Napier (Apache-2.0) — KMP-friendly logging



