Skip to content

yikechayedan/llama-Android-example

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PocketLlama

On-device LLM inference for Android using llama.cpp. Runs GGUF models locally with CPU, Vulkan, or OpenCL backends. Built for benchmarking inference performance across backends and quantizations on Snapdragon hardware.

Group project for Mobile Application Software Development, Tsinghua University, Spring 2026.

While we have put a lot of work into it, this is still a demo and proof of concept project. Please be cautious using it in production scenarios. The authors do not take any responsibilities for any problems or issues caused by the app, however you are free to report issues or submit PRs on github.


Stack

  • Language: Kotlin (app + JNI wrapper), C++ (native inference)
  • Inference: llama.cpp (vendored in lib/src/main/cpp/llama-source/)
  • Build: Gradle + CMake, Android NDK 29
  • Min SDK: 35
  • GPU backends: OpenCL, Vulkan (see Backends)
  • Model format: GGUF

Project Structure

app/                        Android app (UI, MainActivity)
lib/                        JNI wrapper library
  src/main/cpp/
    ai_chat.cpp             C++ JNI bridge into llama.cpp
    CMakeLists.txt          Native build config (backend flags)
  src/main/java/com/arm/aichat/
    InferenceEngine.kt      Public Kotlin interface
    internal/
      InferenceEngineImpl.kt  Singleton JNI wrapper
    gguf/
      GgufMetadataReader.kt   Pure-Kotlin GGUF metadata parser

Building

Open in Android Studio (Hedgehog or later) or build from CLI:

./gradlew assembleDebug

Requires NDK 29 installed. Set the NDK path in local.properties if not auto-detected:

ndk.dir=/path/to/ndk/29.x.x

Backends

Configured via gradle.properties:

ENABLE_VULKAN=false
ENABLE_OPENCL=true

Only one should be enabled at a time. To use CPU-only, disable both.

Status on Snapdragon 8 Elite (Adreno 830):

Backend Status
CPU Working
OpenCL Working — slower than CPU for token generation (bandwidth-limited at batch_size=1), faster for prefill
Vulkan Compiles but produces incorrect output — known driver issue with GL_KHR_cooperative_matrix on Adreno 830

For OpenCL, libOpenCL.so must be present at app/src/main/jniLibs/arm64-v8a/libOpenCL.so. Pull it from your device:

adb pull /vendor/lib64/libOpenCL.so app/src/main/jniLibs/arm64-v8a/

Running

  1. Build and install the APK on your device
  2. Download or copy a .gguf model onto your device, make sure that the size and quantization is suitable for your system specs, otherwise the app may crash due to running out of memory
  3. Launch PocketLlama, yo ucan attach a debugger or listen with adb logcat if required for debugging (see below)
  4. Tap Select GGUF File and pick a local .gguf model (or download one from HuggingFace)
  5. Configure inference parameters such as offloaded GPU layers (0 = CPU only, max = all layers on GPU).
  6. Tap Load Model, then start chatting

For speed when changing configuration in the same run, models are copied to app-internal storage on first load and reused on subsequent loads.

Recommended models

Qwen3-4B quants from bartowski/Qwen_Qwen3-4B-GGUF:

  • Q8_0 — highest quality, ~4 GB
  • Q4_K_M — good balance, ~2.5 GB
  • Q2_K — smallest, ~1.4 GB

Inference Parameters

Parameter Default Description
GPU layers 0 Number of transformer layers offloaded to GPU
Temperature 0.3 Sampling temperature
Max reply tokens 1024 Maximum generated tokens per response
Batch size 128 Prompt processing batch size (n_ubatch)

Changing GPU layers while a model is loaded requires a reload (you will be prompted to do so).


Generation Stats

After each response, tap the details link on any assistant message to see:

  • Prefill time + tok/s (prompt processing speed)
  • Token count + generation tok/s
  • Total time + overall tok/s

Debugging

All llama.cpp log output is tagged ai-chat:

adb logcat -s "ai-chat"

To confirm GPU offload is active:

adb logcat -s "ai-chat" | grep -E "n_gpu_layers|offload"

To check GPU memory allocation:

adb shell dumpsys gpu | grep Proc

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors