scaleinfer

Modular ML inference pipeline for real-time recommendation systems.

Companion open-source implementation for the paper:

Scalable ML Inference for Real-Time Recommendations International Journal of Information Technology and Computer Engineering (IJITCE) Volume 13, Issue 4, 2025 | ISSN 2347-3657 DOI: 10.62647/IJITCE2025V13I4PP1-8

Overview

Building a recommendation system that scores items in real time requires assembling several components: candidate retrieval, feature assembly, model scoring, and re-ranking. This library provides each stage as an independent, pluggable component and an orchestrator that chains them with per-stage latency tracking and an optional latency budget.

The architecture is based on production systems serving millions of users at sub-100ms latency, including the Intuit SER system (97% latency reduction from 1.2s to 40ms) and Expedia's ranking pipeline.

Modules

Module	Purpose
`scaleinfer.retrieve`	Candidate generation via cosine similarity or precomputed lists
`scaleinfer.features`	Thread-safe feature store with static hydration and dynamic computation
`scaleinfer.score`	Model scoring with LRU caching and pluggable score functions
`scaleinfer.rank`	Score-based ranking with optional MMR diversity
`scaleinfer.pipeline`	End-to-end orchestrator with per-stage latency tracking
`scaleinfer.optimize`	Pipeline profiler with p50/p95/p99 breakdown per stage

Installation

pip install scaleinfer

Or with UV:

uv add scaleinfer

Quick Start

import numpy as np
from scaleinfer.retrieve.backends import InMemoryRetriever
from scaleinfer.features.assembler import FeatureStore, FeatureAssembler
from scaleinfer.score.scorer import ModelScorer, LRUCache
from scaleinfer.rank.ranker import Ranker
from scaleinfer.pipeline.pipeline import RecommendationPipeline

# Build the retrieval index
retriever = InMemoryRetriever()
retriever.add_items(item_ids, item_embeddings)

# Set up feature store
store = FeatureStore()
store.bulk_set({"item_001": {"popularity": 0.9, "recency": 0.7}})
assembler = FeatureAssembler(feature_store=store)

# Configure scoring with caching
scorer = ModelScorer(score_fn=my_model.predict, cache=LRUCache(maxsize=50000))

# Configure ranking
ranker = Ranker(top_k=20, diversity_weight=0.2)

# Assemble and run the pipeline
pipeline = RecommendationPipeline(
    retriever=retriever,
    feature_assembler=assembler,
    scorer=scorer,
    ranker=ranker,
    latency_budget_ms=100,
)
result = pipeline.recommend(query_vector)
print(result.summary())

Profiling

from scaleinfer.optimize.profiler import PipelineProfiler

profiler = PipelineProfiler(pipeline)
report = profiler.run(query_vectors=test_queries)
print(report.summary())
print("Bottleneck:", report.bottleneck_stage())

Development

uv sync --all-extras
uv run pytest tests/ -v --cov=src
uv run isort src/ tests/ && uv run black src/ tests/

Citation

If you use this library in your research, please cite the paper:

Scalable ML Inference for Real-Time Recommendations.
International Journal of Information Technology and Computer Engineering (IJITCE),
Volume 13, Issue 4, 2025. DOI: 10.62647/IJITCE2025V13I4PP1-8

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
src/scaleinfer		src/scaleinfer
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scaleinfer

Overview

Modules

Installation

Quick Start

Profiling

Development

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scaleinfer

Overview

Modules

Installation

Quick Start

Profiling

Development

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages