← Back to Projects
ONGOING PROJECT

LLM Inference Optimization System

A real-time content summarization system that ingests live data from NewsAPI and Twitter, then generates intelligent summaries using locally deployed LLaMA 2 (7B) via Ollama — with an emphasis on solving practical LLM deployment challenges including context window management, inference efficiency, batching, and caching.

NLP LLMs LLaMA 2 (7B) LoRA PyTorch FastAPI NewsAPI Twitter API Ollama

Project Status

This project is currently under active development.

The system is being designed, prototyped, and iteratively improved — with a strong emphasis on identifying real engineering bottlenecks and resolving them with principled solutions. The focus of this project is on LLM deployment engineering and system-level optimization, emphasizing practical performance improvements for real-world usage.

Project Objective

The goal of this project is to build a production-ready text summarization platform that can:

  • Collect real-time textual data from NewsAPI (news articles) and Twitter API (social posts)
  • Generate concise, coherent summaries using LLaMA 2 (7B parameters) served locally via Ollama
  • Optimize inference performance through token-aware truncation, intelligent batching, and response caching strategies
  • Provide sentiment insights alongside generated summaries

System Architecture

Data Pipeline

  • Real-time content ingestion from NewsAPI (news articles) and Twitter API (social posts)
  • Text preprocessing: cleaning, normalization, and tokenization
  • Deduplication and basic entity extraction

Modeling Approach

  • LLaMA 2 (7B) as the core foundation model for abstractive summarization, served locally via Ollama
  • Optional LoRA adapters for parameter-efficient domain adaptation (planned future extension)
  • Separate sentiment analysis module using transformer-based classifiers

Backend & API

  • FastAPI-based backend for model inference and request handling
  • REST endpoints for summarization and sentiment analysis
  • Dockerized deployment for reproducibility

Engineering Challenges & Solutions

During development, two significant bottlenecks were identified and addressed:

Problem 1 — Memory & Context Window Limitations

The system operates with a fixed 2048-token context window, which restricts the amount of input text that can be processed at once. Instead of using a token-aware truncation strategy, the initial implementation simply took the first 1500 characters — causing important content to be discarded. Large posts were often cut mid-sentence, harming coherence and semantic continuity. There was no intelligent chunking mechanism to handle long-form inputs properly.

Solution

  • Replaced character-based truncation with a token-aware truncation strategy using the model's actual tokenizer, respecting the 2048-token limit without cutting mid-sentence
  • Implemented a sliding window chunking mechanism that splits long documents into overlapping segments, summarizes each, then merges outputs — preserving semantic continuity
  • Added sentence-boundary detection so truncation always ends at a clean sentence, improving input coherence for LLaMA 2

Problem 2 — No Batching or Caching of Summaries

Each summary request was sent directly to Ollama without checking for previously processed results. Identical queries triggered a full re-summarization from scratch, leading to redundant computation and high latency. Multiple summaries were also processed sequentially rather than in batches, compounding the inefficiency.

Solution

  • Introduced an in-memory cache (with optional Redis backend) keyed by a hash of the input text — repeated queries are served instantly without re-invoking the model
  • Implemented batch processing so multiple summarization requests are grouped and sent to Ollama together, reducing total inference time significantly
  • Added TTL-based cache expiry to ensure stale summaries are refreshed for time-sensitive content from live APIs

Current Progress

  • Literature review on abstractive summarization and LoRA fine-tuning completed
  • Data ingestion pipeline from NewsAPI and Twitter API operational
  • LLaMA 2 (7B) integrated via Ollama for local inference
  • Context window and batching/caching issues identified and resolved
  • FastAPI backend scaffolded with summarization and sentiment endpoints

Planned Validation & Performance Assessment

Once system stabilization is completed, the platform will be validated using:

  • ROUGE metrics for internal summarization quality verification
  • BERTScore for semantic coherence assessment
  • Latency measurements before and after caching implementation
  • Throughput comparison with and without batch processing
  • Memory usage and resource profiling during inference
  • Qualitative analysis of summaries generated from live NewsAPI and Twitter content

Future Roadmap

  • Complete LoRA fine-tuning on domain-specific summarization datasets
  • Integrate sentiment analysis into the summarization pipeline
  • Add multi-source topic clustering before summarization
  • Deploy a working demo with public API access
  • Extend support to multilingual content