Technologies for a Local-First MCP Host Application

This part covers the technologies I used to develop the custom host application and MCP server for my research. When selecting the stack, I prioritized freely available technologies that could run locally and without cost. At the same time, I wanted the host application to stay close to real-world usage, where users may not have sufficiently powerful hardware, so I also supported paid LLM providers that offer free trial access. I cared a lot about that constraint because I did not want the whole project to depend on ideal hardware or a paid setup from day one.

Once the architectural direction was clear, the next practical question was the stack.

Ollama

Ollama is an open-source tool that simplifies working with large language models and serves as the local inference layer in this project. It allows downloading, managing, running, and communicating with LLMs on a local machine. Interaction is possible through the system command line and through API calls.

For this project, Ollama made it possible to use multiple local models inside the system. It offers strong privacy benefits, is free to use, and is easy to set up. Its configuration also allows tuning options such as batch processing and parallel execution, which can improve performance when sufficient memory is available.

A list of available models can be found on the official Ollama website, and models can be downloaded locally with the ollama pull command.

Once a model is installed, it can be used in several ways:

CLI usage
With ollama run, a model starts in an interactive terminal session where prompts can be entered directly and responses are returned conversationally.
REST API usage
Ollama exposes a local API at http://localhost:11434/api/ that accepts POST requests. Prompts are included in the request body, similar to most commercial LLM providers.

The two main endpoints are:
- /api/generate/ — accepts a prompt and returns a raw generated response
- /api/chat/ — accepts role-based conversational context and returns a raw response
SDK usage
Using an SDK for the chosen programming language simplifies development and provides a cleaner integration surface. In this project, I used methods such as:
- agent.generate(request) — accepts a prompt and returns a structured response
- agent.chat(request) — accepts role-based context and returns a structured response

Regardless of the access method, conversational context is passed to the model as a list of messages. Each message contains a role and corresponding content. This structure allows the model to interpret not only the latest prompt, but also the broader flow and purpose of the interaction.

The main message roles are:

system — provides instructions and high-level guidance
user — represents user prompts
assistant — represents LLM responses
tool — represents tool requests and operation results

A simplified example of a message history looks like this:

[
  {
    "role": "system",
    "content": "You are a helpful assistant that can use tools."
  },
  {
    "role": "user",
    "content": "List the files in the project directory."
  },
  {
    "role": "tool",
    "content": "{\"files\":[\"package.json\",\"src/index.ts\",\"README.md\"]}"
  },
  {
    "role": "assistant",
    "content": "The project directory contains package.json, src/index.ts, and README.md."
  }
]

FigureMessage history as JSON context.

Based on the message history, Ollama responds to new prompts with a JSON object that includes both metadata (duration, input and output token counts) and the assistant message itself, with role and content fields. That structured response is what makes downstream tool orchestration possible.

{
  "model": "gemma:2b",
  "created_at": "2025-09-30T08:19:38.867366Z",
  "message": {
    "role": "assistant",
    "content": "I am a helpful file assistant that can assist you with various tasks such as finding information, creating documents, and managing files."
  },
  "done_reason": "stop",
  "done": true,
  "total_duration": 475632875,
  "load_duration": 90065416,
  "prompt_eval_count": 38,
  "prompt_eval_duration": 70484917,
  "eval_count": 26,
  "eval_duration": 314256666
}

FigureOllama response JSON returned by the gemma:2b model.

TypeScript

Although Python would probably have been a simpler choice for an AI-heavy project, I deliberately chose TypeScript instead. As a typed superset of JavaScript, it gives the project static types and stronger tooling, which made the architecture easier to reason about and reduced the chance of mistakes already during development. Because TypeScript compiles down to JavaScript, the same code runs across any JavaScript-compatible runtime. I do not think that choice was only personal preference either; for a modular CLI with multiple moving parts, the type system paid for itself quickly.

For the MCP architecture itself, the implementation uses the official @modelcontextprotocol/sdk library. Local model interaction happens through the ollama package, while remote model providers are accessed through the ai package (Vercel AI SDK), which provides a clean abstraction layer and an agent orchestration model on top of multiple providers.

To validate message content and tool input shapes, the project uses Zod, combined with zod-to-json-schema to convert validators into the JSON Schema format expected by MCP servers. For storing and distributing remote files, I integrated Bunny.net CDN through their official @bunny.net/storage-sdk and @bunny.net/edgescript-sdk libraries.

Bun

Bun is the runtime, package manager, and bundler used to develop, test, and run the project. It is designed as a faster and simpler alternative to Node while staying compatible with most of the Node ecosystem.

A few characteristics make Bun especially well-suited to this kind of project:

a built-in bundler that can package source files and dependencies into a single artifact, simplifying distribution
a built-in test runner for unit and integration tests inside the same toolchain
a package manager that is broadly compatible with Node, so existing libraries and tooling work unchanged

Bun's stated goal is full Node compatibility, which made migrating and reusing existing libraries low-friction. Combined with its faster startup and execution characteristics, it felt like a good fit for a CLI-oriented agent host that needs to start quickly and run consistently.

Docker

Docker is the open-source container platform used to package and isolate the host application. It allows applications and their dependencies to be bundled into containers that behave the same way across environments, regardless of the host operating system or server configuration.

Containers simplify development, testing, and deployment, and make resource management more efficient than full virtual machines for a project of this size. In this work, Docker is used so the host application and its services can be installed and run as containers, lowering the barrier for anyone who wants to try the system without manually configuring a local environment.

Excalidraw

Excalidraw is the open-source web tool used for diagrams and visualizations across the project. It is widely used by engineers and product teams for rapid, sketch-style diagramming, and supports real-time collaboration.

Its informal visual style and low friction make it a good fit for technical documentation and architectural sketches that need to evolve as a system is being designed. All diagrams and figures in this research were drawn with Excalidraw.

Why this stack

Looking at the choices together, a pattern becomes clear. Each technology in the stack was selected to keep the system local-first, openly accessible, and easy to extend:

Ollama for local LLM inference
TypeScript for typed, maintainable agent and server logic
Bun for a fast, modern runtime and bundler
Docker for portable installation
Excalidraw for clear, lightweight diagramming

The result is a stack that lets the host application run for free on a personal machine while still being capable of integrating with paid remote providers when needed. That balance between local-first development and cloud-ready integration is what made the combination practical for the rest of the system. If I changed one assumption during the research, it was this: flexibility mattered more than theoretical stack purity.

With the stack in place, the next step was to turn the design into a working system.