← Writing

research

MCP Agents for File Workflows: Limits and What Comes Next

An evaluation of MCP agents for file workflows, covering usability limits, model speed, tool scaling, and what still blocks day-to-day adoption.

10 min read
  • AI
  • Agentic Systems
  • MCP
  • LLM
  • DX
MCP Agents for File Workflows: Limits and What Comes Next cover image

After implementing the system and running it through technical evaluation and a developer experience survey, the final step was to step back and ask the obvious question: is this actually usable?

By the end of the research, that had become the only question that really mattered.

My answer is not a clean yes or no. The architecture works, but the practical experience is still gated by infrastructure, model speed, and tool count. Those issues are addressable, but not solved yet. If I had to reduce the whole project to one sentence, it would be this: the idea is viable, the current ergonomics are not there yet.

This post captures the closing findings from the research, the directions I would prioritize next, and the limitations that frame the whole effort.

Closing findings

The clearest barrier in the current state of the system is execution speed, which depends heavily on how the agent is run and on the underlying infrastructure.

When running locally on hardware with limited resources, both LLM startup time and per-prompt processing time were long enough to noticeably slow down the developer workflow. The cost compounded when handling several files at once: changes happened slowly enough that the workflow lost its sense of flow and progress.

Remote execution through a free tier started out reasonably stable, but degraded quickly under sustained or multi-user load. The free tier's slower behavior meant the agent could not keep up with realistic demand. A paid subscription would likely lift performance to a usable level, but that would change the basic premise of free, accessible automation that the project was built around.

The number of available MCP tools turned out to be the most decisive limiting factor. Even at twenty tools, the application became almost unusable: the model spent a disproportionate amount of time deliberating which tool and execution strategy to choose. That is not a problem unique to this system. It is a structural property of how current LLMs handle large tool surfaces inside a single context.

Putting these factors together, I see the current implementation as a prototype or experimental tool, not as a daily-use development utility. The improvements needed are clear:

  • significantly shorter startup and execution times
  • optimization for working with larger tool inventories
  • careful, intentional use of more capable cloud services where it makes sense

Until those constraints are addressed, an MCP-based agent workflow will continue to feel slower and less reliable than the established manual approaches it is meant to replace. That is the gap to close, and I think it is a product problem as much as a model problem.

Future development

The clearest direction for improvement is on the model side: smaller, domain-adapted LLMs produced through techniques like pruning, which removes irrelevant model weights, and quantization, which reduces the precision of the remaining weights, for example by going from 32-bit to 8-bit representations.

Both techniques make local execution faster and lower the hardware bar. A domain-adapted model, on top of that, understands the specific terminology, common tasks, and typical user patterns of its target use case far better than a general-purpose model of the same size. For file management, that could translate into more accurate tool selection with fewer wasted reasoning steps.

Beyond the model layer, installation experience is a clear opportunity. The current process requires setting up two Docker containers and editing configuration together with environment variable files. A simple CLI installer that asked for the few key parameters and handled the rest would dramatically lower the barrier to entry, and would probably move the survey's installation score quite a bit.

Inside the running application, dynamic configuration would also help. Today, the only way to switch model or agent provider is to exit and restart. The CLI accepts the word "bye" to quit. A natural enhancement would be commands to swap the active model, agent, or provider mid-session. Combined with richer CLI libraries for prompt handling, clearer menus, and better interactivity in the terminal UI, the developer experience could feel substantially more fluid without changing the underlying architecture at all.

Together, these directions form a practical roadmap: faster and smaller models, smoother installation, and a more interactive runtime experience. None of that is especially glamorous, but it is the kind of work that would actually make the system usable.

Limitations

Despite the demonstrated functionality, the system carries several technical and user-facing limitations worth being explicit about.

CDN providers throttle the number of API calls per second, which can interrupt transfers or temporarily make the cloud server unresponsive. Agents have to be implemented with deliberate pacing between sequential tool calls to avoid running into these limits in normal use.

Some external LLM providers do not stream progress information back to the client. That leads to a degraded user experience where the user has no signal about what is happening or how far along the operation is. For interactive tools, that silence is its own usability cost.

The Groq free tier caps usage at six thousand tokens per period. Both the number of MCP tools and the volume of prompts directly affect how quickly that limit is reached. Once context grows large enough, execution slows down or aborts entirely, which makes large tool catalogs especially painful on free-tier deployments.

Testing was conducted with a relatively narrow, content-similar set of prompts. The results provide solid signal for the patterns I exercised, but they should not be over-generalized to every possible use of the system.

Finally, prompt interpretation depends on the chosen model and the structure of the prompt itself. With ambiguous or imprecise instructions, the agent can make decisions that diverge from what the user actually wanted, including the kind of merge-instead-of-split behavior I observed during the solution test. That is not a bug. It is a property of LLM-driven agents, and a reminder that prompt design is part of the user interface.

Where this leaves the work

The research showed that an MCP-based agent for file workflows is technically viable, that the architecture composes cleanly, and that developers respond positively when the system works. It also showed that the experience is fragile in exactly the places that matter most: speed, installation, and tool scaling. In a way, that was a better result than a vague success story.

That mix of promise and friction is, for me, the most useful kind of result. It points clearly at what to build next instead of pushing me toward either declaring victory or giving up on the approach.

Taken together, the project left me with a much clearer idea of where the real value is and where the real friction still sits.

For me, the more interesting question now is not whether agents can do this work, but how to design the surrounding system, runtime, model selection, installation, and feedback so that doing it actually feels good. That is the direction I want to keep pushing in.