← all writing

Choosing an LLMOps stack: LangSmith, LangFuse, LangWatch compared

Three tools, overlapping pitches, very different strengths in practice. Where each one wins, and how to pick without locking yourself in.

4 min read llmops / observability / evaluation

Every team that ships an LLM product eventually hits the same wall: what is this thing actually doing in production? That wall is what the LLMOps tooling category exists to solve. Three names come up over and over — LangSmith, LangFuse, and LangWatch — and on the surface they look interchangeable. They are not.

Here’s an honest comparison from using all three in real projects.

What they all do well

Before the differences, the overlap matters. All three give you:

  • Tracing: capture every step of an LLM call or agent run, including inputs, outputs, retrieved context, tool calls, and latencies.
  • Prompt management: version prompts and roll back, or compare versions side by side.
  • Datasets and evals: build evaluation sets and run automated graders against them.
  • Dashboards: latency, cost, error rate over time.

If you only need the basics, all three will work. The question is which one fits your stack, your team, and your constraints.


LangSmith

Strengths. First-party tool from the LangChain team. If you’re using LangChain or LangGraph, instrumentation is essentially free — one environment variable and you have traces. The eval framework is mature, the UI is polished, and the prompt playground is the best of the three for iterative prompt work.

Where it bites. It’s a hosted SaaS by default. There’s a self-hosted plan, but it sits behind enterprise licensing and the operational footprint is non-trivial. Pricing is per-trace and adds up faster than people expect at scale. If you have data-residency or air-gap requirements, this is the first thing to check.

Use it when. Your stack is already LangChain/LangGraph, you’re early enough that managed > self-hosted is the right tradeoff, and you want the fastest path from zero to “we can see what’s happening.”


LangFuse

Strengths. Open-source and self-hostable by default. Docker Compose and you’re running. The SDK is framework-agnostic — works with raw OpenAI/Anthropic calls, LlamaIndex, LangChain, anything. The pricing model on their cloud version is more predictable than per-trace billing, and the OSS version is genuinely usable, not crippled-on-purpose.

Where it bites. The eval and dataset features are solid but a step behind LangSmith on polish. Some integrations need manual wiring where LangSmith has them out of the box. If you’re not comfortable running infrastructure, the self-host advantage doesn’t matter.

Use it when. You need on-prem or self-hosted, you have a non-LangChain stack, or you want to control your own data and avoid per-trace lock-in.


LangWatch

Strengths. Strongest of the three on evaluations and quality monitoring in production. The eval library is opinionated and well-curated — hallucination detection, jailbreak detection, off-topic detection, faithfulness — and they show up in dashboards as first-class signals rather than custom metrics you have to build. Good story for non-engineers (PM, support) who need to see how the product is behaving.

Where it bites. Smaller ecosystem than the other two; some integrations are newer. Pricing model has changed a few times. If you want a one-stop shop for tracing and eval and prompt management at parity, LangSmith and LangFuse are more mature in the tracing dimension.

Use it when. Production quality monitoring is your top priority, you want eval signals as first-class telemetry, or you have stakeholders who need to see “is the product okay” without reading traces.


The honest matrix

LangSmithLangFuseLangWatch
Best forLangChain teams who want managedSelf-hosting / framework-agnosticProduction quality monitoring
HostingManaged (default), self-host (enterprise)Self-host (default), cloud optionCloud
Eval depthStrongSolidStrongest
Tracing maturityStrongestStrongSolid
Lock-in riskLangChain-coupledOSS, lowCloud-coupled
Time to “first useful trace”Minutes~1 hour (self-host)Minutes

A pragmatic decision rule

The wrong question is “which tool is best.” The right question is which constraint dominates:

  • If your constraint is data residency: LangFuse self-hosted.
  • If your constraint is speed of integration: LangSmith.
  • If your constraint is detecting bad outputs in production: LangWatch.

It’s also fine to use two. A common pattern: LangFuse for tracing and prompt versioning (because you self-host it cheaply and it owns your raw data), plus LangWatch for the production-quality evals on top. They don’t fight each other.

What you do not need on day one

A common failure mode is picking the most sophisticated tool, configuring it for six weeks, and never actually using it. You can ship a v1 with:

  • A single tracing tool, wired into the hot path.
  • Maybe 30 labelled evaluation items.
  • One offline eval that runs in CI.
  • One alert on cost or latency anomaly.

Everything beyond that, you add when a real incident teaches you what you needed.


Need help putting an LLM system into production?

Get in touch