Microservices: The Costs Nobody Mentions Until You're Already Committed
Microservices solve real scaling problems. They also add distributed systems complexity that compounds at every service boundary. Understanding the real costs helps you decide when the tradeoff is worth it.
The problem microservices actually solve
Microservices are not an architecture you choose because they are modern. They are a solution to two specific problems:
- Deployment independence: different teams need to ship at different cadences without coordinating releases.
- Independent scaling: a specific component (e.g., a video transcoding service) needs resources that differ drastically from the rest of the system.
If neither of these applies to you, microservices will add complexity without delivering the benefits that justify it.
What changes at a service boundary
ConceptDistributed SystemsMoving a function call to a network call is not just a performance change — it changes the reliability model, the data consistency model, and the operational footprint of every operation that crosses the boundary.
Prerequisites
- HTTP and REST
- databases and transactions
- process isolation
Key Points
- Function calls are in-process and atomic. Network calls can fail, time out, or succeed on the server while the response is lost.
- Transactions across services require distributed coordination — two-phase commit, saga patterns, or accepting eventual consistency.
- Every new service needs its own deployment pipeline, monitoring, alerting, and on-call rotation.
- Service discovery, load balancing, and circuit breaking must be explicitly built or provided by infrastructure.
The distributed transaction problem
The single biggest pain point in microservices is data consistency across service boundaries. A relational database transaction gives you all-or-nothing semantics for free. Across services, you have none of that.
Consider an e-commerce checkout that spans three services: Order, Inventory, and Payment.
User clicks Buy
→ Order service creates order (status: pending)
→ Inventory service reserves items
→ Payment service charges card
→ Order service marks order (status: confirmed)
What happens if Payment succeeds but the connection to Order service drops before the status update? You have charged the customer but the order appears failed. What if Inventory reservation succeeds but Payment fails? You need to unreserve inventory.
There are three approaches, each with significant costs:
Two-phase commit (2PC): a coordinator locks all services and commits atomically. This works but the coordinator becomes a single point of failure and locks resources across services for the commit duration. Under high load, this is a scalability bottleneck. Most teams avoid it.
Saga pattern: each service performs its local transaction and publishes an event. If a downstream step fails, compensating transactions run in reverse. The Order service listens for payment failure and cancels the order.
Order created → [OrderCreated event]
↓
Inventory reserved → [InventoryReserved event]
↓
Payment charged → [PaymentSucceeded event]
↓
Order confirmed
Payment failed → [PaymentFailed event]
↓
Inventory unreserved
↓
Order cancelled
The saga is correct in theory. In practice you are now responsible for: designing idempotent compensating transactions, handling partial failures mid-saga, debugging state spread across multiple service logs, and dealing with the window where data is inconsistent between saga steps (a user can see "order pending" while payment is processing).
Accept eventual consistency: for many operations this is the right answer. If Inventory updates lag by a few seconds, most users do not notice. Design your UX and business logic around consistency boundaries rather than forcing synchronous coordination.
⚠The fallacies of distributed computing
Peter Deutsch's classic list of assumptions that engineers make when building distributed systems, all of which are wrong:
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology does not change
- There is one administrator
- Transport cost is zero
- The network is homogeneous
Each service boundary you add is a new surface where these assumptions can fail independently. A monolith has one network boundary (the load balancer). Ten services have ten boundaries plus the connections between them.
When services should own their data
The rule that prevents most microservice disasters: each service owns its database, and no service reads another service's database directly.
This is painful because it means you cannot do a JOIN across service boundaries. A request that needs data from three services requires three API calls and manual stitching. But allowing services to read each other's databases couples their schemas — you cannot change one without coordinating with every service that reads it, which defeats the deployment independence goal entirely.
If you find yourself adding a read replica of Service A's database for Service B to query, you have coupling without the visibility of a formal API contract. This is worse than either a clean API dependency or a monolith.
The operational reality
Every service you add requires:
- A CI/CD pipeline (build, test, deploy, rollback)
- Infrastructure: containers, load balancer, health checks, autoscaling
- Observability: metrics, logs, traces — and crucially, traces that span services
- An on-call rotation that understands the service's failure modes
Distributed tracing (Jaeger, AWS X-Ray, Datadog APM) becomes non-optional once requests span more than two services. Without it, debugging a slow request means tailing logs across five different services and correlating timestamps manually.
Request: GET /checkout (800ms total)
→ Order service: 50ms
→ Inventory service: 20ms
→ Payment service: 700ms ← that's the problem
→ External card processor: 650ms
With tracing, you get this waterfall in one view. Without it, you are guessing.
Monolith vs microservices
The choice depends on team size, deployment needs, and whether you have the operational infrastructure to support distributed services.
- One deployment unit — all teams release together
- In-process function calls, ACID transactions, no network failures between components
- One database, one codebase, one set of logs
- Scales vertically easily; horizontal scaling requires stateless design
- Coupling between modules grows without discipline — but this is a team problem, not an architecture problem
- Independent deployments — teams ship without coordinating
- Network calls between services: latency, timeouts, partial failures
- Each service has its own data store, deployment, and observability
- Each component scales independently
- Requires distributed tracing, saga/eventual consistency, service mesh or explicit circuit breaking
Start with a monolith. Extract services when you have a concrete deployment independence or scaling problem that the monolith cannot solve without slowing the whole team. The question is not 'is microservices better' but 'do we have the specific problem microservices solves, and do we have the engineering capacity to operate them'.
Service boundary decisions
Bad service boundaries are worse than no microservices at all. A common mistake is splitting by technical layer (a separate service for the data access layer, a separate service for business logic) rather than by domain capability.
Good service boundaries align with:
- Business capability: an Order service owns everything about orders. It does not just wrap an orders database — it enforces order lifecycle rules.
- Team ownership: the team responsible for the service has end-to-end control over its behavior and can deploy without coordinating with other teams.
- Data independence: the service owns its data. If a feature requires you to frequently reach into another service's data store, the boundary is wrong.
Jeff Bezos' "two-pizza team" rule maps to this: a service should be small enough that one team can own it fully. If your microservices require a separate "microservices coordination team" to manage dependencies, you have built a distributed monolith — the worst of both worlds.
A team splits their monolith into five microservices. After six months, deployments are more frequent but debugging production incidents takes 3x longer than before. What is the most likely architectural gap?
mediumThe services communicate synchronously over HTTP. Each service has its own logs in separate log streams. There is no distributed tracing.
AThe services are too small and should be consolidated
Incorrect.Service size is not the root cause here. The debugging problem is an observability problem. Consolidating services would help, but the core issue is missing distributed tracing.BDistributed tracing was not implemented, making cross-service request paths invisible
Correct!Without trace IDs propagated across service calls, a single user request that touches five services produces five separate log entries with no shared context. Finding the slow step requires correlating timestamps across systems manually. Distributed tracing (Jaeger, Zipkin, X-Ray, Datadog APM) is not optional in a microservices architecture — it is the primary tool for understanding request paths.CHTTP is too slow for inter-service communication
Incorrect.HTTP latency between services in the same data center or VPC is typically under 1ms. Protocol is rarely the bottleneck in debugging time.DThe services need a shared database to simplify queries
Incorrect.Sharing a database couples services at the schema level and reintroduces the coordination problem microservices are meant to solve. This would make the architecture worse.
Hint:Think about what information you need to debug a slow request that crosses five services, and what infrastructure provides it.