Pico’s Programming: Building Microservices — Part 2
Book Review and Notes
Building Microservices by Sam Newman. This blog post isn’t really for reading, but it is the points that I found interesting in the book that I wanted to keep in mind and share with others. See Part 1 for Chapters 1–5.
Chapter 6 — Workflow
This chapter touches on transactions and their place in a microservice. After the obligatory descriptions of transactions and ACID, the author brings up Sagas which is a way to create multi step business processes (a.k.a. workflows) (a.k.a. Long Lived Transactions) in a microservice architecture.
The author discusses handling error cases when executing a saga and the options for creating undo actions and failing back (rollback) or failing forward (waiting, retry, manual interventions) depending on the scenario.
Warning: Just say no to distributed transactions. Conceptually, each microservice should have its own database. If you have a lock on multiple databases waiting for multiple services to complete, you’re gonna have a bad time.
Concept: Try and stage workflow actions after the most volatile actions so there are less things to rollback if necessary.
Two flavors of implementing sagas: Orchestrated and Choreographed.
Orchestrated sagas implement an orchestrator that calls each service in order via request / response and is responsible for taking undo actions for each step if needed. Orchestrated sagas are pretty easy to understand but they do have some drawbacks. The total time to execute is the sum of each action’s execution time. If a service is down then the orchestrator needs to be tooled to make decisions. There is a higher degree of coupling as the orchestrator is a client of each service. A slippery slope is that business logic moves into the centralized orchestrator making it a monolith that calls ‘dumb’ services that do basic functionality like CRUD.
Choreographed Sagas are more fluid as they use an event based architecture. Services emit events to queues (for single consumer) or topics (for multiple consumers). Other services subscribe to the topics and queues, process the event payload, execute actions, and then emit their own events. Some benefits are that the business logic is forced to stay in the services, some services can be upgraded live as events are queued up waiting to be processed, and multiple services can act at the same time, and my personal favorite of scaling up and down the number of services based on queue size. Some drawbacks are the event based architectures are trickier to debug. Your tracing game has to be spot on with a correlation (a.k.a. trace) ID inserted into each event and relevant log. Another drawback is that it is hard to visualize an explicit business process like you can with an orchestrated saga.
Chapter 7 — Build
Are you really doing Continuous Integration? Do you check in to mainline (master/main) once per day? Do you have a suite of tests to validate your changes? When the build is broken, is it the #1 priority of the team to fix it?
Author discusses the pros & cons of different branching models. For example, feature branches means you are not continuously integrating your code with others. The author sides with trunk based development (like Gitlab Flow main) versus feature based development like (GitFlow master/develop/feature/bugfix). If you use long lived branches, you’re gonna have a bad time.
As a personal experience, early on in my career we had an integration period at the end of the development phase approach. That was a trainwreck with pre-branched releases, configuration managers, and merge conflicts galore. The end result was an integration period that was longer than the development period.
The author compares monorepo vs multirepo (one repo per service / library). The author asserts that if you are Google and can afford many engineers to maintain the tooling for a monorepo, go ahead. If you are a small group, say less than 20 developers a mono repo is usually OK too. If you are above 20 developers you start to step on each other’s toes and don’t have the extra funds to hire someone to build tooling. The author says that if you don’t know, the best position to start with is one repo per service. He continues to say that you should have a set of helper scripts to maintain all the repos. I’m not sold given my experience with many repos.
Clarifications: Continuous Delivery is not the same as Continuous Deployment. Continuous delivery indicates that each commit gets built and tested and is production ready. Continuous deployment is similar, but the artifact gets run through a battery of automated tests and is then deployed.
Rule: The artifact that is built, tested, slow tested, integration tested, is the same one that is deployed. There will be no rebuilds.
The author expresses an ideal build pipeline process where the same build artifact passes through stages:
- The artifact is compiled and ‘fast tests’ are run. (Instead of saying unit_test, the author goes with ‘fast tests’ as a generic concept)
- The artifact has the ‘slow tests’ executed. (slow tests = integration tests)
- Performance Tests
- Production
Chapter 8 — Deployment
This chapter was a bit of a slog as there were a lot of things to go over. Deployment of microservices is not straightforward as there are a lot of options.
Considerations:
- How many instances of a service should you deploy? Is it static or dynamic?
- What about deploying to multiple regions to handle catastrophic failures?
- How do you deploy your database? One master that receives writes and others that are read replicas?
- How many environments? Dev, QA, Staging, Production? How expensive is it to replicate the full production environment?
Principles of Microservice Deployment:
- Isolated Execution: One service should not impact nearby running instances.
- Focus on automation: More microservices requires more automation. Automation needs to be part of the culture.
- Infrastructure as Code: Store environment creation in version control so it can be recreated exactly.
- Zero-downtime deployment: Ensure that deploying a new version of a service does not result in downtime for other services or users.
- Desired State Management: Use a platform that maintains your services in a defined state. Self-repairing in case of outages or traffic increases.
This wasn’t part of the chapter, but I’ve been exposed to 12-factor apps which overlaps some of the author’s principles.
Deployment destination options (where you deploy the services):
- Physical machine (no virtualization)
- Virtual machines
- Container (runs on Physical or Virtual Machine or is managed by a container orchestration framework like Kubernetes)
- Application container (remember JEE?)
- Platform as a Service (PaaS)
- Function as a Service (FaaS)
The author goes through each of these options and their tradeoffs. The author has declared a winner. That winner is Kubernetes. Kubernetes comes with a ton of overhead to maintain so it might be worth outsourcing this. That being said, the author says “If it ain’t broke, don’t fix it”. If you have something that is working for your organization or a simple solution will work, go with that. There is no shame in not adopting Kubernetes in favor of something simpler.
Progressive Delivery
The author describes the multiple ways to roll out the software so you do not break services as you roll out new ones.
- Feature toggles (enable / disable features with configuration)
- Canary release (select a certain category of users to get directed to the new services)
- Parallel Run (run both services and make sure they each return the same responses) route more and more traffic to the new service as it proves to be stable.
Resources:
- Infrastructure As Code (IAC) Pulumi (generic)
- Nomad for desired State Management that is not Kubernetes
Chapter 9 — Testing
The author presents different models of thinking about tests.
Brian Marick’s testing quadrant
1. Acceptance Testing: Business Facing; Supports Programming. Did we build the right thing?
2. Exploratory Testing: Business Facing; Critiques product. Usability: How can I break the system?
3. Unit Testing: Technology Facing; Supports Programming. Did we build it right?
4. Property Testing: Technology Facing; Critiques product. Response time; scalability; performance; security.
Mike Cohn’s test pyramid.
- UI Tests — Increased scope of tests, increased confidence, fewest number of tests, requires a full deployment, slower
- Service tests — Testing a service in isolation, minimally scoped, tests the behavior of each exposed service, requires dependent services to be active or stubbed out.
- Unit Tests — Smallest scope, largest number of tests, and fast! Tests individual methods.
Takeaway: The test pyramid is the ideal, but a lot of testing ends up looking like a snowcone. There are few unit tests at the bottom (which run fast) and a lot of UI and Service tests (which run slow). This tends to make for a very slow feedback process for developers as they wait… and wait… and wait for tests to finish.
Takeaway: Everyone’s definition of tests seems to be different, make sure those terms are defined so folks are sharing a common vocabulary.
The author discusses end-to-end tests and has opinions. End-to-end tests tend to be flaky and brittle as any component change in the chain can break the test. Since it requires every service to be running, end-to-end tests require a lot of resources and it moves away from the ideal that each service is independently deployable. It also begs the question, who writes and maintains these end-to-end tests? Is it the people that write the code or the people most invested in it working or a separate test organization that does not have the background of the use cases? How long can end-to-end tests run? 1 hour? 1 day? 1 week? Does a breaking end-to-end test stop all future deployments of any service?
Concept: Contract Tests and Consumer-Driven Contracts (CDCs). Consider how a Service can break others. If there is a structural change the service APIs will no longer work for the consumers. This can be noticed quickly if there are schemas and validation in use. The other way is when a semantic change occurs and the behavior changes but not the API. This is where CDCs come into play. A CDC is a test that encapsulates how a consumer uses a service and their expectations. The CDC is run in the suite of tests for the Service that way if the CDC test fails, the developers know they are going to break a downstream client. I kind of like this idea, but it seems a little crunchy. Who writes the test? How do you make the contract?
Production testing includes: Canary releases, Smoke Tests, Fake User Behavior
Testing Metrics
Mean Time to Repair (MTTR) over Mean Time Between Failures (MTBF)? The authors point out that the preference of metric depends on if you’re a continuous releaser or a periodic releaser. The continuous releaser minimizes Mean Time to Repair as they fix bugs as they arise which brings a fail forward attitude. The periodic releaser (weekly, monthly, semi-annually) focus on Mean Time Between Failures since the deployments are infrequent, there is a heavy focus on regression testing and eliminating bugs before they are released.
Cross-Functional Tests (a.k.a. nonfunctional requirements): Latency tests; number of concurrent users support; accessibility; security; service level objectives. These CFRs are almost exclusively able to be tested in production. Some performance tests can also be grouped into CFRs as tests of a service may actually be a chain of service calls as Service A calls Service B which calls Service C, which adds up to unresponsiveness to the user.
Summary: Optimize for fast feedback. Avoid end-to-end tests in favor of Consumer Driven Contracts. Beware the testing Snowcone. Understand the tradeoff between MTTR and MTBF and where efforts should be spent.
Resources:
https://pact.io — consumer driven testing tool
Spring Cloud Contract — another contract testing tool but it is limited to those systems using JVMs.
Chapter 10 — From Monitoring to Observability
Monitoring is something you do. Observability is the extent to which you can understand what the system is doing based on external outputs. Logs, events, and metrics might help you make things observable, but be sure to focus on making the system understandable rather than throwing in lots of tools.
Building blocks of Observability:
- Log aggregation — author notes: a log aggregation tool is a prerequisite for implementing a microservice architecture.
- Metrics Aggregation: raw numbers from the microservices and infrastructure to help detect problems, drive capacity, and perhaps scaling
- Distributed tracing — tracking a flow of calls across multiple microservices
- Are you doing OK? Error budgets, SLAs, SLOs, to make sure you’re meeting the needs of your customers
- Alerting: What should you alert on? What does a good alert look like?
- Semantic Monitoring: Thinking differently about the health of a system. What information should wake us up at 3am to address it?
Warning — Alert fatigue: Too many alerts was a cause in the nuclear melt-down at Three Mile Island. Too many alerts and they stop being meaningful.
Alerts shall be:
- Relevant: make sure the alert is of value
- Unique: (it isn’t duplicated)
- Timely: quick enough to make use of it
- Prioritized: which alert should be addressed first
- Understandable: Clear and readable
- Diagnostic: Clear what is wrong
- Advisory: Outlines what actions need to be taken
- Focusing: Draw attention to the most important issues.
Log Aggregation:
- Don’t make the logs a dumping ground, they need to be useful.
- Add correlation IDs as soon as possible.
- Timing in the logs from different services may not be correct. That is why distributed tracing is important.
Metrics Aggregation:
The author discusses low-cardinality versus high-cardinality metrics. The gist: as you add more and more tags to a metric a low-cardinality service like Prometheus will make additional time-series which increases storage costs and slows down queries. Storing metrics is dirt cheap, storing lots of tags and it gets expensive.
Distributed Tracing:
Author notes that you need to have spans implemented in the services then you need to be able to export it to a central tool like Jaeger. The author says adopt OpenTelemetry.
Semantic Monitoring:
Is the system behaving the way we expect? Are the rates consistent and are the errors within acceptable limits?
Testing in Production (revisited)
- synthetic transactions a.k.a. fake user interactions
- A/B testing, roll out two versions of the service and see which is working “better” e.g., less errors, more user adoption or interaction.
- Canary release rollout a portion of the new services to a group and check for difference in error rates, etc.
- Parallel run: execute two different equivalent implementations with the same functionality. Multiplex the inputs to the services and verify the same outputs and then judge the performance.
- Smoke Tests: run tests against production to make sure things are working.
Selecting Tools
If you have tools that are so hard to work with only experienced operations can make use of them, then you limit the number of people who can participate in production activities. Likewise, if you pick tools that are so expensive as to prohibit their use in any situation other than critical production environments, then developers will not have exposure to these tools until it is too late.
Resources:
honeycomb.io — observability
lightstep.com (ServiceNow Cloud Observability) — observability
datadoghq.com — observability, log management
fluentd.org — log forwarding agent
elastic.co/elasticsearch — document storage and search. Author thinks there are now better options as ES abandoned the open source movement after taking so much from it.
https://opensearch.org/ — Open Distro for ElasticSearch stopped at the v7 version of ES. It is now OpenSearch
humio.com (purchased by CrowdStrike) — log management
prometheus.io — metrics aggregation for low cardinality tags
jaegertracing.io — Tracing viewer
opentelemetry.io — Pick a product that is committed to supporting the OpenTelemetry API as it has broad industry support.
Closing
This post only covers chapters 6–10.