If the proliferation of microservices is slowing down your Java teams, this guide will show you how to restore predictability and resilience through clear responsibilities and control mechanisms.
In a mid-sized SaaS company in the healthcare sector, a recent acquisition further complicated an already outdated Java microservices architecture. Attempts to integrate with legacy services created more problems than they solved. As the load increased, the system began to fail randomly, and IaaS and external SaaS service costs skyrocketed, leading customers to threaten to terminate contracts due to instability.
Technical leaders had promised a robust microservices environment that would accelerate development cycles. But instead of delivering new features quickly, teams became bogged down in debugging failures and reactively troubleshooting problems.
This situation is not unique; it’s a common symptom of microservice sprawl. Your job as a leader is to transform this architectural style into the robust tool it was designed to be.
- This guide presents three leadership strategies to achieve this goal:
- Run diagnostics to identify signs of uncontrolled microservice growth.
- Stabilise the system with strategies and standards to improve uptime.
Optimise the architecture by creating a consolidated structure that meets the business’s needs.
Each action plan helps your teams focus on the work needed to reach the next stage. The action plans are interconnected, as some teams will advance their services through the stages faster than others.
Figure 1: Diagnosis and stabilisation with practical steps to achieve an optimal architecture.
These guidelines provide the governance and discipline necessary to create a Java microservices architecture that can be maintained and functionally evolved stably. They also help avoid the extensive code rewrites that novice developers often resort to.
Prepare Your Teams for Action
Disruptions are never pleasant, but microservices are distributed systems, and partial failures are inevitable. Development teams are on the front lines, but the entire organisation feels the impact. Prepare your organisation for change by:
- Temporarily assign high-level technical specialists to a system stabilisation team.
- Assign architects to serve as the brains of this stabilisation team.
Please inform us about the stages that all the teams will go through:
Figure 2: Transition of an organisation from a crisis to a stable state through the application of developed action scenarios.
Stabilisation Pod as the Means Forward
A system in crisis makes it difficult to develop new functionality effectively. Teams react to problems rather than proactively plan their work. To overcome a situation, it’s necessary first to assign key personnel to a stabilisation team.
The stabilisation team collaborates with development teams, prioritising issues within each action plan. This approach reduces the workload, allowing the development of critical functionalities to continue and helping to optimise the development process once a stable state is achieved.
The teams will then effectively develop new functionalities, leveraging the knowledge and discipline gained from working with the action plans.
Request the Topology Diagram
Ask your architects to draw a system topology diagram. This is a test.
- If they provide the information quickly, it shows they fully understand how the system works.
- If they can’t, you’re indicating that they need to understand it.
It’s not necessary to understand the topology in detail. Use it as a clear diagram for management, showing how the automation scenarios transform the system.
In this article, we’ll analyse a simple topology that illustrates the typical problems encountered when scaling a microservices architecture. You’ll see how the topology changes with each automation scenario.
Figure 3: An example topology containing all the typical problems associated with the growth of a microservices architecture.
The topology of your system will be much more complex. You will need to apply these scenarios, using a “divide and conquer” approach, to each subsystem.
Diagnose to Find the Signs of Microservices Sprawl
Diagnosis transforms pain into actionable data. It’s an art made possible by tools that require little to no programming.
Diagnose Playbook Scope and Goals
The objectives of this diagnostic guide are:
- Risk: Identification of critical services and their areas of impact.
- Reliability: Identification of saturation points that slow down transaction volume.
- Version mismatch: Identification of the most significant risks in the software bill of materials (SBOM).
- Cost: Detailed analysis of the function and cost-benefit ratio of each service.
During the diagnostic process, only three types of low-risk code changes should be made:
- Add internal metrics: Your Observability and Monitoring Platform (OMP) provides many proper signals. Supplement them with count and execution time metrics for internal operations outside the scope of the OMP. Together, these can help identify the root cause of problems.
- Add useful logs: Invest in small code changes to add or improve log clarity. Metrics show what happened. Good logs explain why it happened.
- Fix local errors: These are rarely the root cause. However, fixing them provides quick wins that reduce noise and make it easier to detect deeper problems.
Inform everyone that the diagnostic guide is not the place to be a redesign hero or to correct discrepancies between versions arbitrarily. This can create new problems, making it more challenging to identify the root cause.
Limit the team’s actions to documenting design flaws and version discrepancies. This will serve as the basis for future guidelines.
Risk: Find Hotspot Microservices and their Blast Radius
The first step in mitigating risks is to classify potentially dangerous services. To do this, use two reliable sources of information:
- Control panels
- Implementation problems
Dashboards
Ask your teams to provide dashboards that show when service disruptions occur:
- If they already have good dashboards, this will help them identify problems more quickly.
- If they don’t, you’ll be making it clear to them that they need to create high-quality dashboards.
The dashboards should display the following metrics:
- Error Count: Errors reveal the source of the problem. If your performance monitoring system (APM) doesn’t provide this information by default, a quick, no-code solution is to create log-based metrics. To save costs, logs are often moved to cold storage after a short period. Log-based metrics generate long-term, low-cost signals that are crucial for root cause analysis.
- Latency: Slowdowns reveal the source of the problem. Your APM system will show the execution duration of common events, such as REST API calls. Include the p95 latency metric for RESTful web services. Add the duration of critical internal events to complete the picture.
- Saturation: Saturation occurs when slowdowns accumulate. This manifests as exponential growth in queue lengths, CPU utilisation, memory usage, and thread resources.
- Health Check Failures: indicate that services are running under load. Enable Spring Boot’s built-in health checks. Monitor the availability and status of each service instance. Instruct the DevOps team to create alerts for these.
- Traffic: The volume of transactions is putting pressure on system bottlenecks. Compare this data with the previous metrics to identify patterns.
Use your operations management system (OMP) dashboards and error tracking to determine the impact of each failed service.
In our example, the teams used dashboards and logs to identify these issues (marked in red):
- Service 14: Multiple failures, affected by services 17 and 20. CPU usage for service S14 spikes, health checks fail, and the service restarts. Services S17 and S20 return 404 errors while S14 is down.
- Services 2 and 3: These services are experiencing high load and a large number of errors. Root cause analysis indicates that services 2 and 3 are returning database status errors.
Deployment Problems
Another sign of the proliferation of microservices architecture is the problems that arise during the release of new software versions. Here are some examples:
- Do gateway updates require extended system downtime?
- Do individual service deployments cause disruptions to connections with other services?
These are more serious symptoms of the uncontrolled proliferation of microservices. In our example. The following problems were identified during the discussion (marked in orange):
- Gateway: Contains business logic that applies rules to all customer-facing services (1, 2, 3, 6, 7, 8). Downtime increases because the services must be stopped during gateway updates.
- Services 17, 18, 20, 21: Require configuration adjustments to rediscover each other on the network, as the newly acquired services do not use the service discovery mechanism.
Reliability: Transaction Volume and Services Under Duress
From the customer’s perspective, slow service is just as detrimental as a complete service outage. The following key indicators demonstrate this:
- Drastic increase in resource consumption: high CPU/memory usage, increased number of connections.
- Transaction slowdown: transaction duration increases significantly.
- Queue growth: message queues grow exponentially and do not return to normal.
System slowdowns don’t have the same obvious symptoms as errors, but identifying them is no less critical.
In our example, the teams identified the following saturation points (marked in yellow):
- Service 1: Overloaded with REST requests from services 9 and 10.
- Service 5: Queue full of messages from services 6, 7, and 8.
Version Skew: Identify the Biggest Security Risks
Identify the most significant version mismatch issues through a security audit of the Software Bill of Materials (SBOM). Focus on the highest-priority security threats identified.
In our example, the security analysis found two services with serious security risks (marked in purple):
- Service 15: An early version of the REST library used to call external SaaS services.
- Service 19: An outdated parser library that could execute malicious code.
Cost: Find Consolidation Targets
The proliferation of microservices can occur when teams save time by implementing new, independent services (or creating microservices) instead of adding features to existing services. Cost-benefit analysis helps identify which services should be consolidated.
- Role: How many functions does it perform?
- Computational cost: How powerful is the machine used?
- Pipeline efficiency: Does it use its own pipeline, or is it integrated with others?
- Implementation issues: Is it forgotten to be reimplemented when other services change?
In our example, the teams concluded that services 6, 7, and 8 were too expensive for the team (marked in blue). Each of these services costs the company the following:
- It has its own build and deployment process, but its purpose is specific.
- The business logic is practically identical.
- They must be published as a single unit if any of them change.
Diagnosis Playbook Outputs
The following table summarises the diagnostic results. These results correspond to the colour coding in the topology diagram below the table.
| # Problem | Diagnosis Summary | Action Taken |
| 1 Crashes/errors (red) | S14 CPU heats up, restarts, and causes S17 and S20 to throw errors while restarting. | Added metrics and dashboards to identify the root cause |
| 2 Deployment problems
(orange |
S2, S3 throw data-related errors.
Gateway should not have business logic. |
Deferred to stabilisation |
| 3 Slowdowns (yellow) | Acquired services lack discovery.
S1 saturated by S9, S10 REST calls. |
Deferred to stabilisation |
| 4 Priority version skew (purple) | S5 has a queue backup.
S15, S19 use libraries with severe security exploits. |
Deferred to stabilisation |
| 5 Cost of ownership (blue) | S6, S7, S8 require team resources despite small size | Deferred to stabilisation |
This is what the topology diagram looks like after identifying the problems:
Figure 4: Highlighting problems in a topology diagram for all devices.
These metrics, dashboards, and event logs help determine the next steps in accordance with the system stabilisation action plan.
Right-size to get a Consolidated Architecture.
The third pathway implements the recommendations of the stabilisation playbook. This will right-size the topology to mitigate the problems effectively.
Right-Size Playbook Scope and Goals
The stabilise playbook’s goals are as follows:
- Technological refactoring: making decisions about technological changes to improve service reliability and scalability.
- Appropriately sized microservice: making decisions about appropriate topological changes to correct the interdependencies between problematic services.
The resource optimisation action plan may include more extensive code changes. However, the plan must remain within the specified parameters.
Note: The information presented in this resource optimisation action plan may seem brief compared to the diagnostic and stabilisation plans. However, keep in mind that resource optimisation requires significant planning and time. This will be the most extensive action plan in the schedule.
Technology Refactoring
The results of the stabilisation plan show which technologies need to be replaced. Here are some typical examples to choose from:
- Database type: Initially, a relational type was chosen, but currently a document-oriented approach is used.
- Database cluster type: Traffic redirection from the read/write node to the read-only node.
- Caching implementation: Ensuring the availability of frequently used data through a high-performance caching layer.
- Transition to asynchronous processing: Reducing the vulnerability of REST requests through message queues.
In our example, the team implemented a stabilisation solution by replacing the synchronous REST connections between S17/S20 and S14 with a reliable asynchronous message queue:
- S14 no longer required horizontal scaling, which resulted in cost savings.
- Thanks to continuous message processing, which provided greater stability, S14 remained operational.
- S17/S20 functioned flawlessly, sending messages instead of making REST requests.
Right-Size Closely-Related Microservices into Clusters
However, these small or new services may have hidden dependencies on legacy services, such as the following:
- API coupling: REST requests or messages sent for a specific purpose because “it works this way now.”
- Behavioural coupling: Assumptions that one service makes about the behaviour of another service, which cease to be valid when the behaviour changes.
- Operational or implementation coupling: Services that need to be restarted in a specific order because one depends on something that the other provides.
These business partners are benefiting from restructuring their relationship towards closer, contract-based interactions.
In our example: the teams determined that S1, S2, and S3 should be in the same cluster:
- The cluster uses a single database that consolidates its data.
- Thanks to the shared database, servers S9 and S10 were no longer needed.
- Server S1 was no longer overloaded by all the REST data synchronisation requests and was functioning normally.
Right-Size Poor Cost Performers into Modular Monoliths
Creating new services quickly is easier now than ever before. Developers can create new services simply because they can, which contributes to the proliferation of microservices.
- Microservice platforms: Spring Boot enables developers to create new services quickly.
- Microservice templates: Templates allow developers to clone a new service with a standard set of features.
- Generative AI: Generative AI takes templates to a new level by generating the complete structure of a new, customizable service with features tailored to each application.
Services that have very similar characteristics benefit from being implemented together within a single service, known as a modular monolith.
In our example, independent teams developed and implemented each of the systems S6, S7, and S8:
- They used a new AI-based service to generate their microservices.
- They didn’t compare the code and didn’t realise that duplicate messages would be sent.
- The stabilisation team decided to integrate their code into a modular monolith and refine the overall logic, thereby eliminating duplicate messages.
Technology Introduction
During the system development process, it may become necessary to implement new technologies. Here are a few examples:
- Message transmission: Transmitting large volumes of messages requires a reliable platform to handle the load.
- Analytical model management: In data-intensive environments, a data science and machine learning (DS/ML) platform is needed to manage model training and deployment.
- Business rules management system: Organisations with numerous business rules need tools to manage and enforce them.
Implementing a new technology requires approval from the entire organisation.
In our example, Team 1 placed the business rules on the Gateway because they couldn’t find a better place to put them:
- The architects first needed to understand the problem.
- They recommended using a Domain Orchestration Service (DOS) to host the business rules.
- DOS simplified the interaction between the gateway and client services. Allowing business rules to be moved outside the Gateway.
Right-Size Playbook Outputs
The diagram below summarises the final settings for the correctly sized system.
| # Problem | Right-Size Summary | Action Taken |
| 1 Crashes/errors | S17/S20 message S14 instead of REST calls, handling traffic surges gracefully. | Technology refactoring |
| 2 Data errors | S1, S2, and S3 were combined into a cluster with a shared database, eliminating saturation. | Topology adjustment to the shared cluster |
| 3 Message errors | S6, S7, and S8 were combined into a single service to fix logic. | Topology adjustment to modular monolith |
| 4 Domain orchestration service | Business rules were removed from Gateway, moved into the domain orchestration service, and calls were routed accordingly. | Topology adjustment to use DOS between the Gateway and services |
Wrapping Up
These procedural guidelines help your teams transform the system into a stable ecosystem capable of growth and development. Each procedural guideline achieved objectives that ensured the success of the next.
Team Development Habits to Transform
The saying “haste makes waste” is one of the main reasons for the spread of microservices. In this article, we have examined some of these reasons:
- The rush to implement new services creates hidden dependencies.
- Infrastructure built with AI is not tested before deployment.
- Software Bill of Materials (SBOM) updates are delayed to save time.
- Complex acquired systems are integrated into an outdated architecture.
Microservice architecture in Java requires continuous investment to maintain its optimisation and flexibility:
- Develop a phased schedule for updating the SBOM inventory.
- Define goals for major updates, such as Spring Boot and Java updates.
- Maintain security analysis to optimise the SBOM update cycle.
- Encourage architectural oversight of all new features.
Frequently Asked Questions
When should we consolidate Java microservices, and when should we keep them independent?
Consolidation is necessary when bounded contexts become blurred, changes affect multiple services or shared schemas, and synchronised versions indicate interdependence. In such cases, a modular monolith or a clustered core can provide better performance than fragmented microservices in Java. Otherwise, keep your microservices architecture independent and focused; avoid unintentionally reverting to a monolithic architecture.
What belongs in the API gateway vs service business logic?
The API gateway handles authentication, routing, rate limiting, and monitoring. Spring Cloud Gateway centralises these cross-cutting concerns. All business logic resides in microservices developed using Spring Boot and domain orchestration, rather than in the Gateway, allowing teams to deploy applications independently and reducing the impact of potential failures.
Do we still need a service registry if the platform handles load balancing?
Yes. Load balancing distributes traffic among available instances; it does not inform services about the location of other services. Service discovery through a service registry provides dynamic endpoints, health checks, and version control, enabling safe deployments and blue-green deployments even when the topology changes.
When do we switch from REST to async messaging for better fault tolerance?
Switch to asynchronous communication when workloads are intermittent, long-running, or distributed across multiple services. Asynchronous communication and messaging improve fault tolerance, decouple retries and timeouts, and prevent long wait times for users. Use synchronous REST communication only for request-response user actions that require immediate feedback.
How do we keep data consistent across services without distributed transactions?
This requires appropriate data management strategies. Implement idempotency keys, compensating actions, and versioned schemas. This will ensure data consistency across services without the need for an unreliable two-phase commit protocol.
