AWS: “The capability to recover when stressed by load (more requests for service), attacks (either accidental through a bug, or deliberate through intention), and failure of any component in the workload’s components.”
Ability to handle unexpected errors and failures, recover from those failures, and maintain a consistent level of performance despite external challenges.
Main goal is build robust components that can tolerate faults within their scope.
The more components a system has, the more likely it is that something will fail.
# AVAILABILITY
It expresses the amount of time a component is actually available, compared to the amount of time the component is supposed to be available
High availability is a characteristic of a system that aims to ensure that the system is continuously operational and able to provide its services to users without interruption.
Traditional approaches aim at increasing the uptime, while modern approaches aim for reduced recovery times(downtime)
Categorization of resilience design patterns by Uwe Freidrichesen
# RESILIENCE
Resilience4J is a lightweight fault tolerance library designed for functional programming.
This library was inspired by Netflix Hystrix.
Circuit Breaker
Rate Limiter
Retry
Bulkhead
Time Limiter
# PRESENTING CODE
Hystrix | Resilience4j |
---|---|
uses a thread-per-request model | uses a more lightweight and modular approach |
integrates with several other libraries and frameworks, such as Spring and Netflix OSS | designed to be more standalone |
includes a dashboard for monitoring the health and performance of your system | does not have a built-in dashboard but can be used with other monitoring tools |
The idea of circuit breaker is to prevent calls to a remote service if we know that the call is likely to fail or time out. This way, we don’t unnecessarily waste critical resources both in our service and in the remote service.
If there are failures in the Microservice ecosystem, then you need to fail fast by opening the circuit. This ensures that no additional calls are made to the failing service so that we return an exception immediately.
Two types of circuit breaker:
1. count-based (last N number of calls failed)
2. time-based (responses in the last N seconds failed)
Circuit breaker has 3 states:
# STATES
# CIRCUIT BREAKER
@Override
@CircuitBreaker(name = "circuitBreakerApi",
fallbackMethod = "fallback")
public String getCircuitBreaker() throws Exception {
System.out.println("Hello World");
throw new Exception("Error");
// Throw an exception and trigger the fallback method
}
public String fallback(Throwable t) {
System.out.println("Fallback");
return "Sorry ... Service not available!!!";
}
# SERVICE
#path=resilience4j.circuitbreaker.instances.circuitBreakerApi
path.registerHealthIndicator=true
path.ringBufferSizeInClosedState=44
path.ringBufferSizeInHalfOpenState=2
path.waitDurationInOpenState=30s
path.failureRateThreshold=60
# CONFIGURATION
Rate limiting is a technique used to control the rate at which a service or system processes requests in order to prevent it from being overwhelmed or degraded.
It can be used to maintain performance and stability, and ensure that resources are used efficiently.
Incoming requests are analyzed and compared to a set of rules, and only a certain number of requests are allowed to pass through per unit of time.
# RATE LIMITER
#path=resilience4j.ratelimiter.instances.rateLimiterApi
resilience4j.ratelimiter.metrics.enabled=true
path.register-health-indicator=true
path.limit-for-period=5
path.limit-refresh-period=60s
path.timeout-duration=0s
path.allow-health-indicator-to-fail=true
path.subscribe-for-events=true
path.event-consumer-buffer-size=50
# CONFIGURATION
Enabling an application to handle transient failures when it tries to connect to a service, by transparently retrying a failed operation.
This minimizes the effects faults can have on the business tasks the application is performing.
Strategies:
# RETRY
# RETRY
# CONFIGURATION
#path=resilience4j.retry.instances.retryApi
path.max-attempts=3
path.wait-duration=2s
path.enable-exponential-backoff=true
path.exponentialBackoffMultiplier=2
resilience4j.retry.metrics.legacy.enabled=true
resilience4j.retry.metrics.enabled=true
In a bulkhead architecture, elements of an application are isolated into pools so that if one fails, the others will continue to function.
All services need to work independently of each other, so that bulkhead can be implemented.
Strategies:
# BULKHEAD
When to use bulkhead:
# BULKHEAD
#path=resilience4j.bulkhead.instances.bulkheadApi
resilience4j.bulkhead.metrics.enabled=true
path.max-concurrent-calls=3
path.max-wait-duration=1
# CONFIGURATION
Amount of time we are willing to wait for an operation to complete is called time limiting. If the operation does not complete within the time, it will throw exception.
This pattern ensures that users don't wait indefinitely or take server's resources indefinitely.
Main goal is to not hold up resources for too long.
# CONFIGURATION
#path=resilience4j.timelimiter.instances.timeLimiterApi
resilience4j.timelimiter.metrics.enabled=true
path.timeout-duration=2s
path.cancel-running-future=true
# CONFIGURATION
Sources: