Friday 17th May 2024
Ho Chi Minh, Vietnam

1. Fault Tolerance in Microservices Architecture

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components. In the context of microservices architecture, fault tolerance is especially important because the communication between services is distributed and many internal or external network traffic is created for the services to work together.

To achieve fault tolerance in microservices architecture, there are various patterns, principles, and techniques that can be used, such as implementing circuit breakers, retry mechanisms, and graceful degradation. These techniques allow the system to handle and recover from failures without affecting the overall functionality of the system. Some common fault-tolerant patterns for microservices architecture include the use of bulkheads, timeouts, thread pools, or circuit breaker.

2. Hystrix Circuit Breaker

Circuit breaker is a pattern that wraps requests to external services and detects when they fail. If a failure is detected, the circuit breaker opens. All the subsequent requests immediately return an error instead of making requests to the unhealthy service. It monitors and sees the service that is down and misbehaves with other services. It rejects calls until it becomes healthy again.

Hystrix is a library that controls the interaction between microservices to provide latency and fault tolerance. Additionally, it makes sense to modify the UI to let the user know that something might not have worked as expected or would take more time.

3. Sample project

We will demonstrate how Hystrix is applied in a microservices system via a sample project. This project is a home loaning service which consume a KYC service to verify customers by the username. You can imagine these services are in a banking microservices-based system which can contain a lot of other services.

We will use Spring Feign to make RESTful requests to the KYC service and integrate Hystrix into our service. The source code is really straightforward, the VerificationServiceClient makes request to the KYC API which is consumed by the VerificationUserService in order to return the verification result to customers via VerificationUserEndpoint.

3.1. Tech stack

JDK 11, Gradle, Spring boot, Spring Feign, Hystrix

Hystrix is enabled and configured via application.yml with some default configurations:

kyc-service:
  url: localhost:8082

feign:
  hystrix:
    enabled: true

hystrix:
  command:
    default:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 3000
      circuitBreaker:
        requestVolumeThreshold: 5
        sleepWindowInMilliseconds: 2000
        errorThresholdPercentage: 40

3.2. How Hystrix works?

Hystrix supports the notion of a fallback: a default code path that is executed when the circuit is open OR there is an error (response is not 200-300 or timeout). To enable fallbacks for a given @FeignClient set the fallback attribute to the class name that implements the fallback as the VerificationServiceClient class below.

VerificationServiceClient.java

@FeignClient(value = "VerificationServiceClient", url = "${kyc-service.url}", fallback = VerificationServiceClientFallback.class)
public interface VerificationServiceClient {
  @PostMapping(value = "/kyc/{userName}", produces = MediaType.APPLICATION_JSON_VALUE)
  String verify(@PathVariable String userName);
}

VerificationServiceClientFallback.java

@Component
public class VerificationServiceClientFallback implements VerificationServiceClient{

  @Override
  public String verify(String userName) {
    return ApplicationUtil.FALLBACK_MESSAGE;
  }
}

Hystrix does not offer a circuit breaker that breaks after a given number of failures. That means when the fallback is executed, the circuit can be still closed. The Hystrix circuit will break if:

Within a timespan of duration timeoutInMilliseconds, the percentage of actions resulting in a handled exception exceeds errorThresholdPercentage, provided also that the number of actions through the circuit in the timespan is at least requestVolumeThreshold

Imagine there was no such minimum-volume-through-the-circuit threshold. Imagine the first call in a time window error. You would have 1 of 1 calls being an error, = 100% failure rate, which is higher than the 50% threshold you have set. So the circuit would open immediately.

Now you may have another question: how the circuit is closed again after opening? There are actually three states: OPEN, CLOSED, and HALF_OPEN. Once the circuit breaker is OPEN and a certain amount of time has passed it lets a single request sneak through. This is the HALF_OPEN state. If successful the circuit breaker is closed, otherwise, it returns to the OPEN state until that amount of time has passed again, where it enters the HALF_OPEN state once again. You can specify the amount of time between the transition from OPEN to HALF_OPEN using the sleepWindowInMilliseconds property.

In order to experiment with how Hystrix works, we will write an integration test with a few scenarios. We will use Wiremock to mock the KYC API.

VerificationServiceClientTest.java

@SpringBootTest
@EnableConfigurationProperties
@ExtendWith(SpringExtension.class)
@ContextConfiguration(classes = { Application.class })
@AutoConfigureWireMock(port = 8082)
public class VerificationServiceClientTest {

  @Autowired
  private VerificationServiceClient verificationServiceClient;

  @BeforeEach
  public void before() {
    Hystrix.reset();
    HealthCountsStream.reset();
  }

Scenario 1: the fallback method will be triggered when the response is not 200-300

  @Test
  public void fallBackWillBeTriggered_after_responseIsError() throws InterruptedException {
    WireMock.stubFor(WireMock.post(WireMock.urlPathEqualTo("/kyc/thoai"))
      .willReturn(WireMock.aResponse()
        .withStatus(500)));
    String response = verificationServiceClient.verify("thoai");
    Assertions.assertEquals(response, ApplicationUtil.FALLBACK_MESSAGE);
  }

Scenario 2: the fallback will not be triggered when the response is success

  @Test
  public void fallBackWillNotBeTriggered_after_responseIsSuccess() throws InterruptedException {
    WireMock.stubFor(WireMock.post(WireMock.urlPathEqualTo("/kyc/thoai"))
      .willReturn(WireMock.aResponse()
        .withBody(ApplicationUtil.SUCCESS_MESSAGE)
        .withStatus(200)));
    String response = verificationServiceClient.verify("thoai");
    Assertions.assertEquals(response, ApplicationUtil.SUCCESS_MESSAGE);
  }

Scenario 3: the fallback will be triggered when the response is timeout.
Given: hystrix circuit breaker is configured with timeoutInMilliseconds = 3000ms
– When: the response will not be returned until after 4000ms
– Then: fallback will be triggered

  @Test
  public void fallBackWillBeTriggered_after_requestTimeOut() throws InterruptedException {

    WireMock.stubFor(WireMock.post(WireMock.urlPathEqualTo("/kyc/thoai"))
      .willReturn(WireMock.aResponse()
        .withFixedDelay(4000)
        .withStatus(200)));
    String response = verificationServiceClient.verify("thoai");

    Assertions.assertEquals(response, ApplicationUtil.FALLBACK_MESSAGE);
  }

Scenario 4: the fallback will not be triggered when the response is slow but is still less than timeoutInMilliseconds.
Given: hystrix circuit breaker is configured with timeoutInMilliseconds = 3000ms
– When: the response will not be returned until after 2000ms
– Then: fallback will not be triggered

  @Test
  public void fallBackWillBeTriggered_after_requestTimeOut() throws InterruptedException {

    WireMock.stubFor(WireMock.post(WireMock.urlPathEqualTo("/kyc/thoai"))
      .willReturn(WireMock.aResponse()
        .withFixedDelay(4000)
        .withStatus(200)));
    String response = verificationServiceClient.verify("thoai");

    Assertions.assertEquals(response, ApplicationUtil.FALLBACK_MESSAGE);
  }

Scenario 5: the circuit breaker will be opened
Given: hystrix circuit breaker is configured with requestVolumeThreshold = 5, sleepWindowInMilliseconds = 2000ms, errorThresholdPercentage = 40 %
– When: 3/5 requests are failing (over 40%) and after 3000ms
– Then: the circuit will be opened

  @Test

  public void circuitWillBeOpened() throws InterruptedException {

    WireMock.stubFor(WireMock.post(WireMock.urlPathEqualTo("/kyc/thoai"))
      .willReturn(WireMock.aResponse()
        .withStatus(200)));

    WireMock.stubFor(WireMock.post(WireMock.urlPathEqualTo("/kyc/camila"))
      .willReturn(WireMock.aResponse()
        .withStatus(500)));

    verificationServiceClient.verify("thoai");
    verificationServiceClient.verify("camila");
    verificationServiceClient.verify("camila");
    verificationServiceClient.verify("thoai");
    verificationServiceClient.verify("camila");

    Thread.sleep(3000);

    HystrixCircuitBreaker myCircuitBreaker = HystrixCircuitBreaker.Factory.getInstance(
      HystrixCommandKey.Factory.asKey("VerificationServiceClient#verify(String)"));

    Assertions.assertTrue(myCircuitBreaker.isOpen());
  }

Scenario 6: the circuit is still closed when the total execute time of all requests is less than sleepWindowInMilliseconds configure
Given: hystrix circuit breaker is configured with requestVolumeThreshold = 5, sleepWindowInMilliseconds = 2000ms, errorThresholdPercentage = 40 %
– When: 3/5 requests are failing (over 40%)
– Then: the circuit will still be closed

  @Test
  public void circuitWillNotBeOpened_1() throws InterruptedException {

    WireMock.stubFor(WireMock.post(WireMock.urlPathEqualTo("/kyc/thoai"))
      .willReturn(WireMock.aResponse()
        .withStatus(200)));

    WireMock.stubFor(WireMock.post(WireMock.urlPathEqualTo("/kyc/camila"))
      .willReturn(WireMock.aResponse()
        .withStatus(500)));

    verificationServiceClient.verify("thoai");
    verificationServiceClient.verify("camila");
    verificationServiceClient.verify("camila");
    verificationServiceClient.verify("thoai");
    verificationServiceClient.verify("camila");

    HystrixCircuitBreaker myCircuitBreaker = HystrixCircuitBreaker.Factory.getInstance(
      HystrixCommandKey.Factory.asKey("VerificationServiceClient#verify(String)"));

    Assertions.assertTrue(!myCircuitBreaker.isOpen());
  }

Scenario 7: the circuit is still closed when the error rate is less than errorThresholdPercentage configure
Given: hystrix circuit breaker is configured with requestVolumeThreshold = 5, sleepWindowInMilliseconds = 2000ms, errorThresholdPercentage = 40 %
– When: 1/5 requests (less than 40%) are failing and after 3000ms
– Then: the circuit will still be closed

  @Test

  public void circuitWillNotBeOpened_2() throws InterruptedException {

    WireMock.stubFor(WireMock.post(WireMock.urlPathEqualTo("/kyc/thoai"))
      .willReturn(WireMock.aResponse()
        .withStatus(200)));

    WireMock.stubFor(WireMock.post(WireMock.urlPathEqualTo("/kyc/camila"))
      .willReturn(WireMock.aResponse()
        .withStatus(500)));

    verificationServiceClient.verify("thoai");
    verificationServiceClient.verify("camila");
    verificationServiceClient.verify("thoai");
    verificationServiceClient.verify("thoai");
    verificationServiceClient.verify("thoai");

    Thread.sleep(3000);

    HystrixCircuitBreaker myCircuitBreaker = HystrixCircuitBreaker.Factory.getInstance(
      HystrixCommandKey.Factory.asKey("VerificationServiceClient#verify(String)"));

    Assertions.assertTrue(!myCircuitBreaker.isOpen());
  }

Scenario 8: the circuit is still closed when the total number of requests is less than requestVolumeThreshold configure
Given: hystrix circuit breaker is configured with requestVolumeThreshold = 5, sleepWindowInMilliseconds = 2000ms, errorThresholdPercentage = 40 %
When: 2/2 requests (less than 5) are failing and after 3000ms
– Then: the circuit will still be closed

  @Test

  public void circuitWillNotBeOpened_3() throws InterruptedException {

    WireMock.stubFor(WireMock.post(WireMock.urlPathEqualTo("/kyc/camila"))
      .willReturn(WireMock.aResponse()
        .withStatus(500)));

    verificationServiceClient.verify("camila");
    verificationServiceClient.verify("camila");

    Thread.sleep(3000);

    HystrixCircuitBreaker myCircuitBreaker = HystrixCircuitBreaker.Factory.getInstance(
      HystrixCommandKey.Factory.asKey("VerificationServiceClient#verify(String)"));

    Assertions.assertTrue(!myCircuitBreaker.isOpen());
  }

Scenario 9: the circuit is closed again
Given: hystrix circuit breaker is configured with sleepWindowInMilliseconds = 2000ms, the circuit is opened
When: after 2000ms and the request succeeds
– Then: the circuit-breaker transitions to closed again

  @Test
  public void circuitWillBeClosedAgain() throws InterruptedException {

    WireMock.stubFor(WireMock.post(WireMock.urlPathEqualTo("/kyc/thoai"))
      .willReturn(WireMock.aResponse()
        .withBody(ApplicationUtil.SUCCESS_MESSAGE)
        .withStatus(200)));
    WireMock.stubFor(WireMock.post(WireMock.urlPathEqualTo("/kyc/camila"))
      .willReturn(WireMock.aResponse()
        .withStatus(500)));

    verificationServiceClient.verify("thoai");
    verificationServiceClient.verify("camila");
    verificationServiceClient.verify("camila");
    verificationServiceClient.verify("thoai");
    verificationServiceClient.verify("camila");

    Thread.sleep(3000);

    HystrixCircuitBreaker myCircuitBreaker = HystrixCircuitBreaker.Factory.getInstance(
      HystrixCommandKey.Factory.asKey("VerificationServiceClient#verify(String)"));

    /**
     * when the circuit is already opened
     */

    String response = verificationServiceClient.verify("thoai");
    //then response is cached with fallback
    Assertions.assertEquals(response, ApplicationUtil.FALLBACK_MESSAGE);

    Thread.sleep(2000);
    response = verificationServiceClient.verify("thoai");
    Assertions.assertEquals(response, ApplicationUtil.SUCCESS_MESSAGE);
    //the circuit will be closed again after 2000ms and the response is success
    Assertions.assertTrue(!myCircuitBreaker.isOpen());
  }

4. Conclusion

We’ve looked at the Circuit Breaker pattern and how it works to achieve Fault Tolerance in Microservices Architecture by a sample project with some test scenarios. Hystrix also implements bulkhead which limits the number of concurrent calls to a component to achieve Fault Tolerance too, we will dive into bulkhead in another article. The sample project can be found on GitHub.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top