Use Istio and Spring Boot to fuse and retry on Kubernetes

Posted Jun 5, 20208 min read

For each service mesh framework, the ability to handle communication failures in inter-service communication is absolutely necessary. It includes handling of timeouts and HTTP error codes. In this article, I will show how to use Istio to configure the retry and fuse mechanism. As in the previous article on using Istio Service Mesh on Kubernetes, we will analyze the communication between two simple Spring Boot applications deployed on Kubernetes. However, we will discuss more advanced topics rather than very basic examples.

Examples

To demonstrate the usage of Istio and Spring Boot, I created a repository with two sample applications on GitHub:callme-service and caller-service. The address of this repository is https://github.com/piomin/sample-istio-services.git. As mentioned in the preface, the first article on the service mesh with Istio uses the same repository.

Architecture

The architecture of our example system is very similar to the previous article. However, there are some differences. We are not using Istio components to inject faults or delays, but directly inject errors or delays into applications inside the source code. why? Now, we will be able to process directly according to the rules created for callme-service without having to process it on the client side as before. In addition, we are running two instances of the callme-service application version v2 to test how well the fuse works on the same service(or same deployment) instance. The following diagram illustrates the currently described architecture.

Spring Boot application

We start with the implementation of the sample application. The application callme-service exposes two endpoints, and these endpoints return information about the version and instance ID. The endpoint GET /ping-with-random-error sets the HTTP 504 error code to respond to approximately 50%of requests. The response returned by the endpoint GET /ping-with-random-delay has a random delay between 0s and 3s. This is the implementation of @RestController on the callme-service side.

@RestController
@RequestMapping("/callme")
public class CallmeController {

    private static final Logger LOGGER = LoggerFactory.getLogger(CallmeController.class);
    private static final String INSTANCE_ID = UUID.randomUUID().toString();
    private Random random = new Random();

    @Autowired
    BuildProperties buildProperties;
    @Value("${VERSION}")
    private String version;

    @GetMapping("/ping-with-random-error")
    public ResponseEntity<String> pingWithRandomError() {
        int r = random.nextInt(100);
        if(r%2 == 0) {
            LOGGER.info("Ping with random error:name={}, version={}, random={}, httpCode={}",
                    buildProperties.getName(), version, r, HttpStatus.GATEWAY_TIMEOUT);
            return new ResponseEntity<>("Surprise "+ INSTANCE_ID +" "+ version, HttpStatus.GATEWAY_TIMEOUT);
        } else {
            LOGGER.info("Ping with random error:name={}, version={}, random={}, httpCode={}",
                    buildProperties.getName(), version, r, HttpStatus.OK);
            return new ResponseEntity<>("I'm callme-service" + INSTANCE_ID + "" + version, HttpStatus.OK);
        }
    }

    @GetMapping("/ping-with-random-delay")
    public String pingWithRandomDelay() throws InterruptedException {
        int r = new Random().nextInt(3000);
        LOGGER.info("Ping with random delay:name={}, version={}, delay={}", buildProperties.getName(), version, r);
        Thread.sleep(r);
        return "I'm callme-service "+ version;
    }

}

The application caller-service exposes two GET endpoints. It uses RestTemplate to call the corresponding GET endpoint exposed by callme-service. It also returns the version of caller-service, but there is only one application deployment marked with version = v1.

@RestController
@RequestMapping("/caller")
public class CallerController {

    private static final Logger LOGGER = LoggerFactory.getLogger(CallerController.class);

    @Autowired
    BuildProperties buildProperties;
    @Autowired
    RestTemplate restTemplate;
    @Value("${VERSION}")
    private String version;


    @GetMapping("/ping-with-random-error")
    public ResponseEntity<String> pingWithRandomError() {
        LOGGER.info("Ping with random error:name={}, version={}", buildProperties.getName(), version);
        ResponseEntity<String> responseEntity =
                restTemplate.getForEntity("http://callme-service:8080/callme/ping-with-random-error", String.class);
        LOGGER.info("Calling:responseCode={}, response={}", responseEntity.getStatusCode(), responseEntity.getBody());
        return new ResponseEntity<>("I'm caller-service "+ version + ". Calling..." + responseEntity.getBody(), responseEntity.getStatusCode());
    }

    @GetMapping("/ping-with-random-delay")
    public String pingWithRandomDelay() {
        LOGGER.info("Ping with random delay:name={}, version={}", buildProperties.getName(), version);
        String response = restTemplate.getForObject("http://callme-service:8080/callme/ping-with-random-delay", String.class);
        LOGGER.info("Calling:response={}", response);
        return "I'm caller-service "+ version + ". Calling..." + response;
    }

}

Handling retry in Istio

The definition of Istio DestinationRule is the same as the definition of Istio and Spring Boot in the service grid on Kubernetes using my article. Two subsets were created for the instances labeled version = v1 and version = v2. Retry and timeout can be configured on VirtualService. We can set the number of retries and retry conditions(enumerating a list of strings). The following configuration also sets a 3s timeout for the entire request. Both of these settings can be used in the HTTPRoute object. We also need to set a timeout for each attempt. In this case, I set to 1. How does it work in practice? We will analyze it through simple examples.

apiVersion:networking.istio.io/v1beta1
kind:DestinationRule
metadata:
  name:callme-service-destination
spec:
  host:callme-service
  subsets:
    -name:v1
      labels:
        version:v1
    -name:v2
      labels:
        version:v2
---
apiVersion:networking.istio.io/v1beta1
kind:VirtualService
metadata:
  name:callme-service-route
spec:
  hosts:
    -callme-service
  http:
    -route:
      -destination:
          host:callme-service
          subset:v2
        weight:80
      -destination:
          host:callme-service
          subset:v1
        weight:20
      retries:
        attempts:3
        perTryTimeout:1s
        retryOn:5xx
      timeout:3s

Before deploying the sample application, we should increase the logging level. We can easily enable Istio access logging. The Envoy proxy prints the access log and sends all incoming requests and outgoing responses to their standard output. The analysis of log entries will be particularly useful for detecting retry attempts.

$istioctl manifest apply --set profile=default --set meshConfig.accessLogFile="/dev/stdout"

Now, let us send a test request to the HTTP endpoint GET /caller/ping-with-random-delay. It calls the randomly delayed callme-service endpoint GET /callme/ping-with-random-delay. This is the request and response of the operation.

It seems the situation is very clear. But let's check what is happening behind. I have emphasized the order of retrying. As you can see, Istio performed two retries because the first two attempts took longer than perTryTimoeut set to 1s. Both attempts are timed out by Istio and can be verified in its access log. The third attempt was successful because it took about 400 milliseconds.

Retry timeout is not the only retry option available in Istio. In fact, we can retry all 5XX and even 4XX codes. VirtualService for testing error codes is simpler because we have not configured any timeout.

apiVersion:networking.istio.io/v1beta1
kind:VirtualService
metadata:
  name:callme-service-route
spec:
  hosts:
    -callme-service
  http:
    -route:
      -destination:
          host:callme-service
          subset:v2
        weight:80
      -destination:
          host:callme-service
          subset:v1
        weight:20
      retries:
        attempts:3
        retryOn:gateway-error, connect-failure,refused-stream

We will use GET /caller/ping-with-random-error to call the HTTP endpoint, that is, call the endpoint GET /callme/ping-with-random-error exposed by callme-service. It returns HTTP 504 for approximately 50%of incoming requests. This is a request with 200 OK HTTP code and a successful response.

The following are logs, which explain what happened to callme-service. Since both first attempts resulted in HTTP error codes, the request has been retried 2 times.

Handling fuse in Istio

A circuit breaker is configured on the DestinationRule object. We used TrafficPolicy for this. First, we will not set up any retry for the previous example, so we need to remove it from the VirtualService definition. We should also disable any retry on connectionPool in TrafficPolicy. Now the most important. To configure the circuit breaker in Istio, we use the OutlierDetection object. The implementation of the Istion circuit breaker is based on continuous errors returned by downstream services. The number of subsequent errors can be configured using the properties continuous5xxErrors or continuousGatewayErrors. The only difference between them is the HTTP errors they can handle. Although ContinuousGatewayErrors only applies to 502, 503, and 504, continuous5xxErrors is used for 5XX codes. In the following callme-service-destination configuration, I used the set Continuous5xxErrors setting on 3. This means that after 3 error lines occur, an instance of the application(pod) will be deleted from load balancing for 1 minute(baseEjectionTime = 1m). Since we are running two callme-service pods in version v2, we also need to override the default value of maxEjectionPercent to 100%. The default value of this attribute is 10%, which represents the maximum percentage of hosts that can be ejected from the load balancing pool.

apiVersion:networking.istio.io/v1beta1
kind:DestinationRule
metadata:
  name:callme-service-destination
spec:
  host:callme-service
  subsets:
    -name:v1
      labels:
        version:v1
    -name:v2
      labels:
        version:v2
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests:1
        maxRequestsPerConnection:1
        maxRetries:0
    outlierDetection:
      consecutive5xxErrors:3
      interval:30s
      baseEjectionTime:1m
      maxEjectionPercent:100
---
apiVersion:networking.istio.io/v1beta1
kind:VirtualService
metadata:
  name:callme-service-route
spec:
  hosts:
    -callme-service
  http:
    -route:
      -destination:
          host:callme-service
          subset:v2
        weight:80
      -destination:
          host:callme-service
          subset:v1
        weight:20

The fastest way to deploy two applications is to use Jib and Skaffold. First, you go to the directory callme-service and execute the skaffold dev command with the optional --port-forward parameter.

$cd callme-service
$skaffold dev --port-forward

Do the same for caller-service:

$cd caller-service
$skaffold dev --port-forward

Before sending some test requests, let's run the second instance of the callme-service v2 version, because Deployment sets the parameter copy to 1. To do this, we need to run the following command.

$kubectl scale --replicas=2 deployment/callme-service-v2

Now, let's verify the deployment status on Kubernetes. There are 3 deployments.

After that, we are ready to send some test requests. We are calling the endpoint GET /caller/ping-with-random-error exposed by caller-service, that is, the endpoint GET /callme/ping-with-random-error exposed by callme-service . The endpoint exposed by callme-service returns HTTP 504 for 50%of requests. I have set up port forwarding for callme-service on 8080, so the command to call the application is:

curl http://localhost:8080/caller/ping-with-random-error

Now, let us analyze the caller-service response. I have highlighted the HTTP 504 error code response from the callme-service instance of version v2 and generated ID 98c068bb-8d02-4d2a-9999-23951bbed6ad. After three error responses in the row of this instance, it was immediately deleted from the load balancing pool, which resulted in sending all other requests to the callme-service v2 with ID 00653617-58e1-4d59-9e36-3f98f9d403b8 The second instance. Of course, there is still an instance of callme-service v1, which is receiving 20% of the total number of requests sent by caller-service.

Ok, let's check what happens when a single instance callme-service v1 returns 3 errors. I also highlighted these error responses with the following visible logs. Because there is only one instance of callme-service v1 in the pool, there is no opportunity to redirect incoming traffic to other instances. This is why Istio returns HTTP 503 for the next request sent to callme-service v1. Within 1 minute after fusing, the same response will be returned.

PS:This article is a translation, [original]( https://link.zhihu.com/?target=https%3A//piotrminkowski.com/2020/06/03/circuit-breaker-and-retries-on-kubernetes- with-istio-and-spring-boot/amp/)