Workshop 8: Activity Retry + Timeout

Building Robust Fault-Tolerant Workflows

Implement robust activity retry and timeout configurations to handle failures gracefully


What we want to build

Implement robust activity retry and timeout configurations to handle failures gracefully.

Learn different retry strategies for different types of operations.


Expecting Result

By the end of this workshop, you'll have:

  • Activities with custom retry policies
  • Different timeout strategies for different operation types
  • Proper failure handling and exponential backoff
  • Circuit breaker patterns for external services

Code Steps

Step 1: Configure Retry Policies

class ResilientWorkflowImpl : ResilientWorkflow {

    // Quick operations - aggressive retries
    private val validationActivity = Workflow.newActivityStub(
        ValidationActivity::class.java,
        ActivityOptions.newBuilder()
            .setStartToCloseTimeout(Duration.ofSeconds(10))
            .setRetryOptions(
                RetryOptions.newBuilder()
                    .setInitialInterval(Duration.ofSeconds(1))
                    .setMaximumInterval(Duration.ofSeconds(10))
                    .setBackoffCoefficient(2.0)
                    .setMaximumAttempts(5)
                    .build()
            )
            .build()
    )
    // Continued on next slide...

External API Configuration

    // External API calls - conservative retries
    private val externalApiActivity = Workflow.newActivityStub(
        ExternalApiActivity::class.java,
        ActivityOptions.newBuilder()
            .setStartToCloseTimeout(Duration.ofMinutes(2))
            .setRetryOptions(
                RetryOptions.newBuilder()
                    .setInitialInterval(Duration.ofSeconds(5))
                    .setMaximumInterval(Duration.ofMinutes(5))
                    .setBackoffCoefficient(3.0)
                    .setMaximumAttempts(3)
                    .build()
            )
            .build()
    )
}

Notice different strategies: aggressive for internal, conservative for external


Step 2: Handle Different Failure Types

@Component
class ExternalApiActivityImpl : ExternalApiActivity {

    override fun callExternalService(request: ApiRequest): ApiResponse {
        try {
            return httpClient.post(request)
        } catch (e: ConnectTimeoutException) {
            // Retriable - network issue
            throw ApplicationFailure.newFailure("Network timeout", "NETWORK_ERROR")
        } catch (e: HttpClientErrorException) {
            when (e.statusCode.value()) {
                400, 401, 403, 404 -> {
                    // Non-retriable - client error
                    throw ApplicationFailure.newNonRetryableFailure(
                        "Client error: ${e.statusText}", 
                        "CLIENT_ERROR"
                    )
                }
                // Continued on next slide...

Error Classification Continued

                429, 500, 502, 503 -> {
                    // Retriable - server issue
                    throw ApplicationFailure.newFailure(
                        "Server error: ${e.statusText}", 
                        "SERVER_ERROR"
                    )
                }
                else -> throw e
            }
        }
    }
}

Error Classification Strategy:

  • 4xx errors (400, 401, 403, 404) → Don't retry
  • 5xx errors (500, 502, 503) → Retry with backoff
  • Network timeoutsRetry aggressively

Step 3: Activity Heartbeats for Long Operations

@Component
class LongRunningActivityImpl : LongRunningActivity {

    override fun processLargeFile(filePath: String): ProcessingResult {
        val totalSteps = 100

        for (step in 1..totalSteps) {
            // Report progress via heartbeat
            Activity.getExecutionContext().heartbeat(step)

            // Do actual work
            processFileChunk(filePath, step)

            // Check for cancellation
            if (Activity.getExecutionContext().isCancelRequested) {
                logger.info("Activity cancelled at step $step")
                throw CancellationException("Processing cancelled")
            }

            Thread.sleep(1000) // Simulate work
        }

        return ProcessingResult("File processed successfully")
    }
}

Heartbeat Pattern Benefits

Why Use Heartbeats:

  • Progress tracking - Monitor long-running operations
  • Cancellation detection - Respond to workflow cancellation
  • Timeout prevention - Keep activity alive during processing
  • Failure detection - Detect worker crashes quickly
  • Resource optimization - Clean up abandoned work

Use heartbeats for any activity taking more than 30 seconds


How to Run

Configure heartbeat timeout:

private val longRunningActivity = Workflow.newActivityStub(
    LongRunningActivity::class.java,
    ActivityOptions.newBuilder()
        .setStartToCloseTimeout(Duration.ofMinutes(10))
        .setHeartbeatTimeout(Duration.ofSeconds(30))
        .build()
)

Heartbeat timeout should be less than start-to-close timeout


Retry Strategy Examples

Operation Type → Retry Strategy:

Operation Initial Interval Max Interval Backoff Max Attempts
Validation 1s 10s 2.0 5
Database 500ms 30s 1.5 15
External API 5s 5m 3.0 3
File I/O 2s 1m 2.0 10

Match retry strategy to operation characteristics and failure patterns


💡 Key Patterns

Exponential Backoff:

  • Start small (1-5 seconds) and grow exponentially
  • Cap maximum wait time to prevent infinite delays
  • Use jitter to prevent thundering herd

Circuit Breaker:

  • Fail fast when external service is down
  • Allow recovery through half-open state
  • Protect resources from cascading failures

🚀 Production Tips

Monitoring and Alerting:

  • Track retry counts by activity type
  • Alert on high failure rates
  • Monitor timeout patterns
  • Dashboard heartbeat status

Testing:

  • Test timeout scenarios
  • Simulate network failures
  • Verify compensation logic
  • Load test retry behavior

Building bulletproof distributed systems! 🎉

results matching ""

    No results matching ""