Workshop 8: Activity Retry + Timeout

Building Robust Fault-Tolerant Workflows

Implement robust activity retry and timeout configurations to handle failures gracefully

What we want to build

Implement robust activity retry and timeout configurations to handle failures gracefully.

Learn different retry strategies for different types of operations.

Expecting Result

By the end of this workshop, you'll have:

✅ Activities with custom retry policies
✅ Different timeout strategies for different operation types
✅ Proper failure handling and exponential backoff
✅ Circuit breaker patterns for external services

Code Steps

Step 1: Configure Retry Policies

class ResilientWorkflowImpl : ResilientWorkflow {

    // Quick operations - aggressive retries
    private val validationActivity = Workflow.newActivityStub(
        ValidationActivity::class.java,
        ActivityOptions.newBuilder()
            .setStartToCloseTimeout(Duration.ofSeconds(10))
            .setRetryOptions(
                RetryOptions.newBuilder()
                    .setInitialInterval(Duration.ofSeconds(1))
                    .setMaximumInterval(Duration.ofSeconds(10))
                    .setBackoffCoefficient(2.0)
                    .setMaximumAttempts(5)
                    .build()
            )
            .build()
    )
    // Continued on next slide...

External API Configuration

    // External API calls - conservative retries
    private val externalApiActivity = Workflow.newActivityStub(
        ExternalApiActivity::class.java,
        ActivityOptions.newBuilder()
            .setStartToCloseTimeout(Duration.ofMinutes(2))
            .setRetryOptions(
                RetryOptions.newBuilder()
                    .setInitialInterval(Duration.ofSeconds(5))
                    .setMaximumInterval(Duration.ofMinutes(5))
                    .setBackoffCoefficient(3.0)
                    .setMaximumAttempts(3)
                    .build()
            )
            .build()
    )
}

Notice different strategies: aggressive for internal, conservative for external

Step 2: Handle Different Failure Types

@Component
class ExternalApiActivityImpl : ExternalApiActivity {

    override fun callExternalService(request: ApiRequest): ApiResponse {
        try {
            return httpClient.post(request)
        } catch (e: ConnectTimeoutException) {
            // Retriable - network issue
            throw ApplicationFailure.newFailure("Network timeout", "NETWORK_ERROR")
        } catch (e: HttpClientErrorException) {
            when (e.statusCode.value()) {
                400, 401, 403, 404 -> {
                    // Non-retriable - client error
                    throw ApplicationFailure.newNonRetryableFailure(
                        "Client error: ${e.statusText}", 
                        "CLIENT_ERROR"
                    )
                }
                // Continued on next slide...

Error Classification Continued

                429, 500, 502, 503 -> {
                    // Retriable - server issue
                    throw ApplicationFailure.newFailure(
                        "Server error: ${e.statusText}", 
                        "SERVER_ERROR"
                    )
                }
                else -> throw e
            }
        }
    }
}

Error Classification Strategy:

✅ 4xx errors (400, 401, 403, 404) → Don't retry
✅ 5xx errors (500, 502, 503) → Retry with backoff
✅ Network timeouts → Retry aggressively

Step 3: Activity Heartbeats for Long Operations

@Component
class LongRunningActivityImpl : LongRunningActivity {

    override fun processLargeFile(filePath: String): ProcessingResult {
        val totalSteps = 100

        for (step in 1..totalSteps) {
            // Report progress via heartbeat
            Activity.getExecutionContext().heartbeat(step)

            // Do actual work
            processFileChunk(filePath, step)

            // Check for cancellation
            if (Activity.getExecutionContext().isCancelRequested) {
                logger.info("Activity cancelled at step $step")
                throw CancellationException("Processing cancelled")
            }

            Thread.sleep(1000) // Simulate work
        }

        return ProcessingResult("File processed successfully")
    }
}

Heartbeat Pattern Benefits

Why Use Heartbeats:

✅ Progress tracking - Monitor long-running operations
✅ Cancellation detection - Respond to workflow cancellation
✅ Timeout prevention - Keep activity alive during processing
✅ Failure detection - Detect worker crashes quickly
✅ Resource optimization - Clean up abandoned work

Use heartbeats for any activity taking more than 30 seconds

How to Run

Configure heartbeat timeout:

private val longRunningActivity = Workflow.newActivityStub(
    LongRunningActivity::class.java,
    ActivityOptions.newBuilder()
        .setStartToCloseTimeout(Duration.ofMinutes(10))
        .setHeartbeatTimeout(Duration.ofSeconds(30))
        .build()
)

Heartbeat timeout should be less than start-to-close timeout

Retry Strategy Examples

Operation Type → Retry Strategy:

Operation	Initial Interval	Max Interval	Backoff	Max Attempts
Validation	1s	10s	2.0	5
Database	500ms	30s	1.5	15
External API	5s	5m	3.0	3
File I/O	2s	1m	2.0	10

Match retry strategy to operation characteristics and failure patterns

💡 Key Patterns

Exponential Backoff:

Start small (1-5 seconds) and grow exponentially
Cap maximum wait time to prevent infinite delays
Use jitter to prevent thundering herd

Circuit Breaker:

Fail fast when external service is down
Allow recovery through half-open state
Protect resources from cascading failures

🚀 Production Tips

Monitoring and Alerting:

✅ Track retry counts by activity type
✅ Alert on high failure rates
✅ Monitor timeout patterns
✅ Dashboard heartbeat status

Testing:

✅ Test timeout scenarios
✅ Simulate network failures
✅ Verify compensation logic
✅ Load test retry behavior

Building bulletproof distributed systems! 🎉

Workshop