Back to Playbooks
Infrastructure as Code

CloudFormation stack management with change sets and nested stacks

Master advanced CloudFormation patterns including change sets, rollbacks, and nested stack orchestration.

What this covers

Advanced CloudFormation concepts including change sets, stack rollbacks, nested stacks, cross-stack references, and automated deployment pipelines.

Implementation trail

  • Change set creation and review workflows
  • Change set execution and monitoring
  • Automated rollback and recovery procedures
  • Nested stack architecture patterns
  • Cross-stack parameter and output management
  • CI/CD integration with approval gates
  • Drift detection and remediation
  • Stack policy and termination protection

Understanding CloudFormation change sets

Change sets provide a preview of proposed changes before applying them to your stack, enabling safe infrastructure updates.

  • Change sets show exactly which resources will be added, modified, or deleted before execution.
  • They prevent accidental resource deletion by requiring explicit review and approval.
  • Multiple change sets can be created for the same stack to compare different approaches.
  • Change sets automatically detect resource dependencies and update order requirements.

Create and review change sets programmatically

Automate change set creation with proper error handling and comprehensive change analysis.

  • def create_change_set(stack_name, template_url, parameters):
        """Create a change set with comprehensive error handling"""
        changeset_name = f"{stack_name}-changeset-{int(datetime.now().timestamp())}"
        
        try:
            # Determine if this is a new stack or update
            cfn.describe_stacks(StackName=stack_name)
            changeset_type = 'UPDATE'
        except cfn.exceptions.ClientError as e:
            if 'does not exist' in str(e):
                changeset_type = 'CREATE'
            else:
                raise
        
        # Create the change set
        response = cfn.create_change_set(
            StackName=stack_name,
            TemplateURL=template_url,
            Parameters=parameters,
            ChangeSetName=changeset_name,
            ChangeSetType=changeset_type,
            Capabilities=['CAPABILITY_IAM', 'CAPABILITY_NAMED_IAM'],
            IncludeNestedStacks=True,  # Include nested stack changes
            OnStackFailure='ROLLBACK'  # Rollback on failure
        )
        
        return wait_for_changeset_creation(stack_name, changeset_name)

    Function to create change sets with proper type detection and error handling.

  • Include nested stacks in change set analysis to see the full impact across your infrastructure.
  • Set appropriate capabilities (IAM, NAMED_IAM) based on the resources being created or modified.
  • Use descriptive change set names with timestamps for easy identification and tracking.

Analyze change set impact and risks

Thoroughly review change sets to understand the impact and identify potential risks before execution.

  • def analyze_change_set(stack_name, changeset_name):
        """Analyze change set for risk assessment"""
        changes = cfn.describe_change_set(
            StackName=stack_name,
            ChangeSetName=changeset_name
        )
        
        risk_analysis = {
            'high_risk_changes': [],
            'resource_deletions': [],
            'replacement_changes': [],
            'data_loss_risk': False
        }
        
        for change in changes.get('Changes', []):
            action = change['Action']
            resource_type = change['ResourceChange']['ResourceType']
            
            # Identify high-risk changes
            if action == 'Remove':
                risk_analysis['resource_deletions'].append({
                    'resource': change['ResourceChange']['LogicalResourceId'],
                    'type': resource_type
                })
                
            if change['ResourceChange'].get('Replacement') == 'True':
                risk_analysis['replacement_changes'].append({
                    'resource': change['ResourceChange']['LogicalResourceId'],
                    'type': resource_type
                })
                
            # Check for data loss risks
            if resource_type in ['AWS::RDS::DBInstance', 'AWS::S3::Bucket', 'AWS::DynamoDB::Table']:
                if action in ['Remove'] or change['ResourceChange'].get('Replacement') == 'True':
                    risk_analysis['data_loss_risk'] = True
                    
        return risk_analysis

    Function to analyze change sets and identify high-risk operations that require special attention.

  • Pay special attention to resource replacements, which delete and recreate resources.
  • Identify changes that could cause data loss (RDS deletions, S3 bucket removals, etc.).
  • Review IAM permission changes to ensure they don't break existing applications.

Execute change sets with monitoring

Execute approved change sets with proper monitoring and immediate rollback capabilities.

  • def execute_change_set_with_monitoring(stack_name, changeset_name):
        """Execute change set with real-time monitoring"""
        
        # Execute the change set
        cfn.execute_change_set(
            StackName=stack_name,
            ChangeSetName=changeset_name
        )
        
        # Monitor execution progress
        while True:
            stack_status = cfn.describe_stacks(StackName=stack_name)['Stacks'][0]['StackStatus']
            
            if stack_status.endswith('_COMPLETE'):
                return {'status': 'SUCCESS', 'final_status': stack_status}
            elif stack_status.endswith('_FAILED'):
                # Automatic rollback trigger
                return initiate_rollback(stack_name, f"Stack update failed: {stack_status}")
            elif 'ROLLBACK' in stack_status:
                return {'status': 'ROLLING_BACK', 'current_status': stack_status}
                
            # Check for stuck resources
            events = get_recent_stack_events(stack_name)
            if detect_stuck_resources(events):
                return initiate_rollback(stack_name, "Detected stuck resources during update")
                
            time.sleep(30)  # Wait before next check

    Function to execute change sets with continuous monitoring and automatic rollback triggers.

  • Monitor stack events in real-time to detect failures or stuck resources immediately.
  • Set up CloudWatch alarms for stack update duration to catch hung deployments.
  • Implement automatic rollback triggers for common failure scenarios.

Implement comprehensive rollback strategies

Design multiple rollback approaches for different failure scenarios and recovery requirements.

  • def initiate_rollback(stack_name, reason):
        """Initiate stack rollback with multiple strategies"""
        
        try:
            # Strategy 1: Cancel update and rollback
            cfn.cancel_update_stack(StackName=stack_name)
            
            # Wait for rollback completion
            waiter = cfn.get_waiter('stack_update_complete')
            waiter.wait(
                StackName=stack_name,
                WaiterConfig={'Delay': 30, 'MaxAttempts': 120}
            )
            
            return {'rollback_method': 'cancel_update', 'status': 'SUCCESS'}
            
        except cfn.exceptions.ClientError as e:
            if 'No updates are to be performed' in str(e):
                # Strategy 2: Continue rollback if already in progress
                return monitor_existing_rollback(stack_name)
            else:
                # Strategy 3: Force rollback with previous template
                return force_rollback_to_previous_version(stack_name, reason)

    Multi-strategy rollback function that adapts to different failure scenarios.

  • Use cancel_update_stack for active updates that can be safely cancelled.
  • Implement force rollback using previous known-good template versions for severe failures.
  • Maintain rollback history and templates in S3 for point-in-time recovery options.

Handle rollback failures and recovery

Prepare for scenarios where automatic rollbacks fail and manual intervention is required.

  • def handle_rollback_failure(stack_name, failed_resources):
        """Handle cases where rollback itself fails"""
        
        recovery_options = []
        
        for resource in failed_resources:
            resource_type = resource['ResourceType']
            logical_id = resource['LogicalResourceId']
            
            if resource_type == 'AWS::EC2::Instance':
                # Option 1: Skip failed EC2 instances
                recovery_options.append({
                    'action': 'continue_rollback_with_skip',
                    'resources_to_skip': [logical_id],
                    'risk': 'Instance will remain in failed state'
                })
                
            elif resource_type in ['AWS::RDS::DBInstance', 'AWS::S3::Bucket']:
                # Option 2: Retain data resources
                recovery_options.append({
                    'action': 'retain_resource',
                    'resource': logical_id,
                    'risk': 'Resource will be orphaned from stack'
                })
        
        # Continue rollback with skipped resources
        cfn.continue_update_rollback(
            StackName=stack_name,
            ResourcesToSkip=[opt['resources_to_skip'][0] for opt in recovery_options 
                            if opt['action'] == 'continue_rollback_with_skip']
        )
        
        return recovery_options

    Function to handle rollback failures by skipping problematic resources or retaining data resources.

  • Use continue_update_rollback with ResourcesToSkip for stuck resources that can't be rolled back.
  • Implement resource retention policies for data resources that shouldn't be deleted during rollback.
  • Document manual cleanup procedures for orphaned resources after failed rollbacks.

Delete change sets and cleanup

Properly manage change set lifecycle and cleanup unused change sets to avoid clutter.

  • def cleanup_change_sets(stack_name, keep_recent=5):
        """Clean up old change sets, keeping only recent ones"""
        
        # List all change sets for the stack
        response = cfn.list_change_sets(StackName=stack_name)
        change_sets = sorted(
            response['Summaries'], 
            key=lambda x: x['CreationTime'], 
            reverse=True
        )
        
        # Keep only the most recent change sets
        to_delete = change_sets[keep_recent:]
        
        for changeset in to_delete:
            try:
                cfn.delete_change_set(
                    StackName=stack_name,
                    ChangeSetName=changeset['ChangeSetName']
                )
                print(f"Deleted change set: {changeset['ChangeSetName']}")
            except cfn.exceptions.ClientError as e:
                print(f"Failed to delete {changeset['ChangeSetName']}: {e}")
        
        return len(to_delete)

    Function to clean up old change sets while preserving recent ones for audit trails.

  • Delete unused change sets regularly to avoid hitting AWS limits (100 change sets per stack).
  • Preserve recent change sets for audit trails and rollback reference.
  • Export change set metadata to external systems before deletion for long-term audit requirements.

Implement change set approval workflows

Create approval processes that require stakeholder sign-off before executing infrastructure changes.

  • {
      "Comment": "Change set approval workflow",
      "StartAt": "CreateChangeSet",
      "States": {
        "CreateChangeSet": {
          "Type": "Task",
          "Resource": "arn:aws:lambda:region:account:function:create-changeset",
          "Next": "AnalyzeRisk"
        },
        "AnalyzeRisk": {
          "Type": "Task", 
          "Resource": "arn:aws:lambda:region:account:function:analyze-changeset",
          "Next": "RequiresApproval"
        },
        "RequiresApproval": {
          "Type": "Choice",
          "Choices": [
            {
              "Variable": "$.risk_level",
              "StringEquals": "HIGH",
              "Next": "RequestApproval"
            },
            {
              "Variable": "$.risk_level", 
              "StringEquals": "LOW",
              "Next": "AutoExecute"
            }
          ],
          "Default": "RequestApproval"
        },
        "RequestApproval": {
          "Type": "Task",
          "Resource": "arn:aws:states:::sns:publish.waitForTaskToken",
          "Parameters": {
            "TopicArn": "arn:aws:sns:region:account:approval-topic",
            "Message.$": "$.approval_request",
            "TaskToken.$": "$$.Task.Token"
          },
          "Next": "ExecuteChangeSet"
        },
        "AutoExecute": {
          "Type": "Pass",
          "Result": {"approved": true},
          "Next": "ExecuteChangeSet"
        },
        "ExecuteChangeSet": {
          "Type": "Task",
          "Resource": "arn:aws:lambda:region:account:function:execute-changeset",
          "End": true
        }
      }
    }

    Step Functions workflow that implements risk-based approval processes for change set execution.

  • Implement risk-based approval routing where high-risk changes require manual approval.
  • Use SNS with task tokens to pause workflow execution until approval is received.
  • Auto-approve low-risk changes like parameter updates or tag modifications.

Monitor and audit change set operations

Implement comprehensive logging and monitoring for all change set operations and approvals.

  • Log all change set creation, execution, and deletion events to CloudTrail for audit compliance.
  • Create CloudWatch dashboards showing change set success rates, execution times, and rollback frequency.
  • Set up alerts for failed change sets, stuck rollbacks, and approval timeout scenarios.
  • Export change set metadata and approval decisions to external audit systems for compliance reporting.

Need advanced CloudFormation workflows?

We implement sophisticated infrastructure deployment pipelines with change sets, nested stacks, and automated rollback capabilities.

Enhance your IaC workflows