terraform-iac-data-engineering

Infrastructure-as-Code with Terraform for data engineering on AWS (S3, EC2, IAM)

Skill file

Preview skill file↓↑

---
name: terraform-iac-data-engineering
description: Infrastructure-as-Code with Terraform for data engineering on AWS (S3, EC2, IAM)
triggers:
  - "set up terraform for data engineering"
  - "create AWS infrastructure with terraform"
  - "provision S3 and EC2 with IaC"
  - "terraform data engineering setup"
  - "manage AWS resources with terraform"
  - "infrastructure as code for data pipelines"
  - "deploy data infrastructure on AWS"
  - "terraform state management for data engineering"
---

# Terraform IaC for Data Engineering

> Skill by [ara.so](https://ara.so) — Data Skills collection.

This project provides Infrastructure-as-Code (IaC) patterns using Terraform specifically for data engineering workloads on AWS. It demonstrates how to provision and manage AWS resources (S3, EC2, IAM) needed for data pipelines and processing.

## What This Project Does

- Provisions AWS S3 buckets for data storage
- Creates EC2 instances for data processing workloads
- Manages IAM users, roles, and policies
- Demonstrates Terraform state management
- Provides reusable IaC patterns for data engineering infrastructure

## Installation

### Prerequisites

1. **Terraform CLI**
   ```bash
   # macOS
   brew install terraform
   
   # Linux
   wget https://releases.hashicorp.com/terraform/1.5.0/terraform_1.5.0_linux_amd64.zip
   unzip terraform_1.5.0_linux_amd64.zip
   sudo mv terraform /usr/local/bin/
   ```

2. **AWS CLI**
   ```bash
   # macOS
   brew install awscli
   
   # Linux
   curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
   unzip awscliv2.zip
   sudo ./aws/install
   ```

3. **Configure AWS CLI**
   ```bash
   aws configure
   # Enter your AWS Access Key ID
   # Enter your AWS Secret Access Key
   # Default region: us-east-1
   # Default output format: json
   ```

### Project Setup

```bash
git clone https://github.com/josephmachado/iac-for-data-engineering-terraform-.git
cd iac-for-data-engineering-terraform-
```

## Key Terraform Commands

### Initialize Terraform

```bash
# Initialize terraform (downloads providers, sets up backend)
terraform -chdir=terraform init

# Validate configuration files
terraform -chdir=terraform validate

# Format configuration files
terraform -chdir=terraform fmt
```

### Plan and Apply Infrastructure

```bash
# Preview changes before applying
terraform -chdir=terraform plan

# Apply infrastructure changes
terraform -chdir=terraform apply

# Auto-approve without confirmation (use with caution)
terraform -chdir=terraform apply -auto-approve
```

### Inspect Infrastructure

```bash
# List all resources in state
terraform -chdir=terraform state list

# Show details of a specific resource
terraform -chdir=terraform state show aws_s3_bucket.data_bucket

# Output specific values
terraform -chdir=terraform output

# Show current state in JSON
terraform -chdir=terraform show -json
```

### Destroy Infrastructure

```bash
# Destroy all managed infrastructure
terraform -chdir=terraform destroy

# Destroy specific resource
terraform -chdir=terraform destroy -target=aws_instance.data_processor
```

## Configuration Structure

### Basic Terraform Configuration for Data Engineering

**main.tf** - Core infrastructure definition:

```hcl
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.aws_region
}

# S3 bucket for data storage
resource "aws_s3_bucket" "data_lake" {
  bucket = "my-unique-data-lake-bucket-${var.environment}"
  
  tags = {
    Name        = "Data Lake Bucket"
    Environment = var.environment
    Project     = "DataEngineering"
  }
}

# Enable versioning for data protection
resource "aws_s3_bucket_versioning" "data_lake_versioning" {
  bucket = aws_s3_bucket.data_lake.id
  
  versioning_configuration {
    status = "Enabled"
  }
}

# Block public access
resource "aws_s3_bucket_public_access_block" "data_lake_public_access" {
  bucket = aws_s3_bucket.data_lake.id
  
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# EC2 instance for data processing
resource "aws_instance" "data_processor" {
  ami           = var.ec2_ami
  instance_type = var.ec2_instance_type
  
  tags = {
    Name        = "DataProcessor"
    Environment = var.environment
  }
  
  user_data = <<-EOF
              #!/bin/bash
              sudo yum update -y
              sudo yum install -y python3 python3-pip
              pip3 install pandas boto3
              EOF
}

# IAM role for EC2 to access S3
resource "aws_iam_role" "ec2_s3_access_role" {
  name = "ec2-s3-access-role"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })
}

# IAM policy for S3 access
resource "aws_iam_role_policy" "ec2_s3_policy" {
  name = "ec2-s3-policy"
  role = aws_iam_role.ec2_s3_access_role.id
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject",
          "s3:ListBucket"
        ]
        Resource = [
          aws_s3_bucket.data_lake.arn,
          "${aws_s3_bucket.data_lake.arn}/*"
        ]
      }
    ]
  })
}

# Attach IAM role to EC2 instance
resource "aws_iam_instance_profile" "ec2_profile" {
  name = "ec2-s3-profile"
  role = aws_iam_role.ec2_s3_access_role.name
}
```

**variables.tf** - Input variables:

```hcl
variable "aws_region" {
  description = "AWS region for resources"
  type        = string
  default     = "us-east-1"
}

variable "environment" {
  description = "Environment name (dev, staging, prod)"
  type        = string
  default     = "dev"
}

variable "ec2_ami" {
  description = "AMI ID for EC2 instance"
  type        = string
  default     = "ami-0c55b159cbfafe1f0"  # Amazon Linux 2
}

variable "ec2_instance_type" {
  description = "EC2 instance type"
  type        = string
  default     = "t2.micro"
}

variable "bucket_prefix" {
  description = "Prefix for S3 bucket names"
  type        = string
  default     = "data-eng"
}
```

**outputs.tf** - Output values:

```hcl
output "s3_bucket_name" {
  description = "Name of the S3 data lake bucket"
  value       = aws_s3_bucket.data_lake.id
}

output "s3_bucket_arn" {
  description = "ARN of the S3 bucket"
  value       = aws_s3_bucket.data_lake.arn
}

output "ec2_instance_id" {
  description = "ID of the EC2 data processor"
  value       = aws_instance.data_processor.id
}

output "ec2_public_ip" {
  description = "Public IP of EC2 instance"
  value       = aws_instance.data_processor.public_ip
}
```

**terraform.tfvars** - Variable values (gitignore this file):

```hcl
aws_region         = "us-west-2"
environment        = "production"
ec2_instance_type  = "t3.medium"
bucket_prefix      = "my-company-data"
```

## Common Data Engineering Patterns

### Multi-Environment Setup

**environments/dev/main.tf**:

```hcl
module "data_infrastructure" {
  source = "../../modules/data-infra"
  
  environment       = "dev"
  instance_type     = "t2.micro"
  enable_monitoring = false
}
```

**environments/prod/main.tf**:

```hcl
module "data_infrastructure" {
  source = "../../modules/data-infra"
  
  environment       = "prod"
  instance_type     = "t3.xlarge"
  enable_monitoring = true
  backup_enabled    = true
}
```

### S3 Bucket with Lifecycle Policies

```hcl
resource "aws_s3_bucket" "data_archive" {
  bucket = "data-archive-${var.environment}"
}

resource "aws_s3_bucket_lifecycle_configuration" "data_archive_lifecycle" {
  bucket = aws_s3_bucket.data_archive.id
  
  rule {
    id     = "archive-old-data"
    status = "Enabled"
    
    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }
    
    transition {
      days          = 90
      storage_class = "GLACIER"
    }
    
    expiration {
      days = 365
    }
  }
  
  rule {
    id     = "delete-incomplete-uploads"
    status = "Enabled"
    
    abort_incomplete_multipart_upload {
      days_after_initiation = 7
    }
  }
}
```

### VPC Setup for Data Processing

```hcl
resource "aws_vpc" "data_vpc" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
  
  tags = {
    Name = "data-engineering-vpc"
  }
}

resource "aws_subnet" "private_subnet" {
  vpc_id            = aws_vpc.data_vpc.id
  cidr_block        = "10.0.1.0/24"
  availability_zone = "${var.aws_region}a"
  
  tags = {
    Name = "private-data-subnet"
  }
}

resource "aws_security_group" "data_processor_sg" {
  name        = "data-processor-sg"
  description = "Security group for data processing instances"
  vpc_id      = aws_vpc.data_vpc.id
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["10.0.0.0/16"]
  }
}
```

### Remote State Configuration

**backend.tf**:

```hcl
terraform {
  backend "s3" {
    bucket         = "terraform-state-bucket-unique-name"
    key            = "data-engineering/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}
```

Create state backend resources:

```hcl
resource "aws_s3_bucket" "terraform_state" {
  bucket = "terraform-state-bucket-unique-name"
}

resource "aws_s3_bucket_versioning" "terraform_state_versioning" {
  bucket = aws_s3_bucket.terraform_state.id
  
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
  
  attribute {
    name = "LockID"
    type = "S"
  }
}
```

## Verification and Testing

### Verify S3 Bucket Creation

```bash
# List all S3 buckets
aws s3 ls

# Check specific bucket
aws s3 ls s3://my-unique-data-lake-bucket-dev/

# Upload test file
echo "test data" > test.txt
aws s3 cp test.txt s3://my-unique-data-lake-bucket-dev/
```

### Verify EC2 Instances

```bash
# List running instances
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==`Name`].Value, Type:InstanceType, State:State.Name, PublicIP:PublicIpAddress}' \
  --output table

# Get specific instance details
aws ec2 describe-instances \
  --instance-ids $(terraform -chdir=terraform output -raw ec2_instance_id)
```

### Verify IAM Roles

```bash
# List IAM roles
aws iam list-roles --query 'Roles[?contains(RoleName, `ec2-s3-access`)].RoleName'

# Get role policy
aws iam get-role-policy \
  --role-name ec2-s3-access-role \
  --policy-name ec2-s3-policy
```

## State Management

### Inspect State

```bash
# View state file (formatted)
cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'

# List resources in state
terraform -chdir=terraform state list

# Show resource details
terraform -chdir=terraform state show aws_s3_bucket.data_lake
```

### Import Existing Resources

```bash
# Import existing S3 bucket
terraform -chdir=terraform import aws_s3_bucket.data_lake my-existing-bucket

# Import existing EC2 instance
terraform -chdir=terraform import aws_instance.data_processor i-1234567890abcdef0
```

### Move Resources in State

```bash
# Rename resource in state
terraform -chdir=terraform state mv aws_s3_bucket.old_name aws_s3_bucket.new_name
```

## Troubleshooting

### Common Issues

**Issue: Bucket name already exists**
```hcl
# Solution: Use unique bucket name with random suffix
resource "random_id" "bucket_suffix" {
  byte_length = 4
}

resource "aws_s3_bucket" "data_lake" {
  bucket = "data-lake-${var.environment}-${random_id.bucket_suffix.hex}"
}
```

**Issue: AWS credentials not found**
```bash
# Check AWS configuration
aws configure list

# Use environment variables
export AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}"
export AWS_DEFAULT_REGION="us-east-1"
```

**Issue: State file locked**
```bash
# Force unlock (use with caution)
terraform -chdir=terraform force-unlock <LOCK_ID>
```

**Issue: Resource already exists**
```bash
# Import existing resource
terraform -chdir=terraform import <resource_type>.<resource_name> <resource_id>

# Or remove from state
terraform -chdir=terraform state rm <resource_type>.<resource_name>
```

**Issue: Terraform version mismatch**
```hcl
# Specify required version in terraform block
terraform {
  required_version = ">= 1.5.0"
  
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}
```

### Debugging

```bash
# Enable debug logging
export TF_LOG=DEBUG
terraform -chdir=terraform apply

# Log to file
export TF_LOG_PATH=terraform-debug.log
terraform -chdir=terraform apply

# Disable logging
unset TF_LOG
unset TF_LOG_PATH
```

### Validate and Format

```bash
# Validate configuration
terraform -chdir=terraform validate

# Format all files
terraform -chdir=terraform fmt -recursive

# Check formatting without making changes
terraform -chdir=terraform fmt -check
```

## Best Practices

1. **Always use variables** for environment-specific values
2. **Enable S3 versioning** for state files and data buckets
3. **Use remote state** for team collaboration
4. **Tag all resources** with environment, project, and owner
5. **Implement lifecycle policies** for cost optimization
6. **Use modules** for reusable infrastructure patterns
7. **Store secrets in AWS Secrets Manager**, reference via data sources
8. **Run `terraform plan`** before apply
9. **Use workspaces** for multiple environments
10. **Document your infrastructure** with comments and README files

Source

Creator's repository · aradotso/data-skills

View on GitHub ↗

Security

Security checks in progress

Results will appear here once audits complete

What this skill can do

Reads your filesConnects to the internetRuns code on your machine

Checked by 3 independent security firms

Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub

Does it sneak in hidden code?Not yet checkedPending · Socket

Does it have known bugs?Not yet checkedPending · Snyk