iac-terraform-data-engineering

Infrastructure-as-Code fundamentals for data engineers using Terraform to provision AWS resources (S3, EC2, IAM)

Skill file

Preview skill file
---
name: iac-terraform-data-engineering
description: Infrastructure-as-Code fundamentals for data engineers using Terraform to provision AWS resources (S3, EC2, IAM)
triggers:
  - "set up terraform for data engineering"
  - "create AWS infrastructure with terraform"
  - "provision S3 and EC2 using IaC"
  - "terraform for data pipelines"
  - "manage data infrastructure as code"
  - "deploy data engineering resources on AWS"
  - "terraform state management for data platforms"
  - "destroy terraform data infrastructure"
---

# IaC for Data Engineering with Terraform

> Skill by [ara.so](https://ara.so) — Data Skills collection.

This project provides Infrastructure-as-Code (IaC) templates and patterns for data engineers using Terraform to provision and manage AWS resources. It focuses on creating reproducible, version-controlled infrastructure for data platforms including S3 storage, EC2 compute instances, and IAM permissions.

## What This Project Does

- Provides Terraform configurations for common data engineering infrastructure on AWS
- Demonstrates IaC best practices for S3 buckets, EC2 instances, and IAM roles
- Shows state management and lifecycle operations for data infrastructure
- Teaches reproducible infrastructure provisioning for data pipelines

## Prerequisites

Before using this project, ensure you have:

1. **AWS Account** with root or admin access
2. **Terraform CLI** installed ([installation guide](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli))
3. **AWS CLI** installed and configured ([setup guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html))
4. **AWS Credentials** configured via `aws configure`

## AWS IAM Setup

Create an IAM user with appropriate permissions:

1. **Create IAM User**: Navigate to AWS Console → IAM → Users → Create user
2. **Create Inline Policy**: Attach a custom policy to the user
3. **Grant Permissions**: For development/learning, grant full access to:
   - Amazon S3
   - Amazon EC2
   - AWS IAM

**⚠️ Security Note**: Full service access is NOT recommended for production. Use least-privilege policies in production environments.

## Project Structure

```
terraform/
├── main.tf           # Main Terraform configuration
├── variables.tf      # Input variables (if present)
├── outputs.tf        # Output values (if present)
└── terraform.tfstate # State file (generated)
```

## Key Terraform Commands

### Initialize Terraform

Initialize the working directory and download provider plugins:

```bash
terraform -chdir=terraform init
```

### Validate Configuration

Check if the configuration is syntactically valid:

```bash
terraform -chdir=terraform validate
```

### Format Code

Automatically format Terraform files to canonical style:

```bash
terraform -chdir=terraform fmt
```

### Plan Infrastructure Changes

Preview what Terraform will create/modify/destroy:

```bash
terraform -chdir=terraform plan
```

### Apply Configuration

Create or update infrastructure:

```bash
terraform -chdir=terraform apply
```

Terraform will show a plan and ask for confirmation. Type `yes` to proceed.

### Auto-approve (for automation)

```bash
terraform -chdir=terraform apply -auto-approve
```

### Destroy Infrastructure

Remove all resources managed by Terraform:

```bash
terraform -chdir=terraform destroy
```

## Configuration

### Basic Terraform Configuration Example

Before applying, modify `terraform/main.tf` to customize resource names:

```hcl
# terraform/main.tf
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

# S3 bucket for data storage
resource "aws_s3_bucket" "data_bucket" {
  bucket = "my-unique-data-engineering-bucket-12345"
  
  tags = {
    Name        = "Data Engineering Bucket"
    Environment = "dev"
    ManagedBy   = "Terraform"
  }
}

# EC2 instance for data processing
resource "aws_instance" "data_processor" {
  ami           = "ami-0c55b159cbfafe1f0"  # Amazon Linux 2
  instance_type = "t2.micro"
  
  tags = {
    Name        = "Data Processor"
    Environment = "dev"
    ManagedBy   = "Terraform"
  }
}

# IAM role for EC2 instance
resource "aws_iam_role" "ec2_s3_role" {
  name = "ec2-s3-access-role"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })
}
```

### Variables Configuration

Create `terraform/variables.tf` for reusable configurations:

```hcl
variable "aws_region" {
  description = "AWS region for resources"
  type        = string
  default     = "us-east-1"
}

variable "environment" {
  description = "Environment name"
  type        = string
  default     = "dev"
}

variable "bucket_name" {
  description = "S3 bucket name for data storage"
  type        = string
  # Set via terraform.tfvars or -var flag
}
```

Use variables in `main.tf`:

```hcl
provider "aws" {
  region = var.aws_region
}

resource "aws_s3_bucket" "data_bucket" {
  bucket = var.bucket_name
  
  tags = {
    Environment = var.environment
  }
}
```

Create `terraform/terraform.tfvars`:

```hcl
bucket_name  = "my-unique-bucket-name-2026"
aws_region   = "us-west-2"
environment  = "production"
```

## State Management

### Inspect State

List all resources in the state:

```bash
terraform -chdir=terraform state list
```

View detailed state information:

```bash
cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'
```

### Remote State (Production Pattern)

For production, store state remotely in S3:

```hcl
# terraform/backend.tf
terraform {
  backend "s3" {
    bucket         = "my-terraform-state-bucket"
    key            = "data-platform/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}
```

Initialize with backend configuration:

```bash
terraform -chdir=terraform init -backend-config="bucket=${TERRAFORM_STATE_BUCKET}"
```

## Verification Commands

### Verify S3 Bucket Creation

```bash
aws s3 ls
```

### Verify EC2 Instance

```bash
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==`Name`].Value, Type:InstanceType, State:State.Name, PublicIP:PublicIpAddress, PrivateIP:PrivateIpAddress}' \
  --output table
```

### Check Specific Resource

```bash
terraform -chdir=terraform show aws_s3_bucket.data_bucket
```

## Common Patterns for Data Engineering

### Pattern 1: Data Lake with Multiple Buckets

```hcl
# Raw data bucket
resource "aws_s3_bucket" "raw_data" {
  bucket = "my-data-lake-raw-${var.environment}"
}

# Processed data bucket
resource "aws_s3_bucket" "processed_data" {
  bucket = "my-data-lake-processed-${var.environment}"
}

# Enable versioning for data lineage
resource "aws_s3_bucket_versioning" "raw_data_versioning" {
  bucket = aws_s3_bucket.raw_data.id
  
  versioning_configuration {
    status = "Enabled"
  }
}

# Lifecycle rules for cost optimization
resource "aws_s3_bucket_lifecycle_configuration" "raw_data_lifecycle" {
  bucket = aws_s3_bucket.raw_data.id
  
  rule {
    id     = "archive-old-data"
    status = "Enabled"
    
    transition {
      days          = 90
      storage_class = "GLACIER"
    }
  }
}
```

### Pattern 2: EC2 with Data Processing Tools

```hcl
# Security group for data processor
resource "aws_security_group" "data_processor_sg" {
  name        = "data-processor-sg"
  description = "Security group for data processing instances"
  
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # Restrict in production
  }
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# EC2 instance with user data for setup
resource "aws_instance" "data_processor" {
  ami           = var.ami_id
  instance_type = "t3.medium"
  
  vpc_security_group_ids = [aws_security_group.data_processor_sg.id]
  iam_instance_profile   = aws_iam_instance_profile.ec2_profile.name
  
  user_data = <<-EOF
              #!/bin/bash
              yum update -y
              yum install -y python3 python3-pip
              pip3 install pandas boto3
              EOF
  
  tags = {
    Name = "Data Processor Instance"
  }
}

# IAM instance profile
resource "aws_iam_instance_profile" "ec2_profile" {
  name = "ec2-data-processor-profile"
  role = aws_iam_role.ec2_s3_role.name
}
```

### Pattern 3: Outputs for Integration

```hcl
# terraform/outputs.tf
output "s3_bucket_name" {
  description = "Name of the S3 bucket"
  value       = aws_s3_bucket.data_bucket.id
}

output "s3_bucket_arn" {
  description = "ARN of the S3 bucket"
  value       = aws_s3_bucket.data_bucket.arn
}

output "ec2_instance_id" {
  description = "ID of the EC2 instance"
  value       = aws_instance.data_processor.id
}

output "ec2_public_ip" {
  description = "Public IP of the EC2 instance"
  value       = aws_instance.data_processor.public_ip
}
```

Access outputs:

```bash
terraform -chdir=terraform output
terraform -chdir=terraform output -json | jq -r '.s3_bucket_name.value'
```

## Troubleshooting

### Issue: "Error acquiring the state lock"

**Cause**: Another Terraform process is running or a previous run didn't release the lock.

**Solution**:
```bash
# Force unlock (use with caution)
terraform -chdir=terraform force-unlock <LOCK_ID>
```

### Issue: "bucket name already exists"

**Cause**: S3 bucket names must be globally unique across all AWS accounts.

**Solution**: Change the bucket name in `main.tf` to something unique:
```hcl
resource "aws_s3_bucket" "data_bucket" {
  bucket = "my-unique-name-${random_id.bucket_suffix.hex}"
}

resource "random_id" "bucket_suffix" {
  byte_length = 4
}
```

### Issue: "insufficient IAM permissions"

**Cause**: The IAM user doesn't have required permissions.

**Solution**: Verify IAM policy includes necessary actions:
```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:*",
        "ec2:*",
        "iam:*"
      ],
      "Resource": "*"
    }
  ]
}
```

### Issue: State file out of sync

**Cause**: Manual changes made outside Terraform.

**Solution**: Refresh the state:
```bash
terraform -chdir=terraform refresh
```

Or import existing resources:
```bash
terraform -chdir=terraform import aws_s3_bucket.data_bucket my-existing-bucket
```

## Workflow Example

Complete workflow for setting up data infrastructure:

```bash
# 1. Configure AWS credentials
export AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}"
export AWS_DEFAULT_REGION="us-east-1"

# 2. Customize configuration
cd terraform
# Edit main.tf to set unique bucket name

# 3. Initialize Terraform
terraform init

# 4. Validate configuration
terraform validate

# 5. Format code
terraform fmt

# 6. Preview changes
terraform plan

# 7. Apply configuration
terraform apply

# 8. Verify resources
aws s3 ls
aws ec2 describe-instances --output table

# 9. When done, clean up
terraform destroy
```

## Best Practices for Data Engineering IaC

1. **Use variables** for environment-specific values
2. **Enable S3 versioning** for data lineage and recovery
3. **Tag all resources** for cost tracking and management
4. **Store state remotely** in S3 with encryption and locking
5. **Use modules** to organize reusable infrastructure components
6. **Never commit** `.tfstate` files or AWS credentials to version control
7. **Implement lifecycle rules** on S3 for cost optimization
8. **Use IAM roles** instead of access keys for EC2 instances
9. **Plan before apply** to review changes
10. **Destroy unused resources** to avoid unnecessary costs

Source

Creator's repository · aradotso/data-skills

View on GitHub

Security

Security checks in progress
Results will appear here once audits complete
What this skill can do
Reads your filesConnects to the internetRuns code on your machine
Checked by 3 independent security firms
Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub
Does it sneak in hidden code?Not yet checkedPending · Socket
Does it have known bugs?Not yet checkedPending · Snyk