terraform-data-engineering-iac

Infrastructure-as-Code fundamentals for data engineering using Terraform to provision AWS resources (S3, EC2, IAM)

Skill file

Preview skill file
---
name: terraform-data-engineering-iac
description: Infrastructure-as-Code fundamentals for data engineering using Terraform to provision AWS resources (S3, EC2, IAM)
triggers:
  - "set up data engineering infrastructure with terraform"
  - "provision AWS resources for data pipelines"
  - "create S3 buckets and EC2 instances with IaC"
  - "terraform infrastructure for data engineering"
  - "manage data infrastructure as code"
  - "deploy AWS data engineering stack with terraform"
  - "automate data platform provisioning"
  - "terraform state management for data pipelines"
---

# Terraform Data Engineering IaC

> Skill by [ara.so](https://ara.so) — Data Skills collection.

This project demonstrates Infrastructure-as-Code (IaC) fundamentals for data engineering using Terraform. It provisions AWS resources commonly used in data pipelines including S3 buckets for data storage and EC2 instances for data processing workloads.

## What It Does

- **Provisions AWS S3 buckets** for data lake storage
- **Creates EC2 instances** for data processing and pipeline execution
- **Manages IAM policies** for secure resource access
- **Uses Terraform state** to track and manage infrastructure changes
- **Provides reproducible infrastructure** for data engineering environments

## Prerequisites

Before using this project, ensure you have:

1. AWS Account with appropriate permissions
2. Terraform CLI installed
3. AWS CLI installed and configured
4. IAM user with S3, EC2, and IAM permissions

## Installation

### 1. Install Terraform

```bash
# macOS
brew install terraform

# Linux
wget https://releases.hashicorp.com/terraform/1.5.0/terraform_1.5.0_linux_amd64.zip
unzip terraform_1.5.0_linux_amd64.zip
sudo mv terraform /usr/local/bin/

# Verify installation
terraform version
```

### 2. Install AWS CLI

```bash
# macOS
brew install awscli

# Linux
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# Configure AWS credentials
aws configure
```

### 3. Set Up IAM Permissions

Create an IAM user with the following managed policies:
- `AmazonS3FullAccess`
- `AmazonEC2FullAccess`
- `IAMFullAccess`

**Note:** For production, use fine-grained permissions instead of full access.

## Project Structure

```
terraform/
├── main.tf           # Main infrastructure definitions
├── variables.tf      # Input variables
├── outputs.tf        # Output values
└── terraform.tfstate # State file (generated)
```

## Key Terraform Commands

### Initialize Terraform

```bash
# Initialize backend and download providers
terraform -chdir=terraform init
```

### Validate Configuration

```bash
# Check syntax and validate configuration
terraform -chdir=terraform validate
```

### Format Code

```bash
# Auto-format HCL files
terraform -chdir=terraform fmt
```

### Plan Infrastructure Changes

```bash
# Preview what will be created/changed
terraform -chdir=terraform plan
```

### Apply Infrastructure

```bash
# Create or update infrastructure
terraform -chdir=terraform apply

# Auto-approve without confirmation (use carefully)
terraform -chdir=terraform apply -auto-approve
```

### Destroy Infrastructure

```bash
# Remove all managed infrastructure
terraform -chdir=terraform destroy

# Auto-approve destruction (use carefully)
terraform -chdir=terraform destroy -auto-approve
```

### State Management

```bash
# List all resources in state
terraform -chdir=terraform state list

# Show detailed resource information
terraform -chdir=terraform state show aws_s3_bucket.data_bucket

# View state as JSON
cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'
```

## Configuration Examples

### Basic S3 Bucket for Data Storage

```hcl
# terraform/main.tf
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.aws_region
}

resource "aws_s3_bucket" "data_lake" {
  bucket = "my-unique-data-lake-bucket-${var.environment}"
  
  tags = {
    Name        = "Data Lake Bucket"
    Environment = var.environment
    Project     = "data-engineering"
  }
}

resource "aws_s3_bucket_versioning" "data_lake_versioning" {
  bucket = aws_s3_bucket.data_lake.id
  
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "data_lake_lifecycle" {
  bucket = aws_s3_bucket.data_lake.id

  rule {
    id     = "archive_old_data"
    status = "Enabled"

    transition {
      days          = 90
      storage_class = "GLACIER"
    }

    expiration {
      days = 365
    }
  }
}
```

### EC2 Instance for Data Processing

```hcl
# terraform/main.tf (continued)
data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"] # Canonical

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }
}

resource "aws_instance" "data_processor" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = var.instance_type

  tags = {
    Name        = "Data Processing Server"
    Environment = var.environment
  }

  user_data = <<-EOF
              #!/bin/bash
              sudo apt-get update
              sudo apt-get install -y python3-pip
              pip3 install pandas boto3 apache-airflow
              EOF
}

resource "aws_eip" "data_processor_eip" {
  instance = aws_instance.data_processor.id
  domain   = "vpc"
}
```

### Variables Configuration

```hcl
# terraform/variables.tf
variable "aws_region" {
  description = "AWS region for resources"
  type        = string
  default     = "us-east-1"
}

variable "environment" {
  description = "Environment name"
  type        = string
  default     = "dev"
}

variable "instance_type" {
  description = "EC2 instance type"
  type        = string
  default     = "t3.medium"
}
```

### Outputs Configuration

```hcl
# terraform/outputs.tf
output "s3_bucket_name" {
  description = "Name of the S3 data lake bucket"
  value       = aws_s3_bucket.data_lake.id
}

output "ec2_public_ip" {
  description = "Public IP of data processing EC2 instance"
  value       = aws_eip.data_processor_eip.public_ip
}

output "ec2_instance_id" {
  description = "Instance ID of data processor"
  value       = aws_instance.data_processor.id
}
```

## Common Patterns

### Multi-Environment Setup

```hcl
# Use workspace or separate state files
terraform workspace new staging
terraform workspace new production

# Or use variable files
terraform apply -var-file="environments/dev.tfvars"
terraform apply -var-file="environments/prod.tfvars"
```

### Remote State with S3 Backend

```hcl
# terraform/backend.tf
terraform {
  backend "s3" {
    bucket         = "my-terraform-state-bucket"
    key            = "data-engineering/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}
```

### IAM Role for EC2 with S3 Access

```hcl
resource "aws_iam_role" "data_processor_role" {
  name = "data-processor-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "ec2.amazonaws.com"
      }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "s3_access" {
  role       = aws_iam_role.data_processor_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonS3FullAccess"
}

resource "aws_iam_instance_profile" "data_processor_profile" {
  name = "data-processor-profile"
  role = aws_iam_role.data_processor_role.name
}

resource "aws_instance" "data_processor" {
  ami                  = data.aws_ami.ubuntu.id
  instance_type        = var.instance_type
  iam_instance_profile = aws_iam_instance_profile.data_processor_profile.name
}
```

## Verification Commands

### Verify S3 Buckets

```bash
# List all S3 buckets
aws s3 ls

# Get bucket details
aws s3api get-bucket-location --bucket my-data-lake-bucket

# List bucket contents
aws s3 ls s3://my-data-lake-bucket/
```

### Verify EC2 Instances

```bash
# List running instances
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==`Name`].Value, Type:InstanceType, State:State.Name, PublicIP:PublicIpAddress, PrivateIP:PrivateIpAddress}' \
  --output table

# Get specific instance details
aws ec2 describe-instances --instance-ids i-1234567890abcdef0
```

### Connect to EC2 Instance

```bash
# SSH into instance (requires key pair)
ssh -i ~/.ssh/my-key.pem ubuntu@$(terraform -chdir=terraform output -raw ec2_public_ip)
```

## Troubleshooting

### Issue: Terraform Init Fails

```bash
# Clear cache and reinitialize
rm -rf terraform/.terraform
rm terraform/.terraform.lock.hcl
terraform -chdir=terraform init
```

### Issue: State Lock Error

```bash
# Force unlock (use with caution)
terraform -chdir=terraform force-unlock LOCK_ID
```

### Issue: AWS Credentials Not Found

```bash
# Verify AWS configuration
aws configure list
aws sts get-caller-identity

# Set credentials explicitly
export AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}"
export AWS_DEFAULT_REGION="us-east-1"
```

### Issue: Resource Already Exists

```bash
# Import existing resource into state
terraform -chdir=terraform import aws_s3_bucket.data_lake my-existing-bucket

# Or recreate with unique name
terraform -chdir=terraform apply -var="bucket_suffix=$(date +%s)"
```

### Issue: Permission Denied

Check IAM policies and ensure your user has required permissions:

```bash
# Test S3 permissions
aws s3 ls

# Test EC2 permissions
aws ec2 describe-instances

# Test IAM permissions
aws iam list-users
```

### Debugging Terraform

```bash
# Enable debug logging
export TF_LOG=DEBUG
terraform -chdir=terraform apply

# Show detailed plan
terraform -chdir=terraform plan -out=tfplan
terraform -chdir=terraform show tfplan

# Refresh state from actual infrastructure
terraform -chdir=terraform refresh
```

## Best Practices

1. **Always use unique bucket names**: S3 bucket names must be globally unique
2. **Version your state files**: Enable S3 versioning for state file backups
3. **Use remote state**: Store state in S3 with locking via DynamoDB
4. **Tag all resources**: Apply consistent tagging for cost tracking and organization
5. **Use variables**: Parameterize configurations for reusability
6. **Run `terraform plan`** before apply to review changes
7. **Destroy dev resources**: Don't leave test infrastructure running to avoid costs

Source

Creator's repository · aradotso/data-skills

View on GitHub

Security

Security checks in progress
Results will appear here once audits complete
What this skill can do
Reads your filesConnects to the internetRuns code on your machine
Checked by 3 independent security firms
Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub
Does it sneak in hidden code?Not yet checkedPending · Socket
Does it have known bugs?Not yet checkedPending · Snyk