NOT PORTED YET
- DevOps Bash tools for AWS, EKS, EC2 etc
- Multi-Session Console
- AWS CLI
- EC2
- IAM
- RDS - Relational Database Service
- Cloudfront
- ACM - AWS Certificate Manager
- Why move away from CloudWatch Logs and Metrics
- Troubleshooting
- Diagrams
- Memes
If using several AWS accounts as per best practice isolation, you may want to turn on Multi-Session Console support for up to 5 sessions as documented here:
https://docs.aws.amazon.com/awsconsolehelpdocs/latest/gsg/multisession.html
Note this will change existing URLs by adding a subdomain to distinguish sessions.
brew install awsclior run this install script which auto-detects and handles Mac or Linux installs:
git clone https://github.com/HariSekhon/DevOps-Bash-toolsbash-tools/install/install_aws_cli.shFollow the AWS CLI configuration doc.
Typically you'll use SSO config or access keys.
Using AWS CLI Environment Variables.
Set your environment variables in direnv.
See the AWS CLI environment variables reference.
Ready to rock AWS multi-environment switching direnv code including AWS account, region and EKS cluster:
If you have a lot of work AWS profiles due to AWS best practice of having separate accounts for everything...
You can use the interactive aws_profile.sh script from DevOps-Bash-tools to choose from an
easy menu list:
ps. you can create easy gif captures like using the scripts in the DevOps-Bash-tools repo.
A common issue is failing to find resources in the UI or CLI.
Check your region in the top right of the UI or that your CLI is picking up the right region like so:
aws configure get regionand compare with:
aws ec2 describe-availability-zones --query "AvailabilityZones[0].RegionName" --output textSee EKS - Kubectl Access section.
http://aws.amazon.com/ec2/instance-types/
https://aws.amazon.com/ec2/pricing/on-demand/
DO NOT USE T-series (T3 / T2) burstable general instances types for anything besides your own personal PoC.
They can seize up under heavy load and are not recommended for any production workloads.
Find the EC2 instance ID:
aws ec2 describe-instances \
--query 'Reservations[*].Instances[*].[InstanceId,Tags[?Key==`Name`].Value | [0],Placement.AvailabilityZone]' \
--output tableDebug if you're having issues rebooting a VM:
aws ec2 get-console-output --instance-id "$EC2_INSTANCE_ID" | jq -r .OutputAmazon Linux 2 has /tmp on the root partition.
Amazon Linux 2023 /tmp uses tmpfs which stores files in a ramdisk, limited by the EC2 instance's RAM.
To disable tmpfs and go back to using /tmp on underlying root partition,
follow the Linux - Disable tmpfs section.
Packer can be used to easily build AMIs in your shared CI/CD account.
To share these AMI with other AWS accounts for different projects or environments (dev / staging / production) that may all need the same base image or EKS image, you can share it like so:
aws ec2 modify-image-attribute \
--image-id "$ami_id" \
--launch-permission "Add=[{UserId=$aws_account_id}]"There is a script in DevOps-Bash-tools repo that allows you to share by name for predictable convenience:
aws_ec2_ami_share_to_account.sh "$ami_name_or_id" "$aws_account_id"Clone an EC2 instance using this script from DevOps-Bash-tools repo:
aws_ec2_clone_instance.sh "$instance_name" "$new_instance_name"OR
Manual breakdown of steps:
Create an AMI from it using this script from DevOps-Bash-tools repo:
aws_ec2_create_ami_from_instance.sh "$instance_id" "$ami_name"List your AMIs:
aws ec2 describe-images --owners self --query 'Images[*].{ID:ImageId,Name:Name}' --output tableCheck your AMI is finished creating with state Available:
aws ec2 describe-images --image-ids "$AMI_ID" --output tableCreate a new EC2 instance from the AMI:
aws ec2 run-instances \
--image-id "$ami_id" \
--instance-type "$instance_type" \
--subnet-id "$subnet_id" \
--key-name "$ec2_key_pair" \
--security-group-ids "$security_group_id"Describe the newly created instance:
aws ec2 describe-instances --instance-ids "$instance_id" --output tableThis can also be useful for temporary space increases, eg. add a big /tmp partition to allow some
migration loads in an Informatica agent, which can be removed later.
(since you cannot shrink partitions later if you enlarge them instead)
Find out the zone the EC2 instance is in - you will need to create the EBS volume in the same zone:
aws ec2 describe-instances \
--query 'Reservations[*].Instances[*].[InstanceId,Tags[?Key==`Name`].Value | [0],Placement.AvailabilityZone]' \
--output tableSet the Availability Zone environment variable to use in further commands:
AVAILABILITY_ZONE=eu-west-1a # make sure this is same Availability Zone as the VM you want to attach it toChoose a size in GB:
DISK_SIZE_GB=500Create an EC2 EBS volume of 500Gb in the eu-west-1a zone where the VM is:
REGION="${AVAILABILITY_ZONE%?}" # auto-infer the region by removing last character
aws ec2 create-volume \
--size "$DISK_SIZE_GB" \
--region "$REGION" \
--availability-zone "$AVAILABILITY_ZONE" \
--volume-type gp3output:
{
"AvailabilityZone": "eu-west-1a",
"CreateTime": "2024-08-02T11:55:18+00:00",
"Encrypted": false,
"Size": 500,
"SnapshotId": "",
"State": "creating",
"VolumeId": "vol-007e4d5f88a46fb6f",
"Iops": 3000,
"Tags": [],
"VolumeType": "gp3",
"MultiAttachEnabled": false,
"Throughput": 125
}Set the VolumeId field to a variable to use in further commands:
VOLUME_ID="vol-007e4d5f88a46fb6f"Create a description variable to use in next command:
VOLUME_DESCRIPTION="informatica-prod-secure-agent-tmp-volume"Name the new volume so you know what is it when you look at it in future in the UI:
aws ec2 create-tags \
--resources "$VOLUME_ID" \
--tags Key=Name,Value="$VOLUME_DESCRIPTION"This can be done with zero downtime while the VM is running.
Look up the EC2 instance ID of the VM you want to attach it to:
aws ec2 describe-instances \
--query 'Reservations[*].Instances[*].[InstanceId, Tags[?Key==`Name`].Value | [0]]' \
--output tableCreate a variable with the EC2 instance ID:
EC2_INSTANCE_ID="i-0a1234b5c6d7890e1"Attach the new disk to the instance giving it a new device name, in this case /dev/sdb:
aws ec2 attach-volume --device /dev/sdb \
--instance-id "$EC2_INSTANCE_ID" \
--volume-id "$VOLUME_ID"(you cannot specify /dev/nvme1 as the next disk you see on Nitro VMs but if you specify /dev/sdb then it will
appear as /dev/nvme1n1 anyway)
Inside the VM - follow the Disk Management commands.
See if the new disk is available:
cat /proc/partitionsIf you can't see it yet, run partprobe:
sudo partprobeand then repeat the above cat /proc/partitions (it has also appeared after a few seconds on EC2 without this)
Create a new GPT partition table on the new disk:
sudo parted /dev/nvme1n1 --script mklabel gptCreate a new partition that spans the entire disk:
sudo parted /dev/nvme1n1 --script mkpart primary 0% 100%See the new partition:
cat /proc/partitionsFormat the partition with XFS:
sudo mkfs.xfs /dev/nvme1n1p1Verify the new formatting:
lsblk -f /dev/nvme1n1Since device numbers can change on rare occasion, find and use the UUID instead:
lsblk -o NAME,UUIDEdit /etc/fstab:
sudo vi /etc/fstaband add a line like this, substituting the UUID from the above commands:
UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx /tmp xfs defaults,nofail 0 2If the mount point is /tmp make sure you shut down any processes that might be using it first like
Informatica agent.
Then mount it using this short form of the mount command which tests the fstab at the same time:
sudo mount /tmpmount: (hint) your fstab has been modified, but systemd still uses
the old version; use 'systemctl daemon-reload' to reload.sudo systemctl daemon-reloadCheck new mounted partition and space is available:
df -Th /tmpIf you've just mounted a new /tmp make sure to set a sticky bit and world writable permissions for people and apps
to be able to use it:
sudo chmod 1777 /tmpStart back up any processes that you shut down before mounting the disk.
https://docs.aws.amazon.com/ebs/latest/userguide/recognize-expanded-volume-linux.html
Check the partition sizes by running this inside the EC2 VM shell:
lsblkList EC2 EBS volumes using script in DevOps-Bash-tools repo:
aws_ec2_ebs_volumes.shor find it in the AWS Console UI:
open "https://$AWS_DEFAULT_REGION.console.aws.amazon.com/ec2/home?region=$AWS_DEFAULT_REGION#Volumes:"Using script in DevOps-Bash-tools repo:
aws_ec2_ebs_create_snapshot_and_wait.sh "$volume_id" "before root partition expansion"(this script automatically determines and prefixes the name of the EC2 instance to the description)
or manually create and keep checking for completion:
aws ec2 create-snapshot --volume-id "$volume_id" --description "myvm: before root partition expansion"The snapshot may take a while. Watch its progress at in the AWS Console UI here:
open "https://$AWS_DEFAULT_REGION.console.aws.amazon.com/ec2/home?region=$AWS_DEFAULT_REGION#Snapshots:"or check for pending snapshots using AWS CLI:
aws ec2 describe-snapshots --query 'Snapshots[?State==`pending`].[SnapshotId,VolumeId,Description,State]' --output tableAfter the snapshot above is complete, run this script from DevOps-Bash-tools repo:
aws_ec2_ebs_resize_and_wait.sh "$volume_id" "$size_in_gb"or manually:
aws ec2 modify-volume --volume-id "$voume_id" --size "$size_in_gb"and then repeatedly manually monitor the modification:
aws ec2 describe-volumes-modifications --volume-ids "$volume_id"Double check which partition you want to enlarge by running this inside the EC2 VM shell:
lsblkIf the partition is number 4, then
sudo growpart /dev/nvme0n1 4output should look like this:
CHANGED: partition=4 start=1437696 old: size=417992671 end=419430366 new: size=627707871 end=629145566
verify the new size:
lsblkCheck the filesystem sizes and types:
df -hTIf it's Ext4, extend the filesystem like so:
sudo resize2fs /dev/nvme0n1p4If it's XFS, extend the filesystem like so, in this case for the / root filesystem:
sudo xfs_growfs -d /output should look like this:
meta-data=/dev/nvme0n1p4 isize=512 agcount=86, agsize=610431 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=1 bigtime=0 inobtcount=0
data = bsize=4096 blocks=52249083, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
data blocks changed from 52249083 to 78463483
Verify the new filesystem size:
df -hTThis is only for non-root volumes.
For example if you want to replace the /tmp disk with a smaller one now that data migration is complete.
IMPORTANT: First shut down any software in the VM using the volume to avoid data corruption
Inside the VM, unmount the volume, eg:
umount /tmpIf you get an error like:
umount: /tmp: target is busy.
check:
lsof /tmpor
fuser -mv /tmpand kill those processes or ask users to log out if it's their shell session holding it.
If there is nothing left except:
USER PID ACCESS COMMAND
/tmp: root kernel mount /tmp
You may have to reboot the VM - in which case remove or comment out the disk's mount point entry eg.
entry in this case from /etc/fstab first to prevent it having a possible boot time error.
You can do the detachment but the volume will still be visible in an ls -l /tmp and may require a reboot to clear
the state and connection to the EBS volume.
WARNING: do not reboot the EC2 instance without commenting out the disk mount or setting the nofail option
Otherwise you will be forced to do a disk mount recovery using another EC2 instance as per the EC2 Disk Mount Recovery procedure from the troubleshooting section.
If you do that beware that a Reboot instance may not succeed and you may need a Force Instance Stop cold shutdown and
startup to clear the state as a regular Reboot may get stuck starting up before SSH comes up to do anything.
From DevOps-Bash-tools list instances and their EBS volumes:
aws_ec2_ebs_volumes.shaws ec2 detach-volume --volume-id "$VOLUME_ID" --instance-id "$EC2_INSTANCE_ID" --device "$DEVICE"List unattached EBS volumes:
aws ec2 describe-volumes --query 'Volumes[?Attachments==`[]`].[VolumeId]' --output tableOptionally deleted the EBS volume if you're 100% sure you don't need it any more:
aws ec2 delete-volume --volume-id "$VOLUME_ID"Putting AWS SSO roles into IAM policies can be tricky because what you see from
aws sts get-caller-identity{
"UserId": "ABCDEFA1BCDEFABCD23EF:Hari@domain.com",
"Account": "123456789012",
"Arn": "arn:aws:sts::123456789012:assumed-role/AWSReservedSSO_DevOpsAdmins_1abcd2345e67fa8b/Hari@domain.com"
}
is not what you need to put in to IAM.
Putting something like this:
arn:aws:sts::123456789012:assumed-role/AWSReservedSSO_DevOpsAdmins_1abcd2345e67fa8b/Hari@domain.com
or even
arn:aws:sts::123456789012:assumed-role/AWSReservedSSO_DevOpsAdmins_1abcd2345e67fa8b
will result in an AWS IAM error like this:
MalformedPolicy: Invalid principal in policy
Instead you need to put the base role like this:
arn:aws:iam::123456789012:role/AWSReservedSSO_DevOpsAdmins_1abcd2345e67fa8b
You can also verify the role using this command:
aws iam get-role --role-name arn:aws:iam::123456789012:role/AWSReservedSSO_DevOpsAdmins_1abcd2345e67fa8bThe easiest way to determine your currently logged in AWS SSO role, is using this script from DevOps-Bash-tools:
aws_sso_role_arn.shor to see all the AWS SSO role arns in your currently authenticated account:
aws_sso_role_arns.shNote these are two separate scripts.
However, beware that S3 bucket policies IAM need the fuller format of:
arn:aws:iam::123456789012:role/aws-reserved/sso.amazonaws.com/eu-west-1/AWSReservedSSO_DevOpsAdmins_1abcd2345e67fa8b
You can get this from the ARN output by the command:
aws iam get-role --role-name arn:aws:iam::123456789012:role/AWSReservedSSO_DevOpsAdmins_1abcd2345e67fa8bor this command and grepping:
aws iam list-roles --query 'Roles[*].Arn' --output text | tr '[:space:]' '\n'or by just using this script from DevOps-Bash-tools:
aws_sso_role_arn_full.shHosted SQL RDBMS like MySQL, PostgreSQL, Microsoft SQL Server etc.
AWS CLI doesn't have a convenient short form for just listing instances, but you can get one like this:
aws rds describe-db-instances | jq -r '.DBInstances[].DBInstanceIdentifier'with their statuses in a table:
aws rds describe-db-instances --query "DBInstances[*].[DBInstanceIdentifier,DBInstanceStatus]" --output table(notice this is using AWS CLI query not jq - hence the different query string format)
Using the name returned from above commands:
aws rds modify-db-instance \
--db-instance-identifier "$RDS_INSTANCE" \
--master-user-password "MyNewVerySecurePassword"CDN.
Put it in front of public S3 buckets which should have a Control Tower guardrail against public S3 buckets.
Get the Cloudfront domain name that content is available at:
aws cloudfront list-distributions --query 'DistributionList.Items[*].DomainName' --output textFirst ensure it's in pem format by converting using OpenSSL and consider combining the intermediate chain certificate for maximum compatibility.
Then import it to ACM:
aws acm import-certificate \
--certificate "file://$PWD/$name-cert.pem" \
--private-key "file://$PWD/$name-privatekey.pem" \
--certificate-chain "file://$PWD/$chain.pem"You need the file://...path prefix otherwise you'll get this error:
Invalid base64: "$name-cert.pem"
If you get an error like this:
Invalid base64: "-----BEGIN PRIVATE KEY-----
...
Add the debug switch to the command:
--debug
It could be that the private key is in PKCS#1 instead of PKCS#8 format. Convert it like so:
openssl pkcs8 -topk8 -inform PEM -outform PEM -nocrypt -in "$name-privatekey.pem" -out "$name-privatekey-pkcs8.pem"If this wasn't the case then the output file will be identical:
md5sum "$name"-privatekey.pem "$name"-privatekey-pkcs8.pemCheck for Windows carriage returns in the file format and if found...
Fix in place:
sed -i 's/\r$//' "$name"-*.pem "$chain.pem"Or safer fix to new files:
for x in "$name"-*.pem "$chain.pem"; do
tr -d '\r' < "$x" > "${x%.pem}.fixed.pem"
doneCheck the Base64 Encodings in the SSL doc.
If all of the above fails, it's possible that it's a bug in the AWS CLI as I've found that pasting into the UI works.
Verify the import:
aws acm list-certificates --query "CertificateSummaryList[*].{ARN:CertificateArn,DomainName:DomainName}"For the one you just imported:
CERTIFICATE_ARN="$(aws acm list-certificates --query "CertificateSummaryList[-1].CertificateArn" --output text| tee /dev/stderr)"aws acm describe-certificate --certificate-arn "$CERTIFICATE_ARN"Logs and metrics cannot be centralized in AWS Cloudwatch. It can only be stored in CW of the respective AWS account.
High costs for CloudWatch metrics and dashboards.
Lucene query of Elasticsearch is more user friendly than AWS Cloudwatch Log Insights.
Make sure you are not using T-series (T3 / T2) burstable general purpose instance types.
Change to another instance type if you are.
When Status becomes Storage Full on the RDS home page the DB instance writes stop working due to no space to
write DB redo logs for ACID compliance. Reads may still work during this time.
Solution: Ensure Enable storage autoscaling is ticked and modify the instance to increase the
Maximum Storage Threshold by a reasonable amount, no less than 20%.
The event log is basically useless to tell you what the actual problem is other than it fails to get the data files from S3:
Failed to create cache <name>. Data restoration from snapshot failed because failed to retrieve file from S3..
Ensure IAM bucket policies to the S3 bucket grants to <region>.elasticache-snapshot.amazonaws.com
and elasticache.amazonaws.com, as well as to the KMS key used to encrypt the bucket.
In Terraform / Terragrunt it might look something like this for the S3 bucket policy:
policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ServiceAccess",
"Effect": "Allow",
"Principal": {
"Service": [
"${local.aws_region}.elasticache-snapshot.amazonaws.com", # this is the one
"elasticache.amazonaws.com"
]
},
"Action": [
"s3:GetBucketAcl",
"s3:GetObject",
"s3:GetObjectAcl"
"s3:PutObject",
"s3:ListBucket",
"s3:ListMultipartUploadParts",
"s3:ListBucketMultipartUploads",
],
"Resource": [
"arn:aws:s3:::${local.name}",
"arn:aws:s3:::${local.name}/*",
# the above wildcard should be sufficient
#"arn:aws:s3:::${local.name}/my-iplookup-db-0001.rdb",
#"arn:aws:s3:::${local.name}/my-iplookup-db-0002.rdb",
#"arn:aws:s3:::${local.name}/my-iplookup-db-0003.rdb"
]
},
{
"Sid": "RoleAccess",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:role/eks-..." # allow access from EKS role
},
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::${local.name}",
"arn:aws:s3:::${local.name}/*",
# the above wildcard should be sufficient
#"arn:aws:s3:::${local.name}/my-iplookup-db-0001.rdb",
#"arn:aws:s3:::${local.name}/my-iplookup-db-0002.rdb",
#"arn:aws:s3:::${local.name}/my-iplookup-db-0003.rdb"
]
}
]
}
EOFand this for the KMS policy:
policy = {
"Version" : "2012-10-17",
"Id" : "key-default-1",
"Statement" : [
{
"Sid" : "Enable IAM User Permissions",
"Effect" : "Allow",
"Principal" : {
"AWS" : "arn:aws:iam::${local.aws_account_id}:root",
"Service": [
"elasticache.amazonaws.com",
"${local.aws_region}.elasticache-snapshot.amazonaws.com"
]
},
"Action" : "kms:*",
"Resource" : "*"
}
]
}After EKS Spot pod migrations, the app pod sometimes comes up before the Vault pod comes up so its attempt to get the DB password from Vault fails and results in a blank DB password and later DB connection error.
In a Python Django app it may remain up but not functioning and its logs may contain Python tracebacks like this:
MySQldb._exceptions.OperationalError: (1045, "Access denied for user 'myuser'@'x.x.x.x' (using password: YES)")Restart the app deployment to restart the pod after the Vault pod has come up so that the pod re-fetches the correct DB password from Vault.
kubectl rollout restart deployment <app>- Create an init container to accurately test for Vault availability before allowing the app pod to come up
- This can test Vault availability
- It can fetch DB password similar to what the app container does
- It can test that the fetched DB password actually works using a test connection to the DB
- The App itself could crash upon startup detection that the DB connection fails to cause the pod to crash and
auto-restart until the DB password is fetched and connected successfully
- The DB connection and implicitly the Vault password load could be tested by the entrypoint trying to connect to the DB before starting the app
This is sometimes necessary when a Linux VM isn't coming up due to some disk changes such as detaching and deleting a
volume that is still in /etc/fstab or some other configuration imperfection that is preventing the boot process from
completing to give you SSH access.
Use another EC2 instance in the same Availability Zone as the problematic VM which owns the disk where the EBS volume is physically located.
- Shut down the problem instance which isn't booting.
- Optional: mark the ebs volume with tags such as
Name1=Problemto make it easier to find - Detach the EBS volume from the problem instance
- Find the volume (optionally using the
Problemsearch in the list of EBS volumes) - Attach the EBS volume to your debug EC2 instance in the same Availabilty Zone as device
/dev/sdf - On the debug instance:
Find the new disk. It's usually the largest partition on the new disk
cat /proc/mountsMount it:
mount /dev/xvdf4 /mntEdit the fstab:
sudo vi /mnt/etc/fstabAdd the nofail option to all disk lines mount options 4th field to ensure the Linux OS comes up even if it can't
find a disk (because for example you've detached it to replace it with a different EBS volume):
The lines should end up looking like this:
UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx / xfs defaults,nofail 0 0
UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx /tmp xfs defaults,nofail 0 2
UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx /boot xfs defaults,nofail 0 0
UUID=xxxx-xxxx /boot/efi vfat defaults,uid=0,gid=0,umask=077,shortname=winnt,nofail 0 2After editing and saving the /etc/fstab file, unmount the recovery disk:
sudo umount /mnt- Detach the volume from the debug instance
- Attach the volume to the original instance
- Start the original instance which should now come up
- Remove the
Problemtag from the volume
Partial port from private Knowledge Base page 2012+




