Making EC2 boot time faster

jedberg

Boot time is the number one factor in your success with auto-scaling. The smaller your boot time, the smaller your prediction window needs to be. Ex. If your boot time is five minutes, you need to predict what your traffic will be in five minutes, but if you can boot in 20 seconds, you only need to predict 20 seconds ahead. By definition your predictions will be more accurate the smaller the window is.

But! Autoscaling serves two purposes. One is to address load spikes. The other is to reduce costs with scaling down. What this solution does is trade off some of the cost savings by prewarming the EBS volumes and then paying for them.

This feels like a reasonable tradeoff if you can justify the cost with better auto-scaling.

And if you're not autoscaling, it's still worth the cost if the trade off is having your engineers wait around for instance boots.

necovek

From a technical perspective, Amazon has actually optimized this but turned that into "serverless functions": their ultra-optimized image paired with Firecracker achieve ultra-fast boot-up of virtual Linux machines. IIRC from when Firecracker was being introduced, they are booting up in sub-second times.

I wonder if Amazon would ever decide to offer booting the same image with the same hypervisor in EC2 as they do for lambdas?

amluto

I don’t use EC2 enough to have played with this, but a big part here is the population of the AMI into the per-instance EBS volume.

ISTM one could do much better with an immutable/atomic setup: set up an immutable read-only EBS volume, and have each instance share that volume and have a per-instance volume that starts out blank.

Actually pulling this off looks like it would be limited by the rules of EBS Multi-Attach. One could have fun experimenting with an extremely minimal boot AMI that streams a squashfs or similar file from S3 and unpacks it.

edit: contemplating a bit, unless you are willing to babysit your deployment and operate under serious constraints, EBS multi-attach looks like the wrong solution. I think the right approach would be build a very very small AMI that sets up a rootfs using s3fs or a similar technology and optionally puts an overlayfs on top. Alternatively, it could set up a block device backed by an S3 file and optionally use it as a base layer of a device-mapper stack. There’s plenty of room to optimize this.

fduran

So I've created ~300k ec2 instances with SadServers and my experience was that starting an ec2 VM from stopped took ~30 seconds and creating one from AMI took ~50 seconds.

Recently I decided to actually look at boot times since I store in the db when the servers are requested and when they become ready and it turns out for me it's really bi-modal; some take about 15-20s and many take about 80s, see graph https://x.com/sadservers_com/status/1782081065672118367

Pretty baffled by this (same region, same pretty much everything), any idea why?. Definitively going to try this trick in the article.

crohr

> while we can boot the Actions runner within 5 seconds of a job starting, it can take GitHub 10+ seconds to actually deliver that job to the runner

This. I went the same route with regards to boot time optimisations for [1] (cleaning up the AMI, cloud-init, etc.), and can boot a VM from cold in 15s (I can't rely on prewarming pools of machines -- even stopped -- since RunsOn doesn't share machines with multiple clients and this would not make sense economically).

But the time taken by the official runner binary to load and then get assigned a job by GitHub always takes around 8s, which is more than half of the VM boot time :( At some point it would be great if GitHub could give us a leaner runner binary with less legacy stuff, and tailored for ephemeral runners (that, or reverse-engineer the protocol).

[1] https://runs-on.com

develatio

Maybe AWS should actually take a look into this. I know comparing AWS to other (smaller) cloud providers is not totally fair given the size of AWS, but for example creating / booting an instance in Hetzner takes a few seconds.

mnutt

They talk about the limitations of the EC2 autoscaler and mention calling LaunchInstances themselves, but are there any autoscaler service projects for EC2 ASGs out there? The AWS-provided one is slow (as they mention), annoyingly opaque, and has all kinds of limitations like not being able to use Warm Pools with multiple instance types etc.

everfrustrated

It's too bad that EBS doesn't natively support Copy-On-Write.

Snapshots are persisted into S3 (transparently to the user) but it means each new EBS volume spawned doesn't start at full IOPS allocation.

I presume this is due to EBS volumes being specific-AZ so to be able to launch an AMI-seeded EBS volume in any AZ it needs to go via S3 (multi-AZ)

Nextgrid

I don't get why they're using EBS here to begin with. EBS trades off cost and performance for durability. It's slow because it's a network-attached volume that's most likely also replicated under the hood. You use this for data that you need high durability for.

It looks like their use-case fetches all the data it needs from the network (in the form of the GH Actions runner getting the job from GitHub, and then pulling down Docker containers, etc).

What they need is a minimal Linux install (Arch Linux would be good for this) in a squashfs/etc and the only thing in EBS should be an HTTP-aware boot loader like IPXE or a kernel+initrd capable of pulling down the squashfs from S3 and run it from memory. Local "scratchspace" storage for the build jobs can be provided by the ephemeral NVME drives which are also direct-attach and much faster than EBS.

solatic

Makes me wonder why Depot isn't moving to on-prem hardware. When you're reselling compute with a better API, you give up a substantial proportion of your profits to the hyperscaler while offering worse performance (due to being held hostage to the hyperscaler's design decisions, like lazy loading root EBS from S3).

Surely an optimized approach here looks something like booting customer CI workloads directly from the hypervisor, using an ISO/squashfs/etc. stored directly on the hypervisor, where the only networked disks are the ones with the customers' BuildKit caches?

maccard

I don't use GHA as some of our code is stored in Perforce, but we've faced the same challenges with EC2 instance startup times on our self managed runners on a different provider.

We would happily pay someone like depot for "here's the AMI I want to run & autoscale, can you please do it faster than AWS?"

We hit this problem with containers too - we'd _love_ to just run all our CI on something like fargate and have it automatically scale and respond to our demand, but the response times and rate limting are just _so slow_ that it means instead we just end up starting/stopping instances with a lambda which feels so 2014.

bingemaker

Curious, how do you measure the time taken for those 4 steps listed in "What takes so long?" section?

uavoperator

This is really only tangentially related to the article, but

>If AWS responds that there is no current capacity for m7a instances, the instance is updated to a backup type (like m7i) and started again

Any ideas why m7i would be chosen as the backup type rather than the other way around? m7a seems to be more expensive than m7i, so maybe there's some performance advantage or something else I'm missing that makes AMD CPU containing instances preferable to Intel ones?

waiwai933

I believe this is similar to EC2 Fast Launch which is available for Windows AMIs, but I don't know exactly how that works under the hood.

https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/win-a...

cmckn

You can enable fast restore on the EBS snapshot that backs your AMI: https://docs.aws.amazon.com/ebs/latest/userguide/ebs-fast-sn...

It’s not cheap, but it speeds things up.

suryao

This is very cool optimization.

I make a similar product offering fast Github actions runners[1] and we've been down this rabbit hole of boot time optimization.

Eventually, we realized that the best solution is to actually build scale. There are two factors in your favor then: 1) Spikes are less pronounced and the workloads are a lot more predictable. 2) The predictability means that you have a decent estimate of the workload to expect at any given time, within reason for maintaining an efficient warm pool.

This enables us to simplify the stack and not have high-maintenance optimizations while delivering great user experience.

We have some pretty heavy use customers that enable us to do this.

[1] https://www.warpbuild.com

immibis

There's something to say about building a tower of abstractions and then trying to tear it back down. We used to just run a compiler on a machine. Startup time: 0.001 seconds. Then we'd run a Docker container on a machine. Startup time: 0.01 sections. Fine, if you need that abstraction. Now apparently we're booting full VMs to run compilers - startup time: 5 seconds. But that's not enough, because we're also allocating a bunch of resources in a distributed network - startup time: 40 seconds.

Do we actually need all this stuff, or does it suffice to get one really powerful server (price less than $40k) and run Docker on it?

paulddraper

> From a billing perspective, AWS does not charge for the EC2 instance itself when stopped, as there's no physical hardware being reserved; a stopped instance is just the configuration that will be used when the instance is started next. Note that you do pay for the root EBS volume though, as it's still consuming storage.

Shutdown standbys absolutely the way to do it.

Does AWS offer anything for this, because it's very tedious to set this up.

orf

It seems that you want to make your root volume as small as possible, and use it to only attach a pre-warmed pool of EBS volumes at launch time that contain the actual config/data you need?

You can launch a stripped down distribution with what, a 200mb disk? Then attach the “useful” EBS volume, and “do stuff” with that - launch a container, or whatever.

nathants

in the us-west-2-lax-1a local zone, i just booted 100 r5.xlarge spot instances as fortnite like game servers[1]. 1 to be a central server, 99 to be fake players. the server broadcasts x100 write amplified data from every player to every player. the 101st serve is my local pc.

the server broadcasts at 200 MB/s[2]. the whole setup costs me $3-4 usd/hour and by far the slowest part of boot is my game compiling on the central server, whether i store ccache data in s3 or not. i've booted this every day for the last 6 months, to test the game.

if your system can't handle 30s vm boots, your system should improve.

1. https://r2.nathants.workers.dev/ec2_snitch.png

2. https://r2.nathants.workers.dev/ec2_boot.mp4

albert_e

AWS will (should) make this an optional feature.

Often the technology is the easier part.

The difficult part is how to name the feature intuitively, adding to an ocean of jargon and documentation, and making the configuration knobs intuitive both in UI and CLI/SDK.

Amazon Simple Compute Service :) ?

broknbottle

minimizing the image definitely helps.

https://aws.amazon.com/blogs/apn/how-to-build-sparse-ebs-vol...

https://netflixtechblog.medium.com/datastore-flash-upgrades-...

https://github.com/Netflix-Skunkworks/s3-flash-bootloader

elchief

I've noticed that Amazon linux 2023 boots faster than Ubuntu too

mediumsmart

Very cool. How many seconds of the faster boot time fit into one regular second?

thefaux

Whenever I see the flaws in aws ux, I remember that they bill by the hour.

pid-1

Can you have warm pools of spot instances?

gtirloni

> tl;dr — boot the instance once, shut the instance down, then boot it again when needed

their own tldr should be at the top not middle of the article :)