The additional benefit is devs can run all the same stuff on a Linux laptop (or Linux VM on some other platform) - and everyone can have their own VM in the cloud if they like to demo or test stuff using all the same setup. Bootstrapping a new system is checking in their ssh key and running a shell script.
Easy to debug, not complex or expensive, and we could vertically scale it all quite a ways before needing to scale horizontally. It's not for everyone, but seed stage and earlier - totally appropriate imo.
I've been running my SaaS first on a single server, then after getting product-market fit on several servers. These are bare-metal servers (Hetzner). I have no microservices, I don't deal with Kubernetes, but I do run a distributed database.
These bare-metal servers are incredibly powerful compared to virtual machines offered by cloud providers (I actually measured several years back: https://jan.rychter.com/enblog/cloud-server-cpu-performance-...).
All in all, this approach is ridiculously effective: I don't have to deal with complexity of things like Kubernetes, or with cascading system errors that inevitably happen in complex systems. I save on development time, maintenance, and on my monthly server bills.
The usual mantra is "but how do we scale" — I submit that 1) you don't know yet if you will need to scale, and 2) with those ridiculously powerful computers and reasonable design choices you can get very, very far with just 3-5 servers.
To be clear, I am not advocating that you run your business in your home closet. You still need automation (I use ansible and terraform) to manage your servers.
It's when one starts getting sucked down the "cloud native" wormhole of all these niche open source systems and operators and ambassador and sidecar patterns, etc. that things go wrong. Those are for environments with many independent but interconnecting tech teams with diverse programming language use.
Seriously, I think a lot of people do things the hard way to learn large scale infrastructure. Another common reason is 'things will be much easier when we scale to a massive number of clients', or we can dynamically scale up on demand.
These are all valid to the people building this, just not as much to founders or professional CTOs.
Should you pick a complex framework from day one? Probably not, unless your team has extensive experience with it.
My objection is towards the idea that managing infrastructure with a bespoke process and custom tooling will always be less effort to maintain than established tooling. It's the idea of stubbornly rejecting the "complexity" bogeyman, even when the process you built yourself is far from simple, and takes a lot of your time from your core product anyway.
Everyone loves the simplicity of copying over a binary to a VPS, and restarting a service. But then you want to solve configuration and secret management, have multiple servers for availability/redundancy so then you want gradual deployments, load balancing, rollbacks, etc. You probably also want some staging environment, so need to easily replicate this workflow. Then your team eventually grows and they find that it's impossible to run a prod-like environment locally. And then, and then...
You're forced to solve each new requirement with your own special approach, instead of relying on standard solutions others have figured out for you. It eventually gets to a question of sunken cost: do you want to abandon all this custom tooling you know and understand, in favor of "complexity" you don't? The difficult thing is that the more you invest in it, the harder it will be to migrate away from it.
My suggestion is: start by following practices that will make your transition to the standard tooling later easier. This means deploying with containers from day 1, adopting the 12 factors methodology, etc. And when you do start to struggle with some feature you need, switch to established tooling sooner later than later. You're likely find that your fear of the unknown was unwarranted, and you'll spend less time working on infra in the long run.
These are the only things I have ever been comfortable using in the cloud.
Once you get into FaaS and friends, things get really weird for me. I can't handle not having visibility into the machine running my production environment. Debugging through cloud dashboards is a shit experience. I think Microsoft's approach is closest to actually "working", but it's still really awful and I'd never touch it again.
The ideal architecture for me after 10 years is still a single VM with monolithic codebase talking to local instances of SQLite. The advent of NVMe storage has really put a kick into this one too. Backups handled by snapshotting the block storage device. Transactional durability handled by replicating WAL, if need be.
Dumbass simple. Lets me focus on the business and customer. Because they sure as hell don't care about any of this and wouldn't pay any money for it. All this code & infra is pure downside. You want as little of it as possible.
You may never need to split your monolith! Stripe eventually broke some stuff out of their Rails monolith but it gets you surprisingly far.
You are not going to get easier to debug than a Django/Rails/etc monolith.
I bit of foresight on where you want to go with your infra can help you though; I built the first versions of our company as a Django Docker container running on a single VM. Deploy was a manual “docker pull; docker stop; docker start”. This setup got us surprisingly far. Docker is nice here as a way of sidestepping dependency packaging issues, this can be annoying in the early stages (eg does my server have the right C header files installed for that new db driver I installed? Setup Will be different than in your Mac!)
We eventually moved to k8s after our seed extension in response to a business need for reliability and scalability; k8s served us well all the way through series B . So the setup to have everything Dockerized made that really easy too - but we aggressively minimized complexity in the early stages.
VPS technology has come a very long way and is highly reliable. The disks on the node are set up in RAID 1 and the VM itself can be easily live migrated to another machine for node maintenance. You can take snapshots etc.
To me, I would only turn to cloud infra not for greater reliability but more for collaboration and the operational housekeeping features like IAM, secrets management, infra-as-code etc, or for datacenter compliance reasons like HIPAA.
I run a small, bootstrapped startup. We don't have enough money to pay ourselves and I make a living doing consulting on the side. Being budget and time constrained like that I have to be highly selective in what I use.
So, I love things like Google cloud. Our GCP bills are very modest. A few hundred euros per month. I would move to a cheaper provider except I can't really justify the time investment. And I do like Google's UI and tools relative to AWS, which I've used in the past.
I have no use for Kubernetes. Running an empty cluster would be more expensive than our current monthly GCP bills. And since I avoided falling into the micro-services pitfall, I have no need for it either. But I do love Docker. That makes deploying software stupidly easy. Our website is a Google storage bucket that is served via our load balancer and the Google CDN. The same load balancer routes rest calls to two vms that run our monolith. Which talk to a managed DB and managed Elasticsearch and a managed Redis. The DB and Elasticsearch are expensive. But having those managed saves a lot of time and hassle. That just about sums up everything we have. Nice and simple. And not that expensive.
I could move the whole thing to something like Hetzner and cut our bills by 50% or so. Worth doing maybe but not super urgent for me. Losing those managed services would make my life harder. I might have to go back to AWS at some point because some of our customers seem to prefer that. So, there is that as well.
I have had great success with a very simple kube deployment:
- GKE (EKS works well but requires adding an autoscaler tool)
- Grafana + Loki + Prometheus for logs + metrics
- cert-manager for SSL
- nginx-ingress for routing
- external-dns for autosetup DNS
I manage these with helm. I might, one day, get around to using the Prometheus Operator thing, but it doesn't seem to do anything for me except add a layer of hassle.
New deployments of my software roll out nicely. If I need to scale, cut a branch for testing, I roll into a new namespace easily, with TLS autosetup, DNS autosetup, logging to GCP bucket... no problem.
I've done the "roll out an easy node and run" thing before, and I regret it, badly, because the back half of the project was wrangling all these stupid little operational things that are a helm install away on k8s.
So if you're doing a startup: roll out a nice simple k8s deployment, don't muck it up with controllers, operators, service meshes, auto cicds, gitops, etc. *KISS*.
If you're trying to spin a number of small products: just use the same cluster with different DNS.
(note: if this seems particularly appealing to you, reach out, I'm happy to talk. This is a very straightforward toolset that has lasted me years and years, and I don't anticipate having to change it much for a while)
But still, no matter what, the odd customer demands they need all these complexities turned on for no discernible reason.
IMO it’s a far better approach with any platform to deploy the minimum and turn things on if you need to as you develop.
Incidentally, I’ve been exposed to “traditional” cloud platforms (Azure, GCP, AWS) through work and tried a few times to use them for personal projects in recent years and get bewildered by the number of toggles in the interface and strange (to me) paradigms. I recently tried Cloudflare Workers as a test of an idea and was surprised how simple it was.
I thought the same thing until recently. Apparently there's a "Docker Swarm version 2" around, and it was the original (version 1) Docker Swarm that was deprecated:
https://docs.docker.com/engine/swarm/
Do not confuse Docker Swarm mode with Docker Classic Swarm which is no
longer actively developed.
Haven't personally tried out the version 2 Docker Swarm yet, but it might be worth a look at. :)
I was brought in to help get a full system rewrite across the finish line. Of course the deployment story was pretty great! Lots of automated scripts to get systems running nicely, autoscaling, even a nice CI builder. The works.
After joining, I found out all of this was to the detriment of so much. Nobody was running the full frontend/backend on their machine. There was a team of 5 people but something like 10-15 services. CI was just busted when I joined, and people were constantly merging in things that broke the few tests that were present.
The killer was that because of this sort of division of labor, there'd be constant buck-passing because somebody wasn't "the person" who worked on the other service. But in an alternate universe all of that would be in the same repo. Instead, everything ended up coordinated across three engineers.
A shame, because the operational story letting me really easy swap in a pod for my own machine in the test environment was cool! But the brittleness of the overall system was too much for me. Small teams really shouldn't have fiefdoms.
1. It took the end of ZIRP era for people to realize the undue complexity of many fancy tools/frameworks. The shitshow would have continued unabated as long as cheap money was in circulation.
2. Most seasoned engineers know for the fact that any abstractions around the basic blocks like compute, storage, memory and network come with their own leaky parts. And that knowledge and wisdom helps them make the suitable trade-offs. Those who don't grok them, shoot themselves in the foot.
Anecdote on this. A small sized startup doing B2B SaaS was initially running all their workloads on cheap VPSs incurring a monthly bill of around $8K. The team of 4 engineers that managed the infrastructure cost about $10K per month. Total cost:$8K. They made a move to 'cloud native' scene to minimize costs. While the infra costs did come down to about $6K per month, the team needed new bunch of experts who added about another $5K to the team cost, making the total monthly cost $21K ($6K + $10K + $5K). That plus a dent to the developer velocity and the release velocity, along with long windows of uncertainty with regards to debugging complex stuff and challenges. The original team quit after incurring extreme fatigue and just the team cost has now gone up to about $18K per month. All in all, net loss plus undue burden.
Engineers must be tuned towards understanding the total cost of ownership over a longer period of time in relation to the real dollar value achieved. Unfortunately, that's not a quality quite commonly seen among tech-savvy engineers.
Being tech-savvy is good. Being value-savvy is way better.
You can even run the whole thing locally.
We actually just did a Show HN about it:
Do startups really need complex cloud architecture?
Inspired, I wrote a blog exploring simpler approaches and created a docker-compose template for deployment
Curious to know your thoughts on how you manage your infrastructure. How do you simplify it? How do you balance?
Focus on product market fit (PMF) and keep things as straightforward as possible.
Create a monolith, duplicate code, use a single RDBMS, adopt proven tech instead of the “hot new framework”, etc.
The more simple the code, the easier it is to migrate/scale later on.
Unnecessary complexity is the epitome of solving a problem that doesn’t exist.
Low operational costs are essential for a hardware business if you don't want to burden your customers with an ongoing subscription fee. Otherwise the business turns into some kind of pyramid scheme where you have to sell more and more units in order to keep serving your existing customers.
I have a moral obligation towards my customers to keep running even if the sales stop at some point.
So I always multiply my cost for anything with 10 years, and then decide if I am willing to bear it. If not, then i find another solution.
I sometimes wonder how many of these post boil down to "I don't want to learn k8s can I just use this thing I already know?".
My team of 6 engineers have a social app at around 1,000 DAU. The previous stack has several machines serving APIs and several machines handling different background tasks. Our tech lead is forcing everyone to move to separate Lambdas using CDK to handle each each of these tasks. The debugging, deployment, and architecting shared stacks for Lambdas is taking a toll on me -- all in the name of separation of concerns. How (or should) I push back on this?
But an application built in the high pressure environment of a startup also has the risk of becoming unmanageable, one or two years in. And to the extent you already have familiar tools to manage this complexity, I vote for using them. If you can divide and conquer your application complexity into a few different services, and you are already experienced in an appropriate application framework, that may not be such a bad choice. It helps focus on just one part of the application, and have multiple people work on the separate parts without stepping on each other.
I personally don't think that should include k8s. But ECS/Fargate with a simple build pipeline, all for that. "Complex" is the operative word in the article's title.
For new projects that, with luck, will have a couple hundred users at the beginning it is just overkilling (and also very expensive).
My approach is usually Vercel + some AWS/Hetzner instance running the services with docker-compose inside or sometimes even just a system service that starts with the instance. That's just enough. I like to use Vercel when deploying web apps because it is free for this scale and also saves me time with continuous deployment without having to ssh into the instances, fetch the new code and restart the service.
Like all things, there's a good middle ground here-- use managed services where you can but don't over-architect features like availability & scaling. For example, Kubernetes is an heavy abstraction; make sure it's worth it. A lot of these solutions also increase dev cycles, which is not great early on.
Yes. This is the basis of privilege separation and differential rollouts. If you collapse all this down into a single server or even lambda you lose that. Once your service sees load you will want this badly.
> SQS and various background jobs backed by Lambda
Yes. This is the basis of serverless. The failure of one server is no longer a material concern to your operation. Well done you.
> Logs scattered across CloudWatch
Okay. I can't lie. CloudWatch is dogturds. There is no part of the service that is redeemable. I created a DyanmoDB table and created a library which puts log lines collected into "task records" into the table paritioned by lambda name and sorted by record time. Each lambda can configure the logging environment or use default which include a log entry expiration time. Then I created a command line utility which can query and or "tail" this table.
This work took me 3 days. It's paid off 1000x fold since I did it. You do sometimes have to roll your own out here. CloudWatch is strictly about logging cold start times now.
> Could this have been simplified to a single NodeJS container or Python Flask/FastAPI app with Redis for background tasks? Absolutely.
Could this have been simplified into something far more fragile than what is described? Absolutely. Why you'd want this is entirely beyond me.
Downside is its a one to one system. But I just use downsized servers.
Bare minimum, script out the install of your product on a fresh EC2 instance from a stock (and up-to-date) base image, and use that for every new deploy.
Scaling (and relatedly, high availability) are premature optimizations[0] implemented (and authorized) by people hoping for that sweet hockey stick growth, cargo culting practices needed by companies several orders of magnitude larger.
[0] https://blog.senko.net/high-availability-is-premature-optimi...
I recently attempted to move to a completely static site (just plain HTML/CSS/JS) on Cloudflare Pages, that was previously on a cheap shared webhost.
Getting security headers setup, and forcing ssl, and www - as well as HSTS has been a nightmare (and still now working).
When on my shared host, this was like 10 lines of config in an .htaccess file before.
Let’s say I’ve got a golang binary locally on my machine, or as an output of github actions.
With Google Cloud Run/Fargate/DigitalOcean I can click about 5 buttons, push a docker image and I’m done, with auto updates, roll backs, logging access from my phone, all straight out of the box, for about $30/mo.
My understanding with Hetzner and co is that I need to SSH (now i need to keep ssh keys secure and manage access to them) in for updates, logs, etc. I need to handle draining connections from the old app to the new one. I need to either manage https in my app, or run behind a reverse proxy that does tls termination, which I need to manage the ssl certs for myself. This is all stuff that gets in the way of the fact that I just want to write my services and be done with it. Azure will literally install a GitHub actions workflow that will autodeploy to azure container apps for you, with scoped credentials.
Everyone is building like they are the next Facebook or Google. To be honest, if you get to that point, you will have the money to rebuild the environment. But, a startup should go with simple. I miss the days when RAILS was king just for this reason.
The added complexity is overkill. Just keep it simple. Simple to deploy, simple to maintain, simple to test, etc. Sounds silly, but in the long run, it works.
In my time at my current job we've scaled PHP MySQL and Redis from a couple hundred active users to several hundred-thousand concurrent users.
EC2+ELB, RDS (Aurora, Elasticache). Shell script to build a release tarball. Shell script to deploy it. Everyone goes home on time. In my 12+ years I've only had to work off hours twice.
People really love adding needless complexity in my experience.
Postgres for everything including queuing
Golang or nodejs/TypeScript for the web server
Raw SQL to talk to Postgres
Caddy as web server with automatic https certificates
- No docker.
- No k8s.
- No rabbitmq.
- No redis
- No cloud functions or lambdas.
- No ORM.
- No Rails slowing things down.
Your little startup will become large, and fast.
That hacked together single server is going to bite you way sooner than you think, and the next thing you know you’ll be wasting engineer hours migrating to something else.
Me personally, I’d rather just get it right the first time. And to be honest, all the cloud services out there have turned a complex cloud infrastructure into a quick and easy managed service or two.
E.g., why am I managing a single VPS server when I can manage zero servers with Fargate and spend a few extra bucks per month?
A single server with some basic stuff is great for micro-SaaS or small business type of stuff where frugality is very important. But if we shift the conversations to startups, things change fast.
Part of the reason they weren’t successful was because my managers insist on starting with microservices.
Starting with microservices prevents teams from finding product-market fit that would justify microservices.
What would be the simple yet robust infra for data eng? Not thought a lot about it for now, so I am curious if some of you have would have any insights.
- A lot of companies and startups can get by with a few modest sized VPSs for their applications
- Cloud providers and other infrastructure managed services can provide a lot value that justifies paying for them.
Want to run bare metal? OK, guess you're running your databases on bare metal. Do you have the DBA skills to do so? I would wager that an astounding number of founders who find themselves intrigued by the low cost of bare metal do not, in fact, have the necessary DBA skills. They just roll the dice on yet another risk in an already highly risky venture.
[0] https://kamal-deploy.org [1] https://kamalmanual.com/handbook
We’ve tried deploying services on K8s, Lambda/Cloud Run, but in the end, the complexity just didn’t make sense.
I’m sure we could get better performance running our own Compute/EC2 instances, but then we need to manage that.
In reality, there is a strong bias in favor of complex cloud infrastructure:
"We are a modern, native cloud company"
"More people mean (startup/manager/...) is more important"
"Needing an architect for the cloud first CRUD app means higher bills for customers"
"Resume driven development"
"Hype driven development"
... in a real sense, nearly everyone involved benefits from complex cloud infrastructure, where from a technical POV MySQL and PHP/Python/Ruby/Java are the correct choice.
One of the many reasons more senior developers who care for their craft burn out in this field.
One domain, an idea, an easy-to-use development stack for a bootstrapped as well as funded startup is more than good enough to locate product-market fit.
Alway remember this quote by Reid Hoffman “If you are not embarrassed by the first version of your product, you’ve launched too late.”
It's simple and can scale to complex if you want. I've had very good experience with it in medium size TS monorepos.
[0]: https://sst.dev
From what I understand he employees a dedicated system administrator to manage his fleet of VPS (updates, security and other issues that arise) for 1000s of USD per month.
Costs in my case is not the highest priority: I can spend a month learning the ins and outs of a new tool, or can spend a few days learning the basics and host a managed version on a cloud provider. The cloud costs for applications at my scale are basically nothing compared to developer costs and time. In combination with LLMs who know a lot about the APIs of the large cloud providers, this allows me to focus on building a product instead of maintenance.
Operating a bunch of simple low-level infrastructure yourself is not simpler than buying the capabilities off the shelf.
Man the infrastructure was absolutely massive and so much development effort went into it.
They should have had a single server, a backup server, Caddy, Postgres and nodejs/typescript, and used their development effort getting the application written instead of futzing with AWS ad infinitum and burning money.
But that's the way it is these days - startup founders raise money, find someone to build it and that someone always goes hard on the full AWS shebang and before you know it you spend most of your time programming the machine and not the application and the damn thing has become so complex it takes months to work out what the heck is going on inside the layers of accounts and IAM and policies and hundreds of lambda functions and weird crap.
It’s really brilliant. Sun would have been the one to buy Oracle if they’d figured out how to monetize FactorySingletonFactoryBean by charging by the compute hour and byte transferred for each module of that. That’s what cloud has figured out, and it’s easy to get developers to cargo cult complexity.
And S3. S3 is just a wonderful beast. It's just so hard to get something that is so cheap yet offers practically unlimited bandwidth and worry-free durability. I'd venture to say that it's so successful that we don't have an open-source alternative that matches S3. By that I specifically mean that no open-source solution can truly take advantage of scale: adding a machine will make the entire system more performant, more resilient, more reliable, and cheaper per unit cost. HDFS can't do that because of its limitation on name nodes. Ceph can't do that because of it bottleneck on managing OSD metadata and RGW indices. MinIO can't do that because their hash-based data placement simply can't scale indefinitely, let alone ListObjects and GetObjects will have poll all the servers. StorJ can't do that because their satellite and metadata servers are still the bottleneck, and the list can go on.
Of cource it highly depends on the skills of the team. In a startup there could be no time to learn how to do infrastructure well. But having an infrastructure expert in the team can significantly improve the time to market and reduce the developer burnout and the tech debt growth rate.
The result: It's still up after 5 years. I never looked back after I created the project. I do remember the endless other projects I did that have simply died now because I don't have time to maintain a server. And a server almost always end up crashing somehow.
Another thing, Pieter Levels has successful small apps that relies more on centralized audiences than infrastructure. He makes cool money but it's nowhere near startup-expected levels of money/cash/valuations. He is successful in the indie game but it'll be a mistake to extrapolate that to the VC/Silicon Valley startup game.
1. New technology is bad and has no merit other than for resume.
2. Use old technology that I am comfortable with.
3. Insist that everyone should use old technology.
Really? EC2 instances are waaay overpriced. If you need a specific machibe for a relatively short time, sure, you can pick up one from the vast choice of available configurations, but if you need on for long-running workloads, you'll be much better of picking up one from Hetzner, by an order of magnitude.
For one of the many examples, see this 5-year old summary (even more true today) by a CEO of a hardware startup:
https://jan.rychter.com/enblog/cloud-server-cpu-performance-...
Most of crap hitting servers is old exploits targeting popular CMS.
WAF is useful if you have to filter out traffic and you don’t know what might be exposed on your infra. Like that Wordpress blog that marketing set up 3 years ago and stopped adding posts and no one ever updated it.
vulnerability scanning of your images.
Fargate
RDS
I need it on my resume for every 2 year stint and 2-3 people on the team to vouch for it
You’re saying “hey, let everyone know you worked on a tiny company’s low traffic product and how about you just don’t make half a million a year," all to save the company I work at a little money?
until companies start interviewing for that its a dumb idea, I’m rarely making green field projects anywhere and other devs also are looking for maintainers of complex infrastructure
99.99% of the time. No.
For many of the "complex" things like lambdas there are frameworks like Serverless that makes managing and deploying it as easy (if not easier frankly) than static code on a VM.
Not every workload also scales at the same time, we have seen new things that got very successful and crashed right out the door because it could not properly scale up.
I agree that you don't need an over engineered "perfect" infrastructure, but just saying stick it on a VM also seems like it is too far of a swing in the other direction.
That ignores the cost side of running several VM's vs the cost of smaller containers or lambdas that only run when there is actual use.
Plus there is something to be said about easier local development which some things like Serverless and containers give you.
You may not need to setup a full k8s cluster, but if you are going with containers why would you run your own servers vs sticking the container in something managed like ECS.
> But here's the truth: not every project needs Kubernetes, complex distributed systems, or auto-scaling from day one. Simple infrastructure can often suffice,
When the hell can we be done with these self compromising losers? Holy shit! Enough! It doesn't save you anything doing less. People fucking flock to bot-Kubernetes because they can't hack it, because they suck, because they would prefer growing their own far worse far more unruly monster. A monster no one will ever criticize in public because it'll be some bespoke frivolous home grown alt-stack no one will bother to write a single paragraph on, which no one joining will grok understand or enjoy.
It's just so dumb. Theres all these fools trying to say, oh my gosh, the emperor has no clothes! Oh my gosh! It might not be needed! But the alternative is a really running naked through the woods yourself, inventing entirely novel unpracticed & probably vastly worst less good means for yourself. I don't know why we keep entertaining & giving positions of privilege to such shit throwing pointless "you might not need it" scum sucking shits trying to ruin things like so, but never ever do they have positive plans and never ever do they acknowledge that what they are advocating is to take TNT to what everyone else is trying to practice, is collaborating on. Going it alone & DIY'ing your own novel "you might not need" to participate in a society stack is fucking stupid & these people don't have the self respect to face up to the tall dissent they're calling for. You'd have to be a fool to think you are winning by DIY'ing "less". Fucking travesty.
I currently have distilled, compact Puppet code to create a hardened VM of any size on any provider that can run one more more Docker services or run directly a python backend, or serve static files. With this I create a service on a Hetzner VM in 5 minutes whether the VM has 2 cores or 48 cores and control the configuration in source controlled manifests while monitoring configuration compliance with a custom Naemon plugin. A perfectly reproducible process. The startups kids are meanwhile doing snowflakes in the cloud spending many KEUR per month to have something that is worse than what devops pioneers were able to do in 2017. And the stakeholders are paying for this ship.
I wrote a more structured opinion piece about this, called The Emperor's New clouds:
https://logical.li/blog/emperors-new-clouds/