Stop Guessing Start Testing

Performance Engineering

The discipline to build, architect systems with observability and continuous validation to achieve high performance. Knowing all the trade-offs in those decisions.

Performance Testing

After the whole system is built in each iteration it should have been validated and tested. This acts as assurance policy for the visibility of what is going to happen in certain scenarios.

Resiliency vs Reliability vs Efficiency

Reliable system is the one which works and has very high uptime. Is it performing correctly? No failures? Where as the resiliency comes from the recovery once the system breaks. And every system will.

Efficiency is easy. It is the cost for the system reliability metrics.

Performance Engineering Playbook

The guide of set rules to achieve reliable system.

Performance testing types

load - testing system under usual conditions
stress - the bottleneck of how much a system can take
endurance - answering for how long can the system sustain the load (targeting memory leaks and other problems that occur over time)
scalability - ability to scale up and down over longer time when the customer base changes or the demand
spike - responsivity to handle rapid change overloads
volume - testing of the amount of data to pass through

Testing as Discipline

Unit testing as the base with integration built in and E2E on top.

Discipline across all roles as devs, architects, sre, qa and managers.

Metrics

Metrics set the baselines and find the sweet spot of workload.

Frameworks

Framework	Language
JMeter	XML
K6	Javascript
Locust	Python

Distributed LoadTesting on AWS

single tenant
open source
fully supported by AWS
traffic from AWS
works as orchestrator for global traffic testing
integrateable to CI/CD
spins up 10k of containers in a min
200 000mil of requests in 6min
- all requests split per endpoint for results
- great for understanding bottlenecks
3.6bil of requests in 5min
DDoS attack easily flagged, needs unblocking by docs
MCP server capability

Karpenter / Kera - CNCF project

What is Karpenter project split the target focus in those 4 categories of problems

Scaling but slow to respond

Instead of using prometheus to pull metrics and scrape the cluster of each pod they use a sidecar container in each pod to push these. It get to see metrics much more responsively. Supports OTel. One of the keys in responsiveness is perfectly set limitations based on the right metrics. If the gpu node is flat out it does not mean straight away to scale up. Maybe vertically but that would make a disruption. The better indicator would be the job queue.

Scaling but expensive

Fewer nodes is better. Not only count but type is as well, that means the NodePool design. All of this can be set in Karpenter using the specs of Node overlays. extra thing is reserved capacity.

GPU time slicing can be used to parallelize gpu workloads using the same unit.

Scaling but slow to start

Multiple paralelizations have been used to speed up the spin up.

A quote worth mentioning: container is for code not models.

Using model quantization the model size can be lowered and using S3 buckets the image pull is sped up even more.

Scaling but breaks

Everything works but it breaks at some stage. What now? To keep the reliability at certain time due to Node rebuilt we can control disruption. This is to manage the node drift, drifts will happen for example due to updates of control plane.

LinkedIn post

As we move further and further the global systems are becoming very dynamic, elastic. To achieve high reliable systems that can withstand high spikes during black friday or any other high demanding days Luis Guirigay presented a AWS distributed system for testing systems in these scales using global traffic coming from AWS backbone. Over a demo, in numbers it could spin up 10k containers in a minute and create 200 mil requests over 5 minute run. Another run could make 3.6 billion requests in 5min. All request in the result report is being shown per endpoint for better analysis of bottlenecks. This system has significant impact when engineering performance to get an assurance by testing and see the bottlenecks of throughput. Thanks to Christian Mendelez we got a little exposure on Karpenter and how it solves some issues in clusters. Even though it is scaling but it could be not fast to respond, expensive (non efficient node pool scaling), slow to start (slow cold spin ups) or the cluster breaks something during the scaling process.

After this meetup from which appreciation goes to Ronan Guilfoyle for organizing this AWS user group session, I have been able to get a look into Amazon engineering building given by my friend and classmate Matteo Mastore who is an intern in AWS. Thanks for the absolutely stunning evening.

AWS User Group

Stop Guessing Start Testing

Performance Engineering

Performance Testing

Resiliency vs Reliability vs Efficiency

Performance Engineering Playbook

Performance testing types

Testing as Discipline

Metrics

Frameworks

Distributed LoadTesting on AWS

Karpenter / Kera - CNCF project

Scaling but slow to respond

Scaling but expensive

Scaling but slow to start

Scaling but breaks

LinkedIn post

Comments

More from this blog

MUG Search in MongoDB

Sqaurespace SRE meetup

Bots on the internet

DevOps Meetup at Bazaarvoice

Command Palette

Stop Guessing Start Testing

Performance Engineering

Performance Testing

Resiliency vs Reliability vs Efficiency

Performance Engineering Playbook

Performance testing types

Testing as Discipline

Metrics

Frameworks

Distributed LoadTesting on AWS

Karpenter / Kera - CNCF project

Scaling but slow to respond

Scaling but expensive

Scaling but slow to start

Scaling but breaks

LinkedIn post

Comments

More from this blog