📖 Tech challenges & software stories
📝 Blog: bytefusion.de | ✍️ Medium: medium.com/@msbreuer
📷 Pixelfed: @mbreuer@pixelfed | 🌍 Mastodon: @[email protected]
Incident timeline 🕒
1️⃣ No requests/limits set 🚫
2️⃣ Too many pods on one worker 🐘
3️⃣ Load ↑ → kubelet timeouts ⏳
4️⃣ Node drops ❌
5️⃣ Pods reschedule… next node dies 🔄
Repeat until chaos complete 💥
Fix: set sane limits & protect your cluster 🛠️
Incident timeline 🕒
1️⃣ No requests/limits set 🚫
2️⃣ Too many pods on one worker 🐘
3️⃣ Load ↑ → kubelet timeouts ⏳
4️⃣ Node drops ❌
5️⃣ Pods reschedule… next node dies 🔄
Repeat until chaos complete 💥
Fix: set sane limits & protect your cluster 🛠️
1️⃣ No requests/limits 🚫
2️⃣ Too many pods on one worker 🐘
3️⃣ Load ↑ → kubelet timeouts ⏳
4️⃣ Node drops ❌
5️⃣ Pods reschedule… next node dies 🔄
Repeat until chaos is complete.
𝗙𝗶𝘅: set sane limits 🛠️ — protect your cluster before it plays domino.
1️⃣ No requests/limits 🚫
2️⃣ Too many pods on one worker 🐘
3️⃣ Load ↑ → kubelet timeouts ⏳
4️⃣ Node drops ❌
5️⃣ Pods reschedule… next node dies 🔄
Repeat until chaos is complete.
𝗙𝗶𝘅: set sane limits 🛠️ — protect your cluster before it plays domino.
You replace a parent class with a subclass ✅
…but the subclass changes expected behavior ❌
Example: Bird 🐦 → Penguin 🐧,
call fly() 🚀 → 💥 crash.
That’s a violation of the Liskov Substitution Principle 🧩 —
always keep the contract intact.
You replace a parent class with a subclass ✅
…but the subclass changes expected behavior ❌
Example: Bird 🐦 → Penguin 🐧,
call fly() 🚀 → 💥 crash.
That’s a violation of the Liskov Substitution Principle 🧩 —
always keep the contract intact.
Lab Kubernetes cluster collapses.
VMs? Unlimited ♾️ — or so we thought.
Physical hosts? Stuffed beyond capacity 🐷📦.
𝗥𝗲𝘀𝘂𝗹𝘁: CPU/RAM contention → kubelet timeouts ⏳ → workers drop ❌ → pods scramble → more drops 🔄.
𝗟𝗲𝘀𝘀𝗼𝗻: Overprovisioning in the lab can kill your clu
Lab Kubernetes cluster collapses.
VMs? Unlimited ♾️ — or so we thought.
Physical hosts? Stuffed beyond capacity 🐷📦.
𝗥𝗲𝘀𝘂𝗹𝘁: CPU/RAM contention → kubelet timeouts ⏳ → workers drop ❌ → pods scramble → more drops 🔄.
𝗟𝗲𝘀𝘀𝗼𝗻: Overprovisioning in the lab can kill your clu
Complex business-logic SQL view slows to a crawl 🐢.
Daily reports die ❌. Legacy app, no dev team 🚫, no vertical DB scale.
Execution plan fine ✅ — but session cache too small 📦.
𝗙𝗶𝘅: +few MB cache → instant magic ✨.
Complex business-logic SQL view slows to a crawl 🐢.
Daily reports die ❌. Legacy app, no dev team 🚫, no vertical DB scale.
Execution plan fine ✅ — but session cache too small 📦.
𝗙𝗶𝘅: +few MB cache → instant magic ✨.
Legacy app, no dev team 🚫, DB can’t 𝘀𝗰𝗮𝗹𝗲 𝘃𝗲𝗿𝘁𝗶𝗰𝗮𝗹𝗹𝘆.
𝗘𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 𝗽𝗹𝗮𝗻 looks fine ✅ — but the DB session cache is too small 📦.
𝗙𝗶𝘅: just a few MB more cache → massive speed-up ✨.
In SQL, small tweaks can make huge dif
Legacy app, no dev team 🚫, DB can’t 𝘀𝗰𝗮𝗹𝗲 𝘃𝗲𝗿𝘁𝗶𝗰𝗮𝗹𝗹𝘆.
𝗘𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 𝗽𝗹𝗮𝗻 looks fine ✅ — but the DB session cache is too small 📦.
𝗙𝗶𝘅: just a few MB more cache → massive speed-up ✨.
In SQL, small tweaks can make huge dif
Swap a parent class for a subclass ✅ — but it changes the expected behavior ❌.
Example: Bird 🐦 → Penguin 🐧, call fly() 🚀 → 💥 crash.
That’s the Liskov Substitution Principle 🧩: subtypes must honor the contract of their base type. Break it, and your code breaks too.
Swap a parent class for a subclass ✅ — but it changes the expected behavior ❌.
Example: Bird 🐦 → Penguin 🐧, call fly() 🚀 → 💥 crash.
That’s the Liskov Substitution Principle 🧩: subtypes must honor the contract of their base type. Break it, and your code breaks too.
📒 Wiki → easy but a graveyard without rules
📂 SharePoint → versioning, but weak vs. SCM
📝 Git Markdown → great for devs, tough for PMs
📄 PDF/Word → shareable, but outdated fast
📊 Diagram tools → powerful, but niche
No pe
📒 Wiki → easy but a graveyard without rules
📂 SharePoint → versioning, but weak vs. SCM
📝 Git Markdown → great for devs, tough for PMs
📄 PDF/Word → shareable, but outdated fast
📊 Diagram tools → powerful, but niche
No pe
– 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆: use less CPU so business logic isn’t slowed down
– 𝗥𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆: logs can arrive later, e.g. after traffic peaks
– 𝗦𝗲𝗰𝘂𝗿𝗶𝘁𝘆: encrypt transport, protect sensitive data
– 𝗠𝗮𝗶𝗻𝘁𝗮𝗶𝗻𝗮𝗯𝗶𝗹𝗶𝘁𝘆: easy config, painless upgrades
Quality isn’t just for product
– 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆: use less CPU so business logic isn’t slowed down
– 𝗥𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆: logs can arrive later, e.g. after traffic peaks
– 𝗦𝗲𝗰𝘂𝗿𝗶𝘁𝘆: encrypt transport, protect sensitive data
– 𝗠𝗮𝗶𝗻𝘁𝗮𝗶𝗻𝗮𝗯𝗶𝗹𝗶𝘁𝘆: easy config, painless upgrades
Quality isn’t just for product
𝗧𝗼𝗼 𝘀𝗺𝗮𝗹𝗹 → overbooked nodes.
𝗧𝗼𝗼 𝗯𝗶𝗴 → wasted resources.
𝗡𝗼 𝗿𝗲𝗾𝘂𝗲𝘀𝘁𝘀 = tiny defaults, risking instability.
Right-sizing ensures fair scheduling & efficient clusters.
𝗧𝗼𝗼 𝘀𝗺𝗮𝗹𝗹 → overbooked nodes.
𝗧𝗼𝗼 𝗯𝗶𝗴 → wasted resources.
𝗡𝗼 𝗿𝗲𝗾𝘂𝗲𝘀𝘁𝘀 = tiny defaults, risking instability.
Right-sizing ensures fair scheduling & efficient clusters.
🗂 acts like a queue (resume later),
⚡ bulk writes > single updates,
🔄 data can be re-processed,
🛡️ resilient to outages.
Simple, robust, efficient logging.
🗂 acts like a queue (resume later),
⚡ bulk writes > single updates,
🔄 data can be re-processed,
🛡️ resilient to outages.
Simple, robust, efficient logging.
Heap space → objects don’t fit.
Non-heap → stacks, threads, metaspace, direct buffers.
OS OOM → kernel kills JVM when RAM is gone.
👉 Not all OOMs are equal.
Heap space → objects don’t fit.
Non-heap → stacks, threads, metaspace, direct buffers.
OS OOM → kernel kills JVM when RAM is gone.
👉 Not all OOMs are equal.
👉 Beginners need rules — everything feels equally important.
🎯 Experts act intuitively — they focus on what matters and ignore the rest.
From rules to pattern recognition: that’s the path to real expertise. ✨
👉 Beginners need rules — everything feels equally important.
🎯 Experts act intuitively — they focus on what matters and ignore the rest.
From rules to pattern recognition: that’s the path to real expertise. ✨
Not always! Timeouts are often just a symptom:
- overprovisioned hosts 🖥️
- Kubernetes limits ⚙️
- Java garbage collection ♻️
…or all of them combined.
The root cause usually lies deeper — not just “the network.” 🚨
Not always! Timeouts are often just a symptom:
- overprovisioned hosts 🖥️
- Kubernetes limits ⚙️
- Java garbage collection ♻️
…or all of them combined.
The root cause usually lies deeper — not just “the network.” 🚨
👉 Troubleshooting steps:
1️⃣ Check network (logs, policies, TCP)
2️⃣ Check platform (K8s limits, node metrics)
3️⃣ Check app (GC logs, thread dumps)
4️⃣ Correlate everything for the big picture
Only then you’ll uncover the real cause.
👉 Troubleshooting steps:
1️⃣ Check network (logs, policies, TCP)
2️⃣ Check platform (K8s limits, node metrics)
3️⃣ Check app (GC logs, thread dumps)
4️⃣ Correlate everything for the big picture
Only then you’ll uncover the real cause.
– Efficiency: low CPU → business logic stays fast
– Reliability: delay logs after peaks
– Security: encrypt sensitive data
– Maintainability: simple config & upgrades
– Efficiency: low CPU → business logic stays fast
– Reliability: delay logs after peaks
– Security: encrypt sensitive data
– Maintainability: simple config & upgrades
With AI we can code texts the way we build programs: break complex documents into small, consistent units and assemble them into a whole. Like scenes in a novel → chapters → a book. Tools like Cursor.AI make this modular writing workflow smooth and powerful.
With AI we can code texts the way we build programs: break complex documents into small, consistent units and assemble them into a whole. Like scenes in a novel → chapters → a book. Tools like Cursor.AI make this modular writing workflow smooth and powerful.
2. Exactly-once delivery
1. Guaranteed order of messages
2. Exactly-once delivery
2. Exactly-once delivery
1. Guaranteed order of messages
2. Exactly-once delivery
– Efficiency: low CPU → business logic stays fast
– Reliability: delay logs after peaks
– Security: encrypt sensitive data
– Maintainability: simple config & upgrades
– Efficiency: low CPU → business logic stays fast
– Reliability: delay logs after peaks
– Security: encrypt sensitive data
– Maintainability: simple config & upgrades
👉 Troubleshooting steps:
1️⃣ Check network (logs, policies, TCP)
2️⃣ Check platform (K8s limits, node metrics)
3️⃣ Check app (GC logs, thread dumps)
4️⃣ Correlate everything for the big picture
Only then you’ll uncover the real cause.
👉 Troubleshooting steps:
1️⃣ Check network (logs, policies, TCP)
2️⃣ Check platform (K8s limits, node metrics)
3️⃣ Check app (GC logs, thread dumps)
4️⃣ Correlate everything for the big picture
Only then you’ll uncover the real cause.
Not always! Timeouts are often just a symptom:
- overprovisioned hosts 🖥️
- Kubernetes limits ⚙️
- Java garbage collection ♻️
…or all of them combined.
The root cause usually lies deeper — not just “the network.” 🚨
Not always! Timeouts are often just a symptom:
- overprovisioned hosts 🖥️
- Kubernetes limits ⚙️
- Java garbage collection ♻️
…or all of them combined.
The root cause usually lies deeper — not just “the network.” 🚨