Expensive Lessons: My Biggest Homelab Mistakes and How to Avoid Them

August 22, 2025 Infrastructure Security Learning

Two years of self-hosted infrastructure taught me more through painful mistakes than any course ever could. From accidentally creating a public DNS resolver to hardware failures that corrupted weeks of work, here are the expensive lessons that shaped my approach to homelab security and operations.

Expensive Lessons: My Biggest Homelab Mistakes and How to Avoid Them

What followed was a series of embarrassing, frustrating, and sometimes costly mistakes that taught me more about security, hardware planning, and systems administration than any book or course ever could. Here are some of the biggest blunders from my homelab journey (albeit there are probably some I have repressed and forgot), and hopefully you can learn from my pain without having to experience it yourself.

The Great DNS Disaster: When Pi-hole Became a Public Service

The Cause

In August 2023, excited about my new Pi-hole deployment, I made a configuration error that would haunt me for days. I had an old game server port (53, a primary Pihole DNS port) still open and misconfigured Pi-hole’s DHCP settings. Without realizing it, I had created an open DNS resolver accessible to the entire internet.

For 48 hours, my humble homelab became an unwitting participant in DNS amplification attacks. My network traffic spiked as bad actors worldwide discovered my misconfigured server and began attempting to bruteforce it with requests and logins to use it for malicious purposes.

The Wake-Up Call

The first sign something was wrong came when Pi-hole started randomly shutting down. At first I thought it was just a container issue, but when I checked the logs, I found something much worse, that being requests from unknown IP addresses and login attempts.

Thousands of DNS queries per minute from European IP addresses I’d never seen before. My Pi-hole was maxing out its RAM and CPU allocation trying to handle the flood of requests, to the point where my RAM and CPU capped saved me, and it crashed. Rather than legitimate queries from common telemetry-based services like Amazon, Google, or others, these were unnamed IP addresses and DNSes with nefarious purposes ranging in the hundreds of thousands.

I had effectively accidentally created a public DNS resolver that cybercriminals were trying to exploit and gain access.

That’s when I realized the scope of my mistake.

The Damage

The Recovery

Fixing this required immediate action:

  1. Emergency Shutdown: Powered off the Pi-hole container immediately
  2. Port Audit: Scanned all open ports using nmap from an external connection on my router
  3. Log File Dumps: Check the logs to see which port might be exposed alongside configuration files Pi-hole operates on
  4. Configuration Rebuild: Completely rebuilt Pi-hole with proper internal-only settings with new porting, no DHCP
  5. Rate Limiting: Added FTLCONF_RATE_LIMIT=20000/60 to prevent future abuse

Lessons Learned

Hardware Planning Failures: The AMD 6900XT AI Disappointment

The Assumption

When I decided to add local AI capabilities to my homelab, I assumed my AMD Radeon RX 6900XT would handle machine learning workloads adequately. After all, it was a high-end graphics card with 16GB of VRAM, which surely was more than enough for running local LLMs.

Regrettably, my expensive investment wasn’t up to the task.

Consequential Reality

AMD always fell short in terms of one thing relative to NVIDIA, no AI Tensor cores:

Context Window Limitations: The Dolphin2 model I deployed could barely handle 2,000 tokens, far less than the 8,000+ I needed for useful applications.

Driver Nightmare: AMD’s ROCm support was virtually non-existent for consumer cards. I spent weeks trying to get proper drivers working, only to achieve mediocre performance.

Power Efficiency: The card consumed 300W to deliver performance worse than a $20/month cloud GPU instance, and had far weaker token output compared to many NVIDIA options I had found online benchmarked

The Workaround

Rather than admit defeat, I developed a hybrid approach:

Lessons Learned

NAS Storage vs Local Storage: Learning the Hard Way

The Mistake

When I first started deploying Docker services, I wasn’t careful about where different types of data should live. I had some container configurations stored on my NAS storage while keeping active databases and frequently accessed files scattered between local SSD and network storage.

Using this mixed approach created performance issues I didn’t immediately understand, primarily relating to volume mapping issues and a bloated Linux directory structure, with some folders seeming duplicates until I discovered they were empty.

The Performance Issues

Network Bottlenecks: Some services accessing config files over SMB were slower than expected during startup.

I/O Confusion: Not understanding which data belonged on fast local storage vs network storage, comparatively to my life of local storage in computers.

Backup Complexity: Having data spread across different storage tiers made backup planning more complicated.

The Learning Process

Through trial and error, I developed a better storage strategy:

  1. Local SSD: Active databases, container configs, and frequently accessed data
  2. NAS Storage: Large media files, backups, and archive data that benefits from redundancy
  3. Clear Separation: Distinct mount points for different data types

Better Architecture

The improved setup made more sense:

Lessons Learned

GPU Passthrough: Simpler Than Expected

The Initial Confusion

Getting hardware acceleration working for Jellyfin seemed like it would be complicated. I spent way too much time reading complex tutorials about GPU passthrough and device mapping.

Turns out it was much simpler than I thought.

The Simple Solution

The actual process was straightforward once I understood the basics:

  1. Find the render group ID: Quick check showed it was 104 on my system
  2. Identify the device: Intel UHD 630 was at /dev/dri/renderD128
  3. Use Proxmox GUI: Just added the device mapping through the web interface
  4. Set group permissions: Added the container to render group 104

That was it. No complex configurations or deep system modifications needed, although I did spend too much time on the LXC configuration file.

The Real Challenge: Storage Permissions

The actual permission nightmare wasn’t GPU related at all. It was getting proper access to my TrueNAS storage from LXC containers.

The Problem: Proxmox root permissions were blocking /mnt/center access from my unprivileged container, even with SMB directly. This is due to container root permissions. Only privileged LXC Containers can mount using CIFS, so I had to come up with a creative solution.

The Solution: After digging through forums, I learned to create matching /mnt/center directory structures on both the datacenter node and container, then configure mount points properly through the Proxmox GUI.

What Actually Worked

Lessons Learned

Volume Pathing Chaos: The Docker Mount Point Maze

The Mistake

As my Docker infrastructure grew, I developed an inconsistent approach to volume mounting. Some containers used relative paths, others absolute paths. Some mounted individual directories, others entire filesystems. There were many points previous services I had retired also would leave files behind.

This organic growth created a maintenance nightmare, and happens to pretty much any infrastructure.

The Symptoms

Broken Deployments: New containers failed to start due to missing mount points.

Backup Inconsistencies: Some data was backed up multiple times, other critical data was missed entirely.

Permission Conflicts: The same directory mounted in multiple containers with different ownership.

Recovery Difficulties: Restoring from backups required remembering dozens of different path configurations.

The Standardization Solution

I developed a consistent volume mapping strategy:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Standardized Volume Structure
volumes:
  # Configuration
  - /root/[service]/config:/config
  
  # Data Storage
  - /mnt/center/[service]:/data
  
  # Shared Resources
  - /mnt/center:/mnt/center

Benefits of Standardization

Lessons Learned

Family Network Compatibility: The Ad-Blocking Rebellion

The Mistake

Excited about Pi-hole’s ad-blocking capabilities, I configured aggressive blocklists without considering their impact on my family’s daily internet usage. More blocking equals better security which also equals convenience, right?

To them, naturally wrong.

The User Revolt

Within days, complaints started pouring in:

Smart TV Breakdown: Our LG WebOS TV couldn’t access streaming services due to blocked tracking domains.

Amazon Prime Issues: Prime Video required specific telemetry domains that my blocklists had eliminated, leaving my parents on an Amazon logo screen.

Mobile App Failures: Netflix mobile apps and Facebook would not work.

The Balancing Act

Finding the right balance required systematic testing:

  1. Baseline Lists: Started with conservative, well-maintained blocklists
  2. Whitelist Strategy: Added exceptions based on user reports
  3. Device-Specific Rules: Created different filtering levels for different devices
  4. Regular Reviews: Monthly audits of blocked domains and user feedback

The Privacy vs. Functionality Lesson

This experience highlighted an important truth about network security: Perfect security that prevents normal usage is effectively useless.

The goal shifted from maximum blocking to intelligent blocking that:

Lessons Learned

The Non-ECC RAM Reality Check

The Mistake

Building my homelab on consumer hardware seemed like a smart budget decision. As the 9900K and 32GB of RAM was reused from a previous main desktop setup, it was a no-brainer.

However, as I’ve grown my homelab, I’ve started to have reservations towards that thought, as my RAM was not ECC (Error-Correcting Code RAM) This is because strangely, Intel had not allowed the 9900K’s chipset to have this feature supported in consumer-grade processors, unlike AMD chipsets.

Hidden Costs

Enhanced Monitoring Required: Without ECC error correction, I had to implement extensive monitoring to catch data corruption early.

Frequent Restarts: Consumer RAM requires more regular system restarts to prevent accumulated errors.

Backup Paranoia: The possibility of silent data corruption necessitated more frequent and comprehensive backups.

Future Upgrade Pressure: Every storage or compute expansion reminded me of the ECC limitation, questioning my allocations of upgrades.

The Mitigation Strategy

While I couldn’t add ECC support to existing hardware, I developed protective measures:

SMART Monitoring: Regular disk health checks to catch failing drives early.

Mirror Configuration: All critical data stored redundantly across multiple drives.

Automated Backups: Daily incremental backups with multiple retention periods.

Regular Verification: Periodic filesystem checks and data integrity validation.

The Future Solution

My roadmap now includes:

Lessons Learned

The Path Forward: Lessons Become Wisdom

Each of these mistakes taught me valuable lessons that improved my infrastructure:

Security Mindset

Hardware Planning

Operational Excellence

Conclusion: Embrace the Learning Journey

These mistakes were frustrating, time-consuming, and sometimes embarrassing. They also provided the most valuable learning experiences of my homelab journey. Each failure taught me more about proper system administration than success ever could.

If you’re beginning your own homelab journey, remember:

The goal isn’t to avoid all mistakes, rather to make new mistakes, learn from them, and build better systems. My infrastructure today is more secure, more reliable, and more useful because of every error along the way. It taught me to holistically research and be patient when I face problems.

A lab is not a lab if you do not gain experiences of both failure and success.