Expensive Lessons: My Biggest Homelab Mistakes and How to Avoid Them
Two years of self-hosted infrastructure taught me more through painful mistakes than any course ever could. From accidentally creating a public DNS resolver to hardware failures that corrupted weeks of work, here are the expensive lessons that shaped my approach to homelab security and operations.
Expensive Lessons: My Biggest Homelab Mistakes and How to Avoid Them
What followed was a series of embarrassing, frustrating, and sometimes costly mistakes that taught me more about security, hardware planning, and systems administration than any book or course ever could. Here are some of the biggest blunders from my homelab journey (albeit there are probably some I have repressed and forgot), and hopefully you can learn from my pain without having to experience it yourself.
The Great DNS Disaster: When Pi-hole Became a Public Service
The Cause
In August 2023, excited about my new Pi-hole deployment, I made a configuration error that would haunt me for days. I had an old game server port (53, a primary Pihole DNS port) still open and misconfigured Pi-hole’s DHCP settings. Without realizing it, I had created an open DNS resolver accessible to the entire internet.
For 48 hours, my humble homelab became an unwitting participant in DNS amplification attacks. My network traffic spiked as bad actors worldwide discovered my misconfigured server and began attempting to bruteforce it with requests and logins to use it for malicious purposes.
The Wake-Up Call
The first sign something was wrong came when Pi-hole started randomly shutting down. At first I thought it was just a container issue, but when I checked the logs, I found something much worse, that being requests from unknown IP addresses and login attempts.
Thousands of DNS queries per minute from European IP addresses I’d never seen before. My Pi-hole was maxing out its RAM and CPU allocation trying to handle the flood of requests, to the point where my RAM and CPU capped saved me, and it crashed. Rather than legitimate queries from common telemetry-based services like Amazon, Google, or others, these were unnamed IP addresses and DNSes with nefarious purposes ranging in the hundreds of thousands.
I had effectively accidentally created a public DNS resolver that cybercriminals were trying to exploit and gain access.
That’s when I realized the scope of my mistake.
The Damage
- No Internet: Pi-hole kept crashing, which killed DNS resolution for my entire network, although, I had a fallback dynamic DNS through my router
- Security Exposure: Bad actors were actively probing my DNS server for vulnerabilities, luckily had a strong 22 character number and symbol password
- Resource Exhaustion: Pi-hole was maxing out CPU and RAM trying to handle the malicious traffic, although, the resources were capped so it crashed
- Frustrating Debugging: Took me way too long to figure out what was actually happening
The Recovery
Fixing this required immediate action:
- Emergency Shutdown: Powered off the Pi-hole container immediately
- Port Audit: Scanned all open ports using nmap from an external connection on my router
- Log File Dumps: Check the logs to see which port might be exposed alongside configuration files Pi-hole operates on
- Configuration Rebuild: Completely rebuilt Pi-hole with proper internal-only settings with new porting, no DHCP
- Rate Limiting: Added
FTLCONF_RATE_LIMIT=20000/60
to prevent future abuse
Lessons Learned
- Security by Default: Always configure services with minimal permissions first, and have fallback DNS servers, many routers have the feature
- Port Hygiene: Regularly audit open ports and close anything unnecessary
- External Testing: Test your configurations from outside your network
- Monitoring is Critical: Set up proper logging and alerting from day one, and always rate limit and resource limit your containers
Hardware Planning Failures: The AMD 6900XT AI Disappointment
The Assumption
When I decided to add local AI capabilities to my homelab, I assumed my AMD Radeon RX 6900XT would handle machine learning workloads adequately. After all, it was a high-end graphics card with 16GB of VRAM, which surely was more than enough for running local LLMs.
Regrettably, my expensive investment wasn’t up to the task.
Consequential Reality
AMD always fell short in terms of one thing relative to NVIDIA, no AI Tensor cores:
Context Window Limitations: The Dolphin2 model I deployed could barely handle 2,000 tokens, far less than the 8,000+ I needed for useful applications.
Driver Nightmare: AMD’s ROCm support was virtually non-existent for consumer cards. I spent weeks trying to get proper drivers working, only to achieve mediocre performance.
Power Efficiency: The card consumed 300W to deliver performance worse than a $20/month cloud GPU instance, and had far weaker token output compared to many NVIDIA options I had found online benchmarked
The Workaround
Rather than admit defeat, I developed a hybrid approach:
- Local Processing: Basic calculations and simple queries on the AMD card
- Cloud Offloading: Complex reasoning and long-context work sent to OpenAI or SillyTavern APIs
- Mobile Extension: Deployed Deepseek on my Samsung S24 Ultra for basic mobile AI using Deepseek’s 8B model on Termux
Lessons Learned
- Research Before Buying: Gaming hardware rarely translates to productivity workloads
- Plan for NVIDIA: For AI/ML work, NVIDIA’s CUDA ecosystem is essentially mandatory
- Cloud First: For hobbyists, cloud GPUs often provide better value than local hardware, compared to massive data center companies
- Hybrid Architectures: Combining local and cloud resources can be more effective than either alone
NAS Storage vs Local Storage: Learning the Hard Way
The Mistake
When I first started deploying Docker services, I wasn’t careful about where different types of data should live. I had some container configurations stored on my NAS storage while keeping active databases and frequently accessed files scattered between local SSD and network storage.
Using this mixed approach created performance issues I didn’t immediately understand, primarily relating to volume mapping issues and a bloated Linux directory structure, with some folders seeming duplicates until I discovered they were empty.
The Performance Issues
Network Bottlenecks: Some services accessing config files over SMB were slower than expected during startup.
I/O Confusion: Not understanding which data belonged on fast local storage vs network storage, comparatively to my life of local storage in computers.
Backup Complexity: Having data spread across different storage tiers made backup planning more complicated.
The Learning Process
Through trial and error, I developed a better storage strategy:
- Local SSD: Active databases, container configs, and frequently accessed data
- NAS Storage: Large media files, backups, and archive data that benefits from redundancy
- Clear Separation: Distinct mount points for different data types
Better Architecture
The improved setup made more sense:
- Performance: Hot data or quickly accessed data on local SSD for speed in tandem with services
- Redundancy: Important archives on mirrored NAS storage
- Clarity: Easy to understand what data lives where, more assignment and centralization
- Backups: Simplified backup strategy with clear data tiers set systematically in my mind map system
Lessons Learned
- Storage Planning: Think about data access patterns before deployment
- Network vs Local: Understand when you need network storage vs local performance
- Data Classification: Different data types have different storage requirements
- Start Simple: Build storage strategy incrementally based on actual needs
GPU Passthrough: Simpler Than Expected
The Initial Confusion
Getting hardware acceleration working for Jellyfin seemed like it would be complicated. I spent way too much time reading complex tutorials about GPU passthrough and device mapping.
Turns out it was much simpler than I thought.
The Simple Solution
The actual process was straightforward once I understood the basics:
- Find the render group ID: Quick check showed it was 104 on my system
- Identify the device: Intel UHD 630 was at
/dev/dri/renderD128
- Use Proxmox GUI: Just added the device mapping through the web interface
- Set group permissions: Added the container to render group 104
That was it. No complex configurations or deep system modifications needed, although I did spend too much time on the LXC configuration file.
The Real Challenge: Storage Permissions
The actual permission nightmare wasn’t GPU related at all. It was getting proper access to my TrueNAS storage from LXC containers.
The Problem: Proxmox root permissions were blocking /mnt/center
access from my unprivileged container, even with SMB directly. This is due to container root permissions. Only privileged LXC Containers can mount using CIFS, so I had to come up with a creative solution.
The Solution: After digging through forums, I learned to create matching /mnt/center
directory structures on both the datacenter node and container, then configure mount points properly through the Proxmox GUI.
What Actually Worked
- Directory Mirroring: Identical paths on host and container
- Proper Mount Configuration: Ensured /mnt/center was matched between both containers in the /mnt/ pathing of directories
- Permission Mapping: Proper UID/GID mapping between host and container which is 10000 on datacenter (and also the LXC)
Lessons Learned
- Don’t Overcomplicate: Sometimes the solution is simpler than expected
- Separate Issues: GPU passthrough and storage permissions are different problems
- Use the GUI: Proxmox web interface handles a lot of complexity for you
- Community Forums: Real solutions often come from other users, not official docs, and researching is helpful above all else.
Volume Pathing Chaos: The Docker Mount Point Maze
The Mistake
As my Docker infrastructure grew, I developed an inconsistent approach to volume mounting. Some containers used relative paths, others absolute paths. Some mounted individual directories, others entire filesystems. There were many points previous services I had retired also would leave files behind.
This organic growth created a maintenance nightmare, and happens to pretty much any infrastructure.
The Symptoms
Broken Deployments: New containers failed to start due to missing mount points.
Backup Inconsistencies: Some data was backed up multiple times, other critical data was missed entirely.
Permission Conflicts: The same directory mounted in multiple containers with different ownership.
Recovery Difficulties: Restoring from backups required remembering dozens of different path configurations.
The Standardization Solution
I developed a consistent volume mapping strategy:
|
|
Benefits of Standardization
- Predictable Paths: New deployments follow established patterns
- Simplified Backups: All configuration in
/root
, all data dumps in/mnt/center
- Clear Permissions: Consistent ownership patterns across services
- Easy Migration: Services can be moved between hosts with minimal changes
Lessons Learned
- Standards First: Establish conventions before deploying multiple services
- Document Decisions: Maintain clear documentation of path standards
- Regular Audits: Periodically review and standardize existing deployments
- Backup Testing: Verify backup/restore procedures work with your path structure
Family Network Compatibility: The Ad-Blocking Rebellion
The Mistake
Excited about Pi-hole’s ad-blocking capabilities, I configured aggressive blocklists without considering their impact on my family’s daily internet usage. More blocking equals better security which also equals convenience, right?
To them, naturally wrong.
The User Revolt
Within days, complaints started pouring in:
Smart TV Breakdown: Our LG WebOS TV couldn’t access streaming services due to blocked tracking domains.
Amazon Prime Issues: Prime Video required specific telemetry domains that my blocklists had eliminated, leaving my parents on an Amazon logo screen.
Mobile App Failures: Netflix mobile apps and Facebook would not work.
The Balancing Act
Finding the right balance required systematic testing:
- Baseline Lists: Started with conservative, well-maintained blocklists
- Whitelist Strategy: Added exceptions based on user reports
- Device-Specific Rules: Created different filtering levels for different devices
- Regular Reviews: Monthly audits of blocked domains and user feedback
The Privacy vs. Functionality Lesson
This experience highlighted an important truth about network security: Perfect security that prevents normal usage is effectively useless.
The goal shifted from maximum blocking to intelligent blocking that:
- Protects against malicious domains
- Blocks intrusive advertising
- Preserves essential functionality
- Maintains user satisfaction
Lessons Learned
- User Experience First: Security measures must consider real-world usage patterns
- Gradual Implementation: Start conservative and add restrictions incrementally
- Stakeholder Communication: Explain changes and gather feedback from all users
- Monitoring and Adjustment: Be prepared to modify configurations based on usage data
The Non-ECC RAM Reality Check
The Mistake
Building my homelab on consumer hardware seemed like a smart budget decision. As the 9900K and 32GB of RAM was reused from a previous main desktop setup, it was a no-brainer.
However, as I’ve grown my homelab, I’ve started to have reservations towards that thought, as my RAM was not ECC (Error-Correcting Code RAM) This is because strangely, Intel had not allowed the 9900K’s chipset to have this feature supported in consumer-grade processors, unlike AMD chipsets.
Hidden Costs
Enhanced Monitoring Required: Without ECC error correction, I had to implement extensive monitoring to catch data corruption early.
Frequent Restarts: Consumer RAM requires more regular system restarts to prevent accumulated errors.
Backup Paranoia: The possibility of silent data corruption necessitated more frequent and comprehensive backups.
Future Upgrade Pressure: Every storage or compute expansion reminded me of the ECC limitation, questioning my allocations of upgrades.
The Mitigation Strategy
While I couldn’t add ECC support to existing hardware, I developed protective measures:
SMART Monitoring: Regular disk health checks to catch failing drives early.
Mirror Configuration: All critical data stored redundantly across multiple drives.
Automated Backups: Daily incremental backups with multiple retention periods.
Regular Verification: Periodic filesystem checks and data integrity validation.
The Future Solution
My roadmap now includes:
- Dedicated Storage Server: ECC-enabled system for critical data storage, separate from the current homelab
- Compute/Storage Separation: Non-ECC supported hardware for processing, ECC-supported hardware for storage (like my 7950X3D computer)
- Gradual Migration: Phased approach to avoid disrupting existing services, and a good bridge setup to communicate NAS<->Homelab
Lessons Learned
- Plan for Data Integrity: Consider error correction from the beginning, potentially separate NAS from your homelab
- Budget Realistically: Allocate money to protect your data, if this entails a separate NAS with a dock, it might have to be done
- Understand Trade-offs: On-hand hardware savings come with hidden operational costs, like loss of data
- Design for Upgrade: Build systems that can evolve as requirements grow, which my 7950X3D system could fulfill
The Path Forward: Lessons Become Wisdom
Each of these mistakes taught me valuable lessons that improved my infrastructure:
Security Mindset
- Default Deny: Configure services with minimal permissions initially
- Monitoring First: Implement logging and alerting before deploying services
- Incident Response: Have procedures or fallbacks ready for when things go wrong
Hardware Planning
- Research Compatibility: Verify hardware supports your intended workloads
- Plan for Growth: Design systems that can evolve with changing requirements
- Performance Tiers: Match storage and compute to application requirements
- Total Cost of Ownership: Consider operational costs, not just initial purchase price
Operational Excellence
- Documentation Culture: Record decisions, configurations, and procedures, and always audit log your work
- Standards Consistency: Establish and follow deployment patterns
- Testing Procedures: Verify changes in isolated/locally hosted environments first
- User Feedback: Consider the needs of everyone using your infrastructure
Conclusion: Embrace the Learning Journey
These mistakes were frustrating, time-consuming, and sometimes embarrassing. They also provided the most valuable learning experiences of my homelab journey. Each failure taught me more about proper system administration than success ever could.
If you’re beginning your own homelab journey, remember:
- Mistakes are inevitable:Plan for them, learn from them, and share them with others on forums to help others improve their infrastructure.
- Security first: It’s always easier to relax restrictions than recover from data breaches. It even affects the largest Fortune 500 companies today.
- Users matter: Infrastructure that doesn’t serve users is expensive hobby equipment, and at points users won’t align with overbearing protections.
- Document everything: Your future self will thank you, especially if you rebuild your infrastructure from scratch.
The goal isn’t to avoid all mistakes, rather to make new mistakes, learn from them, and build better systems. My infrastructure today is more secure, more reliable, and more useful because of every error along the way. It taught me to holistically research and be patient when I face problems.
A lab is not a lab if you do not gain experiences of both failure and success.