Table of Contents
You’re paying €28.86/month for 10TB on a Hetzner Storage Box, it’s affordable, easy to access, and does a solid job keeping your backups safe.
But let’s talk about the part nobody likes to think about: what happens when things go sideways?
What if Hetzner has a prolonged outage? What if compliance rules suddenly require geographic redundancy? What if you actually need that 99.999999999% durability, the famous eleven nines enterprise systems are built around?
The move isn’t to ditch Hetzner’s great price-to-performance. It’s to add a second layer using Amazon S3 Glacier Deep Archive at roughly $1/TB/month, giving you ultra-durable, geo-replicated archival storage for the long haul.
The real challenge? Moving your data from point A (Hetzner) to point B (AWS) in a way that’s efficient, incremental, and fully automated… without turning the whole thing into a costly, complex mess.
That’s where automation starts paying for itself, in both euros and peace of mind.
The Motive: Why This Matters
Modern backup strategies run into a weird trade-off. Affordable storage, like ~€2.89/TB/month for 10TB on a Hetzner Storage Box, works great for active data, but it doesn’t give you the enterprise-grade durability or geographic redundancy you need for real disaster recovery.
On the flip side, that eleven-nines durability is available… but paying ~$23/TB/month for Amazon S3 Standard can quickly turn your backup bill into something bigger than your production costs.
The sweet spot is hybrid tiering: keep hot data on affordable storage, and push cold backups into ultra-cheap archival storage.
Why Not Manual Backups?
Manual backups tend to fall apart for three simple reasons:
- Human error - eventually, you will forget.
- Bandwidth waste - re-uploading unchanged files eats both time and money.
- No state tracking - without knowing what’s already uploaded, you can’t resume failed transfers or skip duplicates.
The Streaming Advantage
Most backup tools take the scenic route: download files locally first, then upload them to the destination. That creates three avoidable problems:
- Disk space - you need local storage as large as your biggest backup
- Time - you’re doing two transfers instead of one
- Cost - large-disk VMs aren’t cheap
Streaming directly from SFTP (in this case) to S3 removes all three: no local disk needed, a single transfer path, and the whole thing can run on even the smallest VM.
The Architecture
The Python script you will build sets up a direct pipeline from your Hetzner Storage Box straight into Amazon S3 Glacier Deep Archive, no staging, no local copies, no unnecessary moving parts.
System Flow

Requirements & Dependencies
The script runs on Python 3.7+ and relies on four core libraries:
boto3- the AWS SDK for Python, handling authentication, streaming uploads, and automatic multipart transfers for large filesparamiko- a pure-Python SSH/SFTP implementation for secure access to your Storage Boxtqdm- provides real-time progress bars with transfer speeds and ETApython-dotenv- loads environment variables from.envfiles to keep credentials out of your codebase
On top of that, it uses a few built-in modules:
sqlite3for lightweight state tracking (no external DB required)loggingfor console + file outputos,stat,timefor file handling and system utilities
Access Setup
The pipeline needs two things configured outside the script:
Hetzner Storage Box - SFTP access:
- The Storage Box must have SFTP access enabled and configured in the Hetzner Robot / Cloud Console.
- You need the SFTP host (e.g.
u123456.your-storagebox.de), port (usually 23), and a user/password that can read the directories you want to back up. - Without SFTP enabled, the script cannot connect to the Storage Box.
If you still need to get data onto the Storage Box in the first place (local → Hetzner), see Build Your Own Low-Cost Cloud Backup with Hetzner Storage Boxes.
AWS - access keys for S3:
- Create an IAM user (or use an existing one) with permission to write to your target S3 bucket (e.g.
s3:PutObjecton that bucket). - Generate access keys for that user and put
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYin your.env. - The script uses these to authenticate with S3; no keys means uploads will fail.
The Code
You can find the complete implementation on GitHub, but here are the core building blocks that make this pipeline from Hetzner to Amazon S3 Glacier Deep Archive actually work in practice.
System Components Overview
The script is structured around five key layers that work together:
- Configuration Layer - pulls in credentials and runtime settings from a
.envfile - State Persistence Layer - uses SQLite to track which files have already been uploaded
- SFTP Discovery Engine - recursively walks your Storage Box directory tree
- Streaming Transfer Engine - streams files directly from SFTP to S3
- Main Orchestration Loop - ties everything together and manages execution flow
Let’s break down each piece with code snippets and more importantly, why it’s built this way.
Configuration Loading
from dotenv import load_dotenv
load_dotenv()
HETZNER_HOST = os.getenv("HETZNER_HOST")
HETZNER_USER = os.getenv("HETZNER_USER")
HETZNER_PASSWORD = os.getenv("HETZNER_PASSWORD")
REMOTE_PATH = os.getenv("REMOTE_PATH")
S3_BUCKET = os.getenv("S3_BUCKET")
S3_PREFIX = os.getenv("S3_PREFIX")
AWS_REGION = os.getenv("AWS_REGION")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
STATE_DB = os.getenv("STATE_DB", "archive_state.db")
LOG_FILE = os.getenv("LOG_FILE", "archive.log")
# Comma-separated paths to exclude (that path and everything under it)
EXCLUDE_PATHS_RAW = os.getenv("EXCLUDE_PATHS", os.getenv("EXCLUDE_DIRS", ""))
EXCLUDE_PATHS = [p.strip().strip("/") for p in EXCLUDE_PATHS_RAW.split(",") if p.strip()]
All secrets and runtime settings are stored in a .env file and loaded at start-up using python-dotenv. This keeps credentials (for Hetzner and AWS), bucket details, remote paths, logging config, and optional exclusions outside of your actual codebase.
Keeping configuration separate from code is one of those boring best practices that quietly saves you from very real disasters. Loading environment variables at runtime lets you:
- Avoid committing secrets to version control
- Run the same code across dev, staging, and production with different configs
- Rotate credentials without redeploying code
- Open-source the project without leaking access keys
If you want to level up config and validation in Python (typed settings, env validation, APIs), check out Practical Pydantic.
State Database Initialization
def init_db():
conn = sqlite3.connect(STATE_DB)
conn.execute("""
CREATE TABLE IF NOT EXISTS files (
path TEXT PRIMARY KEY,
size INTEGER,
mtime INTEGER,
uploaded_at INTEGER
)
""")
conn.commit()
return conn
A local SQLite database keeps track of every file that’s already been archived using a simple fingerprint: (path, size, mtime). For each upload, it records the file path (as the primary key), its size in bytes, last modified timestamp, and when it was successfully transferred.
If the script stops midway, or you run it again later, it immediately knows what’s already been handled and what still needs attention.
SQLite gives you a zero-config database with no server to install or maintain. The schema is intentionally minimal:
pathas the PRIMARY KEY guarantees uniqueness and fast lookupssize+mtimeact as a reliable change detectoruploaded_atgives you an audit trail for debugging or reporting
Using CREATE TABLE IF NOT EXISTS also makes the process idempotent, meaning you can run it repeatedly without errors, duplicate uploads, or wasted bandwidth. Just clean, incremental backups every time.
Smart Duplicate Detection
def already_uploaded(conn, path, size, mtime):
row = conn.execute(
"SELECT size, mtime FROM files WHERE path=?",
(path,)
).fetchone()
return row == (size, mtime)
This function checks the state database to see whether a file with the same path, size, and last modification time has already been uploaded, allowing the script to intelligently skip anything that hasn’t changed.
That one-line tuple comparison, row == (size, mtime), quietly handles all the important cases:
- If the path isn’t in the database →
None == (size, mtime)→False→ upload it - If the path exists but size or
mtimechanged →(old_size, old_mtime) == (new_size, new_mtime)→False→ upload it - If the path exists with identical size and
mtime→True→ skip it
This makes incremental backups dramatically more efficient. After the initial full run, future executions may end up skipping the vast majority of files. Since the database is updated after each successful transfer, the process is also interruption-safe, if something stops halfway through, it simply resumes where it left off.
State Persistence After Upload
def mark_uploaded(conn, path, size, mtime):
conn.execute(
"REPLACE INTO files VALUES (?, ?, ?, ?)",
(path, size, mtime, int(time.time()))
)
conn.commit()
After a file is successfully uploaded, this function stores its metadata in the state database, creating a persistent record that prevents the same unchanged file from being transferred again in future runs.
Using REPLACE instead of a plain INSERT neatly covers both scenarios:
- If the file path already exists → update the existing record
- If it doesn’t → insert a new one
The immediate commit() is what makes this reliable. If the script crashes right after this step, the upload is already recorded, so the next run won’t waste time or bandwidth reprocessing it.
This gives you interruption safety at the file level. Each commit is atomic: you either have a fully written record, or nothing at all. No half-finished states, no ambiguity.
SFTP Directory Walker with Path Exclusions
def _normalize_path(p):
return p.replace("//", "/").strip("/") if p and p != "." else ""
def _is_path_excluded(full_path):
norm = _normalize_path(full_path)
if not norm:
return False
for exclude in EXCLUDE_PATHS:
if norm == exclude or norm.startswith(exclude + "/"):
return True
return False
def walk_sftp(sftp, path):
for entry in sftp.listdir_attr(path):
# Skip hidden files and directories
if entry.filename.startswith('.'):
continue
# Normalize path
if path == ".":
full = entry.filename
else:
full = f"{path}/{entry.filename}".replace("//", "/")
if stat.S_ISDIR(entry.st_mode):
if _is_path_excluded(full):
continue
yield from walk_sftp(sftp, full)
else:
yield full, entry.st_size, entry.st_mtime
This recursive generator walks the full directory tree on your Storage Box over SFTP. It lists directory contents, separates files from folders using mode attributes, skips hidden entries (anything starting with a dot), and ignores any directory that matches, or lives under, a path defined in EXCLUDE_PATHS. That makes it easy to leave out things like backup/old, log folders, or cache directories from your archive.
This behaves a lot like Python’s os.walk(), but for remote storage. A few design choices make it scale cleanly:
- Generator-based (
yield): files are produced one at a time, so memory usage stays flat even with huge directory trees - Hidden file filtering: avoids wasting time on system artifacts like
.bash_historyor.ssh/ - Path-prefix exclusions:
_is_path_excluded()treats entries inEXCLUDE_PATHSas prefixes, excludingbackup/oldskips that folder and everything inside it - Normalization:
_normalize_path()strips extra slashes so path comparisons are consistent - Recursive delegation:
yield from walk_sftp(...)passes control without building large intermediate lists
The result is a memory-efficient traversal that can handle anything from a few hundred files to millions, without needing to load the full tree into RAM first.
Streaming Upload with Progress
def upload_stream(fileobj, s3_key, size):
progress = tqdm(
total=size,
unit="B",
unit_scale=True,
desc=s3_key,
leave=False
)
def callback(bytes_amount):
progress.update(bytes_amount)
metrics["bytes_uploaded"] += bytes_amount
s3.upload_fileobj(
fileobj,
Bucket=S3_BUCKET,
Key=s3_key,
ExtraArgs={"StorageClass": "DEEP_ARCHIVE"},
Callback=callback
)
progress.close()
This function streams files directly from Hetzner SFTP to Amazon S3 Glacier Deep Archive without ever writing them to local disk. The SFTP file object is passed straight to boto3.upload_fileobj(), which handles chunked uploads internally. Real-time progress bars are updated via a callback after each chunk, giving visibility into transfer speed and completion.
This is the core of the streaming architecture. Key points:
upload_fileobj(): Streams any file-like object (including SFTP handles) to S3, no temporary disk storage requiredStorageClass='DEEP_ARCHIVE': Saves cost by writing directly to the cheapest archival tier, avoiding Standard tier fees and lifecycle transitionsCallback=callback: Updates progress bar andmetrics["bytes_uploaded"]after each chunk (~8MB)leave=False: Keeps the console clean by removing progress bars when done
This method handles gigabyte-scale files on machines with only a few megabytes of RAM. The callback pattern provides live feedback without blocking the upload, and the metrics dictionary allows cumulative tracking for reporting purposes.
Main Orchestration
def main():
conn = init_db()
transport = paramiko.Transport((HETZNER_HOST, HETZNER_PORT))
transport.connect(username=HETZNER_USER, password=HETZNER_PASSWORD)
sftp = paramiko.SFTPClient.from_transport(transport)
try:
log.info(f"Starting backup from: {REMOTE_PATH}")
for path, size, mtime in walk_sftp(sftp, REMOTE_PATH):
if REMOTE_PATH == ".":
rel = path
else:
rel = path.replace(REMOTE_PATH, "").lstrip("/")
s3_key = f"{S3_PREFIX}{rel}"
if already_uploaded(conn, path, size, mtime):
log.info(f"SKIP {path}")
metrics["skipped"] += 1
continue
log.info(f"UPLOAD {path} -> s3://{S3_BUCKET}/{s3_key}")
with sftp.open(path, "rb") as f:
upload_stream(f, s3_key, size)
mark_uploaded(conn, path, size, mtime)
metrics["uploaded"] += 1
finally:
sftp.close()
transport.close()
conn.close()
The main() function orchestrates the entire pipeline: it connects to Hetzner via SFTP and AWS via boto3, initializes the SQLite state database, and iterates over every file discovered by the SFTP walker. The S3 key is built from S3_PREFIX plus the path relative to REMOTE_PATH (so backing up from a subdirectory doesn’t duplicate that prefix in the archive). For each file:
- Checks if it’s already uploaded using the state database
- Skips unchanged files (logging and incrementing the skip counter)
- Streams new or modified files directly to Amazon S3 Glacier Deep Archive
- Updates the database after successful upload
- Tracks metrics like files uploaded, skipped, and bytes transferred
Why it’s built this way:
- Connection management: Opens SFTP and DB connections at the start;
try/finallyensures they’re closed even if something goes wrong - Generator-driven:
walk_sftp()feeds one file at a time, keeping memory usage constant - Early state check:
already_uploaded()prevents unnecessary file reads and saves bandwidth - Context managers:
with sftp.open(...)safely closes remote file handles - Immediate commits:
mark_uploaded()writes to the database after each file, making the system resilient to crashes or interruptions
The design favours simplicity over complexity: no threads, no async, no elaborate state machines.
This keeps debugging straightforward and failure modes predictable. At the end, a summary report logs total files uploaded, skipped, bytes transferred, and total runtime.
Progress Tracking with Metrics
metrics = {
"uploaded": 0,
"skipped": 0,
"bytes_uploaded": 0,
"start_time": time.time(),
}
# ... at the end:
duration = time.time() - metrics["start_time"]
log.info("==== SUMMARY ====")
log.info(f"Uploaded files : {metrics['uploaded']}")
log.info(f"Skipped files : {metrics['skipped']}")
log.info(f"Bytes uploaded : {metrics['bytes_uploaded']:,}")
log.info(f"Duration (sec) : {int(duration)}")
A global metrics dictionary tracks operational statistics throughout execution: files uploaded, files skipped, total bytes transferred, and start time for duration calculation. At the end, a summary report is logged.
A simple dictionary tracks operational metrics. The callback in upload_stream() accumulates bytes, while the main loop increments counters. At the end, you get a summary:
==== SUMMARY ====
Uploaded files : 42
Skipped files : 1,583
Bytes uploaded : 15,234,567,890
Duration (sec) : 340
This makes it easy to track incremental efficiency over time. After the first full backup, subsequent runs should show high skip counts and low upload counts, exactly what you want. These metrics help you verify the script is working correctly and monitor bandwidth usage.
Cost Analysis
Using a hybrid approach, Hetzner for active backups and Amazon S3 Glacier Deep Archive for cold storage, gives enterprise-grade durability at a fraction of standard cloud costs.
Monthly Storage Costs (10TB Example)
| Service | Cost | Role |
|---|---|---|
| Hetzner Storage Box BX31 (10TB) | €28.86 | Active, frequently accessed backups |
| AWS S3 Glacier Deep Archive (10TB) | $10.10 | Cold, geo-redundant disaster recovery |
| Total | €39.96 (~$44) | Hybrid storage solution |
| Per-TB Cost | €4.00/TB (~$4.40/TB) | Unit economics |
Comparison:
- AWS S3 Standard: $230/month
- Backblaze B2: $50/month (+ egress fees)
- Google Cloud Archive: $12/month (higher retrieval costs)
Hybrid tiering delivers reliability at a fraction of S3 Standard pricing.
Transfer & API Costs
- Hetzner → AWS: Free egress & ingress
- S3 PUT Requests: ~$5 for 100,000 files initially; ~$0.25/month for incremental changes (~5%)
- Impact: Negligible compared to storage savings
Retrieval Costs (Deep Archive Caveat)
- Retrieval: $0.02/GB
- Restoration: 12–48 hours
- Temporary storage in S3 Standard: $0.023/GB/month
Example: Restoring 1TB costs ~$20 in retrieval fees plus ~$23/month if kept temporarily. Deep Archive is designed for “write-once, rarely-read” disaster recovery.
This hybrid setup gives the best of both worlds: affordable, accessible active storage and ultra-durable, geo-redundant archival backup, without breaking the bank.
Conclusion
Backing up from a Hetzner Storage Box to Amazon S3 Glacier Deep Archive gives you a powerful, cost-effective solution with:
- Affordability: ~$5/month per TB using hybrid tiered storage
- Durability: 11-nines (99.999999999%) with built-in geo-replication
- Efficiency: Incremental, streaming transfers with zero local disk usage
- Reliability: Interrupt-safe, state-persistent, fully logged operations
- Automation: Fully hands-off, schedule via cron, Task Scheduler, or any job runner
Whether for personal data, compliance, or enterprise disaster recovery, this pipeline delivers enterprise-grade reliability at a fraction of typical cloud costs.
The code is production-ready, tested, and designed to run unattended. By combining Hetzner’s affordability with AWS Glacier Deep Archive’s durability, you get the best of both worlds.
Full code and documentation: https://github.com/nunombispo/hetzner-to-aws-s3-glacier-backups
More on Python, Hetzner and automation: Hetzner Storage Box backup, Django on Hetzner + Dokku) and Python One-Liners.
Follow me on Twitter: https://twitter.com/DevAsService
Follow me on Instagram: https://www.instagram.com/devasservice/
Follow me on TikTok: https://www.tiktok.com/@devasservice
Follow me on YouTube: https://www.youtube.com/@DevAsService
