Automating Long-Term Backup - Hetzner Storage Box to AWS S3 Glacier Deep Archive

By Nuno Bispo
11 min read

Table of Contents

You’re paying €28.86/month for 10TB on a Hetzner Storage Box, it’s affordable, easy to access, and does a solid job keeping your backups safe.

But let’s talk about the part nobody likes to think about: what happens when things go sideways?

What if Hetzner has a prolonged outage? What if compliance rules suddenly require geographic redundancy? What if you actually need that 99.999999999% durability, the famous eleven nines enterprise systems are built around?

The move isn’t to ditch Hetzner’s great price-to-performance. It’s to add a second layer using Amazon S3 Glacier Deep Archive at roughly $1/TB/month, giving you ultra-durable, geo-replicated archival storage for the long haul.

The real challenge? Moving your data from point A (Hetzner) to point B (AWS) in a way that’s efficient, incremental, and fully automated… without turning the whole thing into a costly, complex mess.

That’s where automation starts paying for itself, in both euros and peace of mind.


The Motive: Why This Matters

Modern backup strategies run into a weird trade-off. Affordable storage, like ~€2.89/TB/month for 10TB on a Hetzner Storage Box, works great for active data, but it doesn’t give you the enterprise-grade durability or geographic redundancy you need for real disaster recovery.

On the flip side, that eleven-nines durability is available… but paying ~$23/TB/month for Amazon S3 Standard can quickly turn your backup bill into something bigger than your production costs.

The sweet spot is hybrid tiering: keep hot data on affordable storage, and push cold backups into ultra-cheap archival storage.

Why Not Manual Backups?

Manual backups tend to fall apart for three simple reasons:

  1. Human error - eventually, you will forget.
  2. Bandwidth waste - re-uploading unchanged files eats both time and money.
  3. No state tracking - without knowing what’s already uploaded, you can’t resume failed transfers or skip duplicates.

The Streaming Advantage

Most backup tools take the scenic route: download files locally first, then upload them to the destination. That creates three avoidable problems:

  • Disk space - you need local storage as large as your biggest backup
  • Time - you’re doing two transfers instead of one
  • Cost - large-disk VMs aren’t cheap

Streaming directly from SFTP (in this case) to S3 removes all three: no local disk needed, a single transfer path, and the whole thing can run on even the smallest VM.


The Architecture

The Python script you will build sets up a direct pipeline from your Hetzner Storage Box straight into Amazon S3 Glacier Deep Archive, no staging, no local copies, no unnecessary moving parts.

System Flow

System Architecture

Requirements & Dependencies

The script runs on Python 3.7+ and relies on four core libraries:

  • boto3 - the AWS SDK for Python, handling authentication, streaming uploads, and automatic multipart transfers for large files
  • paramiko - a pure-Python SSH/SFTP implementation for secure access to your Storage Box
  • tqdm - provides real-time progress bars with transfer speeds and ETA
  • python-dotenv - loads environment variables from .env files to keep credentials out of your codebase

On top of that, it uses a few built-in modules:

  • sqlite3 for lightweight state tracking (no external DB required)
  • logging for console + file output
  • os, stat, time for file handling and system utilities

Access Setup

The pipeline needs two things configured outside the script:

Hetzner Storage Box - SFTP access:

  • The Storage Box must have SFTP access enabled and configured in the Hetzner Robot / Cloud Console.
  • You need the SFTP host (e.g. u123456.your-storagebox.de), port (usually 23), and a user/password that can read the directories you want to back up.
  • Without SFTP enabled, the script cannot connect to the Storage Box.

If you still need to get data onto the Storage Box in the first place (local → Hetzner), see Build Your Own Low-Cost Cloud Backup with Hetzner Storage Boxes.

AWS - access keys for S3:

  • Create an IAM user (or use an existing one) with permission to write to your target S3 bucket (e.g. s3:PutObject on that bucket).
  • Generate access keys for that user and put AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in your .env.
  • The script uses these to authenticate with S3; no keys means uploads will fail.

The Code

You can find the complete implementation on GitHub, but here are the core building blocks that make this pipeline from Hetzner to Amazon S3 Glacier Deep Archive actually work in practice.

System Components Overview

The script is structured around five key layers that work together:

  • Configuration Layer - pulls in credentials and runtime settings from a .env file
  • State Persistence Layer - uses SQLite to track which files have already been uploaded
  • SFTP Discovery Engine - recursively walks your Storage Box directory tree
  • Streaming Transfer Engine - streams files directly from SFTP to S3
  • Main Orchestration Loop - ties everything together and manages execution flow

Let’s break down each piece with code snippets and more importantly, why it’s built this way.

Configuration Loading

from dotenv import load_dotenv
load_dotenv()

HETZNER_HOST = os.getenv("HETZNER_HOST")
HETZNER_USER = os.getenv("HETZNER_USER")
HETZNER_PASSWORD = os.getenv("HETZNER_PASSWORD")
REMOTE_PATH = os.getenv("REMOTE_PATH")

S3_BUCKET = os.getenv("S3_BUCKET")
S3_PREFIX = os.getenv("S3_PREFIX")
AWS_REGION = os.getenv("AWS_REGION")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")

STATE_DB = os.getenv("STATE_DB", "archive_state.db")
LOG_FILE = os.getenv("LOG_FILE", "archive.log")
# Comma-separated paths to exclude (that path and everything under it)
EXCLUDE_PATHS_RAW = os.getenv("EXCLUDE_PATHS", os.getenv("EXCLUDE_DIRS", ""))
EXCLUDE_PATHS = [p.strip().strip("/") for p in EXCLUDE_PATHS_RAW.split(",") if p.strip()]

All secrets and runtime settings are stored in a .env file and loaded at start-up using python-dotenv. This keeps credentials (for Hetzner and AWS), bucket details, remote paths, logging config, and optional exclusions outside of your actual codebase.

Keeping configuration separate from code is one of those boring best practices that quietly saves you from very real disasters. Loading environment variables at runtime lets you:

  • Avoid committing secrets to version control
  • Run the same code across dev, staging, and production with different configs
  • Rotate credentials without redeploying code
  • Open-source the project without leaking access keys

If you want to level up config and validation in Python (typed settings, env validation, APIs), check out Practical Pydantic.

State Database Initialization

def init_db():
    conn = sqlite3.connect(STATE_DB)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS files (
            path TEXT PRIMARY KEY,
            size INTEGER,
            mtime INTEGER,
            uploaded_at INTEGER
        )
    """)
    conn.commit()
    return conn

A local SQLite database keeps track of every file that’s already been archived using a simple fingerprint: (path, size, mtime). For each upload, it records the file path (as the primary key), its size in bytes, last modified timestamp, and when it was successfully transferred.

If the script stops midway, or you run it again later, it immediately knows what’s already been handled and what still needs attention.

SQLite gives you a zero-config database with no server to install or maintain. The schema is intentionally minimal:

  • path as the PRIMARY KEY guarantees uniqueness and fast lookups
  • size + mtime act as a reliable change detector
  • uploaded_at gives you an audit trail for debugging or reporting

Using CREATE TABLE IF NOT EXISTS also makes the process idempotent, meaning you can run it repeatedly without errors, duplicate uploads, or wasted bandwidth. Just clean, incremental backups every time.

Smart Duplicate Detection

def already_uploaded(conn, path, size, mtime):
    row = conn.execute(
        "SELECT size, mtime FROM files WHERE path=?",
        (path,)
    ).fetchone()
    return row == (size, mtime)

This function checks the state database to see whether a file with the same path, size, and last modification time has already been uploaded, allowing the script to intelligently skip anything that hasn’t changed.

That one-line tuple comparison, row == (size, mtime), quietly handles all the important cases:

  • If the path isn’t in the database → None == (size, mtime)False → upload it
  • If the path exists but size or mtime changed → (old_size, old_mtime) == (new_size, new_mtime)False → upload it
  • If the path exists with identical size and mtimeTrue → skip it

This makes incremental backups dramatically more efficient. After the initial full run, future executions may end up skipping the vast majority of files. Since the database is updated after each successful transfer, the process is also interruption-safe, if something stops halfway through, it simply resumes where it left off.

State Persistence After Upload

def mark_uploaded(conn, path, size, mtime):
    conn.execute(
        "REPLACE INTO files VALUES (?, ?, ?, ?)",
        (path, size, mtime, int(time.time()))
    )
    conn.commit()

After a file is successfully uploaded, this function stores its metadata in the state database, creating a persistent record that prevents the same unchanged file from being transferred again in future runs.

Using REPLACE instead of a plain INSERT neatly covers both scenarios:

  • If the file path already exists → update the existing record
  • If it doesn’t → insert a new one

The immediate commit() is what makes this reliable. If the script crashes right after this step, the upload is already recorded, so the next run won’t waste time or bandwidth reprocessing it.

This gives you interruption safety at the file level. Each commit is atomic: you either have a fully written record, or nothing at all. No half-finished states, no ambiguity.

SFTP Directory Walker with Path Exclusions

def _normalize_path(p):
    return p.replace("//", "/").strip("/") if p and p != "." else ""

def _is_path_excluded(full_path):
    norm = _normalize_path(full_path)
    if not norm:
        return False
    for exclude in EXCLUDE_PATHS:
        if norm == exclude or norm.startswith(exclude + "/"):
            return True
    return False

def walk_sftp(sftp, path):
    for entry in sftp.listdir_attr(path):
        # Skip hidden files and directories
        if entry.filename.startswith('.'):
            continue

        # Normalize path
        if path == ".":
            full = entry.filename
        else:
            full = f"{path}/{entry.filename}".replace("//", "/")

        if stat.S_ISDIR(entry.st_mode):
            if _is_path_excluded(full):
                continue
            yield from walk_sftp(sftp, full)
        else:
            yield full, entry.st_size, entry.st_mtime

This recursive generator walks the full directory tree on your Storage Box over SFTP. It lists directory contents, separates files from folders using mode attributes, skips hidden entries (anything starting with a dot), and ignores any directory that matches, or lives under, a path defined in EXCLUDE_PATHS. That makes it easy to leave out things like backup/old, log folders, or cache directories from your archive.

This behaves a lot like Python’s os.walk(), but for remote storage. A few design choices make it scale cleanly:

  • Generator-based (yield): files are produced one at a time, so memory usage stays flat even with huge directory trees
  • Hidden file filtering: avoids wasting time on system artifacts like .bash_history or .ssh/
  • Path-prefix exclusions: _is_path_excluded() treats entries in EXCLUDE_PATHS as prefixes, excluding backup/old skips that folder and everything inside it
  • Normalization: _normalize_path() strips extra slashes so path comparisons are consistent
  • Recursive delegation: yield from walk_sftp(...) passes control without building large intermediate lists

The result is a memory-efficient traversal that can handle anything from a few hundred files to millions, without needing to load the full tree into RAM first.

Streaming Upload with Progress

def upload_stream(fileobj, s3_key, size):
    progress = tqdm(
        total=size,
        unit="B",
        unit_scale=True,
        desc=s3_key,
        leave=False
    )

    def callback(bytes_amount):
        progress.update(bytes_amount)
        metrics["bytes_uploaded"] += bytes_amount

    s3.upload_fileobj(
        fileobj,
        Bucket=S3_BUCKET,
        Key=s3_key,
        ExtraArgs={"StorageClass": "DEEP_ARCHIVE"},
        Callback=callback
    )
    progress.close()

This function streams files directly from Hetzner SFTP to Amazon S3 Glacier Deep Archive without ever writing them to local disk. The SFTP file object is passed straight to boto3.upload_fileobj(), which handles chunked uploads internally. Real-time progress bars are updated via a callback after each chunk, giving visibility into transfer speed and completion.

This is the core of the streaming architecture. Key points:

  • upload_fileobj(): Streams any file-like object (including SFTP handles) to S3, no temporary disk storage required
  • StorageClass='DEEP_ARCHIVE': Saves cost by writing directly to the cheapest archival tier, avoiding Standard tier fees and lifecycle transitions
  • Callback=callback: Updates progress bar and metrics["bytes_uploaded"] after each chunk (~8MB)
  • leave=False: Keeps the console clean by removing progress bars when done

This method handles gigabyte-scale files on machines with only a few megabytes of RAM. The callback pattern provides live feedback without blocking the upload, and the metrics dictionary allows cumulative tracking for reporting purposes.

Main Orchestration

def main():
    conn = init_db()

    transport = paramiko.Transport((HETZNER_HOST, HETZNER_PORT))
    transport.connect(username=HETZNER_USER, password=HETZNER_PASSWORD)
    sftp = paramiko.SFTPClient.from_transport(transport)

    try:
        log.info(f"Starting backup from: {REMOTE_PATH}")
        for path, size, mtime in walk_sftp(sftp, REMOTE_PATH):
            if REMOTE_PATH == ".":
                rel = path
            else:
                rel = path.replace(REMOTE_PATH, "").lstrip("/")
            s3_key = f"{S3_PREFIX}{rel}"

            if already_uploaded(conn, path, size, mtime):
                log.info(f"SKIP  {path}")
                metrics["skipped"] += 1
                continue

            log.info(f"UPLOAD {path} -> s3://{S3_BUCKET}/{s3_key}")

            with sftp.open(path, "rb") as f:
                upload_stream(f, s3_key, size)

            mark_uploaded(conn, path, size, mtime)
            metrics["uploaded"] += 1
    finally:
        sftp.close()
        transport.close()
        conn.close()

The main() function orchestrates the entire pipeline: it connects to Hetzner via SFTP and AWS via boto3, initializes the SQLite state database, and iterates over every file discovered by the SFTP walker. The S3 key is built from S3_PREFIX plus the path relative to REMOTE_PATH (so backing up from a subdirectory doesn’t duplicate that prefix in the archive). For each file:

  • Checks if it’s already uploaded using the state database
  • Skips unchanged files (logging and incrementing the skip counter)
  • Streams new or modified files directly to Amazon S3 Glacier Deep Archive
  • Updates the database after successful upload
  • Tracks metrics like files uploaded, skipped, and bytes transferred

Why it’s built this way:

  • Connection management: Opens SFTP and DB connections at the start; try/finally ensures they’re closed even if something goes wrong
  • Generator-driven: walk_sftp() feeds one file at a time, keeping memory usage constant
  • Early state check: already_uploaded() prevents unnecessary file reads and saves bandwidth
  • Context managers: with sftp.open(...) safely closes remote file handles
  • Immediate commits: mark_uploaded() writes to the database after each file, making the system resilient to crashes or interruptions

The design favours simplicity over complexity: no threads, no async, no elaborate state machines.

This keeps debugging straightforward and failure modes predictable. At the end, a summary report logs total files uploaded, skipped, bytes transferred, and total runtime.

Progress Tracking with Metrics

metrics = {
    "uploaded": 0,
    "skipped": 0,
    "bytes_uploaded": 0,
    "start_time": time.time(),
}

# ... at the end:
duration = time.time() - metrics["start_time"]
log.info("==== SUMMARY ====")
log.info(f"Uploaded files : {metrics['uploaded']}")
log.info(f"Skipped files  : {metrics['skipped']}")
log.info(f"Bytes uploaded : {metrics['bytes_uploaded']:,}")
log.info(f"Duration (sec) : {int(duration)}")

A global metrics dictionary tracks operational statistics throughout execution: files uploaded, files skipped, total bytes transferred, and start time for duration calculation. At the end, a summary report is logged.

A simple dictionary tracks operational metrics. The callback in upload_stream() accumulates bytes, while the main loop increments counters. At the end, you get a summary:

==== SUMMARY ====
Uploaded files : 42
Skipped files  : 1,583
Bytes uploaded : 15,234,567,890
Duration (sec) : 340

This makes it easy to track incremental efficiency over time. After the first full backup, subsequent runs should show high skip counts and low upload counts, exactly what you want. These metrics help you verify the script is working correctly and monitor bandwidth usage.


Cost Analysis

Using a hybrid approach, Hetzner for active backups and Amazon S3 Glacier Deep Archive for cold storage, gives enterprise-grade durability at a fraction of standard cloud costs.

Monthly Storage Costs (10TB Example)

Service Cost Role
Hetzner Storage Box BX31 (10TB) €28.86 Active, frequently accessed backups
AWS S3 Glacier Deep Archive (10TB) $10.10 Cold, geo-redundant disaster recovery
Total €39.96 (~$44) Hybrid storage solution
Per-TB Cost €4.00/TB (~$4.40/TB) Unit economics

Comparison:

  • AWS S3 Standard: $230/month
  • Backblaze B2: $50/month (+ egress fees)
  • Google Cloud Archive: $12/month (higher retrieval costs)

Hybrid tiering delivers reliability at a fraction of S3 Standard pricing.

Transfer & API Costs

  • Hetzner → AWS: Free egress & ingress
  • S3 PUT Requests: ~$5 for 100,000 files initially; ~$0.25/month for incremental changes (~5%)
  • Impact: Negligible compared to storage savings

Retrieval Costs (Deep Archive Caveat)

  • Retrieval: $0.02/GB
  • Restoration: 12–48 hours
  • Temporary storage in S3 Standard: $0.023/GB/month

Example: Restoring 1TB costs ~$20 in retrieval fees plus ~$23/month if kept temporarily. Deep Archive is designed for “write-once, rarely-read” disaster recovery.

This hybrid setup gives the best of both worlds: affordable, accessible active storage and ultra-durable, geo-redundant archival backup, without breaking the bank.


Conclusion

Backing up from a Hetzner Storage Box to Amazon S3 Glacier Deep Archive gives you a powerful, cost-effective solution with:

  • Affordability: ~$5/month per TB using hybrid tiered storage
  • Durability: 11-nines (99.999999999%) with built-in geo-replication
  • Efficiency: Incremental, streaming transfers with zero local disk usage
  • Reliability: Interrupt-safe, state-persistent, fully logged operations
  • Automation: Fully hands-off, schedule via cron, Task Scheduler, or any job runner

Whether for personal data, compliance, or enterprise disaster recovery, this pipeline delivers enterprise-grade reliability at a fraction of typical cloud costs.

The code is production-ready, tested, and designed to run unattended. By combining Hetzner’s affordability with AWS Glacier Deep Archive’s durability, you get the best of both worlds.

Full code and documentation: https://github.com/nunombispo/hetzner-to-aws-s3-glacier-backups

More on Python, Hetzner and automation: Hetzner Storage Box backup, Django on Hetzner + Dokku) and Python One-Liners.


Follow me on Twitter: https://twitter.com/DevAsService

Follow me on Instagram: https://www.instagram.com/devasservice/

Follow me on TikTok: https://www.tiktok.com/@devasservice

Follow me on YouTube: https://www.youtube.com/@DevAsService

Tagged in:

Python, Hetzner, AWS S3, Glacier

Last Update: February 23, 2026

About the Author

Nuno Bispo Netherlands

Building better devs, one post at a time. 💻 Practical tips, pro insights, and tools that actually work. Read the blog today.

View All Posts