Ever dreamt of creating your own programming language, but figured that was something only compiler geeks or professors could pull off?

Think again.

In this article, you’ll learn how to write your own toy programming language in a single weekend, using nothing but Python and a bit of brainpower. No compilers, no scary grammar tools, just regular Python code, a few re patterns, and a dose of curiosity.

You won’t be building the next JavaScript or Rust (yet), but you will build a working interpreter that can understand code like this:

let x = 10;
print(x * 2 + 1);

And the best part? You’ll understand how it works, from converting text into tokens, building an Abstract Syntax Tree (AST), and walking that tree to evaluate results. It’s like writing a mini-brain for your language, and it’s deeply satisfying.

Let’s get started. Your language awaits.

The full source code is available at the end of the article.


CTA Image

This book offers an in-depth exploration of Python's magic methods, examining the mechanics and applications that make these features essential to Python's design.

Get the eBook

Step 1: Design Your Language

Before we write a single line of Python code for our new language interpreter, we need to answer a simple question:

What kind of language are we building?

We’re not aiming to replace Python or create a full-fledged compiler. Our goal is to create a simple, interpreted, expression-based language that supports:

  • Variable declarations using let
  • Basic arithmetic (+, -, *, /)
  • Built-in print() function
  • A script-style execution (no functions or conditionals, at least not yet)

Let's review the steps necessary to create a language:

Programming Language Steps

In this step 1, we will look at the source code.

Syntax Design

Here's the minimal syntax we'll support:

let x = 5;
let y = x + 10;
print(y);

In English, this means:

  • Declare a variable x and set it to 5
  • Declare another variable y, set it to x + 10
  • Print the value of y

Each statement ends with a semicolon ;, similar to JavaScript or C.

Grammar Overview

We’ll need a rough idea of the grammar to build a parser later. Here’s a simplified version:

program      ::= statement*
statement    ::= "let" IDENTIFIER "=" expression ";" 
               | "print" "(" expression ")" ";"
expression   ::= term (("+" | "-") term)*
term         ::= factor (("*" | "/") factor)*
factor       ::= NUMBER | IDENTIFIER | "(" expression ")"

This grammar:

  • Is written in EBNF-style notation (Extended Backus-Naur Form)
  • Defines how statements and expressions are structured
  • Handles operator precedence (i.e., * and / are evaluated before + and -)
  • Supports grouping with parentheses

Don’t worry if this looks unfamiliar. We’ll break this down step-by-step as we build the tokenizer, parser, and interpreter.

Just keep in mind that this grammar defines the structure of a programming language using basic constructs like variable assignment and printing.


Step 2: Tokenizer (Lexer)

Now that we’ve defined our language’s syntax, it’s time to build the first real component: a tokenizer, also known as a lexer.

Let's review the steps necessary to create a language:

Programming Language Steps

In this step 2, we will take a look at the tokenizer.

What Is a Tokenizer?

A tokenizer breaks your source code (plain text) into a sequence of meaningful tokens, small labelled pieces like keywords, identifiers, numbers, and symbols.

For example, given this line of code:

let x = 5 + 2;

The tokenizer should return something like:

[
  ('LET', 'let'),
  ('IDENT', 'x'),
  ('EQUALS', '='),
  ('NUMBER', '5'),
  ('PLUS', '+'),
  ('NUMBER', '2'),
  ('SEMICOLON', ';')
]

These tokens make it easier for the parser (in step 3) to understand what’s happening.

Building the Tokenizer in Python

We’ll use Python’s built-in re (regular expressions) module to match patterns for each token type.

Let’s define the token types and write a simple lexer:

import re

# Define token types and regex patterns
TOKEN_TYPES = [
    ('LET',      r'let'),
    ('PRINT',    r'print'),
    ('NUMBER',   r'\d+'),
    ('IDENT',    r'[a-zA-Z_][a-zA-Z0-9_]*'),
    ('EQUALS',   r'='),
    ('PLUS',     r'\+'),
    ('MINUS',    r'-'),
    ('TIMES',    r'\*'),
    ('DIVIDE',   r'/'),
    ('LPAREN',   r'\('),
    ('RPAREN',   r'\)'),
    ('SEMICOLON',r';'),
    ('SKIP',     r'[ \t]+'),   # ignore spaces and tabs
    ('NEWLINE',  r'\n'),
]

Now let’s write the function to match and extract these tokens:

def tokenize(code):
    tokens = []
    index = 0

    while index < len(code):
        match = None
        for token_type, pattern in TOKEN_TYPES:
            regex = re.compile(pattern)
            match = regex.match(code, index)
            if match:
                text = match.group(0)
                if token_type != 'SKIP' and token_type != 'NEWLINE':
                    tokens.append((token_type, text))
                index = match.end(0)
                break
        if not match:
            raise SyntaxError(f'Unexpected character: {code[index]}')
    return tokens

Example

Let’s test it:

code = "let x = 5 + 2;"
print(tokenize(code))

Output:

[('LET', 'let'), ('IDENT', 'x'), ('EQUALS', '='), ('NUMBER', '5'), ('PLUS', '+'), ('NUMBER', '2'), ('SEMICOLON', ';')]

You’ve got a working tokenizer!


Step 3: Building a Parser (AST Generator)

Now that we can tokenize our code, it’s time to make sense of those tokens. This is where the parser comes in.

Let's review the steps necessary to create a language:

Programming Language Steps

In this step 3, we will look at the parser and AST.

What Is a Parser?

A parser reads the list of tokens and builds an Abstract Syntax Tree (AST), which is a structured, hierarchical representation of the code.

Take this input:

let x = 5 + 2;

The tokenizer gives us:

[('LET', 'let'), ('IDENT', 'x'), ('EQUALS', '='), ('NUMBER', '5'), ('PLUS', '+'), ('NUMBER', '2'), ('SEMICOLON', ';')]

The parser turns this into an AST like:

[
    LetStatement(
        name="x",
        value=BinaryOp(
            left=Number(value=5),
            op="+",
            right=Number(value=2)
        )
    ),
    PrintStatement(
        expr=Identifier(name="x")
    )
]

Let’s build that.

Define AST Nodes

We’ll define a few Python classes to represent different AST node types:

class Number:
    def __init__(self, value):
        self.value = int(value)

    def __repr__(self):
        return f"Number(value={self.value})"

class Identifier:
    def __init__(self, name):
        self.name = name

    def __repr__(self):
        return f"Identifier(name={self.name})"

class BinaryOp:
    def __init__(self, left, op, right):
        self.left = left
        self.op = op
        self.right = right

    def __repr__(self):
        return f"BinaryOp(left={self.left}, op={self.op}, right={self.right})"

class LetStatement:
    def __init__(self, name, value):
        self.name = name
        self.value = value

    def __repr__(self):
        return f"LetStatement(name={self.name}, value={self.value})"

class PrintStatement:
    def __init__(self, expr):
        self.expr = expr

    def __repr__(self):
        return f"PrintStatement(expr={self.expr})"

Create the Parser Class

We'll make a simple recursive descent parser that consumes tokens one by one and builds AST nodes.