blank

JAX, Static-Shape Programming and Polyhedron

2025-08-20T00:00:00+00:00

Rectangles are fine. Weird shapes are fun.

JAX wants static shapes.
Your loops, alas, are sometimes not rectangles.

This post tours the pain –> coping strategies –> a tiny helper I wrote called HedraX, which lets you index arbitrary polyhedral domains in JAX without summoning five GPTs.

This is Part 1: we’ll build intuition with hand-rolled code and end with HedraX’s table indexer.
In Part 2, I’ll show a more “closed-form” approach HedraX can auto-generate for suitable domains.

Rectangles are Boring

In JAX, we often translate a Python-like loop like

for i in range(N):
    for j in range(N):
        a[i, j] = f(i, j)

into

a = jax.vmap(
      jax.vmap(f, in_axes=(0, None)),
      in_axes=(None, 0)
    )(jnp.arange(N), jnp.arange(N))

This translation works fine for rectangular domains. But suppose we want the lower triangle:

for i in range(N):
    for j in range(i):
        a[i, j] = f(i, j)

At first glance this looks “dynamic” because the inner bound j depends on i. Can we do this in JAX with static shapes?

The answer is yes.

The Heroic (but Fragile) Closed-Form for Triangles

Although the domain isn’t rectangular, it is statically sized : it has N * (N + 1) // 2 points.

We can biject a linear index k to (i, j) and iterate over k:

We can picture the domain as a triangle, and we assign each point a linear index k in the order of enumerating the rows and columns.

And here is the JAX code that implements this idea:

# Lower triangle: j in [0, i] (including the diagonal)
# k ranges over 0..T_{N-1} where T_m = m(m+1)/2
def body(a, k):
    # Solve for row i from k using the quadratic formula
    i = jnp.floor((jnp.sqrt(8.0 * k + 1.0) - 1.0) / 2.0).astype(jnp.int32)
    Ti = (i * (i + 1)) // 2        # T_i
    j = (k - Ti).astype(jnp.int32) # j in [0, i]
    a = a.at[i, j].set(f(i, j))
    return a, None

K = N * (N + 1) // 2
a, _ = lax.scan(body, a0, jnp.arange(K))

This works and is reasonably fast, but the math is bespoke. You also won’t want to re-derive a closed-form quadratic formula for every odd-shaped loop you meet.

The “fine, I’ll just precompute it” Route

Another approach: explicitly enumerate the valid lattice points into a table and scan over that table.

import jax
import jax.numpy as jnp
from jax import lax

def build_coords_triangle(N):
    # Lower triangle (including the diagonal)
    # Store linear addresses k = i * N + j
    pts = [i * N + j for i in range(N) for j in range(i + 1)]
    return jnp.asarray(pts, dtype=jnp.int32)

addresses = build_coords_triangle(N)  # shape: (K,)
def body(a, k):
    i, j = k // N, k % N
    a = a.at[i, j].set(f(i, j))
    return a, None

a, _ = lax.scan(body, a0, addresses)

This is conceptually simple but:

adds an address table (memory),
adds an extra read per iteration,
still asks you to hand-enumerate the domain.

What if your domain is… less cozy?

From Triangles to “Whatever”

Consider the polygonal domain $\mathcal{D} = \{ (i, j) \in \mathbb{Z}^2 \mid \; 5j - i - 8 \ge 0,\; -3i - 6j + 39 \ge 0,\; 4i + j - 10 \ge 0 \}.$ that looks like this:

How to implement the build_coords_triangle function for this domain? It’s not obvious.

A simple approach is to bound the domain by a rectangle and reject points outside the domain, as shown by the dashed rectangle in the figure above.

But, in higher dimensions:

bounding boxes get tedious,
rejection gets expensive.

Introducing HedraX

Happily, the problem of parametric polyhedral enumeration has been studied to death (Verdoolaege et al., 2007; Klöckner, 2014; Verdoolaege, 2010).
It powers polyhedral compilation in systems like LLVM/MLIR.

I wrapped just enough of that machinery into a tiny helper: HedraX, specifically built for the use case of static-shape programming in JAX.¹

TL;DR: Tell HedraX your domain; it builds the address table for you and gives you an unravel to recover multi-indices.

The Triangle Example

import hedrax as hdx
from jax import lax

addresses, unravel = hdx.compile_table_indexer(
    "[N] -> { [i, j] : 0 <= j <= i < N }",
    N=10
)

def body(a, k):
    i, j = unravel(k)
    a = a.at[i, j].set(f(i, j))
    return a, None

a, _ = lax.scan(body, a0, addresses)

Crazy domain? Just change the set:

addresses, unravel = hdx.compile_table_indexer(
    "[N] -> { [i, j] : 5j - i - 8 >= 0 and -3i - 6j + 39 >= 0 and 4i + j - 10 >= 0 }",
    N=10
)

The GPT Unicorn

With the table indexer in HedraX, you can even do unions of polyhedra.

For example, here is a ChatGPT-generated unicorn built as a union of convex polyhedra:

Voilà!

What About the Quadratic-Solving Approach?

hdx.compile_table_indexer automates the “precompute the table” route in precompute-route.
It doesn’t produce the same neat closed-form mapping as in our closed-form approach — but in Part 2 I’ll show how HedraX can derive those closed-forms automatically when the domain admits them.

A lot of credit underneath the hood of HedraX goes to islpy (Klöckner, 2014), which is a Python binding for the isl library for manipulating parametric polyhedra. ↩

On The Computability of Parametric Inversion

2024-12-20T00:00:00+00:00

Parametric inversion, introduced in (Tavares & Solar-Lezama, 2016), generalizes the classical notion of function inversion to non-invertible functions by introducing a parameterized function that selects specific elements from the preimage of a given function. This approach enables inverting functions that are not bijective, opening up new possibilities for practical applications.

In (Tavares & Solar-Lezama, 2016), the authors mentioned that

In contrast to a conventional inverse, a parametric inverse always exists.

While this holds mathematically, the computability of parametric inverses presents additional challenges. Computability is essential for algorithmic applications, determining whether such inverses can be constructed and utilized effectively. In this document, I examine the computability of parametric inverses within the framework of Type-2 Computability (Weihrauch, 2000). I show that a computable function need not have a computable parametric inverse, illustrating a limitation of this concept.

Parametric Inversion

Definition For a function $f: X \to Y$, a function $ f^{-1} : Y \times \Theta \to X $ is a parametric inverse of $f$ if, for all $y \in Y$:

\[\{ f^{-1}(y, \theta) \mid \theta \in \Theta \} = \set{ x \in X \mid f(x) = y }.\]

Here, $\theta$ serves as a parameter to select specific elements from the preimage of $f$, ensuring full coverage of the preimage for each $y$.

Mathematically, a parametric inverse always exists for any function $f$. To construct one:

For each $y$, choose an abitrary element $x^*_y \in \set{ x \in X \mid f(x) = y }$.
Let $\Theta = X$.
Define a trivial parametric inverse as:
\[f^{-1}(y, \theta) = \begin{cases} \theta & \text{if } f(\theta) = y, \\ x^*_y & \text{otherwise}. \end{cases}\]

However, when considering continuous (or computable) functions, ensuring that the parametric inverse is also continuous (or computable) becomes more complex. This distinction leads to interesting limitations, as shown in the example below.

A Computable Function Without a Computable Parametric Inverse

Consider the ReLU function, $\mathrm{relu}: \mathbb{R} \to \mathbb{R}$, defined as:

\[\mathrm{relu}(x) = \max(x, 0).\]

The ReLU function is continuous and computable because $\max(., .)$ is continuous and computable (Weihrauch, 2000, theorem 4.3.2).

However, any parametric inverse $\mathrm{relu}^{-1}$ is discontinuous, as shown below.

Proof of Discontinuity

For $x = a$, where $a < 0$, we have $\mathrm{relu}(a) = 0$.
By the definition of parametric inverse, there exists $\theta^* \in \Theta$ such that:
- $\mathrm{relu}^{-1}(0, \theta^*) = a$, and
- $\mathrm{relu}^{-1}(y, \theta^*) = y$ for all $y > 0$.

The second point implies a jump discontinuity at $y = 0$ for $\mathrm{relu}^{-1}(y, \theta^*)$, as shown in the figure:

Since every computable function on $\mathbb{R}$ is continuous (Weihrauch, 2000, theorem 4.3.1), the discontinuity imples $\mathrm{relu}^{-1}$ is not computable. Thus, the $\mathrm{relu}$ example demonstrates a computable function lacking a computable parametric inverse.

When Is Parametric Inversion Computable?

Functions defined on computably enumerable sets (e.g., integers $\mathbb{Z}$, rationals $\mathbb{Q}$, finite-length strings $\Sigma^*$) admit computable parametric inverses. A simple construction is as follows:

Let $f: X \to Y$ be a computable function mapping bewteen computably enumerable domains $X$ and $Y$. Let $\Theta = \mathbb{N}$. Define $f^{-1} : Y \times \Theta \to X$ with the following pseudo-code:

def make_parametric_inverse(f: Callable[[X], Y]) -> Callable[[Y, Naturals], X]:
    def parametric_inverse(y: Y, theta: Naturals) -> X:
        for x in X:  # Assume X is computably enumerable
            if f(x) == y:  # Equality testing is computable for computably enumerable Y
                if theta == 0:
                    return x
                theta -= 1
    return parametric_inverse

This algorithm ensures computability but is inefficient, relying on exhaustive enumeration.

For general domains, computability appears to hinge on additional structure of the function, such as a computable enumeration of the function’s local optima, a topic worth further exploration.

Conclusion

Parametric inverses provide a flexible framework for “inverting” non-bijective functions, with guaranteed existence mathematically. However, their computability depends on the function’s domain and properties. As shown, computable functions on reals may lack computable parametric inverses due to discontinuities in the inverse. On the other hand, functions on discrete domains offer a more favorable computability landscape. This nuanced interplay highlights interesting directions for further study within computability theory.

Estimating Fluid Velocity and Diffusion from Temperature Measurements (in Theory)

2024-08-15T00:00:00+00:00

Background

My father, an sensor engineer, recently posed to me an intriguing question: How can we estimate the velocity and thermal diffusion coefficient of a running fluid using only temperature measurements?

While I have taken some computational physics classes, I am no an expert in fluid dynamics or sensor design. However, based on some fundamental physics principles, we can sketch out a theoretical approach that might just work.

Basic Setup

The temperature of a fluid in a long insulated pipe should basically become stationary after a while, assuming the fluid is flowing at a constant velocity. So, to be able to get a signal from the temperature, we need to introduce a heat source at a specific point, say $x = 0$ in the pipe.

A good idea is to drive the heat source with a periodic signal, say $f(t) = A(1 + \sin(\omega_d t))$, so that our temperature sensors can pick up the signal at the same frequency $\omega_d$ and analyze the temperature distribution at that frequency. Intuitively, this should reduce chances that the the teperature sensors picking up noise from the environment, as long as we pick a unique drive frequency.

The 1-D Heat Partial Differential Equation

To get started, ChatGPT told me to consider the classic one-dimensional heat equation, which describes how temperature evolves over time in a moving fluid in an infinitely long pipe.

\[\frac{\partial T}{\partial t} = \alpha \frac{\partial^2 T}{\partial x^2} - v \frac{\partial T}{\partial x} + \frac{f(t)}{\rho c} \delta(x)\]

Where:

$\alpha$ is the thermal diffusivity.
$v$ is the velocity of the fluid (what we’re trying to estimate).
$f(t)$ is the heat source, which we’ll assume is $f(t) = A(1 + \sin(\omega_d t))$.
$\rho$ and $c$ are the fluid density and specific heat capacity.
$\delta(x)$ is a Dirac delta function to model the fact that the heat source is located at a point $x = 0$.

I’ll admit, there are practical challenges to implementing this in the real world, especially when it comes to building a sensor. But for now, let’s stick with the math!

Solving the Equation in the Frequency Domain

To make things easier, we switch from the time domain to the frequency domain by applying the Fourier transform. This lets us look at the system’s response at the driving frequency $\omega_d$. The Fourier transform of the temperature $T(x, t)$ at this frequency is $\hat{T}(x)$:

\[\hat{T}(x, \omega) = \int_{-\infty}^{\infty} T(x, t) e^{-i \omega t} \, dt\]

Since we’re interested in the response to the heat source’s drive frequency $\omega_d$, we consider the Fourier transform at this specific frequency, $\hat{T}(x) = \hat{T}(x, \omega_d)$.

By applying the Fourier transform to the heat equation, we get the following equation at frequency $\omega_d$:

\[i \omega_d \hat{T}(x) = \alpha \frac{d^2 \hat{T}}{dx^2} - v \frac{d \hat{T}}{dx} - \frac{A}{\rho c} i \pi \delta(0) \delta(x)\]

For $x \neq 0$, the delta function vanishes, so we are left with the homogeneous part of the equation:

\[\alpha \frac{d^2 \hat{T}}{dx^2} - v \frac{d \hat{T}}{dx} - i \omega_d \hat{T}(x) = 0\]

This is a second-order ordinary differential equation that we can solve. The general solution is:

\[\hat{T}(x) = C_1 e^{\lambda_1 x} + C_2 e^{\lambda_2 x}\]

Where the constants $\lambda_1$ and $\lambda_2$ are:

\[\lambda_{1,2} = \frac{v \pm \sqrt{v^2 + 4 \alpha i \omega_d}}{2 \alpha}\]

Applying Boundary Conditions

Now we apply the boundary conditions. Since we want the solution to remain bounded as $x \to \infty$, we must set $C_1 = 0$ for $x > 0$. Similarly, for $x < 0$, we set $C_2 = 0$ to avoid a diverging solution as $x \to -\infty$.

At $x = 0$, the temperature distribution must be continuous, so we require that $\hat{T}(0^-)$ equals $\hat{T}(0^+)$. This gives us the final form of the solution:

\begin{equation} \label{eq:solution} \hat{T}(x) = C \exp{\left(\frac{v - \text{sign}(x) \sqrt{v^2 + 4 \alpha i \omega_d}}{2 \alpha} x\right)} \end{equation}

Where $C$ is a constant that depends on the heat source amplitude $A$ and the fluid properties $\rho$ and $c$. The sign function $\text{sign}(x)$ is $1$ for $x > 0$ and $-1$ for $x < 0$.

Measuring Temperature and Solving for $v$ and $\alpha$

A good idea here is to place two sensors at different locations and measure $\hat{T}(x_0)$ and $\hat{T}(x_1)$. This way, we can cancel out the constant $C$! This effectively means we use two sensors to “denoise” the signal to remove effects due to the initial and conditions and the heat source’s waveform (in practice, the waveform cannot be perfectly sinusoidal).

So, let us place two temperature sensors at different positions along the flow, say at $x_0$ and $x_1$. Once we have temperature measurements at these two locations, we take their discrete Fourier transforms, giving us two complex values $M_0 = \hat{T}(x_0)$ and $M_1 = \hat{T}(x_1)$.

We take the log of the ratio between these measurements:

\[K = \log{\frac{M_1}{M_0}}\]

This $K$ value is a complex number, which consists of two real quantities, and we can use it to solve for two parameters of interest $v$ and $\alpha$ by setting up the equations from the solution \eqref{eq:solution}:

\[K = \frac{v - \text{sign}(x_0) \sqrt{v^2 + 4 \alpha i \omega_d}}{2 \alpha} x_1 - \frac{v - \text{sign}(x_1) \sqrt{v^2 + 4 \alpha i \omega_d}}{2 \alpha} x_0\]

Depending on whether $x_0$ and $x_1$ are both positive, or if one is negative, the solution process will vary slightly. For example, if both positions are positive, the equations reduces to a a system of linear equations, which can be solved for $v$ and $\alpha$:

\[\begin{aligned} v & = \frac{a^2 - b^2}{b^3 + a^2 b} (x_1 - x_0)^{-1} \omega_d \\ \alpha & = \frac{a}{b^3 + a^2 b} (x_1 - x_0)^{-2} \omega_d \end{aligned}\]

where $a = \Re(K)$ and $b = \Im(K)$. If the sensors are on opposite sides of the source, the solution is a bit more complex and becomes a quadratic system, but it’s still solvable by Mathematica:

\[\begin{aligned} \label{eq:solution-np} v & = \frac{(x_0 + x_1)^2 \left( b^2 (x_1 - x_0) - \lvert a \rvert \sqrt{\left( -4b^2 x_0 x_1 + a^2 (x_0 + x_1)^2 \right)} \right)}{b \left( b^2 (x_0 - x_1)^2 + a^2 (x_0 + x_1)^2 \right) } \omega_d \\ \alpha & = \frac{(x_0 + x_1)^2 \left( a^2 (x_0 + x_1)^2 + \lvert a \rvert (-x_0 + x_1) \sqrt{\left( -4b^2 x_0 x_1 + a^2 (x_0 + x_1)^2 \right)} \right)}{2ab \left( b^2 (x_0 - x_1)^2 + a^2 (x_0 + x_1)^2 \right)} \omega_d \end{aligned}\]

Practical Challenges (And Some Guesswork)

Sensor Placement

One obvious challenge is placing the sensors at the right positions. They need to be at known distances from the heat source, and they should be sensitive enough to pick up small temperature changes at the drive frequency $\omega_d$. Also, we need to make sure the sensors are sampling fast enough to avoid aliasing (more on that below).

Drive Frequency and Aliasing

From some simulations, I found a constraint on the drive frequency $\omega_d$ that needs to be satisfied. Specifically, the following condition must hold:

\[\frac{\omega_d}{\pi} = 2 f_d < \frac{v}{||x_1| - |x_0||}\]

This essentially means that the drive frequency should be less than half the fluid velocity divided by the distance between the sensors. Though I haven’t fully worked out the details, I am speculating that this is due to Shannon’s sampling theorem. If the drive frequency is too high compared to the fluid velocity, we’ll run into aliasing issues, which could throw off the measurements.

Real-World Imperfections

The model I’ve used assumes everything is happening in one dimension, but in real life, the system could be more complex. There could be heat dissipating into the environment, non-uniform fluid flow, or other external factors affecting the results. These real-world complications would introduce some uncertainty into the measurements.

I’m not an expert in sensor design, but the theory behind it is sound, and with proper calibration, it seems possible to make this work. However, it’s worth noting that further work would be needed to handle the real-world deviations from the idealized model.

Conclusion

Using temperature measurements to estimate the velocity and thermal diffusivity of a flowing fluid is doable in theory, even though the practical aspects (like sensor design) might be trickier. By leveraging the 1-D heat equation with some Fourier analysis, we can get reasonable estimates for $v$ and $\alpha$. If nothing else, it’s a fun application of physics!

Estimating Fluid Velocity and Diffusion from Temperature Measurements (Part 2, Simulation)

2024-08-15T00:00:00+00:00

Introduction

Torque Analysis of a Motorized Filament Rewinder

2023-12-10T00:00:00+00:00

Introduction

It has long been a goal of mine to build a motorized filament spool holder for my multi-material 3D printer. The idea is to have a motorized spool holder that can automatically rewind the unused filament back to its spool after a filament swap, so that the idling filament doesn’t get tangled mid-print. There are various attempts in the open-source 3D printing community at building such a device, and I also have a simple prototype a while ago:

Renderings of my motorized filament rewinder design. The motor is hidden inside the front roller.

However, one of the key challenges in building such a device is to estimate the torque required to rewind the filament back to the spool. This estimate is essential for choosing the appropriate motor and drive mechanism (e.g., diameter of the drive roller) for the rewinder.

In this post, I’ll analyze the torque requirements for a motorized filament rewinder and discuss the key factors that affect the torque.

Disclaimer: The analysis is based on a spherical-cow model and may not capture all the complexities of the real-world rewinder. However, my hope is that it should provide a good starting point for my motorized filament rewinder.

Torque Analysis

To carry out the torque analysis, I need to model the rewinding process. My analysis here will ignore any friction that is outside the scope of the rewinder itself (e.g., friction in the filament path, air resistance, etc.). The analysis will focus on the mass of the filament spool and the force required to rewind the filament back to the spool at a certain linear acceleration.

Thus, I consider the following disassembly of a typical filament spool to model the mass distribution of the spool:

Disassembly of a typical filament spool.

The disaassembly consists of four parts:

The spool core
Two spool disks
The filament

It happens that both the spool core and the filament are hollow cylinders. With a mild approximation by ignoring the patterns on the disks, we can view the disks as “hollow cylinders” as well — the approximation is quite valid given that the disks are thin and weigh little in the entire spool.

The key to the analysis is to compute the moment of inertia of each part of the spool. The moment of inertia of a hollow cylinder is given by:

\[I = \frac{1}{2} m (r_1^2 + r_2^2)\]

where $m$ is the mass of the cylinder, $r_1$ is the inner radius, and $r_2$ is the outer radius.

So I take out calipers and a kitchen scale to measure the dimensions and mass of these spool components:

spool_hole_radius = 54.7 / 2 * 1e-3 # m
spool_rim_radius = 100e-3           # m
spool_disk_weight = 53e-3       # kg
spool_core_weight = 44e-3           # kg
spool_core_thickness = 3.5e-3       # m
full_spool_weight = 1               # kg

The outer radius of the filament component changes as filament is used up. Assuming the filament is wound uniformly, we can compute this radius as a function of the remaining filament weight:

min_filament_radius = spool_hole_radius + spool_core_thickness
max_filament_radius = spool_disk_weight
def filament_radius_from_weight(filament_weight: float):
  """Given the weight of the filament on the spool, returns the radius of the filament that is left on the spool."""
  return min_filament_radius + (max_filament_radius - min_filament_radius) * filament_weight / full_spool_weight

The overall moment of inertia of the spool is the sum of the moments of inertia of the core, the two disks, and the filament. It is a function of the weight of the filament on the spool:

def moments_of_inertia(filament_weight: float):
  """
  Returns the moments of inertia (kg m^2) of the spool and filament together.
  """
  # the MOI of the one side rims
  disk_moi = 1/2 * spool_side_rim_weight * (spool_rim_radius**2 + spool_hole_radius**2)
  # the MOI of the center ring
  core_moi = 1/2 * spool_core_weight * ((spool_hole_radius + spool_core_thickness)**2 + spool_hole_radius**2)
  # filament MOI
  filament_moi = 1/2 * filament_weight * (filament_radius_from_weight(filament_weight)**2 + min_filament_radius**2)
  # total MOI
  return disk_moi * 2 + core_moi + filament_moi

Finally, we compute the torque required to rewind the filament, at a given filament weight and at a certain linear acceleration. The key equation is:

\[\tau = \frac{r_{\text{roller}}}{r_{\text{rim}} r_{\text{filament}} } I a\]

where $\tau$ is the torque, $I$ is the moment of inertia, $a$ is the linear acceleration, $r_{\text{roller}}$ is the radius of the roller, $r_{\text{rim}}$ is the radius of the spool rim, and $r_{\text{filament}}$ is the radius of the filament on the spool.

def torque_for_acceleration(filament_weight: float, acceleration: float, roller_radius: float):
  """
  Returns the torque (N . m) acting on the rim that is required to accelerate the filament at the given unload acceleration (m/s^2)
  """
  filament_radius = filament_radius_from_weight(filament_weight)
  angular_acceleration = acceleration / filament_radius # rad/s^2
  torque = moments_of_inertia(filament_weight) * angular_acceleration
  return torque * roller_radius / spool_rim_radius

Plotting

I can now plot the torque required to rewind the filament at a certain acceleration, given the weight of the filament on the spool. The ideal acceleration I want to achieve is around 300 mm/s^2, and the roller radius in my rewinder design is 26 mm. Pluggging these values into the above function and plotting over the range of filament weights, I get the following torque curve:

So now I know that with my current design, I need a motor that can provide at least ~0.05kg.cm of torque to rewind the filament!

I can now also find out the range of roller radii given a particular motor torque and speed. The speed of the motor is a function of the linear speed of the filament, the radius of the roller, and the radius of the filament on the spool:

def motor_speed(filament_weight: float, speed: float, roller_radius: float):
  filament_radius = filament_radius_from_weight(filament_weight)
  spool_rotational_speed = speed / filament_radius
  return spool_rotational_speed * spool_rim_radius / roller_radius

With above, we know that can plot both the required motor torque and the motor speed for a given roller radius:

So for example, if the motor is rated at 0.035kg.cm of torque at 500RPM, I should be able to use a roller radius between ~17mm to ~19mm.

The code for this analysis is available in this Google Colab notebook.

Enumerating Context-Free Languages and Minimizing Regular Expressions

2021-12-01T00:00:00+00:00

As I work on machine learning algorithms to combinatorial-optimization problems like compiler optimization, one vastly simplified version of a class of problems is to learning to minimize regular expressions. The problem is to learn a function that takes a regular expression as input and outputs a minimal equivalent regular expression that describes the same language. Since this is machine learning, a good starting point is a dataset of regular expressions and their minimal equivalents that can be used directly for supervised learning. To that end, this post describes my approach to generate a dataset of regular expressions and their minimal equivalents.

The key steps are:

Enumerating all regular expressions up to a certain length $n$.
Finding the minimal equivalent for each regular expression by leveraging DFA minimization and hashing.

Background on Regular Expressions and Equivalence

A regular expression over an alphabet $\Sigma$ is a symbolic representation of a regular language, using operations like concatenation, union, and Kleene star. For example, the expression $(a|b)^*$ represents the language of all strings consisting of any number of $a$’s and $b$’s.

Mathematically, the set of all regular expressions over $\Sigma$ can be recursively defined as follows:

The empty set $\emptyset$, the empty string $\epsilon$, and any single character $a \in \Sigma$ are regular expressions.
If $r_1$ and $r_2$ are regular expressions, then $r_1r_2$ (concatenation), $r_1 | r_2$ (union), and $r_1^*$ (Kleene star) are also regular expressions.

The equivalence of two regular expressions $r_1$ and $r_2$ means that they describe the same language:

\[L(r_1) = L(r_2)\]

That is, the set of strings accepted by $r_1$ is identical to that accepted by $r_2$. Unlike general program equivalence (which is undecidable), regular expression equivalence is decidable, making it a good candidate for minimization tasks.

I encountered this problem while working on generating a dataset of regular expressions and their minimal versions. This post describes the methods I used to achieve that goal.

Step 1: Enumerating Regular Expressions

The first step is to systematically generate all regular expressions up to a certain length $n$. This is a challenging combinatorial problem because the space of regular expressions grows exponentially with length.

To efficiently enumerate these expressions, we can represent them using a context-free grammar (CFG). A CFG provides a formal mechanism to define the structure of regular expressions through production rules. For instance, a simplified CFG for regular expressions could look like this:

\[S \rightarrow S + S \, | \, SS \, | \, S^* \, | \, (S) \, | \, a \, | \, b\]

where $S$ is a non-terminal symbol representing a regular expression, and $a, b$ are terminal symbols (characters from the alphabet).

The key insight here is that we can enumerate all regular expressions up to a fixed length by expanding these CFG rules recursively. This technique is based on Berstel and Brzozowski (2012), which provides a framework for enumerating regular expressions from the context-free language definition of regular expressions.

Formally, let $G = (V, \Sigma, P, S)$ be a CFG, where:

$V$ is the set of non-terminal symbols,
$\Sigma$ is the set of terminal symbols (our alphabet),
$P$ is the set of production rules,
$S$ is the start symbol.

The goal is to generate all strings in $L(G)$ (the language of the grammar) that have a length $\leq n$. This is essentially done by recursively applying the production rules until we reach strings of terminal symbols.

Here’s the Python code implementing this recursive expansion:

@dataclasses.dataclass(frozen=True)
class CFG:
    start: NonTerminal
    productions: List[Production]

def enumerate_cfg(cfg_info: EnumerateCFGInfo, size: int) -> Iterator[String]:
    """Enumerates all regular expressions up to a fixed length for a CFG."""
    def expand_rec(symb: NonTerminal, size: int) -> Iterator[String]:
        if size == 0:
            if symb in cfg_info.empty_non_terminals:
                yield tuple()
            return
        for rule, weight, num_non_terminals in cfg_info.productions[symb]:
            rem_size = size - weight
            if rem_size >= 0:
                for ns in partition(rem_size, num_non_terminals):
                    yield from itertools.product(*[expand_rec(s, n) for s, n in zip(rule, ns)])
    return expand_rec(cfg_info.grammar.start, size)

This code systematically enumerates all possible regular expressions up to length $n$, using the grammar $G$.

Step 2: Minimizing Regular Expressions Using DFA and Hashing

Once we can enumerate regular expressions, the next step is to find the minimal equivalent for each expression. This is done by:

Converting the regular expression into a DFA (deterministic finite automaton), which is a formal model for recognizing regular languages.
Minimizing the DFA to obtain the smallest possible automaton that recognizes the same language as the original expression. DFA minimization is a process to reduce the number of states in a DFA, while ensuring that the automaton is minimal with respect to recognizing the same language (DFA Minimization).
Hashing the minimized DFA to uniquely identify the language of the expression.

In step 2, by hashing the minimal DFA, we ensure that equivalent regular expressions (which describe the same language) have the same hash. In practice, I used the PADS library for handling DFAs and regular languages. Here is my fork of the library that supports hashing the DFAs.

Here is the Python-style pseudocode for this algorithm:

def generate_dataset(max_length, alphabet):
    minimals = {}  # Dictionary for storing minimal DFAs by hash
    for expr in enumerate_regular_expressions(max_length):
        M = convert_to_dfa(expr)  # Convert the regular expression to DFA
        h = hash(minimize_dfa(M))  # Hash of the minimized DFA

        if h in minimals:
            m = minimals[h]  # Retrieve the already stored minimal expression
        else:
            minimals[h] = expr  # Store the current expression as minimal
            m = expr

        yield expr, m  # Return the pair of original and minimal expression

Conclusion

By combining CFG-based enumeration with DFA minimization, we can generate a dataset of regular expressions (up to a specified length limit) and their minimal equivalents. Of course, the size of the dataset grows exponentially with the maximum length, so a machine learning approach that relies on such a dataset is likely only applicable to the toy problems of minimizing short regular expressions. Nonetheless, I found this approach to be a fun exercise in formal language theory and automata theory, and it provides a good starting point for exploring more complex problems in the future.