Softmaxxing: Finding -Free Softmax with Symbolic Regression
All the big LLMs use attention which boils down at its core to this equation:
I'm not going to give a reiteration of how attention works, far more people have done far better jobs than I would.
For every token the LLM processes this function is called, which means for every token the LLM processes the function is called. So what if we tried to make that function as fast as we could. To do this we should break it down:
Of course there has been near endless work on this problem, such as the FlashAttention paper. Which implements a far faster version of attention that not only uses less memory but does the computation iteratively. So we get rid of that bottom loop and do it per token instead.
This behavior (a result of Milakov & Gimelshein), keeps a running maximum and running exponential sum , and for each new logit it updates them as:
This call to is what this project focuses on. On standard consumer grade CPUs std::exp() costs ~20ns per call, at least when using a SISD processor. These scalar CPUs / CPU paths are quite common, be it small model local inference, or face detection algorithms running on RPI smart cameras.
So how can we replace this with something cheaper?
The Approach
One my big goals for the year of 2026 is to get better at computerized search, as it seems to pop up everywhere that I'm interested in. So instead of trying to be clever myself I decided to use this neat Symbolic Regression library I ran across SymbolicRegression.jl do it instead.
Symbolic regression (at least with this library) is an evolutionary search over expression trees (like an AST). You give the symbolic regression engine some training data and an operator set and it finds compact closed-form expressions that fit the data.
The key constraint that I applied upon this search was to reduce the operators to the quickly computable ones. The entire allowed operator set was:
| Operator | Definition |
|---|---|
| * | Multiplication |
| + | Addition |
| - | Subtraction |
| Returns if | |
| The absolute value of |
Everything in that list maps to a single-cycle or near-single-cycle scalar instruction on all modern CPUs.
Training Data
As I was modeling an equation I was able to fuzz the inputs, I ended up using a training size of samples. Each sample was drawn from and was a tuple . The input and exact output of one step.
As random samples across all of float space is not very many, I used a circular buffer and overwrote the bottom 15% with new sample values. As symbolic regression is honestly somewhat meant to learn boundary behaviour the continuous update of a portion of the datset improves generalization.
Search Configuration
And to some extent all that remained was to define the search parameters:
# PySR / SymbolicRegression.jl config
binary_operators = [+, -, *, greater]
unary_operators = [abs]
maxsize = 40
loss = "MAE"
populations = 25 # islands
population_size = 200
To help with the generalization, the inner symbolic regression would run for a number of iterations only to be stopped and have the dataset change underneath it. At each run of this "outer loop" the best equations from it were saved and deduplicated across all outer loop runs. Each unique equation generated C++ source files in parallel, compiled themselves with g++ -O3 -march=native, and were appended to the benchmark results. This let a multi-day SR run be benchmarked continuously without reprocessing anything.
Results
CPU: 3–5x Speedup
The eight best-accuracy expressions all beat the baseline, hard. The fastest:
| Expression (abbreviated) | ns/call | Speedup | MAE |
|---|---|---|---|
s + (abs((5.72-m)+x) * ... greater(x+1.00,m) ...) | 4.59 | 5.3× | 1.04e-3 |
s + abs(((5.72-m)+x) * ... greater(x+0.88,m) ...) | 4.86 | 5.0× | 1.04e-3 |
(... abs((5.5-m)+x) ... greater(x+1.0,m) ...) + s | 5.45 | 4.5× | 9.49e-4 |
| slowest in accuracy tier | 7.13 | 3.4× | 1.03e-3 |
| Baseline (2× std::exp) | 24.40 | 1× | — |
4.59 ns versus 24.4 ns. This is the mechanism that we would expect, the compilers ees a flat scalar expression with no transcendeal calls. This means that it can very cleanly fold the constants, eliminate the common subexpressions, and emit the short sequence of istructions. At which point the modern OOO processors will be able to readily rip through the dependency chains. However, the very linear lower register pressue equations should work very efficently on scalar processors. Basically the approximations don't need to be clever they just need to avoid .
GPU: Basically Nothing
The same expressions benchmarked on GPU (8,256 unique equations total) yielded:
- Baseline: 24.34 G-updates/s
- Best SR expression: 24.53 G-updates/s
- That's +0.8%... aka within measurement noise
This confused me for an hour if I'm honest. But after some digging and critical thinking it makes sense.
On GPU, __expf is a (sorta) native instruction. It can overlap with regular arithmetic instructions. Therefore when replacing it for example with three multiplications that can't be similarly overlapped. The trade is neutral at best. But that's not really the reason why.
Fundamentally the bottleneck isn't the arithmetic at all. The online softmax scan is an inherently sequential computation. Each step depends on the previous step through both and . There is no intra-sequence parallelism. Whether a step costs two __expf calls or three multiplications and a compare, the compiler still schedules them serially along the same critical path. Shorter arithmetic per step doesn't shorten the critical path when the bottleneck is data dependency. And at the end of the day GPUS's can do this sequential operation across threads within the same warp. Within single instruction latency.
What Makes an Expression Fast? (It Differs by Hardware)
I extracted 124 structural features from every expression in the top accuracy tier and computed Pearson correlations with measured throughput. CPU and GPU tell completely different stories.
CPU: node count dominates
| Feature | Correlation (r) | Direction |
|---|---|---|
tree_n_nodes | +0.78 | More nodes → slower |
complexity | +0.78 | Higher complexity → slower |
n_greater | +0.71 | More greater() → slower |
frac_var_leaves | -0.71 | Variable-heavy → faster |
On CPU, the compiler aggressively applies CSE and constant folding. What it can't eliminate is the total number of distinct operations. Raw tree size predicts runtime almost perfectly. (Even , which is branchless and costs only one cycle, adds to the instruction count expressions that achieve the same accuracy with fewer calls are consistently faster.)
GPU: variable depth dominates
| Feature | Correlation | Direction |
|---|---|---|
x_min_depth | -0.59 | x buried deep → slower |
all_var_depth_mean | -0.53 | Deep variables → slower |
x_poly_degree | +0.40 | Higher x degree → faster |
On GPU, the critical path is the dataflow graph, and this matters much than the total node count. Deep occurrences create long serial dependency chains; the GPU can't pipeline those away the way it hides memory latency through warp switching. Expressions where variables appear shallow in the tree tend to have shorter critical paths.
These two profiles suggest a useful principle for future hardware-targeted SR: optimize tree size for CPU, variable depth for GPU.
The Speed-Aware Loss
After the initial search, I encoded the GPU structural predictors directly into the SR loss:
L = MAE + λ · φ(features)
φ = 0.28·mean_x_depth + 0.20·std_var_depths + 0.30·frac_const_leaves
+ 0.14·n_distinct_ops - 0.08·x_squared_presentWith , the penalty contributes at most ~7% of a typical frontier MAE. Small enough not to distort accuracy early in search, large enough to break ties between similarly accurate expressions toward structurally faster forms.
The idea: rather than measuring speed at every search step (expensive), encode empirical hardware feedback as a structural prior and let the optimizer find forms that satisfy it naturally.
The Discovered Structure
Every accurate expression the search found takes the form:
The search discovered this decomposition. This is the right shape: when , the correction should be small and approximate ; when , it needs to restructure the sum. The operator set ( and in particular) gave just enough behaviour to find this structure from scratch.
What's Left Open
The main open question is end-to-end accuracy. The MAE of per step is measured in isolation. In a real length-2048 attention sequence, that error accumulates over 2048 steps. If errors were fully correlated (in the worst case), the running sum could drift by ~2 by the end of the sequence.
In practice, errors partially cancel due to sign structure, and the final attention output divides out the running sum entirely -> drift in enters only through the denominator. A systematic overestimate pushes attention weights toward uniform. And of course that also means a systematic underestimate concentrates them. Whether this matters for a trained model's output is an empirical question I haven't answered yet.
Other future directions:
- CPU-specific speed loss (weight tree size rather than variable depth)
- Training data from real model attention patterns rather than Gaussian prior
- More different approximate computing projects (hint hint)
Code
The paper, benchmark pipeline, and all discovered expressions are at github.com/dleiferives/softmaxxing.