Pattern Overload (And Why We Love It)
hprscript: Multi-Pattern Search Without Running grep N Times
Why I built it
Like many other programmers, I use AI agents for code development. A lot of the work is
supervising what the agent does - reading its plans, checking its diffs, and
sanity-checking the code it touches. That means a lot of grep commands, run one
after another, often against the same tree.
At some point I started thinking this was wasteful. Each grep invocation re-reads
the files, re-compiles its pattern, and pays its own startup cost. And when an AI agent is the
one driving, every separate search also re-sends the surrounding context to the model - the
same conversation, the same instructions, the same tool schemas - just to ask one more
question of the same codebase. Ten searches mean ten round trips, ten context replays, ten
walks over the data.
I knew there was a library that could match many patterns at once: Intel's Hyperscan. It compiles N regexes into a single DFA and matches them all in one pass over the input. The catch is that Hyperscan isn't a tool - it's a C library with a non-trivial API: compile databases, scratch buffers, match callbacks, flags, mode bits. Powerful, but not something you reach for to grep a directory.
So there was a gap: a fast multi-pattern engine on one side, and the everyday "search this tree for these things" workflow on the other, with no easy bridge between them. hprscript is my attempt to be that bridge.
What it is
hprscript is a command-line search tool built on Hyperscan. It scans files,
recursive globs, or arbitrary input on stdin and matches all of your patterns simultaneously
in a single pass. One invocation replaces N sequential grep/rg calls.
All patterns are compiled into a single DFA and matched in one walk over the data, so adding
patterns is virtually free - whether you're typing them by hand or having an AI agent build
them up.
| Need | grep / rg | hprscript |
|---|---|---|
| Search for one regex | ✅ | ✅ |
| Search for N regexes in one scan | ❌ (run N times) | ✅ (one DFA, one walk) |
| Pattern-per-file output (JSON Lines) | ❌ | ✅ default |
| Cross-line block extraction (function bodies, JSON objects) | ❌ | ✅ |
| Multi-pass workflows in one process (collect → resolve) | ❌ | ✅ via phases |
| Per-file aggregation (counts, ranking, grouping) | ❌ | ✅ via scripts |
| Files missing a pattern | grep -L | -absent (also works in scripts) |
| Pattern compile cost as N grows | linear | constant - one shared DFA |
PCRE patterns
hprscript patterns use PCRE syntax (the subset Hyperscan accepts), so most everyday
regex features work unchanged - character classes, alternation, quantifiers, anchors, and
word boundaries. For example, \b matches a word boundary, so you can find a whole
word and skip substring hits inside larger identifiers:
$ echo "pretokenization process for token" | hprscript -pi "\btoken\b"
{"file":"<stdin>","pat":"p0","line":1,"col":29,"from":28,"to":33,"match":"token","context":"pretokenization process for token"}
Only the standalone word token matches - the token inside
pretokenization is correctly skipped. The same goes for the rest of the usual PCRE
toolkit: \d, \s, \w, lazy quantifiers like .*?,
character classes like [A-Za-z_]+, alternation, and so on.
Many patterns vs. one big alternation
At first glance, "match many patterns at once" sounds like what grep -E "TODO|DEPRECATED"
already does. It isn't quite the same. With alternation, grep only tells you that
something matched on a line - it can't tell you which branch of the alternation
fired. If you need to know whether you saw a TODO or a DEPRECATED, you
have to run two separate searches.
In hprscript, every -p / -pi pattern is a first-class
pattern with its own ID (p0, p1, ...). Each match record carries the
ID of the pattern that produced it, so you always know which one fired - even when several
patterns match on the same line:
$ echo "version v018 finished with status: FAIL" | hprscript -p "\bv\d+\b" -p "SUCCESS|FAIL"
{"file":"<stdin>","pat":"p0","line":1,"col":9,"from":8,"to":12,"match":"v018","context":"version v018 finished with status: FAIL"}
{"file":"<stdin>","pat":"p1","line":1,"col":36,"from":35,"to":39,"match":"FAIL","context":"version v018 finished with status: FAIL"}
Two distinct matches, two distinct pattern IDs, on the same line, in one pass. That makes the
output trivial to route downstream: filter by pat, count per pattern, group by
pattern, or hand the JSON Lines straight to a script or an AI agent that needs to reason about
which patterns hit where.
Quick taste
# Single pattern (default JSON Lines output)
hprscript -p "TODO" -glob "**/*.go"
# Multi-pattern in one pass - adding patterns is virtually free
hprscript -p "TODO" -p "FIXME" -p "XXX" -glob "**/*.go"
# Mix case-sensitive and case-insensitive in the same scan
hprscript -p '\bError\b' -pi 'todo|fixme' -glob '**/*.go'
# Pipeline use - content from stdin, no glob/files needed
curl -s https://example.com | hprscript -p 'href="[^"]+"' -o
kubectl logs my-pod | hprscript -p 'ERROR|panic' -C 2
# Extract every Go function body (signature + braces, balanced)
hprscript -p 'func \w+\(' -block-open '{' -block-close '}' -o '**/*.go'
# Files missing a license header (one pass, no scripting)
hprscript -p 'Copyright|SPDX-License-Identifier' -absent -glob '**/*.go'
The default output is JSON Lines - one record per match, ready to pipe into jq, a script, or an AI agent:
{"file":"main.go","pat":"p0","line":42,"col":5,"from":1023,"to":1027,"match":"TODO","context":"// TODO: refactor"}
Block extraction: getting the whole function body
Because hprscript matches every pattern simultaneously - including the open and
close delimiters of a block - it can do something that's genuinely hard for almost any tool or
script: extract the entire body that follows a match by counting balanced brackets.
Think about how you'd solve this in general. You want to find a specific function and print its
full body. With grep, you can find the signature line, but the body might span
dozens of lines and contain its own nested braces - so you can't stop at the first }.
Even reaching for awk, sed, or a quick Python script, you end up
hand-rolling a small depth counter, dealing with strings and comments that contain stray braces,
and writing more code than the task deserves. The "right" answer is often to pull in a real
parser for the language. With hprscript, it's one command:
$ hprscript -p 'sub qualify_to_ref' -block-open '{' -block-close '}' -o perl/run/lib/5.36.1/Symbol.pm
sub qualify_to_ref ($;$) {
no strict 'refs';
return \*{ qualify $_[0], @_ > 1 ? $_[1] : caller };
}
hprscript finds the match, walks forward until it sees the first {, then
tracks depth as it scans - incrementing on {, decrementing on } -
until it returns to zero. The block returned is the matched balanced region, regardless of how
many nested braces sit inside it.
The delimiters are configurable, so the same mechanism works far beyond curly-brace languages:
JSON objects ({/}), JSX or HTML subtrees
(<div>/</div>), SQL BEGIN/END blocks,
or any custom paired tokens you can express as strings.
Beyond plain matching
- Script mode (JSON DSL). Variables, lifecycle hooks, sub-pattern matching, conditionals, grouping, ranking, and multi-phase scans - all in a single invocation. Useful for "collect identifiers in pass one, then resolve their definitions in pass two" workflows.
-
-absentmode. Find files where a pattern is not present. Likegrep -L, but it also works inside scripts and composes with the rest of the DSL. -
Unicode by default. UTF-8 mode is on, and
-picase folding works across scripts (CAFÉ↔café,ПРИВЕТ↔привет). -
grep-compatible output modes:
-f(file list),-c(per-file counts),-o(matched text only),-format(custom template),-A/-B/-C(context lines).
Phases: when one pass isn't enough
Some questions can't be answered in a single scan because you don't know what to look for until you've already looked at the data. A classic example: which exported functions in this package are no longer used anywhere? You can't search for "unused functions" directly - you first have to learn what the functions are, and only then can you check whether anything calls them.
hprscript handles this with phases: sequential scan rounds that
share a variable store. Pass 1 collects facts; pass 2 (or 3, or 4) consults them. After the
last phase, an on_complete hook can walk the collected state and emit a final
report.
Here's the sketch of an unused-exports detector for a Go package. Phase defs
records every exported function defined in the package. Phase uses scans the
whole repo for identifier usages and removes anything it finds from the map. What's left
after both phases is the set of exports nothing references:
hprscript -s '{
"variables": {
"defs": {"type": "map"},
"_n": {"type": "string"}
},
"phases": [
{
"id": "defs",
"scan": ["pkg/mylib/**/*.go"],
"patterns": [{
"id": "fn", "regexp": "func [A-Z]\\w+",
"on_match": [
{"action": "submatch", "patterns": [
{"id": "name", "regexp": "[A-Z]\\w+", "on_match": [
{"action": "set", "var": "_n", "value": "$MATCH"}
]}
]},
{"action": "map_set", "target": "defs", "key": "$_n", "value": "$FILE:$LINE"}
]
}]
},
{
"id": "uses",
"scan": ["**/*.go"],
"exclude": ["pkg/mylib/"],
"patterns": [{
"id": "ident", "regexp": "\\b[A-Z]\\w+\\b",
"on_match": [
{"action": "map_delete", "target": "defs", "key": "$MATCH"}
]
}]
}
],
"on_complete": [
{"action": "for_each", "var": "defs", "key_as": "name", "as": "where", "do": [
{"action": "emit", "data": {"unused": "$name", "defined_at": "$where"}}
]}
]
}'
Output is a list of dead exports, with the file and line where each was defined - in one process, with one DFA per phase, and no intermediate files on disk:
{"unused":"OldHelper","defined_at":"pkg/mylib/util.go:47"}
{"unused":"DeprecatedConfig","defined_at":"pkg/mylib/config.go:112"}
The same shape solves a lot of "collect, then resolve" problems: stale references after a
rename (define = current functions; use = call sites; report misses), undefined feature flags
(define = registry; use = call sites), schema drift (define = struct fields; use = JSON keys),
and so on. Anywhere you'd otherwise be tempted to grep into a temp file and then
grep against it from a script, phases collapse the workflow into one invocation.
Built for AI coding agents too
The repository ships an MCP server that exposes
hprscript to AI coding tools like Claude Code and Cursor as a set of structured tools
(search, list_files, count_per_file, extract_blocks,
run_script). The JSON Lines output and multi-pattern semantics fit naturally into
agent workflows - a single tool call can answer "find every TODO, FIXME, and XXX across the
codebase, with surrounding context" instead of three.
You don't strictly need the MCP server, though. If you've already installed the
hprscript binary (a prebuilt Linux x86-64 executable is attached to every release
at github.com/pinkhasn/hprscript/releases/latest
- just download, chmod +x, and put it on your PATH),
an even simpler setup is to download
HPRSCRIPT.md -
the full reference for the CLI flags and the script DSL - drop it next to your project
(or somewhere the agent can read), and tell the agent something like
"read HPRSCRIPT.md and use hprscript for code search whenever it fits."
The reference is written to be self-contained, so the agent can learn the tool from it and
start composing multi-pattern searches and script-mode invocations on its own, without any
extra integration.
There's also a dedicated -llm output mode for when matches are going straight into
a model's context. Instead of JSON Lines (which costs tokens for keys, quotes, and braces on
every record), -llm emits a compact plain-text format grouped by file - one file
header printed once, then each match as <line>: <text>. Same
information, roughly 30-50% fewer tokens than -j:
hprscript -p 'TODO|FIXME' -llm -glob '**/*.go'
# pkg/foo.go
# 42: // TODO: refactor
# 78: // FIXME: race condition
# pkg/bar.go
# 13: // TODO: handle nil case
It also adapts automatically to other flags - if you add -block-open, the extracted
block is included; if you add -scope, the enclosing function name is shown inline.
For piping matches into an LLM this is usually the right default; switch back to -j
only when a downstream tool needs byte offsets, pattern IDs, or capture groups.
Install
A prebuilt Linux x86-64 binary is attached to every tagged release - grab it from github.com/pinkhasn/hprscript/releases/latest, or pull it directly from the command line:
curl -L -o hprscript https://github.com/pinkhasn/hprscript/releases/latest/download/hprscript
chmod +x hprscript
mv hprscript ~/.local/bin/
Or build from source - requires g++ (C++17), make, and Hyperscan (libhs-dev on Debian/Ubuntu):
sudo apt install libhs-dev g++ make
make # builds ./hprscript
make test # runs the test suite
make install # copies to ~/.local/bin/hprscript
The build statically links Hyperscan and libstdc++, so the resulting binary is a single
self-contained executable with no runtime dependencies beyond libc, libm,
and libpthread. Drop it on any modern Linux x86-64 box and it just runs.
Where to go next
- GitHub repository - source, releases, and issues.
- HPRSCRIPT.md - full reference: every CLI flag, the script-mode JSON DSL, Unicode handling, regex quirks, exit codes, and a cookbook of recipes.
- README - quick start and feature overview.
This post only scratches the surface. hprscript has many more options - submatching,
ranking, grouping, sampling, scope detection, conditional emits, output budgets, and a full
script DSL - that together make it a genuinely powerful tool for analyzing massive codebases
and huge text files in one pass. And because every search costs less time and fewer tokens,
it makes an LLM agent both faster and more effective: faster because one
invocation replaces many, more effective because the agent can ask broader, more ambitious
questions of the codebase without blowing its context budget.
If you spend a lot of time grepping the same tree for different things - or if you're building tools
that hand a regex search to an LLM - give hprscript a try. The "all patterns at once"
model changes how you think about searching code.
Comments
Post a Comment