Pattern Overload (And Why We Love It)

hprscript: Multi-Pattern Search Without Running grep N Times

Why I built it

Like many other programmers, I use AI agents for code development. A lot of the work is supervising what the agent does - reading its plans, checking its diffs, and sanity-checking the code it touches. That means a lot of grep commands, run one after another, often against the same tree.

At some point I started thinking this was wasteful. Each grep invocation re-reads the files, re-compiles its pattern, and pays its own startup cost. And when an AI agent is the one driving, every separate search also re-sends the surrounding context to the model - the same conversation, the same instructions, the same tool schemas - just to ask one more question of the same codebase. Ten searches mean ten round trips, ten context replays, ten walks over the data.

I knew there was a library that could match many patterns at once: Intel's Hyperscan. It compiles N regexes into a single DFA and matches them all in one pass over the input. The catch is that Hyperscan isn't a tool - it's a C library with a non-trivial API: compile databases, scratch buffers, match callbacks, flags, mode bits. Powerful, but not something you reach for to grep a directory.

So there was a gap: a fast multi-pattern engine on one side, and the everyday "search this tree for these things" workflow on the other, with no easy bridge between them. hprscript is my attempt to be that bridge.

What it is

hprscript is a command-line search tool built on Hyperscan. It scans files, recursive globs, or arbitrary input on stdin and matches all of your patterns simultaneously in a single pass. One invocation replaces N sequential grep/rg calls. All patterns are compiled into a single DFA and matched in one walk over the data, so adding patterns is virtually free - whether you're typing them by hand or having an AI agent build them up.

Need grep / rg hprscript
Search for one regex
Search for N regexes in one scan❌ (run N times)✅ (one DFA, one walk)
Pattern-per-file output (JSON Lines)✅ default
Cross-line block extraction (function bodies, JSON objects)
Multi-pass workflows in one process (collect → resolve)✅ via phases
Per-file aggregation (counts, ranking, grouping)✅ via scripts
Files missing a patterngrep -L-absent (also works in scripts)
Pattern compile cost as N growslinearconstant - one shared DFA

PCRE patterns

hprscript patterns use PCRE syntax (the subset Hyperscan accepts), so most everyday regex features work unchanged - character classes, alternation, quantifiers, anchors, and word boundaries. For example, \b matches a word boundary, so you can find a whole word and skip substring hits inside larger identifiers:

$ echo "pretokenization process for token" | hprscript -pi "\btoken\b"
{"file":"<stdin>","pat":"p0","line":1,"col":29,"from":28,"to":33,"match":"token","context":"pretokenization process for token"}

Only the standalone word token matches - the token inside pretokenization is correctly skipped. The same goes for the rest of the usual PCRE toolkit: \d, \s, \w, lazy quantifiers like .*?, character classes like [A-Za-z_]+, alternation, and so on.

Many patterns vs. one big alternation

At first glance, "match many patterns at once" sounds like what grep -E "TODO|DEPRECATED" already does. It isn't quite the same. With alternation, grep only tells you that something matched on a line - it can't tell you which branch of the alternation fired. If you need to know whether you saw a TODO or a DEPRECATED, you have to run two separate searches.

In hprscript, every -p / -pi pattern is a first-class pattern with its own ID (p0, p1, ...). Each match record carries the ID of the pattern that produced it, so you always know which one fired - even when several patterns match on the same line:

$ echo "version v018 finished with status: FAIL" | hprscript -p "\bv\d+\b" -p "SUCCESS|FAIL"
{"file":"<stdin>","pat":"p0","line":1,"col":9,"from":8,"to":12,"match":"v018","context":"version v018 finished with status: FAIL"}
{"file":"<stdin>","pat":"p1","line":1,"col":36,"from":35,"to":39,"match":"FAIL","context":"version v018 finished with status: FAIL"}

Two distinct matches, two distinct pattern IDs, on the same line, in one pass. That makes the output trivial to route downstream: filter by pat, count per pattern, group by pattern, or hand the JSON Lines straight to a script or an AI agent that needs to reason about which patterns hit where.

Quick taste

# Single pattern (default JSON Lines output)
hprscript -p "TODO" -glob "**/*.go"

# Multi-pattern in one pass - adding patterns is virtually free
hprscript -p "TODO" -p "FIXME" -p "XXX" -glob "**/*.go"

# Mix case-sensitive and case-insensitive in the same scan
hprscript -p '\bError\b' -pi 'todo|fixme' -glob '**/*.go'

# Pipeline use - content from stdin, no glob/files needed
curl -s https://example.com | hprscript -p 'href="[^"]+"' -o
kubectl logs my-pod | hprscript -p 'ERROR|panic' -C 2

# Extract every Go function body (signature + braces, balanced)
hprscript -p 'func \w+\(' -block-open '{' -block-close '}' -o '**/*.go'

# Files missing a license header (one pass, no scripting)
hprscript -p 'Copyright|SPDX-License-Identifier' -absent -glob '**/*.go'

The default output is JSON Lines - one record per match, ready to pipe into jq, a script, or an AI agent:

{"file":"main.go","pat":"p0","line":42,"col":5,"from":1023,"to":1027,"match":"TODO","context":"// TODO: refactor"}

Block extraction: getting the whole function body

Because hprscript matches every pattern simultaneously - including the open and close delimiters of a block - it can do something that's genuinely hard for almost any tool or script: extract the entire body that follows a match by counting balanced brackets.

Think about how you'd solve this in general. You want to find a specific function and print its full body. With grep, you can find the signature line, but the body might span dozens of lines and contain its own nested braces - so you can't stop at the first }. Even reaching for awk, sed, or a quick Python script, you end up hand-rolling a small depth counter, dealing with strings and comments that contain stray braces, and writing more code than the task deserves. The "right" answer is often to pull in a real parser for the language. With hprscript, it's one command:

$ hprscript -p 'sub qualify_to_ref' -block-open '{' -block-close '}' -o perl/run/lib/5.36.1/Symbol.pm
sub qualify_to_ref ($;$) {
    no strict 'refs';
    return \*{ qualify $_[0], @_ > 1 ? $_[1] : caller };
}

hprscript finds the match, walks forward until it sees the first {, then tracks depth as it scans - incrementing on {, decrementing on } - until it returns to zero. The block returned is the matched balanced region, regardless of how many nested braces sit inside it.

The delimiters are configurable, so the same mechanism works far beyond curly-brace languages: JSON objects ({/}), JSX or HTML subtrees (<div>/</div>), SQL BEGIN/END blocks, or any custom paired tokens you can express as strings.

Beyond plain matching

  • Script mode (JSON DSL). Variables, lifecycle hooks, sub-pattern matching, conditionals, grouping, ranking, and multi-phase scans - all in a single invocation. Useful for "collect identifiers in pass one, then resolve their definitions in pass two" workflows.
  • -absent mode. Find files where a pattern is not present. Like grep -L, but it also works inside scripts and composes with the rest of the DSL.
  • Unicode by default. UTF-8 mode is on, and -pi case folding works across scripts (CAFÉcafé, ПРИВЕТпривет).
  • grep-compatible output modes: -f (file list), -c (per-file counts), -o (matched text only), -format (custom template), -A/-B/-C (context lines).

Phases: when one pass isn't enough

Some questions can't be answered in a single scan because you don't know what to look for until you've already looked at the data. A classic example: which exported functions in this package are no longer used anywhere? You can't search for "unused functions" directly - you first have to learn what the functions are, and only then can you check whether anything calls them.

hprscript handles this with phases: sequential scan rounds that share a variable store. Pass 1 collects facts; pass 2 (or 3, or 4) consults them. After the last phase, an on_complete hook can walk the collected state and emit a final report.

Here's the sketch of an unused-exports detector for a Go package. Phase defs records every exported function defined in the package. Phase uses scans the whole repo for identifier usages and removes anything it finds from the map. What's left after both phases is the set of exports nothing references:

hprscript -s '{
  "variables": {
    "defs": {"type": "map"},
    "_n":   {"type": "string"}
  },
  "phases": [
    {
      "id": "defs",
      "scan": ["pkg/mylib/**/*.go"],
      "patterns": [{
        "id": "fn", "regexp": "func [A-Z]\\w+",
        "on_match": [
          {"action": "submatch", "patterns": [
            {"id": "name", "regexp": "[A-Z]\\w+", "on_match": [
              {"action": "set", "var": "_n", "value": "$MATCH"}
            ]}
          ]},
          {"action": "map_set", "target": "defs", "key": "$_n", "value": "$FILE:$LINE"}
        ]
      }]
    },
    {
      "id": "uses",
      "scan": ["**/*.go"],
      "exclude": ["pkg/mylib/"],
      "patterns": [{
        "id": "ident", "regexp": "\\b[A-Z]\\w+\\b",
        "on_match": [
          {"action": "map_delete", "target": "defs", "key": "$MATCH"}
        ]
      }]
    }
  ],
  "on_complete": [
    {"action": "for_each", "var": "defs", "key_as": "name", "as": "where", "do": [
      {"action": "emit", "data": {"unused": "$name", "defined_at": "$where"}}
    ]}
  ]
}'

Output is a list of dead exports, with the file and line where each was defined - in one process, with one DFA per phase, and no intermediate files on disk:

{"unused":"OldHelper","defined_at":"pkg/mylib/util.go:47"}
{"unused":"DeprecatedConfig","defined_at":"pkg/mylib/config.go:112"}

The same shape solves a lot of "collect, then resolve" problems: stale references after a rename (define = current functions; use = call sites; report misses), undefined feature flags (define = registry; use = call sites), schema drift (define = struct fields; use = JSON keys), and so on. Anywhere you'd otherwise be tempted to grep into a temp file and then grep against it from a script, phases collapse the workflow into one invocation.

Built for AI coding agents too

The repository ships an MCP server that exposes hprscript to AI coding tools like Claude Code and Cursor as a set of structured tools (search, list_files, count_per_file, extract_blocks, run_script). The JSON Lines output and multi-pattern semantics fit naturally into agent workflows - a single tool call can answer "find every TODO, FIXME, and XXX across the codebase, with surrounding context" instead of three.

You don't strictly need the MCP server, though. If you've already installed the hprscript binary (a prebuilt Linux x86-64 executable is attached to every release at github.com/pinkhasn/hprscript/releases/latest - just download, chmod +x, and put it on your PATH), an even simpler setup is to download HPRSCRIPT.md - the full reference for the CLI flags and the script DSL - drop it next to your project (or somewhere the agent can read), and tell the agent something like "read HPRSCRIPT.md and use hprscript for code search whenever it fits." The reference is written to be self-contained, so the agent can learn the tool from it and start composing multi-pattern searches and script-mode invocations on its own, without any extra integration.

There's also a dedicated -llm output mode for when matches are going straight into a model's context. Instead of JSON Lines (which costs tokens for keys, quotes, and braces on every record), -llm emits a compact plain-text format grouped by file - one file header printed once, then each match as   <line>: <text>. Same information, roughly 30-50% fewer tokens than -j:

hprscript -p 'TODO|FIXME' -llm -glob '**/*.go'
# pkg/foo.go
#   42: // TODO: refactor
#   78: // FIXME: race condition
# pkg/bar.go
#   13: // TODO: handle nil case

It also adapts automatically to other flags - if you add -block-open, the extracted block is included; if you add -scope, the enclosing function name is shown inline. For piping matches into an LLM this is usually the right default; switch back to -j only when a downstream tool needs byte offsets, pattern IDs, or capture groups.

Install

A prebuilt Linux x86-64 binary is attached to every tagged release - grab it from github.com/pinkhasn/hprscript/releases/latest, or pull it directly from the command line:

curl -L -o hprscript https://github.com/pinkhasn/hprscript/releases/latest/download/hprscript
chmod +x hprscript
mv hprscript ~/.local/bin/

Or build from source - requires g++ (C++17), make, and Hyperscan (libhs-dev on Debian/Ubuntu):

sudo apt install libhs-dev g++ make
make           # builds ./hprscript
make test      # runs the test suite
make install   # copies to ~/.local/bin/hprscript

The build statically links Hyperscan and libstdc++, so the resulting binary is a single self-contained executable with no runtime dependencies beyond libc, libm, and libpthread. Drop it on any modern Linux x86-64 box and it just runs.

Where to go next

  • GitHub repository - source, releases, and issues.
  • HPRSCRIPT.md - full reference: every CLI flag, the script-mode JSON DSL, Unicode handling, regex quirks, exit codes, and a cookbook of recipes.
  • README - quick start and feature overview.

This post only scratches the surface. hprscript has many more options - submatching, ranking, grouping, sampling, scope detection, conditional emits, output budgets, and a full script DSL - that together make it a genuinely powerful tool for analyzing massive codebases and huge text files in one pass. And because every search costs less time and fewer tokens, it makes an LLM agent both faster and more effective: faster because one invocation replaces many, more effective because the agent can ask broader, more ambitious questions of the codebase without blowing its context budget.

If you spend a lot of time grepping the same tree for different things - or if you're building tools that hand a regex search to an LLM - give hprscript a try. The "all patterns at once" model changes how you think about searching code.

Comments