$ less 2026-07-03.md

Turning a raw log line into a working regex (and Grok, and a Wazuh decoder)

4 min read

Every log parser starts the same way: you have a line, and you need to pull structured fields out of it. Here is one from an SSH brute-force attempt, the kind that fills /var/log/auth.log on any internet-facing host:

Jun 30 22:14:15 fw01 sshd[4721]: Failed password for invalid user admin from 203.0.113.45 port 51234 ssh2

You want five things out of that: the timestamp, the PID, the attempted username, the source IP, and the source port. The mechanical way to get them is a regex with named capture groups. The interesting part is not writing the regex — it is that the same extraction has to be written three completely different ways depending on where it runs, and two of those ways will silently do the wrong thing if you port the PCRE version literally.

Step one: find the fields

Before you can capture anything, you have to decide what is a field and what is a constant. The trick is to look at more than one line. Given a second, similar line — different user, different IP, different port — the parts that change are your fields, and the parts that stay identical (Failed password for invalid user, port, ssh2) are literal anchors. That single observation is the whole ballgame: a column that varies becomes a capture group; a column that never varies becomes text you match exactly, which is what keeps the pattern from matching lines you did not mean to match.

Detection also means typing each field. 203.0.113.45 is an IPv4 address, so it earns a tight \d{1,3}(?:\.\d{1,3}){3} instead of a lazy .*. 4721 is an integer. admin is a username — letters, digits, and a handful of punctuation. Getting the type right is what separates a pattern that documents your log from a pattern full of .* that matches everything and captures nothing useful.

The regex: named groups, anchored

Run that through LogForge and you get a single anchored pattern in the PCRE/JavaScript common subset:

^(?<timestamp>[A-Za-z]+ \d+ \d+:\d+:\d+) fw01 sshd\[(?<pid>-?\d+(?:\.\d+)?)\]: Failed password for invalid user (?<user>[A-Za-z0-9._@-]+) from (?<srcip>\d{1,3}(?:\.\d{1,3}){3}) port (?<port1>\d{1,5}) ssh2$

Note three deliberate choices. The ^ and $ anchors make it strict — it matches the whole line or nothing, so a truncated line cannot produce a half-populated event. The literal dot in the IP is written \., because in PCRE a bare . means any character and you do not want 203x0x113x45 to match. And each field is a (?<name>…) named group, so downstream code reads match.group('srcip') instead of counting parentheses. This string works unchanged in grep -P, Python’s re, PCRE2, Ruby, Java, and JavaScript.

The same idea in Grok

Logstash does not want raw regex; it wants Grok, which is regex with a macro layer. Instead of respelling the IP pattern, you reference a named pattern from Logstash’s library:

%{SYSLOGTIMESTAMP:timestamp} fw01 sshd\[%{NUMBER:pid}\]: Failed password for invalid user %{USERNAME:user} from %{IPV4:srcip} port %{INT:port1} ssh2

%{IPV4:srcip} expands to essentially the same IP regex above, but reads as intent. The win is legibility and reuse; the catch is that the macro has to actually fit your value. If a field’s shape does not match any stock pattern, a good generator validates against your real sample and falls back to %{DATA} with a note, rather than emitting a pattern that quietly fails at runtime.

The Wazuh decoder: same regex, inverted metacharacters

Now the one that bites people. Wazuh decoders are written in OS_Regex, Wazuh’s own dialect. It looks like regex. It is not regex. Here is the generated decoder:

<decoder name="logforge-syslog3164">
  <program_name>^sshd$</program_name>
</decoder>

<decoder name="logforge-syslog3164">
  <parent>logforge-syslog3164</parent>
  <regex offset="after_parent">^Failed password for invalid user (\w+) from (\d+.\d+.\d+.\d+) port (\d+) ssh2</regex>
  <order>user, srcip, port1</order>
</decoder>

Look at the IP: (\d+.\d+.\d+.\d+). In PCRE that would be wrong — those bare dots would match any character. In OS_Regex it is correct, because the metacharacters are inverted: a bare . is a literal dot, and \. is the any-character wildcard. This is the single most common reason a decoder hand-ported from a working regex silently refuses to match. OS_Regex also has no named groups — captures are positional, mapped by name in the <order> line — and no {n,m} quantifiers or [...] classes. So the parent decoder gates on program_name, the child decoder captures, and <order> names the columns left to right.

Three formats, one extraction, three incompatible spellings — and the . versus \. inversion means the two that look most alike behave oppositely.

Try it

Paste your own line into LogForge and watch all four outputs (rsyslog is the fourth) regenerate as you rename fields. Nothing is uploaded; it all runs in your browser. To go format by format with more examples, see the output-format docs, or head straight to the parser.

FAQ

Why not just write the regex by hand? For a one-off, do. Generating from a real sample pays off because the pattern is verified against your exact lines, the group names match the fields you care about, and you get the finicky Grok and Wazuh equivalents for free.

Will the generated regex work in Go (RE2)? Mostly — but for logs whose fields reorder between lines, the generator emits lookaround captures, which RE2 rejects. Feed it a sample with stable field order if you target Go.

Why does the Wazuh decoder use a bare . where my regex used \.? Because OS_Regex inverts them: bare . is a literal period, \. is the wildcard. The generator handles the inversion so you do not port the bug.

Try it on your own logs

Paste a few lines, review the detected fields, copy whichever format your stack needs. Free, no account, nothing uploaded.

open LogForge →