bokamba / logforge / parse / Apache access (combined)

$ logforge parse apache

Parse Apache access (combined) logs → regex, Grok, Wazuh & rsyslog

The Apache HTTP Server writes its access log through the mod_log_config module, with the exact columns set by a LogFormat directive and bound to a file by CustomLog — typically /var/log/apache2/access.log on Debian/Ubuntu or /var/log/httpd/access_log on RHEL. The near-universal choice is the 'combined' nickname, whose format string is "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"". Decoded, that is: remote host, the identd/ident value (%l, almost always a dash), the HTTP-authenticated username (%u), the request time in brackets (%t), the full request line in quotes (%r), the final status code (%>s), the response size in bytes (%b), and the quoted Referer and User-Agent. The layout is deliberately identical to nginx combined, which is why the two are so often parsed by one rule.

What makes Apache logs their own parsing problem is the fields administrators add. Because the format is a directive, real-world Apache logs frequently carry a leading %v (the serving virtual host) for multi-tenant boxes, a %D or %T response-time column, %{X-Forwarded-For}i to recover the true client behind a proxy, or a %p port — none of which appear in a textbook combined line. The %u username field is the one that most often confuses generic HTTP parsers: on an authenticated path like /wp-admin/ you get a real username (jdoe), while on public paths the same column is a dash, so a field that looks constant in one sample varies in another. The bracketed %t timestamp uses the same day/Mon/year:HH:MM:SS +zzzz form as nginx and is not ISO 8601. A quoted %r can be blank or garbage when a client sends a malformed request, and %b is a dash (not 0) when no body was sent.

For security monitoring the high-value fields are the client host (or the forwarded-for header when Apache sits behind a load balancer), the request path and method, and the status. Apache access logs are a primary source for spotting web reconnaissance and exploitation: 404 storms hunting for /.env, wp-login.php, or phpMyAdmin; 302/401 patterns around admin endpoints; and automated clients like curl/8.6.0 that announce themselves in the User-Agent. Correlating a spike of 404s from a single %h with a later 200 on a sensitive path is a classic 'they found something' signal that a correct field-level parse makes trivial to alert on.

Open this in LogForge →

What an Apache access (combined) line looks like

The Combined sample below is fed verbatim into the engine to produce every parser on this page.

192.0.2.10 - jdoe [03/Jul/2026:14:22:15 +0300] "GET /wp-admin/ HTTP/1.1" 302 512 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/126.0"
203.0.113.99 - - [03/Jul/2026:14:22:40 +0300] "GET /.env HTTP/1.1" 404 153 "-" "curl/8.6.0"

Detected fields

The engine classified this sample as freeform and consolidated 11 fields across 2 lines. Fields marked literal were identical on every sample line, so they are baked into the pattern as anchors rather than captured.

  • ip1 : ipv4
  • _lit1 : literal · literal
  • literal : literal
  • timestamp : timestamp
  • method : http_method · literal
  • quoted_string : quoted_string
  • quoted_string2 : quoted_string · literal
  • status : http_status
  • number : number
  • url : url
  • user_agent : user_agent

Regex (named capture groups)

# sample: 192.0.2.10 - jdoe [03/Jul/2026:14:22:15 +0300] "GET /wp-admin/ HTTP/1.1" 302 512 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/126.0"
# groups: ip1=192.0.2.10, literal=jdoe, timestamp=03/Jul/2026:14:22:15 +0300, quoted_string=/wp-admin/, status=302, number=512, url=https://example.com/, user_agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/126.0
^(?<ip1>\d{1,3}(?:\.\d{1,3}){3}) - (?<literal>(?:[A-Za-z]+|-)) \[(?<timestamp>\d+/[A-Za-z]+/\d+:\d+:\d+:\d+ \+\d+)\] "GET (?<quoted_string>(?:/[A-Za-z]+-[A-Za-z]+/|/\.[A-Za-z]+)) HTTP/1\.1" (?<status>\d{3}) (?<number>-?\d+(?:\.\d+)?) "(?<url>[^"]*)" "(?<user_agent>[^"]*)"$

Grok pattern (Logstash / Elastic)

# custom patterns
APACHE_NOTDQUOTE [^"]*

%{IPV4:ip1} - %{NOTSPACE:literal} \[%{HTTPDATE:timestamp}\] "GET %{NOTSPACE:quoted_string} HTTP/1\.1" %{INT:status} %{NUMBER:number} "%{APACHE_NOTDQUOTE:url}" "%{APACHE_NOTDQUOTE:user_agent}
  • note constant field "method" embedded as literal anchor "GET" (varying=false)
  • note constant field "quoted_string2" embedded as literal anchor "HTTP/1.1" (varying=false)
  • note field "url" (url): samples do not all match %{URI}; using %{APACHE_NOTDQUOTE} instead
  • note custom patterns emitted — save the '# custom patterns' block to a file in your patterns_dir

Wazuh decoder (OS_Regex XML)

<!--
  Generated by LogForge - Wazuh decoder (OS_Regex dialect, not PCRE)
  sample: 192.0.2.10 - jdoe [03/Jul/2026:14:22:15 +0300] "GET /wp-admin/ HTTP/1.1" 302 512 "https://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/5
  test with: /var/ossec/bin/wazuh-logtest
-->

<decoder name="apache-freeform">
  <prematch>^\d+.\d+.\d+.\d+ </prematch>
</decoder>

<decoder name="apache-freeform">
  <parent>apache-freeform</parent>
  <regex>^(\d+.\d+.\d+.\d+) - (\w+) [(\d+/\w+/\d+:\d+:\d+:\d+ \p\d+)] "GET (\S+) HTTP/1.1" (\d+) (\d+) "(\.+)" "(\.+)"</regex>
  <order>srcip, literal, timestamp, quoted_string, status, number, url, user_agent</order>
</decoder>
  • note no stable literal prefix found — <prematch> anchors on the leading field pattern; tighten it for your environment
  • note field "ip1" mapped to Wazuh conventional field "srcip"
  • note field "url": free-text capture (\.+) bounded by a quote anchor — OS_Regex greediness may over-consume if the anchor repeats
  • note field "user_agent": free-text capture (\.+) bounded by end of line — OS_Regex greediness may over-consume if the anchor repeats
  • note constant field "method" embedded as literal anchor "GET"
  • note constant field "quoted_string2" embedded as literal anchor "HTTP/1.1"
  • note decoder order and prematch specificity may need site-specific tuning (other decoders in your ruleset can shadow these) — validate with /var/ossec/bin/wazuh-logtest

rsyslog template / liblognorm rulebase

version=2
# apache — liblognorm v2 rulebase (generated by LogForge)
# Usage with rsyslog (mmnormalize runs liblognorm):
#   module(load="mmnormalize")
#   action(type="mmnormalize" rulebase="/etc/rsyslog.d/apache.rb" useRawMsg="on")
# Literal "%" is escaped as "%%"; raw tabs are written as \x09.
rule=apache:%ip1:ipv4% - %literal:word% [%timestamp:char-to{"extradata":"]"}%] "GET %quoted_string:word% HTTP/1.1" %status:number% %number:number% "%url:char-to{"extradata":"\""}%" "%user_agent:char-to{"extradata":"\""}%"
  • note trailing literal "\"" reconstructed from line 1
  • note field "timestamp": samples do not uniformly match engine type "timestamp"; using a generic parser
  • note chosen parser types: ip1=ipv4, literal=word, timestamp=char-to(]), quoted_string=word, status=number, number=number, url=char-to("), user_agent=char-to(")

FAQ

What is the difference between Apache combined and common log format?
The 'common' format (CLF) ends after the response size: host, ident, user, time, request, status, bytes. 'combined' appends two quoted header fields, Referer and User-Agent. Both are just nicknames defined by LogFormat directives, so a given server logs whatever its active LogFormat says — inspect the config, not the file name.
Why does my Apache line have an extra field my parser does not expect?
Almost certainly a customized LogFormat. Common additions are a leading %v (virtual host), a %D/%T response-time column, %p (port), or %{X-Forwarded-For}i to capture the real client behind a proxy. Grab a representative sample from the actual server and regenerate the parser against it rather than assuming stock combined.
How do I recover the real client IP when Apache is behind a load balancer?
The %h field will be the proxy or load-balancer address. To get the originating client you need the X-Forwarded-For header, which is only present if the LogFormat includes %{X-Forwarded-For}i. If it does, capture that field and take the left-most address in the comma-separated list as the client (trusting it only as far as you trust your proxy chain).

Try it on your own Apache access (combined) lines

Paste a few real lines, review the detected fields, and copy whichever format your stack needs. Free, no account, nothing uploaded.

Open this sample in LogForge →