Pattern Extraction
Matchy includes a high-performance pattern extractor for finding domains, IP addresses (IPv4 and IPv6), email addresses, and file hashes (MD5, SHA1, SHA256, SHA384) in unstructured text like log files.
Overview
The Extractor uses SIMD-accelerated algorithms to scan text and extract patterns at 200-500 MB/sec. This is useful for:
- Log scanning: Find domains/IPs in access logs, firewall logs, etc.
- Threat detection: Extract indicators from security logs
- Analytics: Count unique domains/IPs in large datasets
- Compliance: Find email addresses or PII in audit logs
- Forensics: Extract patterns from binary logs
Quick Start
#![allow(unused)]
fn main() {
use matchy::extractor::Extractor;
let extractor = Extractor::new()?;
let log_line = b"2024-01-15 GET /api evil.example.com 192.168.1.1";
for match_item in extractor.extract_from_line(log_line) {
println!("Found: {}", match_item.as_str(log_line));
}
// Output:
// Found: evil.example.com
// Found: 192.168.1.1
}
Supported Patterns
Domains
Extracts fully qualified domain names with TLD validation:
#![allow(unused)]
fn main() {
let line = b"Visit api.example.com or https://www.github.com/path";
for match_item in extractor.extract_from_line(line) {
if let ExtractedItem::Domain(domain) = match_item.item {
println!("Domain: {}", domain);
}
}
// Output:
// Domain: api.example.com
// Domain: www.github.com
}
Features:
- TLD validation: 10K+ real TLDs from Public Suffix List
- Unicode support: Handles münchen.de, café.fr (both UTF-8 and punycode)
- Subdomain extraction: Extracts full domain from URLs
- Word boundaries: Avoids false positives in non-domain text
IPv4 Addresses
Extracts all valid IPv4 addresses:
#![allow(unused)]
fn main() {
let line = b"Traffic from 10.0.0.5 to 172.16.0.10";
for match_item in extractor.extract_from_line(line) {
if let ExtractedItem::Ipv4(ip) = match_item.item {
println!("IP: {}", ip);
}
}
// Output:
// IP: 10.0.0.5
// IP: 172.16.0.10
}
Features:
- SIMD-accelerated: Uses
memchrfor fast dot detection - Validation: Rejects invalid IPs (256.1.1.1, 999.0.0.1)
- Word boundaries: Avoids false matches in version numbers
IPv6 Addresses
Extracts all valid IPv6 addresses:
#![allow(unused)]
fn main() {
let line = b"Server at 2001:db8::1 responded from fe80::1";
for match_item in extractor.extract_from_line(line) {
if let ExtractedItem::Ipv6(ip) = match_item.item {
println!("IPv6: {}", ip);
}
}
// Output:
// IPv6: 2001:db8::1
// IPv6: fe80::1
}
Features:
- SIMD-accelerated: Uses
memchrfor fast colon detection - Compressed notation: Handles
::and full addresses - Validation: Full RFC 4291 compliance via Rust’s
Ipv6Addr - Mixed notation: Supports
::ffff:127.0.0.1format
Email Addresses
Extracts RFC 5322-compliant email addresses:
#![allow(unused)]
fn main() {
let line = b"Contact alice@example.com or bob+tag@company.org";
for match_item in extractor.extract_from_line(line) {
if let ExtractedItem::Email(email) = match_item.item {
println!("Email: {}", email);
}
}
// Output:
// Email: alice@example.com
// Email: bob+tag@company.org
}
Features:
- Plus addressing: Supports user+tag@example.com
- Subdomain validation: Checks domain part for valid TLD
File Hashes
Extracts MD5, SHA1, and SHA256 file hashes:
#![allow(unused)]
fn main() {
use matchy::extractor::{ExtractedItem, HashType};
let line = b"malware.exe MD5=5d41402abc4b2a76b9719d911017c592 detected";
for match_item in extractor.extract_from_line(line) {
if let ExtractedItem::Hash(hash_type, hash) = match_item.item {
let type_str = match hash_type {
HashType::Md5 => "MD5",
HashType::Sha1 => "SHA1",
HashType::Sha256 => "SHA256",
};
println!("{}: {}", type_str, hash);
}
}
// Output:
// MD5: 5d41402abc4b2a76b9719d911017c592
}
Features:
- Boundary distance detection: Finds tokens of exact length (32/40/64 hex chars)
- SIMD hex validation: Auto-vectorized lookup table for blazing speed
- Case insensitive: Accepts both lowercase and uppercase hex
- Zero false positives: Rejects UUIDs (with dashes) and non-hex strings
- High throughput: ~1-2 GB/sec processing speed
Supported hash types:
- MD5: 32 hex characters (e.g.,
5d41402abc4b2a76b9719d911017c592) - SHA1: 40 hex characters (e.g.,
2fd4e1c67a2d28fced849ee1bb76e7391b93eb12) - SHA256: 64 hex characters (e.g.,
2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae) - SHA384: 96 hex characters (e.g.,
cb00753f45a35e8bb5a03d699ac65007272c32ab0eded1631a8b605a43ff5bed8086072ba1e7cc2358baeca134c825a7)
Configuration
Customize extraction behavior using the builder pattern:
#![allow(unused)]
fn main() {
use matchy::extractor::Extractor;
let extractor = Extractor::builder()
.extract_domains(true) // Enable domain extraction
.extract_ipv4(true) // Enable IPv4 extraction
.extract_ipv6(true) // Enable IPv6 extraction
.extract_emails(false) // Disable email extraction
.min_domain_labels(3) // Require 3+ labels (api.test.com)
.require_word_boundaries(true) // Enforce word boundaries
.build()?;
}
Configuration Options
| Option | Default | Description |
|---|---|---|
extract_domains | true | Extract domain names |
extract_ipv4 | true | Extract IPv4 addresses |
extract_ipv6 | true | Extract IPv6 addresses |
extract_emails | true | Extract email addresses |
extract_hashes | true | Extract file hashes (MD5, SHA1, SHA256, SHA384) |
min_domain_labels | 2 | Minimum labels (2 = example.com, 3 = api.example.com) |
require_word_boundaries | true | Ensure patterns have word boundaries |
Unicode and IDN Support
The extractor handles Unicode domains automatically:
#![allow(unused)]
fn main() {
let line = "Visit münchen.de or café.fr".as_bytes();
for match_item in extractor.extract_from_line(line) {
if let ExtractedItem::Domain(domain) = match_item.item {
println!("Unicode domain: {}", domain);
}
}
// Output:
// Unicode domain: münchen.de
// Unicode domain: café.fr
}
How it works:
- Extracts Unicode text as-is
- Validates TLD using punycode conversion internally
- Returns original Unicode form (not punycode)
Binary Log Support
The extractor can find ASCII patterns in binary data:
#![allow(unused)]
fn main() {
let mut binary_log = Vec::new();
binary_log.extend_from_slice(b"Log: ");
binary_log.push(0xFF); // Invalid UTF-8
binary_log.extend_from_slice(b" evil.com ");
for match_item in extractor.extract_from_line(&binary_log) {
println!("Found in binary: {}", match_item.as_str(&binary_log));
}
// Output:
// Found in binary: evil.com
}
This is useful for scanning:
- Binary protocol logs
- Corrupted text files
- Mixed encoding logs
Performance
The extractor is highly optimized:
- Throughput: 200-500 MB/sec on typical log files
- SIMD acceleration: Uses
memchrfor byte scanning - Zero-copy: No string allocation until match
- Lazy UTF-8 validation: Only validates matched patterns
Performance Tips
-
Disable unused extractors to reduce overhead:
#![allow(unused)] fn main() { let extractor = Extractor::builder() .extract_ipv4(true) // Only extract IPv4 .extract_ipv6(true) // Only extract IPv6 .extract_domains(false) .extract_emails(false) .build()?; } -
Process line-by-line for better memory usage:
#![allow(unused)] fn main() { for line in BufReader::new(file).lines() { for match_item in extractor.extract_from_line(line?.as_bytes()) { // Process match } } } -
Use byte slices to avoid UTF-8 conversion:
#![allow(unused)] fn main() { // Fast: no UTF-8 validation on whole line extractor.extract_from_line(line_bytes) // Slower: validates entire line as UTF-8 first extractor.extract_from_line(line_str.as_bytes()) }
Combining with Database Lookups
After extracting patterns, you typically want to look them up in a database. Use lookup_extracted() for a clean, efficient API:
#![allow(unused)]
fn main() {
use matchy::{Database, extractor::Extractor};
let db = Database::from("threats.mxy").open()?;
let extractor = Extractor::new()?;
let log_line = b"Traffic from 192.168.1.100 to evil.com";
for item in extractor.extract_from_line(log_line) {
if let Some(result) = db.lookup_extracted(&item, log_line)? {
println!("⚠️ Match: {} ({})",
item.as_str(log_line),
item.item.type_name()
);
}
}
}
See the Querying guide for complete details on the extract-and-lookup pattern.
CLI Integration
The matchy match command uses the extractor internally:
# Scan logs for threats (outputs JSON to stdout)
matchy match threats.mxy access.log
# Each match is a JSON line:
# {"timestamp":"123.456","line_number":1,"matched_text":"evil.com","match_type":"pattern",...}
# {"timestamp":"123.789","line_number":2,"matched_text":"1.2.3.4","match_type":"ip",...}
# Show statistics (to stderr)
matchy match threats.mxy access.log --stats
# Statistics output (stderr):
# [INFO] Lines processed: 15,234
# [INFO] Lines with matches: 127 (0.8%)
# [INFO] Throughput: 450.23 MB/s
See matchy match for CLI details.
Examples
Complete working examples:
examples/extractor_demo.rs: Demonstrates all extraction featuressrc/bin/matchy.rs: Seecmd_match()for CLI implementation
Run the demo:
cargo run --release --example extractor_demo
Summary
- High performance: 200-500 MB/sec throughput
- SIMD-accelerated: Fast pattern finding
- Unicode support: Handles international domains
- Binary logs: Extracts ASCII from non-UTF-8
- Zero-copy: Efficient memory usage
- Configurable: Customize extraction behavior
Pattern extraction makes it easy to scan large log files and find security indicators.