matchy extract
Extract patterns (domains, IPs, emails, hashes, cryptocurrency addresses) from log files or unstructured text.
Synopsis
matchy extract [OPTIONS] <INPUT>...
Description
The matchy extract command scans log files or streams to automatically extract IP addresses, domain names, email addresses, file hashes, and cryptocurrency addresses from unstructured text. This is useful for:
- Generating threat intelligence feeds from logs
- Building input lists for
matchy build - Analyzing log data for patterns
- Pre-filtering data before database matching
Key features:
- SIMD-accelerated extraction (200-500 MB/sec typical throughput)
- Multiple output formats: JSON, CSV, plain text
- Configurable extraction types
- Unicode/IDN domain support with automatic punycode conversion
- Word boundary detection for accurate extraction
- Deduplication with
--uniqueflag
Arguments
<INPUT>...
One or more log files to process (one entry per line), or - for stdin.
$ matchy extract access.log
$ matchy extract log1.txt log2.txt log3.txt
$ cat access.log | matchy extract -
Options
--format <FORMAT>
Output format (default: json):
json- NDJSON format (one JSON object per pattern)csv- CSV format with header (type, value columns)text- Plain text (one pattern per line, no metadata)
$ matchy extract access.log --format json
{"type":"domain","value":"example.com"}
{"type":"ipv4","value":"192.0.2.1"}
$ matchy extract access.log --format csv
type,value
domain,"example.com"
ipv4,"192.0.2.1"
$ matchy extract access.log --format text
example.com
192.0.2.1
--types <TYPES>
Comma-separated extraction types (default: all):
ipv4orip4- IPv4 addresses onlyipv6orip6- IPv6 addresses onlyip- Both IPv4 and IPv6domainordomains- Domain namesemailoremails- Email addresseshashorhashes- File hashes (MD5, SHA1, SHA256, SHA384)bitcoinorbtc- Bitcoin addresses (all formats)ethereumoreth- Ethereum addressesmoneroorxmr- Monero addressescrypto- All cryptocurrency addressesall- Extract everything (default)
$ matchy extract access.log --types ipv4,domain
$ matchy extract access.log --types ip # IPv4 + IPv6
$ matchy extract access.log --types all # Everything
--min-labels <NUMBER>
Minimum number of domain labels to extract (default: 2).
$ matchy extract access.log --min-labels 2 # example.com (default)
$ matchy extract access.log --min-labels 3 # sub.example.com
This is useful to filter out bare hostnames or require fully-qualified domain names.
--no-boundaries
Disable word boundary requirements, allowing patterns to be extracted from the middle of text.
By default, extraction requires word boundaries (whitespace, punctuation) around patterns. Use this flag to extract patterns embedded in other text.
$ matchy extract access.log --no-boundaries
-u, --unique
Output only unique patterns (deduplicate across all input).
$ matchy extract access.log --unique
This maintains a hash set of seen patterns and outputs each unique pattern only once.
-s, --stats
Show extraction statistics to stderr.
$ matchy extract access.log --stats
[INFO] Extracting: IPv4, IPv6, domains, emails
[INFO] Min domain labels: 2
[INFO] Word boundaries: true
[INFO] Unique mode: false
[INFO] === Extraction Complete ===
[INFO] Lines processed: 15,234
[INFO] Patterns found: 3,456
[INFO] IPv4: 2,100
[INFO] IPv6: 23
[INFO] Domains: 1,200
[INFO] Emails: 133
[INFO] Throughput: 450.23 MB/s
[INFO] Total time: 0.15s
Statistics are always written to stderr, leaving stdout clean for piped output.
--show-candidates
Show candidate extraction details for debugging (output to stderr).
$ matchy extract access.log --show-candidates
[CANDIDATE] Domain at 45-61: example.com
[CANDIDATE] IPv4 at 0-10: 192.0.2.1
[CANDIDATE] Email at 23-42: user@example.com
Examples
Extract All Patterns (JSON)
$ matchy extract access.log
{"type":"ipv4","value":"192.0.2.1"}
{"type":"domain","value":"example.com"}
{"type":"email","value":"user@example.com"}
{"type":"ipv6","value":"2001:db8::1"}
Extract Only Domains
$ matchy extract access.log --types domain --format text
example.com
subdomain.example.org
malware.net
Build Threat Intel Database from Logs
Extract unique domains and build a database:
$ matchy extract suspicious.log \
--types domain \
--unique \
--format text \
> domains.txt
$ echo "key,threat_level" > threats.csv
$ cat domains.txt | sed 's/^/&,high/' >> threats.csv
$ matchy build threats.csv -o threats.mxy
Extract IPs with Statistics
$ matchy extract access.log --types ip --stats --unique
{"type":"ipv4","value":"192.0.2.1"}
{"type":"ipv4","value":"198.51.100.42"}
{"type":"ipv6","value":"2001:db8::1"}
[INFO] Lines processed: 10,000
[INFO] Patterns found: 2,345
[INFO] IPv4: 2,320
[INFO] IPv6: 25
[INFO] Throughput: 380.15 MB/s
[INFO] Total time: 0.08s
CSV Output for Spreadsheet Import
$ matchy extract firewall.log --format csv > patterns.csv
$ open patterns.csv # Opens in Excel/Numbers/etc.
Extract from stdin Stream
$ tail -f /var/log/syslog | matchy extract - --types domain --stats
Process Multiple Files
$ matchy extract *.log --stats --unique > all_patterns.json
Output Formats
JSON (NDJSON)
One JSON object per line with type and value:
{"type":"domain","value":"example.com"}
{"type":"ipv4","value":"192.0.2.1"}
{"type":"ipv6","value":"2001:db8::1"}
{"type":"email","value":"user@example.com"}
CSV
Header row followed by data rows:
type,value
domain,"example.com"
ipv4,"192.0.2.1"
ipv6,"2001:db8::1"
email,"user@example.com"
Values are properly escaped (quotes doubled for embedded quotes).
Text
One pattern per line, no metadata:
example.com
192.0.2.1
2001:db8::1
user@example.com
Pattern Extraction Details
IPv4 Addresses
Extracts standard IPv4 addresses: 192.0.2.1, 10.0.0.1
Validates format and rejects invalid addresses (e.g., 999.999.999.999).
IPv6 Addresses
Extracts IPv6 addresses in all standard formats:
- Full:
2001:0db8:0000:0000:0000:0000:0000:0001 - Compressed:
2001:db8::1 - IPv4-mapped:
::ffff:192.0.2.1
Domain Names
Extracts domain names with proper TLD validation:
example.comsubdomain.example.orgmulti.level.subdomain.co.uk
Unicode/IDN support: International domain names are automatically converted to punycode:
- Input:
münchen.de - Output:
xn--mnchen-3ya.de
TLD validation: Only domains with valid top-level domains are extracted (uses embedded TLD automaton with Public Suffix List data).
Email Addresses
Extracts email addresses with format validation:
user@example.comfirst.last@subdomain.example.orgadmin+tag@example.net
File Hashes
Extracts common cryptographic hashes:
- MD5: 32 hex characters (e.g.,
5d41402abc4b2a76b9719d911017c592) - SHA1: 40 hex characters (e.g.,
2fd4e1c67a2d28fced849ee1bb76e7391b93eb12) - SHA256: 64 hex characters
- SHA384: 96 hex characters
Useful for malware analysis and threat intelligence feeds.
Cryptocurrency Addresses
Extracts blockchain addresses with checksum validation:
Bitcoin (all formats):
- Legacy (P2PKH):
1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa - P2SH:
3Cbq7aT1tY8kMxWLbitaG7yT6bPbKChq64 - Bech32 (SegWit):
bc1qar0srrr7xfkvy5l643lydnw9re59gtzzwf5mdq
Ethereum:
- Format:
0x5aeda56215b167893e80b4fe645ba6d5bab767de(42 chars) - Validates EIP-55 checksum for mixed-case addresses
- Accepts all-lowercase addresses without checksum
Monero:
- Standard addresses starting with
4or8(~95 characters) - Integrated addresses (~106 characters)
Validation: All addresses are validated with cryptographic checksums:
- Bitcoin: Base58Check (double SHA256) or Bech32
- Ethereum: Keccak256-based EIP-55 checksum
- Monero: Keccak256 checksum
Useful for ransomware analysis, fraud investigation, and darknet marketplace intelligence.
Performance
Typical throughput: 200-500 MB/s on modern hardware.
Performance factors:
- Extraction types: Fewer types = faster (skip unnecessary checks)
- Word boundaries: Enabled (default) = faster (reduces false matches)
- Unique mode: Enabled = slower (hash set overhead for deduplication)
- Output format: Text = fastest, JSON = moderate, CSV = moderate
Exit Status
0- Success (even if no patterns found)1- Error (file not found, invalid arguments, etc.)
See Also
- matchy match - Match extracted patterns against database
- matchy build - Build database from extracted patterns
- Pattern Extraction Guide - Detailed extraction documentation