Context
StringyMcStringFace is a binary string extraction and analysis tool that extracts meaningful strings from executable files (PE, ELF, Mach-O) with semantic classification, tagging, and scoring. The tool requires multiple output formats to serve different use cases: human-readable for interactive analysis, JSONL for automation and data pipelines, and YARA for security rule generation.
This issue focuses on implementing the JSONL (JSON Lines) output formatter, which provides machine-readable structured output where each line is a complete JSON object representing a FoundString. This format is ideal for:
- Pipeline integration and streaming processing
- Database ingestion and batch analysis
- Automated security tooling and SIEM integration
- Post-processing with
jq, Python, or other tools
Problem Statement
The tool currently has a placeholder src/output/mod.rs but no concrete formatter implementations. This issue implements the JSONL formatter as part of the output formatting system defined in issue #25.
Data Structure
The JSONL formatter serializes FoundString instances with all fields:
pub struct FoundString {
pub text: String, // The extracted string
pub encoding: Encoding, // Ascii, Utf8, Utf16Le, Utf16Be
pub offset: u64, // File offset
pub rva: Option<u64>, // Relative Virtual Address (if available)
pub section: Option<String>, // Section name (.text, .rdata, etc.)
pub length: u32, // Length in bytes
pub tags: Vec<Tag>, // Semantic tags (Url, FilePath, Guid, etc.)
pub score: i32, // Relevance score for ranking
pub source: StringSource, // SectionData, ImportName, ExportName, etc.
}
Requirements
- Requirement 6.1: Implement output formatting framework with trait-based architecture
- Requirement 6.4: Support machine-readable JSON Lines format for automation
Proposed Solution
1. File Creation
Create src/output/json.rs implementing the JSONL formatter.
2. Implementation Approach
use crate::types::FoundString;
use serde_json;
use std::io::Write;
pub struct JsonFormatter;
impl JsonFormatter {
pub fn new() -> Self {
Self
}
/// Format strings as JSON Lines (one JSON object per line)
pub fn format(&self, strings: &[FoundString]) -> crate::Result<String> {
let mut output = String::new();
for found_string in strings {
let json = serde_json::to_string(found_string)?;
output.push_str(&json);
output.push('\n');
}
Ok(output)
}
/// Format a single string for streaming output
pub fn format_one(&self, found_string: &FoundString) -> crate::Result<String> {
let json = serde_json::to_string(found_string)?;
Ok(format!("{}\n", json))
}
}
3. Key Design Decisions
JSON Lines Format
- One object per line: Each line is a complete, valid JSON object
- No pretty-printing: Compact format for efficient parsing and storage
- UTF-8 encoding: Standard for JSON
- No array wrapper: Unlike standard JSON arrays, JSONL has no
[] wrapper
Field Serialization
- All fields included: Complete
FoundString data in every record
- Null handling:
Option<T> fields serialize as null when absent
- Enum serialization: Leverage existing
serde derives on Encoding, Tag, StringSource
- String escaping:
serde_json handles special characters automatically
Example Output
{"text":"kernel32.dll","encoding":"Ascii","offset":4096,"rva":8192,"section":".idata","length":12,"tags":["ImportName"],"score":95,"source":"ImportName"}
{"text":"https://api.example.com/v1","encoding":"Utf8","offset":16384,"rva":20480,"section":".rdata","length":26,"tags":["Url","Domain"],"score":88,"source":"SectionData"}
{"text":"C:\\\\Windows\\\\System32\\\\config","encoding":"Utf16Le","offset":32768,"rva":null,"section":null,"length":56,"tags":["FilePath"],"score":72,"source":"SectionData"}
4. Integration with Framework
Once issue #25 (Output Formatter Framework) is complete, this implementation should:
- Implement the
Formatter trait defined in src/output/mod.rs
- Integrate with
OutputConfig for filtering options
- Support streaming output for large result sets
- Be selectable via CLI
--json or --format json flags
5. Error Handling
use thiserror::Error;
#[derive(Error, Debug)]
pub enum JsonFormatterError {
#[error("Failed to serialize string: {0}")]
SerializationError(#[from] serde_json::Error),
#[error("I/O error: {0}")]
IoError(#[from] std::io::Error),
}
6. Testing Requirements
Unit Tests
#[cfg(test)]
mod tests {
use super::*;
use crate::types::{Encoding, Tag, StringSource};
#[test]
fn test_basic_jsonl_output() {
let strings = vec![
FoundString {
text: "test".to_string(),
encoding: Encoding::Ascii,
offset: 0,
rva: Some(4096),
section: Some(".text".to_string()),
length: 4,
tags: vec![],
score: 50,
source: StringSource::SectionData,
},
];
let formatter = JsonFormatter::new();
let output = formatter.format(&strings).unwrap();
// Should have one line with newline
assert_eq!(output.lines().count(), 1);
// Should be valid JSON
let parsed: serde_json::Value = serde_json::from_str(output.trim()).unwrap();
assert_eq!(parsed["text"], "test");
assert_eq!(parsed["offset"], 0);
}
#[test]
fn test_special_characters_escaping() {
// Test strings with quotes, backslashes, newlines
let strings = vec![
FoundString {
text: "path\\to\\file\"quoted\"".to_string(),
encoding: Encoding::Ascii,
offset: 100,
rva: None,
section: None,
length: 20,
tags: vec![Tag::FilePath],
score: 60,
source: StringSource::SectionData,
},
];
let formatter = JsonFormatter::new();
let output = formatter.format(&strings).unwrap();
// Should be valid JSON despite special characters
let parsed: serde_json::Value = serde_json::from_str(output.trim()).unwrap();
assert!(parsed["text"].as_str().unwrap().contains("file"));
}
#[test]
fn test_null_optional_fields() {
let strings = vec![
FoundString {
text: "test".to_string(),
encoding: Encoding::Utf8,
offset: 0,
rva: None, // Optional field
section: None, // Optional field
length: 4,
tags: vec![],
score: 50,
source: StringSource::SectionData,
},
];
let formatter = JsonFormatter::new();
let output = formatter.format(&strings).unwrap();
let parsed: serde_json::Value = serde_json::from_str(output.trim()).unwrap();
assert!(parsed["rva"].is_null());
assert!(parsed["section"].is_null());
}
#[test]
fn test_multiple_strings() {
let strings = vec![
FoundString {
text: "first".to_string(),
encoding: Encoding::Ascii,
offset: 0,
rva: Some(100),
section: Some(".text".to_string()),
length: 5,
tags: vec![],
score: 50,
source: StringSource::SectionData,
},
FoundString {
text: "second".to_string(),
encoding: Encoding::Utf8,
offset: 100,
rva: Some(200),
section: Some(".data".to_string()),
length: 6,
tags: vec![Tag::Url],
score: 75,
source: StringSource::ImportName,
},
];
let formatter = JsonFormatter::new();
let output = formatter.format(&strings).unwrap();
// Should have two lines
assert_eq!(output.lines().count(), 2);
// Each line should be valid JSON
for line in output.lines() {
serde_json::from_str::<serde_json::Value>(line).unwrap();
}
}
#[test]
fn test_empty_collection() {
let strings: Vec<FoundString> = vec![];
let formatter = JsonFormatter::new();
let output = formatter.format(&strings).unwrap();
assert_eq!(output, "");
}
#[test]
fn test_utf16_encoding() {
let strings = vec![
FoundString {
text: "wide string".to_string(),
encoding: Encoding::Utf16Le,
offset: 1000,
rva: Some(2000),
section: Some(".rdata".to_string()),
length: 24, // 2 bytes per char
tags: vec![],
score: 65,
source: StringSource::ResourceString,
},
];
let formatter = JsonFormatter::new();
let output = formatter.format(&strings).unwrap();
let parsed: serde_json::Value = serde_json::from_str(output.trim()).unwrap();
assert_eq!(parsed["encoding"], "Utf16Le");
assert_eq!(parsed["length"], 24);
}
}
Integration Tests
- Test with real
FoundString collections from binary analysis
- Verify output can be parsed by
jq and other JSON tools
- Test large collections (10k+ strings) for performance
- Verify streaming output for memory efficiency
7. Documentation Requirements
- Inline documentation for public functions
- Examples in doc comments showing usage
- Reference to JSON Lines specification: https://jsonlines.org/
- CLI usage examples in
docs/src/output-formats.md
Acceptance Criteria
Edge Cases to Handle
- Very long strings: No truncation in JSONL (unlike human format)
- Binary/invalid UTF-8: Already handled by
String type in FoundString
- Empty string text: Should serialize as
{"text":"", ...}
- Zero-length collections: Should produce empty output (no lines)
- Unicode characters:
serde_json handles UTF-8 automatically
- Control characters:
serde_json escapes appropriately (\n, \r, \t)
Dependencies
References
Estimated Effort
Low-Medium complexity - Straightforward serialization using existing serde infrastructure. Primary work is comprehensive testing and edge case handling. Estimated 4-6 hours of development time.
Example CLI Usage (Post-Integration)
# Basic JSONL output
stringy --json malware.exe
# Save to file
stringy --json binary.elf > strings.jsonl
# Pipeline with jq
stringy --json app.exe | jq '.[] | select(.score > 80)'
# Filter URLs
stringy --json binary | jq 'select(.tags | contains(["Url"]))'
# Count by section
stringy --json binary | jq -r '.section' | sort | uniq -c
Context
StringyMcStringFace is a binary string extraction and analysis tool that extracts meaningful strings from executable files (PE, ELF, Mach-O) with semantic classification, tagging, and scoring. The tool requires multiple output formats to serve different use cases: human-readable for interactive analysis, JSONL for automation and data pipelines, and YARA for security rule generation.
This issue focuses on implementing the JSONL (JSON Lines) output formatter, which provides machine-readable structured output where each line is a complete JSON object representing a
FoundString. This format is ideal for:jq, Python, or other toolsProblem Statement
The tool currently has a placeholder
src/output/mod.rsbut no concrete formatter implementations. This issue implements the JSONL formatter as part of the output formatting system defined in issue #25.Data Structure
The JSONL formatter serializes
FoundStringinstances with all fields:Requirements
Proposed Solution
1. File Creation
Create
src/output/json.rsimplementing the JSONL formatter.2. Implementation Approach
3. Key Design Decisions
JSON Lines Format
[]wrapperField Serialization
FoundStringdata in every recordOption<T>fields serialize asnullwhen absentserdederives onEncoding,Tag,StringSourceserde_jsonhandles special characters automaticallyExample Output
{"text":"kernel32.dll","encoding":"Ascii","offset":4096,"rva":8192,"section":".idata","length":12,"tags":["ImportName"],"score":95,"source":"ImportName"} {"text":"https://api.example.com/v1","encoding":"Utf8","offset":16384,"rva":20480,"section":".rdata","length":26,"tags":["Url","Domain"],"score":88,"source":"SectionData"} {"text":"C:\\\\Windows\\\\System32\\\\config","encoding":"Utf16Le","offset":32768,"rva":null,"section":null,"length":56,"tags":["FilePath"],"score":72,"source":"SectionData"}4. Integration with Framework
Once issue #25 (Output Formatter Framework) is complete, this implementation should:
Formattertrait defined insrc/output/mod.rsOutputConfigfor filtering options--jsonor--format jsonflags5. Error Handling
6. Testing Requirements
Unit Tests
Integration Tests
FoundStringcollections from binary analysisjqand other JSON tools7. Documentation Requirements
docs/src/output-formats.mdAcceptance Criteria
src/output/json.rscreated withJsonFormatterstructformat()method serializesVec<FoundString>to JSONL formatformat_one()method for streaming single stringsFoundStringfields serialized correctlyserde_jsonrva,section) serialize asnullwhen absentFormattertrait (once Implement Output Formatter Framework with Trait-Based Architecture #25 is complete)jqand standard JSON parsersEdge Cases to Handle
Stringtype inFoundString{"text":"", ...}serde_jsonhandles UTF-8 automaticallyserde_jsonescapes appropriately (\n,\r,\t)Dependencies
serde_json(already inCargo.toml)serdederives onFoundString,Encoding,Tag,StringSource(already implemented)References
docs/src/architecture.md(lines 289-294)docs/src/output-formats.mdsrc/types.rs(line 144:FoundStringdefinition)stringy-analyzer/jsonl-output-formatEstimated Effort
Low-Medium complexity - Straightforward serialization using existing
serdeinfrastructure. Primary work is comprehensive testing and edge case handling. Estimated 4-6 hours of development time.Example CLI Usage (Post-Integration)