Non-blocking streaming XML parser and serializer for event-driven I/O
This is Gonzalez, a data-driven XML parser and serializer using non-blocking, event-driven I/O. Unlike traditional SAX parsers that pull data from an InputSource, Gonzalez is completely feedforward: you push data to it as it arrives, and it produces SAX events without ever blocking for I/O, as long as the documents are standalone. The XMLWriter provides the inverse: streaming XML serialization to NIO channels.
- non-blocking: in streaming mode, will only ever block for external entities
- data-driven: processes whatever is available, state machine driven
- SAX-compatible: generates standard ContentHandler events and has a SAX2 convenience front end (which is blocking, of course)
- JAXP integration: can be used as a drop-in SAX parser via SAXParserFactory
- JPMS module: proper Java module (
org.bluezoo.gonzalez) for Java 9+ - lazy DTD parsing: DTD parser loaded only when a DOCTYPE is encountered
- memory efficient: streaming architecture handles documents of any size
- Java NIO throughput: uses ByteBuffer for content I/O operations
- SAX2 native: implements
ContentHandler,LexicalHandler,DTDHandler,DeclHandler - NIO-first: writes to
WritableByteChannelwith automatic buffering - SAX pipeline ready: wire directly as parser event sink, no adapter needed
- namespace-aware: full support for prefixed and default namespaces via
startPrefixMapping - DOCTYPE output: supports inline DTD declarations with standalone conversion mode
- pretty-print: optional indentation via
IndentConfig - empty element optimization: automatically emits
<foo/>instead of<foo></foo> - proper escaping: handles special characters in content and attributes
- configurable encoding: UTF-8 (default), ISO-8859-1, US-ASCII, or any Java charset
Traditional SAX parsers use a pull model where the parser controls the flow of
data via InputSource. This requires blocking I/O and makes them incompatible
with asynchronous frameworks. Gonzalez inverts this by using a push model where
data is fed to the parser via receive(ByteBuffer).
The parser is a state machine that processes whatever data is available and buffers incomplete tokens. When the end of available data is reached, it returns control to the caller, who can then feed more data when it arrives. This makes Gonzalez ideal for integration with async I/O frameworks like the Gumdrop multiserver.
The following diagram illustrates how Gonzalez integrates into an NIO server pipeline like Gumdrop or Netty, receiving ByteBuffer chunks from network channels, SSL unwrap operations, or other sources, and transforming XML data in real time into SAX events without blocking or loading the entire document into memory:
Gonzalez treats the source document being processed itself as an external entity, as well as external identities that are potentially referenced inside it. The job of the EED is to convert byte data into character data, and it will use the XML declaration and text declarations (if any) as well as BOMs in the byte data to determine the character set decoding process to be used via NIO.
The tokenizer processes the character data to produce token events using a state trie to predictably decide on token types and boundaries. When the tokenizer emits tokens, it passes them to a token consumer via the latters receive(Token, CharBuffer) interface, so character data associated with the token is exposed via the NIO CharBuffer interface.
The content parser is a type of token consumer which handles the main XML content grammar: elements, attributes, PI, comments etc. It reports events to the configured SAX ContentHandler.
Most XML documents in practice are standalone documents without DTDs. To minimize memory overhead for the common case, the DTD parser is a separate component that is loaded only when a DOCTYPE declaration is encountered.
The DTD parser uses the same token consumer interface as the content parser.
It receives DTD content via receive(Token, CharBuffer) and reports
declarations through the standard SAX interfaces: DTDHandler for notations
and unparsed entities, DeclHandler for element and attribute declarations,
and LexicalHandler for comments and entity boundaries.
Gonzalez will use standard SAX2 EntityResolver and EntityResolver2 mechanisms for resolving external entities in a blocking fashion. Therefore, if you process XML streams with external entities (including the document type declaration), this will inevitably affect the performance and reliability of the parsing process. There is no solution to this: the alternative, using asynchronous non-blocking I/O, is technically feasible but the parser still needs to block at the point the external entity is defined until it has been able to finish parsing it. Therefore all its content input would have to be buffered until the parser was in a state ready to process it. Since that would easily lead to resource exhaustion, Gonzalez design is to implement external entities as a blocking process.
Gonzalez includes an XSLT 3.0 transformer that follows the same design
principles as the parser: event-driven processing with predictable resource
usage. The transformer integrates with standard JAXP APIs
(TransformerFactory, Templates, Transformer) and can be used as a
drop-in replacement for other XSLT processors.
Key features:
- JAXP compliant - Standard
TransformerFactoryinterface - Streaming support - SAX-based transformation pipeline
- Iterative XPath parser - Pratt algorithm with explicit stacks, no recursion
- Forward-compatible - Graceful handling of later XSLT version constructs
See XSLT.md for detailed documentation on the transformer design, features, test methodology, and performance characteristics.
Parser parser = new Parser();
// Set handlers
parser.setContentHandler(myContentHandler);
// Set document identifiers (optional, for Locator reporting)
parser.setSystemId("http://example.com/document.xml");
parser.setPublicId("-//Example//DTD Example 1.0//EN");
// Feed data as it arrives. Do this instead of parse() for non-blocking
// behaviour
parser.receive(byteBuffer1);
parser.receive(byteBuffer2);
// ...
parser.close();
// Signal end of documentGonzalez can also be used via the standard JAXP SAXParserFactory mechanism, making it a drop-in replacement for other SAX parsers:
// Explicit factory selection
SAXParserFactory factory = SAXParserFactory.newInstance(
"org.bluezoo.gonzalez.GonzalezSAXParserFactory", null);
factory.setNamespaceAware(true);
SAXParser parser = factory.newSAXParser();
parser.parse(inputStream, myHandler);Or set Gonzalez as the default SAX parser via system property:
System.setProperty("javax.xml.parsers.SAXParserFactory",
"org.bluezoo.gonzalez.GonzalezSAXParserFactory");
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
parser.parse(inputStream, myHandler);The parser implements org.xml.sax.ext.Locator2 to provide accurate position
information during parsing. The locator is automatically passed to the
ContentHandler via setDocumentLocator() before startDocument() is called.
The locator tracks:
- Line number (starting at 1)
- Column number (starting at 0)
- System identifier (URI)
- Public identifier
- Character encoding
- XML version
Since Gonzalez is data-driven, you must explicitly set the system and public identifiers if you want them reported:
parser.setSystemId("http://example.com/document.xml");
parser.setPublicId("-//Example//DTD Example 1.0//EN");Line endings are handled according to XML 1.0 specification section 2.11:
- LF (U+000A)
- CR (U+000D)
- CRLF (CR followed by LF, treated as single line ending)
The parser correctly handles CRLF sequences even when split across receive()
boundaries.
XML 1.1 line-end delimiters are also supported in XML 1.1 documents.
The XMLWriter implements SAX2 interfaces directly, so it can be wired into a SAX pipeline as an event sink with no adapter:
import org.bluezoo.gonzalez.Parser;
import org.bluezoo.gonzalez.XMLWriter;
// Parse and serialize: wire parser directly to writer
Parser parser = new Parser();
FileChannel channel = FileChannel.open(path, WRITE, CREATE);
XMLWriter writer = new XMLWriter(channel);
writer.setIndentConfig(IndentConfig.spaces2());
parser.setContentHandler(writer);
parser.setProperty("http://xml.org/sax/properties/lexical-handler", writer);
parser.parse(inputSource);
writer.close();For standalone use, call SAX methods directly:
import org.bluezoo.gonzalez.XMLWriter;
import org.xml.sax.helpers.AttributesImpl;
try (FileOutputStream fos = new FileOutputStream("output.xml")) {
XMLWriter writer = new XMLWriter(fos);
writer.setIndentConfig(IndentConfig.spaces2());
AttributesImpl atts = new AttributesImpl();
atts.addAttribute("", "version", "version", "CDATA", "1.0");
writer.startDocument();
writer.startPrefixMapping("", "http://example.com/ns");
writer.startElement("http://example.com/ns", "root", "root", atts);
AttributesImpl itemAtts = new AttributesImpl();
itemAtts.addAttribute("", "id", "id", "CDATA", "1");
writer.startElement("http://example.com/ns", "item", "item", itemAtts);
char[] text = "Hello, World!".toCharArray();
writer.characters(text, 0, text.length);
writer.endElement("http://example.com/ns", "item", "item");
writer.startElement("http://example.com/ns", "empty", "empty", new AttributesImpl());
writer.endElement("http://example.com/ns", "empty", "empty");
writer.endElement("http://example.com/ns", "root", "root");
writer.endPrefixMapping("");
writer.endDocument();
writer.close();
}Output:
<root xmlns="http://example.com/ns" version="1.0">
<item id="1">Hello, World!</item>
<empty/>
</root>Gonzalez is secure by default. External entity processing and external DTD loading are disabled out of the box, protecting against XXE (XML External Entity) attacks, SSRF, and entity expansion bombs.
| Setting | Default | Effect |
|---|---|---|
external-general-entities |
false |
External general entities are not resolved |
external-parameter-entities |
false |
External DTD subsets are not loaded |
accessExternalDTD |
"" (none) |
No protocols allowed for external DTD access |
| Entity expansion limit | 64,000 | Protects against billion-laughs entity bombs |
If you need to process documents with external DTDs or entity references, you must explicitly opt in to both the feature flags and the allowed protocols:
Parser parser = new Parser();
// Enable external entity processing
parser.setFeature("http://xml.org/sax/features/external-general-entities", true);
parser.setFeature("http://xml.org/sax/features/external-parameter-entities", true);
// Allow specific protocols (required even when entities are enabled)
parser.setProperty("http://javax.xml.XMLConstants/property/accessExternalDTD", "file");The accessExternalDTD property accepts:
""-- no protocols allowed (default)"all"-- all protocols allowed- Comma-separated list -- e.g.
"file","file,http,https"
Gonzalez recognizes the standard JAXP FEATURE_SECURE_PROCESSING feature.
Setting it to false enables external entities and sets accessExternalDTD
to "all":
// Via SAXParserFactory
SAXParserFactory factory = new GonzalezSAXParserFactory();
factory.setFeature("http://javax.xml.XMLConstants/feature/secure-processing", false);
SAXParser parser = factory.newSAXParser();The transformer factory also enforces secure defaults. To allow stylesheet and DTD loading from files:
GonzalezTransformerFactory factory = new GonzalezTransformerFactory();
factory.setAttribute("http://javax.xml.XMLConstants/property/accessExternalStylesheet", "file");
factory.setAttribute("http://javax.xml.XMLConstants/property/accessExternalDTD", "file");The default limit of 64,000 entity expansions matches the JDK default and protects against entity expansion attacks (billion laughs). To adjust:
parser.setProperty("http://www.nongnu.org/gonzalez/properties/entity-expansion-limit", 100000);Set to 0 to disable the limit (not recommended for untrusted input).
Use ant to build the project:
ant distThis produces three jar files in the dist directory:
| JAR | Contents | Dependencies |
|---|---|---|
gonzalez-core-1.2.jar |
Parser, XMLWriter, DTD (~230 KB) | None |
gonzalez-xslt-1.2.jar |
XSLT transformer, schema (~1.7 MB) | gonzalez-core, jsonparser |
gonzalez-1.2.jar |
Combined (fat jar) (~1.9 MB) | jsonparser |
When to use which:
- Core only — Use
gonzalez-coreif you only need XML parsing/serialization (e.g. Gumdrop, Netty pipelines). Zero external dependencies. - XSLT — Use
gonzalez-xslt(plusgonzalez-coreand jsonparser) for XSLT transformation. - Fat jar — Use
gonzalez-1.2.jarfor backward compatibility or when you want everything in one artifact.
Download from GitHub Releases:
The build downloads the jsonparser library automatically (see Dependencies below).
The architecture has been designed to support the full XML specification, including:
- XML declaration with all charsets supported by Java
- DOCTYPE with internal and external subsets
- Elements with attributes
- Character data and CDATA sections
- Entity references (character, internal, external)
- Processing instructions
- Comments
- Namespaces
- Validation using DTD
- XML 1.1
Benchmark results comparing Gonzalez with the JDK's bundled Xerces SAX parser (Java 21, measured with JMH):
| Document Size | Gonzalez | Xerces | Comparison |
|---|---|---|---|
| Small plain (1.3KB) | 9.1 µs | 11.2 µs | Gonzalez 1.23x faster |
| Small NS-heavy (2.9KB) | 18.3 µs | 18.6 µs | Essentially tied |
| Large plain (1MB) | 4026 µs | 1755 µs | Xerces 2.3x faster |
| Large NS-heavy (1.4MB) | 9876 µs | 4408 µs | Xerces 2.2x faster |
For small documents, Gonzalez can slightly outperform Xerces due to its lightweight architecture and efficient NIO buffer handling.
For larger documents, Xerces is faster due to decades of micro-optimisation in its char[] processing and custom UTF-8 decoder. However, Gonzalez's streaming architecture provides benefits that don't show in synthetic benchmarks:
- Non-blocking I/O: Your thread is never waiting on data from a network connection, allowing integration with async frameworks like Netty or Gumdrop
- Memory efficiency: Documents are processed in chunks without loading the entire file into memory
Xerces and other blocking I/O based parsers require you either to wait for the entire document to arrive, or to allocate another thread specifically for that parse.
Gonzalez has been tested with the W3C Conformance test suite xmlconf and
achieves 100% conformance with that suite. The test data in xmlconf/ is from
the 20130923 release, the latest and final
official release.
- Java 8 or later (including SAX API)
- gonzalez-core — Zero external dependencies.
- gonzalez-xslt — Requires
gonzalez-coreand jsonparser (org.bluezoo.json). The jsonparser library is used for XML-to-JSON and JSON-to-XML translation (xml-to-json,json-to-xml,parse-json). The build downloadsjsonparser-1.2.jarfrom GitHub Releases via theresolve-depstarget.
For Java 9+, Gonzalez provides JPMS modules:
// Parser/writer only (zero dependencies)
module myapp {
requires org.bluezoo.gonzalez;
}
// XSLT support (requires core + jsonparser)
module myapp {
requires org.bluezoo.gonzalez;
requires org.bluezoo.gonzalez.xslt;
requires org.bluezoo.json;
}Gonzalez is licensed under the GNU Lesser General Public License version 3.
See COPYING for full terms.
-- Chris Burdess