Skip to content

lutaml/moxml

Repository files navigation

Moxml: Modern XML processing for Ruby

Introduction and purpose

Moxml provides a unified, modern XML processing interface for Ruby applications. It offers a consistent API that abstracts away the underlying XML implementation details while maintaining high performance through efficient node mapping and native XPath querying.

Key features:

  • Intuitive, Ruby-idiomatic API for XML manipulation

  • Consistent interface across different XML libraries

  • Efficient node mapping for XPath queries

  • Support for all XML node types and features

  • Easy switching between XML processing engines

  • Clean separation between interface and implementation

Supported XML libraries

General

Moxml supports the following XML libraries:

REXML

REXML, a pure Ruby XML parser distributed with standard Ruby. Not the fastest, but always available.

Nokogiri

(default) Nokogiri, a widely used implementation which wraps around the performant libxml2 C library.

Oga

Oga, a pure Ruby XML parser. Recommended when you need a pure Ruby solution say for Opal.

Ox

Ox, a fast XML parser.

LibXML

libxml-ruby, Ruby bindings for the performant libxml2 C library. Alternative to Nokogiri with similar performance characteristics.

Feature table

Moxml exercises its best effort to provide a consistent interface across basic XML features, various XML libraries have different features and capabilities.

The following table summarizes the features supported by each library.

Note
The checkmarks indicate support for the feature, while the footnotes provide additional context for specific features.
Feature Nokogiri Oga REXML LibXML Ox

HeadedOx

Parsing, serializing

SAX parsing

✅ Full (10/10 events)

✅ Full (10/10 events)

✅ Full (10/10 events)

✅ Full (10/10 events)

⚠️ Core (4/10 events) See NOTE 7.

⚠️ Core (4/10 events) See NOTE 7.

Node manipulation

✅ See NOTE 1.

✅ See NOTE 1.

Basic XPath

Uses Ox-specific API locate. See NOTE 2.

✅ Full XPath 1.0. See NOTE 3.

XPath with namespaces

Uses Ox-specific API locate. See NOTE 2.

⚠️ Basic. See NOTE 3.

Note
Ox/HeadedOx: Text node replacement may fail in some cases due to internal node structure.
Note
Limited XPath support via locate() method. See adapter limitations section.
Note
HeadedOx provides full XPath 1.0 support via a pure Ruby XPath engine layered on top of Ox’s C parser. See HeadedOx documentation for details.
Note
Ox/HeadedOx SAX: Only core events supported (start_element, end_element, characters, errors). No separate CDATA, comment, or processing instruction events.

Adapter comparison

Feature compatibility matrix

Feature/Operation Nokogiri Oga REXML LibXML Ox HeadedOx

Core Operations

Parse XML string

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

Parse XML file/IO

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

Serialize to XML

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

Element Operations

Create elements

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

Get/set attributes

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

Add/remove children

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

Replace nodes

✅ Full

✅ Full

✅ Full

✅ Full

⚠️ Limited1

⚠️ Limited1

Namespace Operations

Add namespaces

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

Default namespaces

✅ Full

✅ Full

✅ Full

✅ Full

⚠️ Basic

⚠️ Basic

Namespace inheritance

✅ Full

✅ Full

✅ Full

✅ Full

❌ None

❌ None5

Namespaced attributes

✅ Full

✅ Full

✅ Full

✅ Full

⚠️ Limited

⚠️ Limited5

XPath Queries

Basic paths (//element)

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

Attribute predicates ([@id])

✅ Full

✅ Full

✅ Full

✅ Full

⚠️ Existence only2

✅ Full

Attribute values ([@id='123'])

✅ Full

✅ Full

✅ Full

✅ Full

❌ None3

✅ Full

Logical operators ([@a and @b])

✅ Full

✅ Full

✅ Full

✅ Full

❌ None

✅ Full

Position predicates ([1], [last()])

✅ Full

✅ Full

✅ Full

✅ Full

❌ None

✅ Full

Text predicates ([text()='x'])

✅ Full

✅ Full

✅ Full

✅ Full

❌ None

✅ Full

Namespace-aware queries

✅ Full

✅ Full

✅ Full

✅ Full

❌ None

⚠️ Basic5

Parent axis (..)

✅ Full

✅ Full

✅ Full

✅ Full

❌ None

✅ Full

Sibling axes

✅ Full

✅ Full

✅ Full

✅ Full

❌ None

❌ None5

XPath functions (count(), etc.)

✅ Full

✅ Full

✅ Full

✅ Full

❌ None

✅ All 27

Special Content

CDATA sections

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

Comments

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

Processing instructions

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

DOCTYPE declarations

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

✅ Full

Performance

Parse speed

Fast

Fast

Medium

Fast

Very Fast

Very Fast

Serialize speed

Fast

Fast

Medium

Medium

Very Fast

Very Fast

Memory usage

Good

Medium

Medium

Good

Excellent

Excellent

Thread safety

✅ Yes

✅ Yes

✅ Yes

✅ Yes

✅ Yes

✅ Yes

+ 1 Ox/HeadedOx: Text node replacement may fail in some cases due to internal node structure
2 Ox: //book[@id] works (returns all book elements), but doesn’t filter by attribute existence
3 HeadedOx: Full XPath 1.0 with all 27 functions and 6 axes. Pure Ruby XPath engine on Ox’s C parser. 99.20% pass rate. See docs/headed-ox.adoc
4 Ox: Use .find { |el| el["id"] == "123" } instead of XPath attribute value predicates
5 HeadedOx limitations: Namespace introspection and 7 axes not implemented. See docs/HEADED_OX_LIMITATIONS.md

Adapter selection guide

Choose Nokogiri when:

  • You need industry-standard compatibility

  • Large community support is important

  • C extension performance is acceptable

  • Cross-platform deployment is required

Choose Oga when:

  • Pure Ruby environment is required (JRuby, TruffleRuby)

  • Best test coverage is needed (98%)

  • No C extensions are allowed

  • Memory usage is not the primary concern

Choose REXML when:

  • Standard library only (no external gems)

  • Maximum portability is required

  • Small to medium documents

  • Deployment simplicity is critical

Choose LibXML when:

  • Alternative to Nokogiri is desired

  • Full namespace support is required

  • Good performance with correctness

  • Native C extension is acceptable

Choose Ox when:

  • Maximum parsing speed is critical

  • Simple document structures (limited nesting)

  • XPath usage is minimal or absent

  • Memory efficiency is paramount

Choose HeadedOx when:

  • Need Ox’s fast parsing with full XPath support

  • Want comprehensive XPath 1.0 features (functions, predicates)

  • Prefer pure Ruby XPath implementation for debugging

  • Need more XPath capabilities than standard Ox provides

  • Memory efficiency is important but XPath features are required

Caution
Ox’s custom XPath engine supports common patterns but cannot handle complex XPath expressions. Test thoroughly if your use case requires advanced XPath.

TODO: We should throw errors when unsupported XPath features are used with Ox or HeadedOx to prevent silent failures.

Getting started

Installation

Install the gem and at least one supported XML library:

# In your Gemfile
gem 'moxml'
gem 'nokogiri'  # Or 'oga', 'rexml', 'ox', or 'libxml-ruby'

Basic document creation

doc = Moxml.new.create_document

# Add XML declaration
doc.add_child(doc.create_declaration("1.0", "UTF-8"))

# Create root element with namespace
root = doc.create_element('book')
root.add_namespace('dc', 'http://purl.org/dc/elements/1.1/')
doc.add_child(root)

# Add content
title = doc.create_element('dc:title')
title.text = 'XML Processing with Ruby'
root.add_child(title)

# Output formatted XML
puts doc.to_xml(indent: 2)

Real-world examples

Practical, runnable examples demonstrating Moxml usage in common scenarios are available in the examples directory.

These examples include:

RSS Parser

Parse RSS/Atom feeds with XPath queries and namespace handling

Web Scraper

Extract data from HTML/XML using DOM navigation and table parsing

API Client

Build and parse XML API requests/responses with SOAP

Each example is:

  • Fully documented with detailed README

  • Self-contained and runnable

  • Demonstrates best practices

  • Includes sample data files

  • Shows comprehensive error handling

Run any example directly:

ruby examples/rss_parser/rss_parser.rb
ruby examples/web_scraper/web_scraper.rb
ruby examples/api_client/api_client.rb

See the examples README for complete documentation and learning paths.

Working with documents

Using the builder pattern

The builder pattern provides a clean DSL for creating XML documents:

doc = Moxml::Builder.new(Moxml.new).build do
  declaration version: "1.0", encoding: "UTF-8"

  element 'library', xmlns: 'http://example.org/library' do
    element 'book' do
      element 'title' do
        text 'Ruby Programming'
      end

      element 'author' do
        text 'Jane Smith'
      end

      comment 'Publication details'
      element 'published', year: '2024'

      cdata '<custom>metadata</custom>'
    end
  end
end

Direct document manipulation

doc = Moxml.new.create_document

# Add declaration
doc.add_child(doc.create_declaration("1.0", "UTF-8"))

# Create root with namespace
root = doc.create_element('library')
root.add_namespace(nil, 'http://example.org/library')
root.add_namespace("dc", "http://purl.org/dc/elements/1.1/")
doc.add_child(root)

# Add elements with attributes
book = doc.create_element('book')
book['id'] = 'b1'
book['type'] = 'technical'
root.add_child(book)

# Add mixed content
book.add_child(doc.create_comment('Book details'))
title = doc.create_element('title')
title.text = 'Ruby Programming'
book.add_child(title)

# Add entity reference (for declared entities)
book.add_child(doc.create_entity_reference('mdash'))

Entity References

Moxml supports EntityReference nodes for preserving entity syntax in XML documents. This enables round-trip preservation of entity references like  , ©, and custom entities defined in the DOCTYPE.

# Create entity reference programmatically
ref = doc.create_entity_reference('nbsp')
element.add_child(ref)

# Or using the builder pattern
doc = Moxml::Builder.new(Moxml.new).build do
  element 'text' do
    entity_reference 'ndash'
    entity_reference 'copy'
  end
end

Parsing and Round-Trip:

When parsing XML with declared entities, Moxml preserves entity references:

# Parse document with custom entity
xml = <<-XML
<!DOCTYPE root [<!ENTITY nbsp " "> ]>
<root>hello&nbsp;world</root>
XML

doc = Moxml.new(:nokogiri).parse(xml)
doc.to_xml  # => preserves &nbsp; entity reference

Adapter Notes:

  • Nokogiri: Preserves custom declared entities as EntityReference nodes

  • Ox, Oga: These adapters resolve entities during parsing and do not expose entity reference nodes. Use Nokogiri or LibXML for entity preservation.

Entity Loading Configuration:

Moxml provides configurable entity loading with four modes to balance between functionality, performance, and security:

# Default: Load all W3C entities (HTML + MathML + ISO entity sets)
# Raises error if entity data is unavailable
context = Moxml.new

# Optional: Load entities if available, silently skip if not
context = Moxml.new do |config|
  config.entity_load_mode = :optional
end

# Disabled: No entity loading (fastest, for controlled XML sources)
context = Moxml.new do |config|
  config.entity_load_mode = :disabled
end

# Custom: Load entities from your own source
context = Moxml.new do |config|
  config.entity_load_mode = :custom
  config.entity_provider = -> { MyEntitySource.all_entities }
end

The entity data comes from the W3C XML Core WG Character Entities specification (HTMLMathML set), bundled locally in data/w3c_entities.json for offline capability. Set the MOXML_ENTITY_DEFINITIONS_PATH environment variable to use a custom entity data source.

For backward compatibility, config.load_external_entities = false maps to :disabled mode, and config.load_external_entities = true maps to :required mode.

Fluent interface API

Moxml provides a fluent, chainable API for improved developer experience:

element = doc.create_element('book')
  .set_attributes(id: "123", type: "technical")
  .with_namespace("dc", "http://purl.org/dc/elements/1.1/")
  .with_child(doc.create_element("title"))

For complete fluent API documentation including all chainable methods, convenience methods, and practical examples, see Working with Documents Guide.

SAX (Event-Driven) Parsing

SAX (Simple API for XML) provides memory-efficient, event-driven XML parsing for large documents.

When to use SAX:

  • Processing very large XML files (>100MB)

  • Memory-constrained environments

  • Streaming data extraction

  • Need to process data as it arrives

Quick example:

class BookExtractor < Moxml::SAX::ElementHandler
  attr_reader :books

  def initialize
    super
    @books = []
  end

  def on_start_element(name, attributes = {}, namespaces = {})
    super
    @books << { id: attributes["id"] } if name == "book"
  end
end

handler = BookExtractor.new
Moxml.new.sax_parse(xml_string, handler)
puts handler.books.inspect

For complete SAX documentation including all handler types, event methods, adapter support, and best practices, see SAX Parsing Guide.

XML objects and their methods

For complete node API reference including traversal methods, manipulation, queries, type checking, and node information, see Node API Reference.

Node identity

Moxml provides a consistent #identifier method across all node types to safely identify nodes:

element = doc.at_xpath("//book")
puts element.identifier  # => "book"

attr = element.attribute("id")
puts attr.identifier     # => "id"

The #identifier method returns the primary identifier for each node type (tag name for elements, attribute name for attributes, target for processing instructions, or nil for content nodes).

Important
Always use type-safe patterns when working with mixed node types. See the Node API Consistency Guide for complete documentation on safe coding patterns, API surface by node type, and migration guidelines.

Advanced features

XPath querying

Moxml provides efficient XPath querying with consistent node mapping:

# Find all book elements
books = doc.xpath('//book')

# Find with namespaces
titles = doc.xpath('//dc:title', 'dc' => 'http://purl.org/dc/elements/1.1/')

# Find first matching node
first_book = doc.at_xpath('//book')

Namespace handling

# Add namespace to element
element.add_namespace('dc', 'http://purl.org/dc/elements/1.1/')

# Create element in namespace
title = doc.create_element('dc:title')

For complete documentation on XPath querying, namespace handling, and accessing native implementations, see Advanced Features Guide.

Error handling

Moxml provides comprehensive error classes with enhanced context for debugging:

begin
  doc = Moxml.new.parse(xml_string, strict: true)
  results = doc.xpath("//book[@id='123']")
rescue Moxml::ParseError => e
  puts "Parse failed at line #{e.line}: #{e.message}"
rescue Moxml::XPathError => e
  puts "XPath error: #{e.expression}"
rescue Moxml::Error => e
  puts "XML processing error: #{e.message}"
end

For complete error class hierarchy, error types, best practices, and debugging techniques, see Error Handling Guide.

Configuration

Moxml can be configured globally or per instance:

# Global configuration
Moxml.configure do |config|
  config.default_adapter = :nokogiri
  config.strict = true
  config.encoding = 'UTF-8'
end

# Instance configuration
context = Moxml.new do |config|
  config.adapter = :oga
  config.strict = false
end

Namespace URI validation

Moxml validates namespace URIs against RFC 3986 by default, as required by the W3C Namespaces in XML specification.

For documents that use non-standard namespace identifiers, a lenient mode is available:

# Strict mode (default) — rejects invalid URIs per RFC 3986
context = Moxml.new do |config|
  config.namespace_uri_mode = :strict
end

# Lenient mode — accepts any string as a namespace URI
context = Moxml.new do |config|
  config.namespace_uri_mode = :lenient
end

For all configuration options, adapter selection, serialization options, and environment-based configuration, see Configuration Guide.

Thread safety

For complete information on thread-safe patterns, context management, and concurrent processing, see the Thread Safety Guide.

Performance considerations

For detailed performance optimization strategies, memory management best practices, and efficient querying patterns, see the Performance Considerations Guide.

Best practices

For comprehensive best practices covering XPath queries, adapter selection, error handling, namespace handling, memory management, thread safety, performance optimization, and testing strategies, see Best Practices Guide.

Specific adapter limitations

Ox adapter

The Ox adapter provides maximum parsing speed but has XPath limitations.

XPath limitations:

  • No attribute value predicates: //book[@id='123']

  • No logical operators, position predicates, text predicates ❌

  • No namespace queries, parent axis, sibling axes ❌

  • No XPath functions ❌

Workaround: Use Ruby enumerable methods:

# Instead of: doc.xpath("//book[@id='123']")
doc.xpath("//book").find { |book| book["id"] == "123" }

For complete Ox adapter documentation including all limitations and workarounds, see Ox Adapter Guide.

HeadedOx adapter

The HeadedOx adapter combines Ox’s fast C-based XML parsing with Moxml’s comprehensive pure Ruby XPath 1.0 engine.

Status: Production-ready v1.2 (99.20% pass rate, 1,992/2,008 tests)

Key features:

  • Fast XML parsing (Ox C extension)

  • All 27 XPath 1.0 functions

  • 6 XPath axes (child, descendant, parent, attribute, self, descendant-or-self)

  • Expression caching for performance

  • Pure Ruby XPath engine (debuggable)

When to use:

  • Need Ox’s fast parsing with comprehensive XPath

  • Want XPath functions (count, sum, contains, etc.)

  • Prefer pure Ruby XPath for debugging

  • Basic namespace queries are sufficient

# Use HeadedOx adapter
context = Moxml.new(:headed_ox)
doc = context.parse(xml_string)

# Full XPath 1.0 support
books = doc.xpath('//book[@price < 20]')
count = doc.xpath('count(//book)')
titles = doc.xpath('//book/title[contains(., "Ruby")]')

For complete HeadedOx documentation including architecture, XPath capabilities, known limitations, and usage examples, see HeadedOx Adapter Guide and Limitations Documentation.

LibXML adapter

Performance:

  • Serialization speed: ~120 ips (slower than target)

  • Parsing speed: Good

  • For high-throughput serialization, consider Ox or Nokogiri

Other adapters

Nokogiri, Oga, REXML:

All three adapters have near-complete feature support with only minor edge case limitations. Use these adapters when you need full XPath and namespace support.

Round-trip XML Testing

Moxml includes comprehensive round-trip testing to verify that XML documents remain semantically equivalent when parsed and serialized across different adapters.

Purpose

Round-trip testing ensures:

  • Cross-adapter compatibility - XML parsed with one adapter (e.g., Nokogiri) can be serialized and re-parsed with another adapter (e.g., Oga) while preserving content

  • Structural fidelity - Element names, attributes, and document structure are maintained

  • Content preservation - Text content and entity references survive multiple parse/serialize cycles

  • Double round-trip verification - Source → Target → Source sequences produce semantically equivalent output

Test Fixtures

Round-trip tests use real-world XML documents organized into collections:

rfcxml - IETF RFC documents in XML format. These provide complex, standards-compliant XML with mixed content, namespaces, and attributes. The collection includes:

  • Large documents (500KB-2.4MB) for stress testing

  • Rich metadata and cross-references

  • Various XML schema patterns

metanorma - Metanorma document processing XML. These test:

  • Document structure preservation

  • Nested elements and complex hierarchies

  • Standard XML vocabularies

niso-jats - NISO Journal Article Tag Suite XML. These provide:

  • Scholarly publishing XML schemas

  • Rich bibliographic metadata

  • Mixed content models

Running Round-trip Tests

# Run all round-trip tests
bundle exec rake spec:consistency

# Exclude REXML for larger fixtures (faster, REXML is pure Ruby)
MOXML_ROUNDTRIP_REXML_MAX_SIZE=0 bundle exec rake spec:consistency

# Adjust the per-example timeout (default: 120 seconds)
MOXML_ROUNDTRIP_TIMEOUT=300 bundle exec rake spec:consistency

REXML is a pure Ruby XML parser and becomes very slow on large documents (500KB+). By default, REXML adapter pairs are skipped for fixtures exceeding 500KB. All other adapters (Nokogiri, Oga, Ox) are tested against every fixture.

Test Mechanics

For each fixture, tests run across all adapter pairs (4 adapters = 12 combinations):

  1. Parse with source adapter

  2. Serialize to XML string

  3. Parse serialized output with target adapter

  4. Compare semantic equivalence (element names, attributes, text content)

A "double round-trip" test additionally verifies: Source → Target → Source → Target produces consistent results.

Note
REXML is excluded from adapter pairs for fixtures larger than 500KB (configurable via MOXML_ROUNDTRIP_REXML_MAX_SIZE). This is because REXML is pure Ruby and cannot parse large XML documents in a practical timeframe. A per-example timeout (MOXML_ROUNDTRIP_TIMEOUT, default 120s) prevents tests from hanging indefinitely.

Ox Adapter Element Ordering Caveat

The Ox adapter produces elements in a different order than other adapters for certain fixtures with complex nested structures (e.g., element_citation.xml, collection1nested.xml, pnas_sample.xml). This causes the elements_with_attributes comparison to fail with "Array length mismatch" even though the semantic equivalence check (double round-trip) passes.

Round-trip tests automatically skip the elements_with_attributes comparison for these known Ox ordering issues. The ruby-versions CI job tests only Nokogiri and Oga adapters; the nokogiri-ox and nokogiri-rexml CI jobs test Ox and REXML respectively but are marked as experimental since these adapters lack full XML feature support:

  • Ox: Lacks proper namespace support, XPath with predicates, and uses a custom locate() method instead of standard XPath

  • REXML: Pure Ruby, exponential time complexity with document size, impractical for documents over ~500KB

For production use, prefer Nokogiri or Oga which provide complete XML conformance.

To run tests with a specific adapter set locally:

# Nokogiri + Oga only (fast, full test suite)
MOXML_ROUNDTRIP_ADAPTERS=nokogiri,oga bundle exec rspec spec/consistency/ --tag round_trip

# Nokogiri × Ox only (experimental)
MOXML_ROUNDTRIP_ADAPTERS=nokogiri,ox MOXML_ROUNDTRIP_TIMEOUT=300 bundle exec rspec spec/consistency/ --tag round_trip

# Nokogiri × REXML only (experimental, small fixtures due to exponential complexity)
MOXML_ROUNDTRIP_ADAPTERS=nokogiri,rexml MOXML_ROUNDTRIP_TIMEOUT=300 MOXML_ROUNDTRIP_REXML_MAX_SIZE=50000 bundle exec rspec spec/consistency/ --tag round_trip

Why Semantic Equivalence?

While a pure round-trip test with raw XML comparison would be ideal, different XML adapters have fundamentally different philosophies for handling:

  • Element ordering - Some preserve document order, others sort alphabetically

  • Whitespace handling - Some normalize spaces, others preserve exactly

  • Attribute representation - Different data structures for the same attributes

  • Text extraction - Varying approaches to concatenating text content

Instead of raw comparison, Moxml implements semantic equivalence testing that focuses on meaningful XML structure and content:

# Element name must match
expect(target_element.name).to eq(source_element.name)

# Attributes must be semantically equivalent
expect(target_attributes).to eq(source_attributes)

# Text content must be preserved (whitespace-normalized)
expect(normalized_text(target)).to eq(normalized_text(source))

# Document structure (element count) must match
expect(doc.xpath("//*").size).to eq(original.xpath("//*").size)

This approach tolerates adapter-specific serialization differences while ensuring the actual XML content remains intact.

Development and testing

For complete information on development setup, testing strategies, benchmarking, and coverage reporting, see the Development and Testing Guide.

Contributing

  1. Fork the repository

  2. Create your feature branch (git checkout -b feature/my-new-feature)

  3. Commit your changes (git commit -am 'Add some feature')

  4. Push to the branch (git push origin feature/my-new-feature)

  5. Create a new Pull Request

License

Copyright Ribose.

This project is licensed under the Ribose 3-Clause BSD License. See the LICENSE.md file for details.

About

Unified interface for multiple XML libraries

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages