Skip to content

Conversation

@bwalsh
Copy link

@bwalsh bwalsh commented Jul 1, 2025

Add Support for Nested Objects in PFB Schema [#133]


Description

This pull request adds support for nested objects within the PFB (Portable Format for Biomedical Data) schema, resolving [Issue #133](#133).

Summary of Changes

  • Allows JSON Schema fields of type "object" to be used within PFB entities.
  • Recursively handles nested object properties during PFB schema parsing and Avro schema generation.
  • Ensures nested objects are correctly represented in both the JSON schema interpretation and the Avro output.

This enhancement improves compatibility with complex GraphQL schemas that include embedded structures, enabling more flexible and expressive metadata models within the PFB format.


Related Issue

Closes [#133](#133)


Testing

  • Confirmed existing unit tests pass.

  • Added new tests to validate:

    • Parsing schemas containing nested objects.
    • Correct Avro schema generation with nested fields.
    • Round-trip compatibility for PFB export and import with nested structures.

Documentation Updates

  • README.md

    • Add a section under "Schema Support" to mention nested object compatibility.
    • Briefly describe how nested object properties are represented in PFB output.

Checklist

  • Code changes tested locally
  • Unit tests updated for new functionality
  • Documentation updated (to be completed)
  • All existing tests pass
  • Verified backward compatibility with existing PFBs

Additional Notes

This update maintains backward compatibility for existing PFB workflows. Consumers of the library are not required to change anything unless they wish to leverage nested object support explicitly.


Implementation details


Changes to Recursive Object Property Handling (_any_map)

  • Introduced the _any_map(max_depth) helper to generate Avro map types for fields with "additionalProperties": true.

  • This ensures object properties that allow arbitrary key-value pairs are accurately modeled in the PFB schema output.

  • Provides explicit, bounded recursion for object nesting, controlled by max_depth.

  • Ensures downstream consumers (e.g., PFB readers) have a complete, well-defined schema for open-ended object fields.

  • Supports:

    • Primitive types (null, boolean, int, long, float, double, bytes, string).
    • Arrays containing primitive types or further nested maps.
    • Recursively nested maps, bounded by the configurable max_depth to prevent infinite recursion.
  • Applied during schema generation when a property has:

    {
      "type": "object",
      "additionalProperties": true
    }

Why

  • Improves compatibility with Gen3 dictionaries that define open-ended object fields.
  • Ensures downstream consumers of PFB output receive valid, self-contained Avro schemas for arbitrary nested structures.
  • Avoids runtime guesswork or incomplete schema definitions for these complex fields.
  • Current implementation may not have fully or correctly expressed recursive, arbitrary-depth object types in Avro.

When It's Used

The function is invoked during schema generation for Gen3 dictionary fields where:

{
  "type": "object",
  "additionalProperties": true
}

This allows open-ended object fields to be represented as valid, recursive Avro map types in the PFB output.

Inside schema generation logic:

if property_type["type"] == "object" or property_type["type"] == ["object", "null"]:
    if property_type.get('additionalProperties', False):
        return _any_map(max_depth)
  • When a schema property is of type "object" with "additionalProperties": true, _any_map is invoked.
  • This ensures the corresponding PFB/Avro schema can express arbitrary nested key-value pairs, respecting the maximum recursion depth.

How It Works

The _any_map(max_depth) function generates an Avro-compatible map type to represent arbitrary key-value pairs, supporting recursive nesting up to a configurable depth.

Behavior

  • Returns an Avro map where:

    • Keys are arbitrary strings.

    • Values can be:

      • Primitive types:
        "null", "boolean", "int", "long", "float", "double", "bytes", "string"

      • Arrays containing:

        • Primitive types
        • Nested maps (_any_map) for further recursion
      • Nested maps (_any_map) for recursive object structures


Example Output (max_depth = 1)

{
  "type": "map",
  "values": [
    "null",
    "boolean",
    "int",
    "long",
    "float",
    "double",
    "bytes",
    "string",
    {
      "type": "array",
      "items": [
        "null",
        "boolean",
        "int",
        "long",
        "float",
        "double",
        "bytes",
        "string",
        {
          "type": "map",
          "values": [
            "null",
            "boolean",
            "int",
            "long",
            "float",
            "double",
            "bytes",
            "string"
          ]
        }
      ]
    },
    {
      "type": "map",
      "values": [
        "null",
        "boolean",
        "int",
        "long",
        "float",
        "double",
        "bytes",
        "string"
      ]
    }
  ]
}

adds doc for nested maps

typo
@bwalsh bwalsh force-pushed the feature/nested-objects branch from 5cb143a to 92583ea Compare July 2, 2025 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant