Design Proposal

Build the Platform. Ship the Future.

An AI-first application framework designed for rapid module development against a standardized platform. Platform-level data, AI agent architecture, execution engine, tool standards, and schedulers are solved once and shared by every module. Developers and AI agents build only business logic — the framework handles security, tenancy, orchestration, and infrastructure. This is how we deliver module packages at the speed of conversation.

Tenant Isolation Module SDK Platform Agents Tenant Data Connector On-Prem Bridge Tool Security Policies
InspectionBuilder.ai
Multi-tenant isolation, RBAC, module-scoped security
DataVisions.ai
69+ tools, agent orchestration, SSE streaming, memory modes
Chitty.ai
Skills-driven agents, dispatch, Bridge architecture, tool marketplace
Vanguard Workplace
Plug-and-play modules, auto-discovery, CoreDataService, AI SDK
Paulson Palmer
Platform Architect — Data Visions Consultant
Every pattern in this document has been built, tested, and refined across production platforms over two years of AI-first development. Tenant isolation proven under real multi-tenant load. Agent execution engine ran thousands of sessions. Tool security framework processed production calls with audit logging. Module SDK shipped independent features without merge conflicts. This proposal distills the best of what worked — and the hard lessons from what didn't.

1. The Thesis: Making the Complex Simple

$1T+
SaaS market cap wiped
in February 2026
$285B
erased in a single
48-hour window
35%
Atlassian decline —
AI agents replace per-seat workflows
The SaaS model is collapsing. AI agents are replacing the per-seat, per-month software that enterprises have been paying for. Investors have ended the "SaaS premium" era and now evaluate companies on one metric: AI defensibility. The question is no longer who has the best model or the most sophisticated agent — models are becoming a commodity. The winner is the company that owns, governs, and operationalizes the golden data layer.

For Dematic, that golden data layer is proprietary material handling intelligence — conveyor configurations, PLC programs, fault code libraries, equipment performance baselines, and operational patterns learned across hundreds of distribution center deployments. No model can replicate this. No competitor can download it. It's the compounding advantage of being the company that designs, builds, installs, and supports the equipment — with field technicians on-site, customer support resolving issues, and parts supply chains keeping systems running. Dematic is present at every stage of the asset lifecycle, and every interaction generates data that makes the platform smarter.

But the value goes beyond monitoring software. This platform sells reliability, quality, and predictability for the full lifecycle of a customer's assets — from engineering design through installation, commissioning, daily operations, maintenance, and end-of-life planning. It provides probability-based insights into equipment health: when will this motor fail, what's the remaining useful life of this belt, which shuttles need preventive maintenance this quarter. Cradle-to-grave asset intelligence, powered by data that only the equipment manufacturer has.

This Is a Company-Wide Initiative, Not a Technology Project

The golden data layer doesn't build itself. It requires every department to adopt AI-first thinking — engineering, resident maintenance, customer support, commissioning, parts & supply chain, sales, and marketing. This starts at the top and is driven as a company-wide initiative:

  • Engineering includes AI-ready documentation in every system design — structured fault codes, labeled I/O maps, documented control logic, commissioning baselines
  • Resident Maintenance logs corrective actions with chain-of-thought reasoning — what they checked, what they found, what they replaced, and why — building predictive maintenance models
  • Customer Support captures diagnostic workflows as structured training pairs — so the next call with the same symptom pattern is resolved by an agent, not a 2-hour phone call
  • Commissioning records equipment baselines as birth certificates — so every future reading is compared against that specific unit's healthy state, not generic specs
  • Parts & Supply Chain tracks component lifecycle from installation through replacement — feeding predictive ordering models that eliminate emergency shipments
  • Sales & Marketing uses platform intelligence to demonstrate ROI with real data — "customers on the platform see 34% reduction in unplanned downtime" backed by fleet-wide metrics

This is not optional for individual departments to opt into. AI-first data practices become part of department reviews, employee performance metrics, and standard operating procedures. The data is the gold — and mining it is everyone's job. The platform provides the infrastructure to capture, structure, and activate the data. The company provides the culture and commitment to fill it with quality.

It starts with engineering. When engineers design and configure a system, they create the fault code libraries, define the sensor thresholds, and program the PLC logic. Today that knowledge lives in project files and spreadsheets. The platform captures it as structured, AI-ready training data from day one — so every deployment builds the foundation for smarter diagnostics, better predictions, and faster resolution across the entire installed base. But engineering is just the first department. See Dematic Models (Reach Goal) for the full department-by-department strategy with concrete examples.

The Hidden Gem: Proprietary Design & Manufacturing Data

Beyond operational data, Dematic holds something no competitor and no foundation model has: decades of proprietary CAD designs, STEP files, 3D mesh models, CNC manufacturing data, and engineering specifications for material handling equipment. This data exists nowhere else on earth.

The industry is moving fast in this direction. Autodesk announced neural CAD foundation models that reason about geometry and manufacturing processes. Siemens built the first industrial foundation model, stating: "We have domain know-how, we understand our industries, and we have the data — together with AI, this is a winning combination." MIT researchers demonstrated AI agents that learn to use CAD tools to generate 3D objects from sketches. The infrastructure for training domain-specific models on engineering data is arriving now.

A Dematic-specific foundation model — trained on proprietary STEP files, assembly configurations, component specifications, and manufacturing procedures — could transform internal operations:

  • Design acceleration — AI that understands Dematic's design standards generates compliant conveyor layouts and component selections
  • Manufacturing validation — automated verification of CNC programs against design specifications before cutting
  • Quote accuracy, drift detection & cost contingency — AI cross-references design files with parts databases and manufacturing procedures to validate quotes, identify missing components, and detect deviations between the original design scope and what's actually being built or installed. Catches scope drift before it becomes a cost overrun — "this site was quoted for 6 aisles but the current design has 8" or "the motor spec changed from standard to heavy-duty but the quote still reflects standard pricing." With enough project history, the model predicts cost contingency requirements — "projects of this complexity and equipment mix historically run 8-12% over initial quote due to field modifications, recommend contingency budget of $X based on 47 similar past projects"
  • Predictive component lifecycle — models trained on design specs + operational data predict component failure timelines with engineering-grade accuracy
  • Knowledge preservation & design acceleration — engineering expertise embedded in training data instead of individual engineers' memories. Example: a junior engineer designing a Dematic Multishuttle system for a grocery distribution center. Today they spend weeks consulting senior engineers and searching past project files. With a Dematic model trained on hundreds of completed projects:

    The engineer describes the project: "Grocery DC, 40-foot clear height, mixed tote sizes, heavy product weights, seismic zone, 4 aisles, high throughput."

    The model responds with design guidance learned from similar past deployments — not from spec sheets, but from what actually worked and what caused problems:
    • Equipment selection: Recommends the Multishuttle 2 Flex carrier for mixed tote sizes — and warns that the real throughput constraint at this volume isn't the shuttle speed but the lift system at the aisle end, recommending dual lifts based on 8 similar grocery projects
    • Structural design: Flags that California seismic requirements need the reinforced rack upright profile at this height and tote weight — the standard profile passed on paper but 2 previous projects required costly retrofits during commissioning
    • Level optimization: Shows that mixed-tote systems can gain 2 extra storage levels by zoning smaller totes to upper levels — but notes that 5 of 11 similar projects chose uniform spacing for operational simplicity, and the choice depends on whether the WCS can enforce tote-size routing
    • Known pitfalls: Warns that the standard conveyor infeed speed causes tote jams with heavy grocery products — a problem resolved at 14 sites after installation that should be designed out from the start. Flags that California floor slabs frequently come in below the flatness spec the shuttle rails require — recommend tighter specs in the civil contract to avoid grinding delays
    • Downstream integration: Validates that the Multishuttle outfeed rate matches the downstream sorter capacity — and flags bottleneck risks from 3 past projects where the sorter couldn't keep up during peak
    The point isn't the specific numbers — it's that a senior engineer's 20 years of "I've seen this go wrong before" is preserved in training data from hundreds of deployments. A junior engineer gets that experience on their first project. The knowledge survives retirements, job changes, and organizational turnover

This data never leaves Dematic. Model training runs on internal infrastructure or private cloud. No third-party access. No external API exposure. The models, the training data, and the resulting intelligence are proprietary Dematic assets.

This is a strategic roadmap item — it starts with engineering adopting the data curation process through the Engineering Configuration module, building organized, AI-ready training data as part of every deployment. The foundation model opportunity compounds with every system Dematic designs and installs.

But proprietary data locked in SCADA historians, PLC registers, and engineering spreadsheets isn't a competitive advantage — it's a liability. The platform makes this data agent-accessible, tenant-governed, and operationally live. It connects AI to the enterprise truth that matters: what's running, what's failing, what's trending, and what to do about it.

The Market Shift

Traditional SaaS sells a one-size-fits-all product. Every customer gets the same software, pays for features they don't use, fights configuration systems that don't fit their workflow, and waits months for customizations that never come. AI agents are dismantling this model — why pay per seat for task tracking when an agent does it? Why pay for a dashboard tool when an agent builds the dashboard from live data?

But not all SaaS is equal. Generic horizontal software (task management, CRM data entry, reporting dashboards) is being automated away. Vertical platforms connected to proprietary operational data are not — because the data is the product, not the software.

Dying: Generic SaaS

  • One codebase, every customer gets the same product
  • Features hidden behind toggles and tiers
  • Per-seat pricing — AI agents reduce seats needed
  • Customer data locked into the vendor's database
  • AI bolted on later as a chatbot sidebar
  • No connection to operational reality

Winning: AI-Native Modular Platform

  • Custom module packages deployed at SaaS speed
  • AI code assistants build modules — faster user adoption by removing complexity
  • Proprietary data is the moat — agents make it accessible, not replaceable
  • Tenant data lives in their own isolated dataset
  • AI agents are infrastructure — every module gets agent capabilities
  • Bridge connects to live equipment, SCADA, and production systems

Our Position

This platform is not another SaaS product. It is an AI-native application framework that makes the complex simple:

Complex RealityHow the Platform Simplifies
Every customer's equipment is unique — different PLCs, different configurations, different fault codes Module SDK: custom modules built rapidly by AI code assistants against standard patterns. Application solutions delivered at SaaS implementation speed.
SCADA data is locked on-premises in proprietary formats Bridge: processes data at the edge, streams contracted metrics to the cloud. Agent-accessible without moving raw data.
Operators need answers, not dashboards they have to interpret Agent System: operators ask questions in natural language. Agents triage alarms, diagnose equipment, create work orders. Faster user adoption by removing complexity.
Engineering knowledge lives in people's heads and spreadsheets Engineering Configuration module: an agent reads PLC configuration files (Allen-Bradley L5X, Siemens TIA Portal exports) and automatically derives monitoring definitions, fault code libraries, and diagnostic procedures. Upload the config → agent parses it → generates the bridge monitoring setup → engineer reviews and approves. What used to take weeks of manual SCADA configuration becomes hours. The captured configurations also build the training data that makes agents smarter over time.
Every new customer deployment is a custom project Tenant model: each deployment is a tenant with their own data, modules, and bridge. Platform handles isolation, security, and scaling. Deploy in days, not months.
Integration with Control Tower, WMS, ERP is always custom Integrations as Modules: same SDK, same patterns. Build a connector module once, subscribe tenants. No custom integration code per customer.
💡
Making the Complex Simple: Dematic builds and operates the most complex material handling systems in the world. The platform doesn't replace that complexity — it makes it accessible. AI agents that understand conveyor systems. Modules built in days, not months. Operators who get answers instead of dashboards. Engineering knowledge that compounds across every deployment. Custom application solutions delivered at the speed of a SaaS implementation, with faster user adoption by removing the complexity that slows everything down.

2. Architecture Overview

The platform has two distinct data layers: the Platform Database and Tenant Data Stores. Above everything sits the Agent System. Below everything sits the Bridge.

The Platform Database is a shared PostgreSQL instance protected by Row-Level Security (RLS) — a PostgreSQL feature that automatically filters every query so that a tenant can only see its own rows, even if the application code doesn't include a WHERE tenant_id = ... clause. When a user logs in, the platform sets the tenant context on the database session, and PostgreSQL enforces the boundary at the engine level. This means a bug in module code, a misconfigured query, or a missing filter cannot leak data across tenants — the database itself prevents it. The Platform Database holds configuration, identity, and orchestration data only — not business data.

Tenant Databases are physically isolated per-tenant PostgreSQL databases for module business data, connected via the Tenant Data Connector. A tenant's business data lives in their own database — completely separate from other tenants and from the platform database.

Platform-Level
Agent System
Available to all modules. Every module can define agents, skills, and tools.
Agent Executor
Skills Registry
Memory Manager
Guardrails
Prompt Assembly
Tool Dispatcher
SSE Streaming
Failover
Model Registry
Embedding Generation
PostgreSQL · Shared · RLS
Platform Database
Config, identity, and orchestration — not business data. Row-Level Security enforced on tenant_id.
Tenants
Users
Roles
Permissions
Module Registry
Module Subscriptions
Agent Registry
Bridge Registry
Scheduling Engine
Form Engine
Audit Log
Settings
SDK Interfaces · @platform_tool · CoreDataService
Module SDK
Application Modules & Integration Modules
Independent module packages built against the SDK. Application modules deliver business features. Integration modules connect external systems and expose services. Both follow the same SDK contract — cross-module access goes through the Service Registry.
SCADA Monitor
Agent · Skills · Bridge Tools
PLC read/write · Alarms
Application Module
Maintenance
Agent · Skills · Cloud Tools
Work orders · Dispatch
Application Module
ETL Pipeline
Agent · Skills · Bridge Tools
Data processing · Sync
Application Module
Control Tower
WebSocket connector
Exposes services via Registry
Integration Module
SCADA Monitor
→ Service Registry →
Control Tower
SCADA module calls Control Tower services for live equipment data
module.json
@platform_tool
TenantDataConnector
Service Registry
Skill Packages
Bridge Tools
Scheduler Handlers
Vector Search
Dedicated Service
Streaming Service
Dedicated, independently scaled service for real-time SCADA data streams. Bridges publish to tenant-scoped topics. Frontend modules subscribe for live dashboards, engineering views, and operational analytics.
WebSocket Hub
Topic Routing
Live State Cache (Redis)
Tenant-Scoped Streams
Frontend Fan-Out
Agent Event Subscriptions
Tenant Data Connector · contracted data to storage
Tenant A
PostgreSQL
database: tenant_a_modules
scada_readings
work_orders
alarms
etl_runs
Subscribed modules only
WebSocket
Tenant B
PostgreSQL
database: tenant_b_modules
scada_readings
avg_fleet
alarms
etl_runs
Subscribed modules only
WebSocket
Tenant C
PostgreSQL
database: tenant_c_modules
work_orders
inspections
Subscribed modules only
— direct connection
WebSocket · contracted data only
On-Premises per Tenant
Bridge
Processes data at the edge. Only contracted data is transmitted upstream.
WebSocket Server
Tool Executor
SCADA Ingestion
AI / Deterministic Processing
Credential Vault
Health Monitor
ETL Engine
AI-Assisted · Deterministic Scripting
AI-Assisted Transform
LLM-guided field mapping & schema inference
Deterministic Scripts
Rule-based transforms with version-controlled logic
Pipeline Scheduler
Cron, event-triggered & on-demand execution
Source Connectors
SCADA, REST, SQL, flat-file & streaming ingest
Validation & QA
Schema checks, anomaly flags & reject queues
Run Audit Log
Full lineage, row counts & error reporting per run
PLC / SCADA
Industrial control
Sensors
IoT / telemetry
ETL Jobs
Scheduled pipelines
Control Tower API
External integrations

The Key Separation

LayerData TypeStorageIsolation
Platform DB Config, identity, orchestration — tenants, users, roles, permissions, module subscriptions, agent configs, bridge registrations, schedules, audit log Shared PostgreSQL Row-Level Security on tenant_id
Tenant Data Business data — sensor readings, work orders, alarms, ETL results, inspections — everything modules create and query Dedicated PostgreSQL database per tenant Physical isolation — dedicated PostgreSQL database per tenant
Bridge On-premises processing — SCADA ingestion, ETL pipelines, AI/deterministic processing, PLC interaction, raw sensor streams Bridge-local storage and processing. Only contracted data transmitted to tenant data store. Physical isolation — bridge deployed per tenant/site
Why this works: The platform database stays small and fast — it only holds configuration. Tenant business data lives in dedicated PostgreSQL databases — one per tenant, completely isolated. If a tenant wants to leave, export their database. Modules that need access to external data sources (BigQuery analytics, data lakes, third-party APIs) use module-level data connectors — that's a module integration concern, not a tenant data store choice. Module code uses the TenantDataConnector for the tenant's PostgreSQL database and module-specific connectors for everything else.

3. Tenant Model

A tenant is the top-level isolation boundary. Each tenant is assigned a unique slug (e.g., acme-corp) that drives their platform URL (acme-corp.app.dematic.com). The slug can optionally be mapped to the tenant's email domain (@acmecorp.com) for automatic routing — but email domain mapping is not required. Every tenant has their own users, roles, module subscriptions, data store, and optionally their own Bridge instances. Tenants never see each other's data, users, or configurations.

What a Tenant Has

ID

Identity

Name, slug, status (active/suspended). Billing contacts and technical contacts stored for coordination. Stored in Platform DB.

Users & Roles

Users belong to a tenant. Roles are defined per tenant. Permissions are scoped per module. A user in Tenant A cannot see Tenant B.

Module Subscriptions

Tenants subscribe to modules. When subscribed, the platform provisions tables in the tenant's data store from the module's schema definition.

Data Store

Each tenant gets their own dedicated PostgreSQL database, completely isolated from other tenants. The Tenant Data Connector provides the standard data access API. Module business data lives here.

Data Access

All access runs in tenant context — agents, users, and API calls can only reach their own tenant's data, tools, and bridges. Enforced by the platform through RLS and the Tenant Data Connector, not by the module.

Bridge Instances

Tenants register on-premises Bridges for SCADA ingestion, ETL processing, AI/deterministic workloads, and local tool execution. Each bridge is assigned to a tenant and optionally to a site/location. Only contracted data is transmitted to the platform.

Tenant Access Routing

Every tenant gets a slug-based URL ({slug}.app.dematic.com). The tenant is resolved from the URL before the user sees a login screen. Users are stored in the Platform DB with a tenant_id — when they hit their tenant's URL, the platform already knows the tenant context and presents the correct login experience (standard, SSO/SAML, or branded).

ScenarioURLWhat Happens
Direct tenant URL tenant-a.app.dematic.com Tenant resolved from slug. Login screen loads in tenant context. User authenticates. Platform sets RLS + data connector. Primary method.
Email-based redirect app.dematic.com/login User enters email. If their email domain is mapped to a tenant (e.g., @acmecorp.comacme-corp), they are redirected to the tenant's slug URL. Optional convenience — not required.
Multi-tenant user app.dematic.com For Dematic employees or consultants assigned to multiple tenants. Authenticate at the platform level, then select which tenant to enter. Switches tenant context on selection.
💡
Why users live in the Platform DB: Users need a platform-level identity so they can be assigned to one or more tenants, and so the slug URL can validate them against the correct tenant on login. Business data lives in the tenant data store — user accounts and roles are platform infrastructure.

Tenant Lifecycle

1

Provision Tenant

Create tenant record in Platform DB with slug (e.g., tenant-a). Provision subdomain (tenant-a.app.dematic.com). Provision dedicated PostgreSQL database for the tenant.

2

Create Admin User & Roles

Create the tenant admin account in the Platform DB. Seed default roles (Admin, Operator, Viewer) with default permissions. Configure SSO/SAML if required.

3

Subscribe to Module Packages

Tenant selects module packages. For each subscription, the Tenant Data Connector creates the module's tables in the tenant's data store.

4

Configure & Connect

Tenant admin configures per-module settings, registers Bridge instances, sets up Platform Connectors, configures email domain mapping, invites users.

5

Operate

Users log in via tenant subdomain. Agents assist. Bridge processes SCADA data, runs ETL pipelines, and executes local tools. Only contracted data flows upstream through the Tenant Data Connector.


4. Platform Services

The Platform Database (shared PostgreSQL with RLS) holds only configuration, identity, and orchestration data. No business data. Module developers never touch this layer — they consume it through the SDK.

Platform ServiceKey TablesWhat It Provides
Tenant Registry tenants Tenant identity, slug, status, billing. The root isolation boundary.
Users & Auth users All user accounts scoped to tenants. Auth providers: email/password, OAuth, SSO/SAML, API keys.
Roles & Permissions roles, role_permissions, user_role_assignments RBAC per tenant per module. Permission codes follow MODULE.ACTION pattern.
Module Registry module_registry Catalog of all available modules. Synced from module.json manifests at startup.
Module Subscriptions tenant_module_subscriptions Which modules each tenant has activated. Triggers table provisioning in tenant data store.
Agent Registry agent_registry, agent_skills, agent_tool_permissions All agents, their skills, tool bindings, and execution configs. Platform-level — available to all tenants.
Bridge Registry bridge_instances, bridge_tool_sync, bridge_health_log Registered on-prem bridges per tenant. Health monitoring, tool definition sync, credential management.
Scheduling Engine schedules, schedule_assignments, schedule_occurrences Universal scheduling. Modules register schedule types. Platform generates occurrences and triggers handlers or agent runs.
Form Engine form_definitions Reusable JSON form schemas. Any module can create dynamic forms without custom UI code.
Integrations No dedicated tables Integrations are modules (Section 19). Connection config lives in the integration module's tenant_settings. Audit trail uses the platform audit_log. No separate integration framework needed.
Module Settings module_settings Per-tenant configuration for each module. JSONB, schema defined in manifest. Customize behavior without code.
Tenant Data Connector Config tenant_data_stores Connection details for each tenant's dedicated PostgreSQL database.
Platform Storage Service storage_buckets, storage_files Platform-level file storage using Google Cloud Storage. Buckets created and managed by the platform. Modules declare storage needs in their manifest. Files organized per tenant, per module. See below for detail.
Audit Log audit_log Immutable, append-only. Who changed what, when, through what interface (UI, API, agent, bridge). All tenants, all modules.
Platform Notifications notifications, notification_preferences Centralized communication to tenants: model changes, module updates, maintenance, bridge alerts, security events, and custom announcements. Modules can send tenant-scoped and user-specific notifications via the SDK. Email + in-app delivery. See below for detail.

Platform Storage Service

File storage is a platform-level service, not a module responsibility. The platform creates and manages Google Cloud Storage buckets, enforces tenant and module isolation at the path level, handles signed URL generation, and provides a standard API that modules consume. Modules never create buckets, manage credentials, or interact with GCS directly.

How It Works

1

Module Declares Storage Needs

In module.json, the module declares what storage it needs: purpose, max file size, and allowed content types. This is a request — the platform decides how to fulfill it.

"storage": [{
  "purpose": "equipment_photos",
  "max_file_size_mb": 50,
  "allowed_content_types": "image/jpeg,image/png,application/pdf"
}, {
  "purpose": "sensor_exports",
  "max_file_size_mb": 500,
  "allowed_content_types": "text/csv,application/json,application/parquet"
}]
2

Platform Provisions Bucket

At module load, the platform creates the GCS bucket (if it doesn't exist) and registers it in the storage_buckets table. One bucket per environment — tenant and module isolation is at the path level, not the bucket level.

3

Files Stored with Tenant + Module Path Prefix

Every file is stored at a structured path that enforces isolation:

gs://platform-storage-{env}/
  └── {tenant_slug}/
      └── {module_code}/
          └── {purpose}/
              └── {uuid}.{ext}

# Example:
gs://platform-storage-prod/tenant-a/SCADA_MONITOR/equipment_photos/a1b2c3d4.jpg
gs://platform-storage-prod/tenant-a/MAINTENANCE/work_order_attachments/e5f6g7h8.pdf
gs://platform-storage-prod/tenant-b/SCADA_MONITOR/sensor_exports/i9j0k1l2.parquet
4

File Metadata in Platform DB

File metadata (original filename, content type, size, uploader, tenant, module, purpose, GCS path) stored in the storage_files table with RLS on tenant_id. The actual file content lives in GCS — the Platform DB only tracks metadata.

5

Signed URLs for Access

Files are never served directly. The platform generates time-limited signed URLs (default 15 minutes) that grant temporary read access. URLs are scoped to the requesting tenant — a Tenant A signed URL cannot access Tenant B files.

Platform Storage API (Available to All Modules)

MethodPurpose
storage.upload(purpose, file, filename)Upload a file. Platform validates content type and size against the module's manifest declaration. Stores at the correct tenant/module path. Returns file metadata.
storage.get_url(file_id)Generate a signed URL for a file. Validates the requesting user's tenant matches the file's tenant. Returns time-limited URL.
storage.list(purpose, filters)List files for the current tenant and module, filtered by purpose. RLS ensures tenant isolation.
storage.delete(file_id)Soft-delete a file. Marks as deleted in storage_files. GCS object retained per retention policy. Audit logged.
storage.get_metadata(file_id)Return file metadata (name, type, size, uploader, upload date) without generating a download URL.

Module Code Example

@platform_tool(
    module_code="MAINTENANCE",
    permission_code="MAINTENANCE.CREATE",
)
async def attach_photo_to_work_order(
    context: ToolContext,
    storage: PlatformStorage,    # Injected, pre-scoped to tenant + module
    data: TenantDataConnector,
    work_order_id: str,
    file: UploadFile,
) -> dict:
    # Platform handles: content type validation, size check, path generation,
    # GCS upload, metadata recording, tenant isolation — all automatic
    file_meta = await storage.upload(
        purpose="work_order_attachments",
        file=file,
        filename=file.filename,
    )

    # Link the file to the work order in tenant data store
    await data.update("work_orders",
        filters={"id": work_order_id},
        values={"photo_file_id": file_meta["file_id"]},
    )
    return {"file_id": file_meta["file_id"], "status": "attached"}
Module developers write zero storage code. No GCS client setup, no bucket creation, no credential management, no path construction, no signed URL generation. Declare what you need in the manifest, call storage.upload() and storage.get_url(). The platform handles provisioning, isolation, security, and serving.

Platform Notifications

The Platform Notification Service is the centralized communication channel between the platform and its tenants. Every operational change — model swaps, module updates, maintenance windows, security patches, bridge health alerts, and custom announcements — flows through this service. Notifications are sent via email and in-app (visible in the tenant's dashboard). Accessible via CLI, Admin Console, and triggered automatically by platform events.

Automatic Notifications (Platform-Triggered)

EventTriggerRecipients
Model Deprecated Admin marks a model as deprecated with sunset date All tenant admins with agents using that model. Includes countdown and recommended action.
Model Sunset Reminders Countdown reaches 30, 14, and 7 days Tenant admins with agents still on the deprecated model. Urgency escalates with each reminder.
Model Swap Complete After platform models swap executes All affected tenant admins. Critical agents flagged with clone & test link.
Critical Agent Validation Critical agent flagged during model swap Tenant admin for the specific tenant. Includes direct link to clone & test in Admin Console.
Module Update Available New module version promoted to production Tenant admins subscribed to that module. Includes changelog and migration notes.
Module Data Migration Module update includes data_migrations Tenant admins. Includes which tables are affected and what changed.
Bridge Disconnected Bridge health monitor detects disconnect > 5 minutes Tenant admin + configured operators for that bridge.
Bridge Reconnected Bridge reconnects after a disconnect event Same recipients as the disconnect notification. Includes downtime duration.
Security Alert Failed login attempts, API key rotation, permission changes Tenant admin. Security-scoped — only the affected tenant is notified.
Tenant Subscription Change Module enabled or disabled for a tenant Tenant admin. Confirms what was activated/deactivated and any data implications.

Manual Notifications (Admin-Triggered)

# Send a maintenance notice to all tenants
$ platform notify --type maintenance \
    --subject "Scheduled maintenance: April 15, 02:00-04:00 UTC" \
    --message "Platform database maintenance. Expect brief API latency." \
    --tenants all

Notification sent to 12 tenant admins via email + in-app.

# Send to specific tenants
$ platform notify --type update \
    --subject "SCADA module v1.3.0 available" \
    --message "New features: motor temperature trending, predictive alerts." \
    --tenants tenant-a,tenant-b

Notification sent to 2 tenant admins.

# Custom communication
$ platform notify --type custom \
    --subject "Q2 platform roadmap published" \
    --message "View the updated roadmap at..." \
    --tenants all

Notification sent to 12 tenant admins.

Notification Channels

ChannelDetail
In-App Notification badge in the tenant's dashboard. Visible to tenant admins on login. Persistent until dismissed. Includes action links (clone & test, view changelog, acknowledge maintenance).
Email Sent to tenant admin email addresses. Platform-branded templates. Includes the same action links as in-app. Configurable per tenant — tenant admin can opt out of non-critical emails.
SMS / Text Message Optional channel for critical and urgent notifications (equipment faults, bridge disconnects, security alerts). Requires explicit opt-in per user — users provide their mobile number and consent through their profile settings. Tenant admin can configure which notification types are eligible for SMS. Users can opt out at any time. Compliant with TCPA/GDPR messaging requirements. Delivered via cloud SMS provider (Twilio, AWS SNS, or Google Cloud Messaging).
CLI Output Platform admins see notification confirmations in CLI output. platform notify list --recent shows recent notifications sent.

SMS Compliance Controls

  • Opt-in only — SMS is never enabled by default. Users must explicitly provide their mobile number and consent. No auto-enrollment.
  • Tenant-controlled eligibility — Tenant admin configures which notification types can trigger SMS (e.g., only critical and bridge_disconnect, never marketing or informational).
  • User opt-out — Users can disable SMS at any time from their profile. Replying STOP to any SMS also opts out automatically.
  • Rate limiting — Platform enforces max SMS per user per hour/day to prevent notification fatigue and cost overruns.
  • Audit trail — Every SMS sent is logged: recipient, message type, timestamp, delivery status. Stored in notification_preferences for compliance review.
💡
Notifications are operational infrastructure, not a feature. Every change that affects a tenant — model swaps, module updates, bridge health, maintenance windows, security events — is communicated through the same notification service. Tenant admins have a single place to see what's changing, what requires action, and what's coming. No surprises.

Module SDK: Custom Notifications

Modules can send notifications through the SDK using the send_notification platform tool. Notifications are scoped to the current tenant and can target the tenant admin, specific users, or roles. This is the same notification infrastructure the platform uses internally — modules don't build their own email or messaging systems.

# Module tool sending a user-specific notification
@platform_tool(
    module_code="MAINTENANCE",
    permission_code="MAINTENANCE.CREATE",
)
async def create_urgent_work_order(
    context: ToolContext,
    data: TenantDataConnector,
    notify: PlatformNotifications,  # Injected — scoped to current tenant
    equipment_id: str,
    description: str,
) -> dict:
    # Create the work order
    wo = await data.insert("work_orders", {
        "equipment_id": equipment_id,
        "description": description,
        "priority": "urgent",
    })

    # Notify the assigned technician (user-specific)
    await notify.send(
        to_user=wo["assigned_to"],
        subject="Urgent work order assigned: " + wo["id"],
        message=description,
        action_url="/maintenance/work-orders/" + wo["id"],
    )

    # Notify the tenant admin (role-based)
    await notify.send(
        to_role="admin",
        subject="Urgent WO created for " + equipment_id,
        message=description,
    )

    return {"work_order": wo["id"], "notifications_sent": 2}

Module Notification API

MethodPurpose
notify.send(to_user, subject, message, action_url)Send to a specific user in the current tenant. Email + in-app.
notify.send(to_role, subject, message)Send to all users with a specific role in the current tenant (e.g., all admins, all operators).
notify.send(to_tenant_admin, subject, message)Send to the tenant admin. Shorthand for to_role="admin".
notify.broadcast(subject, message)Send to all users in the current tenant. Use sparingly.
💡
Tenant-scoped by design. A module's notify instance is pre-scoped to the current tenant. A module in Tenant A cannot send notifications to Tenant B users. The platform enforces this the same way it enforces data isolation — the module never sees other tenants. Platform-wide notifications (maintenance, model changes) are admin-only via CLI or Admin Console.

5. Security Model

Security operates on three axes. All three must pass for every operation — API requests, AI agent tool calls, inter-module service calls, and bridge tool executions.

Axis 1: Tenant Isolation

Two-Layer Isolation

  • Platform DB: PostgreSQL Row-Level Security. Every table has tenant_id. On every DB session, the platform sets SET app.current_tenant_id = '{tenant_id}'. RLS policy: WHERE tenant_id = current_setting('app.current_tenant_id'). Queries physically cannot return another tenant's config data.
  • Tenant Data: Physical isolation. Each tenant has their own dedicated PostgreSQL database. A misconfigured query doesn't leak data — it hits the wrong database and finds no tables. Module code uses TenantDataConnector which is pre-routed to the current tenant.

Axis 2: RBAC — What Can You Do?

Roles are defined per tenant. Each role grants specific permission codes per module:

SCADA.READ_SENSORS       — View sensor data and PLC registers
SCADA.WRITE_PLC          — Write values to PLC registers
SCADA.ACK_ALARMS         — Acknowledge and dismiss alarms
MAINTENANCE.CREATE       — Create maintenance work orders
MAINTENANCE.ASSIGN       — Assign work orders to technicians
ETL.RUN_PIPELINE         — Trigger ETL pipeline execution
REPORTING.EXPORT         — Export data to external systems

Axis 3: Tool Security Policies — What Can Tools Do?

Every tool is a security principal with its own policy:

Policy FieldPurposeExample
risk_levelClassification for audit and approvalsafe, low, medium, high, critical
required_permissionRBAC permission the caller must haveSCADA.WRITE_PLC
requires_approvalHuman must approve before executiontrue for PLC writes
allowed_agentsWhich module agents can invoke this tool["SCADA_MONITOR"]
denied_agentsAgents explicitly blocked["REPORTING"]
max_execution_time_msHard timeout5000
auditLogging levelnone, standard, full
error_policyFailure behaviorretry_once, halt_and_notify, escalate_to_human
executionWhere the tool runscloud or bridge

Authorization Flow

1

Authenticate & Resolve Tenant

Tenant already resolved from slug URL ({slug}.app.dematic.com) at login. JWT contains user_id and tenant_id. On every request, middleware sets RLS context: SET app.current_tenant_id.

2

Check Module Subscription

Verify this tenant has an active subscription to the requested module.

3

Check Permission

Verify user has the required permission code for this module in this tenant.

4

Check Tool Policy

If tool call: verify agent is allowed, check approval gate, evaluate risk level.

5

Route to Tenant Data

Resolve the tenant's data store via TenantDataConnector. All module queries are pre-routed to the correct dataset/database.

6

Execute & Audit

Execute the operation. Log per the tool's audit level. Track actor type (human, agent, bridge, service).

Module developers write zero security code. Tenant isolation (RLS + physical data separation), permission checking, tool policy enforcement, and audit logging are all handled by the Platform. Application-level hardening — API rate limiting, injection prevention, XSS/CSRF protection, file upload validation — is covered in Section 6. The module uses the @platform_tool decorator and TenantDataConnector.

6. Application Security

Application security is platform-enforced. Module developers write zero security code — the platform's middleware, SDK patterns, and module pipeline handle protection against external attacks, injection, cross-site scripting, and abuse. This section covers the security hardening that sits on top of the tenant isolation and RBAC model defined in Section 5.

API Security (Every Request)

The platform API middleware stack runs on every inbound request before it reaches any module code:

1

TLS Enforcement

All traffic over HTTPS. HTTP requests are redirected. HSTS headers enforce TLS for all subsequent connections. Certificate managed by Cloud Run.

2

Rate Limiting

Per-tenant, per-endpoint rate limits. Configurable in platform settings (e.g., 100 req/min per tenant for standard endpoints, 10 req/min for write operations). Returns 429 Too Many Requests with retry-after header. Prevents abuse and noisy-neighbor impact.

3

CORS Policy

Strict origin whitelist per tenant. Only {slug}.app.dematic.com is allowed as an origin for that tenant's API. No wildcard origins. Preflight requests validated. Blocks cross-origin attacks from other domains.

4

JWT Validation

Every request validates the JWT: signature verification, expiry check, tenant_id claim matches the slug URL, token not revoked. Refresh tokens are rotated on use (one-time use). Expired tokens return 401 — never silently passed through.

5

Request Validation

All request bodies validated against Pydantic schemas (schemas.py in every module). Type checking, field constraints, length limits. Malformed requests rejected with 422 before reaching any business logic. This is the primary defense against injection.

6

Response Sanitization

Outbound responses stripped of internal details: no stack traces, no database error messages, no tenant metadata leakage. Errors return structured error objects with safe messages and correlation IDs for support.

Injection Prevention

Attack VectorPlatform DefenseModule Developer Impact
SQL Injection TenantDataConnector uses parameterized queries exclusively. No raw SQL passes through the SDK's query(), insert(), update(), delete() methods. The raw_sql() escape hatch requires params argument — string concatenation is rejected. Zero. Use the connector API. Injection is not possible through the SDK.
XSS (Cross-Site Scripting) React's default JSX escaping prevents reflected XSS. Content Security Policy headers block inline scripts (script-src 'self'). No dangerouslySetInnerHTML allowed in module UI components — caught by module pipeline static analysis. Zero. Use React components. The pipeline rejects unsafe patterns.
CSRF (Cross-Site Request Forgery) CSRF tokens required on all state-changing requests (POST, PUT, DELETE). Tokens are tied to the user session and validated server-side. SameSite cookie flag set to Strict. Zero. Platform middleware handles token generation, injection, and validation.
Command Injection Bridge tool execution is sandboxed. Tool handlers receive typed parameters, never shell strings. No os.system() or subprocess.shell=True allowed — caught by module pipeline static analysis. Zero. Use context.bridge.execute() with typed params.
Path Traversal File upload paths are generated by the platform (UUID-based). User-supplied filenames are sanitized and never used as filesystem paths. Storage buckets enforce per-module path prefixes. Zero. Use the storage API from the module manifest.

Frontend Security

ProtectionImplementation
Content Security Policy default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline'; img-src 'self' data: https:; connect-src 'self' *.app.dematic.com; frame-ancestors 'none' — no inline scripts, no external script sources, no iframing.
Auth Cookies HttpOnly (no JavaScript access), Secure (HTTPS only), SameSite=Strict (no cross-origin sends), short expiry with refresh rotation.
Security Headers X-Content-Type-Options: nosniff, X-Frame-Options: DENY, Referrer-Policy: strict-origin-when-cross-origin, Permissions-Policy: camera=(), microphone=(), geolocation=()
Subresource Integrity All third-party scripts (if any) loaded with SRI hashes. Tampered scripts are blocked by the browser.
Module UI Isolation Module frontend components render within the platform shell. They cannot modify the shell, access other modules' DOM, or override security headers. The shell owns the auth state.

File Upload Security

ControlDetail
Type Validation File content type validated against the module manifest's allowed_content_types. Both MIME type header and magic bytes are checked — renaming a .exe to .pdf is caught.
Size Limits Enforced from the module manifest's max_file_size_mb. Oversized uploads rejected at the middleware level before reaching module code.
Storage Isolation Files stored in per-tenant, per-module GCS paths. A tenant cannot access another tenant's files. Signed URLs expire after configured duration.
Filename Sanitization User-supplied filenames are sanitized (stripped of path separators, special characters). Stored with UUID-based names. Original filename preserved as metadata only.

Bridge Security

ControlDetail
Mutual TLS Bridge WebSocket connection uses TLS with certificate pinning. Bridge authenticates to the platform with its API key. Platform authenticates to the bridge with its certificate.
Defense in Depth Bridge validates tool calls locally even though the platform already validated them. If the platform is compromised, the bridge still enforces tool policies independently.
Credential Vault Local PLC passwords, database connections, and API keys stored in encrypted local vault. Never transmitted to the cloud platform. Decrypted only at execution time.
Network Isolation Bridge only opens one outbound WebSocket connection to the platform. No inbound ports required. PLC/SCADA connections are on the local OT network — isolated from the IT network.

Module Pipeline Security Checks

The module validation pipeline (Section 23) includes security-specific checks that block unsafe modules before they reach production:

CheckWhat It Catches
No raw SQL Module code must use TenantDataConnector API. Direct database connections, raw SQL strings, or ORM session creation are rejected.
No unsafe patterns Static analysis flags: eval(), exec(), os.system(), subprocess.shell=True, dangerouslySetInnerHTML, innerHTML.
Dependency CVE scan All Python and npm dependencies checked against known vulnerability databases. Critical CVEs block promotion. High CVEs generate warnings.
Secrets detection Scan for hardcoded API keys, passwords, tokens, or connection strings in module source. Any match blocks upload.
Tool policy completeness Every @platform_tool must have a risk_level. Every bridge tool must have a complete security policy. Missing fields block validation.
The security promise: Module developers write zero security code. The platform middleware handles API hardening, the SDK prevents injection by design, the frontend shell enforces CSP and CSRF, and the module pipeline catches unsafe patterns before deployment. Security is infrastructure, not a developer responsibility.

7. Core Data Service

The CoreDataService provides access to platform data (users, roles, permissions, settings) with tenant isolation via RLS. It does not handle module business data — that goes through the TenantDataConnector (Section 8).

MethodPurpose
get_current_tenant()Returns the current tenant's profile and deployment config
get_tenant_users()Returns users in the current tenant (RLS-filtered)
get_user_permissions(module_code)Returns permission codes the user has for a module
get_tenant_subscriptions()Returns active module subscriptions for the tenant
get_module_settings(module_code)Returns per-tenant settings for a module
get_active_bridges()Returns bridge instances registered to the tenant
require_permission(permission_code)FastAPI dependency — raises 403 if user lacks permission

RLS Enforcement

Every request goes through middleware that sets SET app.current_tenant_id = '{tenant_id}' on the database session. The RLS policy on every platform table ensures queries only return rows where tenant_id matches. Module developers never set this — it's automatic. Platform admin tools use a separate connection that can bypass RLS for cross-tenant operations.


8. Tenant Data Connector

The TenantDataConnector is the data access layer between module code and the tenant's dedicated PostgreSQL database. It provisions tables when modules are subscribed, routes queries to the correct tenant database, and provides a clean query API. Module developers call data.query() and data.insert() — the connector handles connection management, tenant isolation, and query execution.

What It Does

ResponsibilityDetail
Provision When a tenant subscribes to a module, the connector reads the module's data_tables schema and creates those tables in the tenant's PostgreSQL database.
Route At runtime, the connector is injected into module tools and services pre-routed to the current tenant's data store. Module code calls data.query() without knowing which engine is underneath.
Abstract Translates the query API into optimized PostgreSQL queries. Handles connection pooling, prepared statements, and transaction management. Modules get ACID transactions, relational joins, and low-latency CRUD out of the box.
Deprovision When a tenant unsubscribes from a module, the connector can archive or drop the module's tables in the tenant's data store.

Module Code Example

@platform_tool(
    module_code="SCADA_MONITOR",
    permission_code="SCADA.READ_SENSORS",
)
async def query_sensor_history(
    context: ToolContext,
    data: TenantDataConnector,   # Injected, pre-routed to this tenant's data store
    zone_id: str,
    sensor_type: str,
    hours: int = 24,
) -> dict:
    # This query runs against the tenant's data store
    # Runs against the tenant's dedicated PostgreSQL database
    readings = await data.query(
        table="scada_readings",
        filters={"zone_id": zone_id, "sensor_type": sensor_type},
        time_range={"column": "timestamp", "last_hours": hours},
        order_by="timestamp DESC",
    )
    return {"readings": readings, "count": len(readings)}

TenantDataConnector API

MethodPurpose
query(table, filters, time_range, order_by, limit)Read rows from a module table in the tenant's data store
insert(table, rows)Insert one or more rows
update(table, filters, values)Update matching rows
delete(table, filters)Soft-delete matching rows (is_deleted = true)
aggregate(table, group_by, metrics, filters)Run aggregation queries (count, sum, avg, min, max)
vector_search(table, query_vector, top_k, filters)Similarity search against a vector column. Uses pgvector for cosine similarity search. Returns top_k nearest rows ranked by distance.
raw_sql(query, params)Execute engine-specific SQL (escape hatch). params argument required — string concatenation rejected. See Section 6 injection prevention.
get_schema(table)Return column names, types, and constraints for a table
provision_tables(module_schema)Create module tables in the tenant's data store (called on subscription)

Database Adapter

TenantDataConnector └── PostgreSQLAdapter → Dedicated PostgreSQL database per tenant ├── asyncpg connection pool (async, high-performance) ├── pgvector extension for vector similarity search ├── Full ACID transactions, relational joins └── Connection pooling, prepared statements, tenant isolation
💡
One database, one adapter, zero ambiguity. Every tenant runs on PostgreSQL. Module developers get ACID transactions, relational joins, low-latency CRUD, and pgvector similarity search. No engine-specific quirks to work around, no eventual consistency surprises, no adapter mismatches.

Module Data Connectors (External Data Sources)

The TenantDataConnector handles the tenant's core PostgreSQL database. But modules often need to reach external data sources — a tenant's BigQuery warehouse for analytics, a Redshift cluster for historical reporting, a data lake for raw files, or a third-party API for enrichment. The platform provides a library of data connector adapters that modules can use for these integrations.

Core (every tenant, every module): TenantDataConnector └── PostgreSQLAdapter → Tenant's dedicated PostgreSQL database └── Always available. Default for module data_tables. Module-level (opt-in per module, configured per tenant): ModuleDataConnector │ ├── BigQueryConnector → Google BigQuery datasets │ └── Analytics, reporting, large-scale data warehousing │ ├── RedshiftConnector → AWS Redshift clusters │ └── Tenants on AWS with existing Redshift infrastructure │ ├── SynapseConnector → Azure Synapse Analytics │ └── Tenants on Azure with existing Synapse pools │ ├── S3Connector / GCSConnector → Cloud storage (data lakes) │ └── Raw file access, CSV/Parquet ingestion │ └── CustomConnector → Any data source via module's connectors/ code └── REST APIs, MQTT, databases, SaaS platforms
AspectTenantDataConnectorModule Data Connectors
Scope Every module, every tenant. Automatic. Per module. Declared in module.json connectors section. Configured per tenant in the Admin Console.
Engine PostgreSQL only. One database per tenant. BigQuery, Redshift, Synapse, S3, GCS, REST APIs, custom. Depends on what the module and tenant need.
Configuration Automatic at tenant provisioning. No tenant-side config needed. Per-tenant connection details stored in the Integration Registry. Tenant admin provides credentials, project IDs, connection strings.
Use case Module's core business data — alarms, work orders, sensor readings, user preferences. The tables declared in data_tables. External data access — analytics warehouses, data lakes, third-party systems, legacy databases. Data the module reads from or writes to beyond its own tables.
Tenant isolation Physical isolation — dedicated database per tenant. Configured per tenant — each tenant's connector points to their own BigQuery project, Redshift cluster, etc. Credentials stored per tenant.
Clean separation: TenantDataConnector is the tenant's database — always PostgreSQL, always available, zero config. Module data connectors are external integrations — opt-in, per-module, per-tenant config. A module that only needs its own tables uses data.query() and never thinks about external sources. A module that needs BigQuery analytics declares a BigQueryConnector in its manifest and accesses it through the connector framework. Both patterns are platform-provided — module developers never build connection management, credential handling, or tenant isolation from scratch.

9. Platform Scheduler

The Platform Scheduler is a core platform service — not a module feature. It provides centralized scheduling for the entire platform: cron-based recurring tasks, event-driven triggers, agent session launches, and module handler invocations. Modules declare their scheduling needs in the manifest. The platform runs them. No module builds its own cron system, timer loop, or scheduling table.

Why This Is a Platform Service

Without the Platform Scheduler, every module that needs recurring tasks would build its own: a cron table, a polling loop, a timer process. That's duplicated infrastructure across every module, duplicated tenant isolation logic, duplicated error handling, and duplicated monitoring. The Platform Scheduler solves scheduling once: modules declare what should happen and when. The platform handles execution, tenant context, error recovery, and observability.

What the Scheduler Does

CapabilityDetail
Cron schedules Recurring tasks at configurable frequencies (1m, 5m, 15m, 1h, daily, weekly, custom cron). Defined per entity type in the module manifest. The platform generates occurrences and invokes the module's handler function for each subscribed tenant at the scheduled time.
Agent session triggers Automatically start an agent session in response to events. A critical alarm fires → the scheduler spawns the SCADA agent with the alarm-triage skill and the alarm context injected. A PM comes due → the scheduler spawns the Maintenance agent to generate the work order. The module defines the trigger in the manifest — the scheduler handles the rest.
Event-driven triggers Respond to platform events: bridge connects, module subscription changes, tenant setting updated, webhook received, data threshold crossed. The module declares which events trigger which handlers or agents.
Per-tenant execution Every scheduled task runs in tenant context. A Maintenance module PM schedule runs once per subscribed tenant at the configured frequency for that tenant. Tenant A can have daily PMs, Tenant B can have weekly — same module, different tenant settings.
Tenant-configurable frequency Modules declare available frequencies in the manifest. Tenant admins choose which frequency applies to their deployment through the Module Settings screen. The scheduler adapts per tenant.

How Modules Use the Scheduler

Modules declare scheduling needs in module.json. The platform reads the manifest, registers the schedules, and runs them. The module provides a handler function in scheduler.py — the platform calls it at the right time with the right tenant context.

// In module.json — scheduler section
"scheduler": {
  "entity_types": [
    {
      "entity_type": "preventive_maintenance",
      "handler": "generate_pm_work_orders",      // Function in scheduler.py
      "frequencies": ["daily", "weekly", "monthly"], // Tenant admin picks one
      "description": "Generate preventive maintenance work orders based on maintenance plans"
    },
    {
      "entity_type": "sensor_poll",
      "handler": "poll_sensor_health",
      "frequencies": ["1m", "5m", "15m"],
      "description": "Poll equipment sensors for health status and trend detection"
    }
  ],
  "agent_triggers": [
    {
      "event": "critical_alarm",                  // Platform event type
      "action": "run_agent",                     // Launch an agent session
      "skill": "alarm-triage",                   // Activate this skill
      "inject_context": true,                   // Pass the event data as agent context
      "description": "Auto-triage critical alarms via SCADA agent"
    },
    {
      "event": "pm_due",
      "action": "run_agent",
      "skill": "maintenance-planning",
      "inject_context": true,
      "description": "Generate PM work order when a maintenance plan comes due"
    }
  ]
}
# scheduler.py — handler functions called by the Platform Scheduler

async def generate_pm_work_orders(
    context: SchedulerContext,
    data: TenantDataConnector,
) -> dict:
    # This runs in tenant context — data is pre-routed to the tenant's DB
    plans = await data.query("maintenance_plans", filters={"active": True})
    created = []
    for plan in plans:
        if plan_is_due(plan):
            wo = await service.create_work_order_from_plan(data, plan)
            created.append(wo)
    return {"work_orders_created": len(created)}

Scheduler Execution Flow

Module Manifest declares: entity_type: "preventive_maintenance" handler: "generate_pm_work_orders" frequencies: ["daily", "weekly", "monthly"] │ ▼ Tenant Admin configures: Tenant A: weekly (every Monday 6:00 AM) Tenant B: daily (every day 5:00 AM) Tenant C: monthly (1st of month 4:00 AM) │ ▼ Platform Scheduler runs: Monday 5:00 AM → Tenant B context → generate_pm_work_orders() → Creates 12 PMs for Tenant B's equipment Monday 6:00 AM → Tenant A context → generate_pm_work_orders() → Creates 8 PMs for Tenant A's equipment (Tenant C skipped — not 1st of month) │ ▼ Each execution: ├── Tenant context set (RLS, data connector pre-routed) ├── Handler called with TenantDataConnector ├── Results logged in schedule_occurrences table ├── Errors retried per scheduler retry policy └── Audit trail: who, what, when, result

Agent Triggers

Agent triggers are the most powerful scheduler feature. They connect platform events to agent sessions — automatically spawning an agent with the right skill and context when something happens. No human in the loop. The event fires, the agent runs, the result is logged.

EventSourceAgent ActionExample
critical_alarm SCADA module detects alarm severity ≥ critical Spawn SCADA agent with alarm-triage skill, inject alarm data as context Fault 5012 fires on Shuttle 7 → agent auto-triages, reads history, recommends maintenance dispatch
pm_due Scheduler detects a maintenance plan's next occurrence is today Spawn Maintenance agent with maintenance-planning skill, inject plan details Weekly belt inspection plan due → agent generates work order with parts list and estimated time
bridge_disconnect Bridge heartbeat missed for > 5 minutes Spawn monitoring agent, inject bridge status and last known state Cincinnati DC bridge offline → agent checks recent data, notifies site supervisor, logs incident
etl_pipeline_failed ETL module reports pipeline failure Spawn ETL agent with diagnostics skill, inject error details and pipeline config Nightly data sync failed → agent reads error log, identifies root cause, retries or escalates
webhook_received External system sends webhook to a connector Spawn module agent with the appropriate skill, inject webhook payload WMS sends order update → agent processes change, updates inventory, dispatches pick tasks
Modules don't build schedulers. The module declares what should happen and when in the manifest. The platform handles: cron generation, per-tenant frequency configuration, tenant context injection, handler invocation, agent session spawning, retry on failure, occurrence tracking, and audit logging. A Preventive Maintenance module doesn't need its own cron table, its own timer loop, or its own scheduling UI — it declares entity_types and agent_triggers in the manifest and writes handler functions. The Platform Scheduler does the rest.

Module SDK — The contract for building application modules

10. The Module SDK

A module is a self-contained folder that follows a standard structure. Drop it into app/modules/, and the Platform discovers it, mounts its routes, registers its services, wires up AI components, and provisions its tables in each subscribed tenant's data store.

📋
The SDK promise: The SDK enforces platform standards while giving independent developers full flexibility to build customer-focused solutions. Developers focus on business logic and domain expertise — not framework integrations, security plumbing, or tenant isolation. Whether building a custom module for a single tenant or a pre-built module approved for the marketplace, the same SDK contract applies: standard structure, standard tools, standard security — and complete freedom in what the module actually does.

What the SDK Provides

📜

Module Manifest Schema

module.json defines everything: identity, permissions, data tables, AI config, bridge tools, connectors, scheduler, settings.

🛡

@platform_tool Decorator

Wraps any function with auth, permissions, tool policy, tenant routing, audit. Module developer writes only business logic.

💾

TenantDataConnector

Injected into every tool and service. Pre-routed to the current tenant's data store. Same API regardless of engine.

🔍

CoreDataService

Access platform data (users, roles, settings) with RLS. One call gets permissions, another gets tenant config.

🔗

Service Registry

Register services for other modules. Invoke other modules' services with automatic permission checking.

📦

Connector Framework

Standard patterns for webhooks, REST APIs, WebSockets, MQTT. Per-tenant config via Integration Registry.

Auto-Discovery Process

1

Scan for Modules

At startup, scan app/modules/ for folders containing module.json.

2

Parse Manifest

Read identity, permissions, data_tables schema, AI config, bridge tools, connectors, scheduler, settings.

3

Mount API Routes & Register Extensions

Mount FastAPI router. Register services, connectors, scheduler handlers, AI components (agent, skills, tools), bridge tool definitions.

4

Sync Module Registry

Upsert to module_registry in Platform DB. The manifest is the single source of truth.

5

Provision Tenant Data (per subscription)

For each tenant subscribed to this module, verify their data store has the module's tables. Create any missing tables via TenantDataConnector.provision_tables().

Module Loading: New Module vs. Module Update

The Module Loader handles two scenarios at startup — a module appearing for the first time and an existing module with a version change. Both are standardized and automatic.

ScenarioWhat the Module Loader Does
New module (first load) 1. Parse module.json — extract identity, permissions, data tables, AI config
2. Mount API routes at api_base_path
3. Register in module_registry with version and status
4. Register tools in Tool Registry, skills in Skills Registry, services in Service Registry
5. Queue bridge tool definitions for sync
6. No tenant data provisioned yet — tables are created when a tenant subscribes
Tenant subscribes to module 1. Read data_tables from manifest
2. Call TenantDataConnector.provision_tables(module_schema) for the tenant's data store
3. Creates all declared tables with columns, indexes, and partitioning
4. Records the module version that provisioned the tables in tenant_module_subscriptions
Module update (version change) 1. Module Loader detects version change in module.json vs. module_registry
2. Re-registers routes, tools, skills, services (hot-reload safe — new definitions replace old)
3. Reads data_migrations from manifest (see below)
4. For each subscribed tenant: compares their provisioned version against current version
5. Runs pending migrations against each tenant's data store via TenantDataConnector
6. Updates tenant_module_subscriptions with new version

Module Data Migrations

When a module update changes its data_tables schema (new columns, new tables, index changes), the manifest declares migrations. The Module Loader runs them automatically per tenant.

// In module.json — migrations section
"data_migrations": [
  {
    "version": "1.1.0",
    "description": "Add motor_temperature column to scada_readings",
    "operations": [
      {"action": "add_column", "table": "scada_readings",
       "column": "motor_temp_celsius", "type": "float", "nullable": true}
    ]
  },
  {
    "version": "1.2.0",
    "description": "Add alarm_embeddings table for vector search",
    "operations": [
      {"action": "create_table", "table": "alarm_embeddings",
       "columns": {
         "id": {"type": "uuid", "primary_key": true},
         "description": {"type": "string"},
         "embedding": {"type": "vector", "dimensions": 1536},
         "fault_code": {"type": "integer"},
         "created_at": {"type": "timestamp", "auto": true}
       }}
    ]
  }
]

Migration Rules

RuleDetail
Migrations are declarative Operations are declared in the manifest as structured JSON — not raw SQL scripts. The TenantDataConnector translates operations to PostgreSQL DDL statements.
Additive only by default Supported operations: add_column, create_table, add_index, rename_column. Destructive operations (drop_column, drop_table) require a "destructive": true flag and are blocked in the validation pipeline unless explicitly approved.
Versioned and ordered Each migration has a version that matches the module version. Migrations run in version order. A tenant on v1.0.0 upgrading to v1.2.0 runs both the v1.1.0 and v1.2.0 migrations.
Per-tenant execution Migrations run against each subscribed tenant's data store independently. Tenant A can be on v1.2.0 while Tenant B is still on v1.1.0 (if Tenant B hasn't restarted yet). The Module Loader tracks the provisioned version per tenant.
Idempotent Migrations check before executing. add_column skips if the column already exists. create_table skips if the table exists. Safe to re-run.
Pipeline validated The Module Pipeline (Section 24) validates migrations before promotion: operations are syntactically valid, referenced tables exist in data_tables, no destructive operations without approval.

Migration Execution at Startup

# What happens inside load_modules() for an updated module:

Module: SCADA_MONITOR
  Registry version: 1.0.0
  Manifest version: 1.2.0
  Pending migrations: [1.1.0, 1.2.0]

  Tenant A (v1.0.0 → 1.2.0):
    ├── v1.1.0: ALTER TABLE scada_readings ADD COLUMN motor_temp_celsius FLOAT  
    └── v1.2.0: CREATE TABLE alarm_embeddings (...)                             
    Updated: tenant_module_subscriptions.provisioned_version = 1.2.0

  Tenant B (v1.1.0 → 1.2.0):
    └── v1.2.0: CREATE TABLE alarm_embeddings (...)                             
    (v1.1.0 migration skipped — already applied)
    Updated: tenant_module_subscriptions.provisioned_version = 1.2.0

  Tenant C (not subscribed): skipped
The module developer's experience: Add a column to data_tables, add a migration entry to data_migrations, bump the version. The Module Loader handles the rest — detects the version change, runs migrations per tenant, tracks the provisioned version. No manual database scripts. No per-tenant coordination. The TenantDataConnector translates the declarative migration to PostgreSQL DDL and applies it to each tenant's database.

11. Module Anatomy

Every module on the platform follows the same directory structure. This is not a suggestion — it's the contract. The Module Loader expects specific files in specific locations. The Module Pipeline validates the structure before a module can be published. And critically, this structure is what makes coding agents effective: when an agent opens a module directory, it sees a predictable layout and knows immediately where business logic lives, where tools are defined, where skills go, and where the UI renders.

Why a Fixed Structure Matters

A coding agent (Claude Code, Cursor, or any compatible agent) that loads the Platform Module Development skills package knows this structure by heart. When you say "add an alarm severity filter to the SCADA module," the agent knows:

  • schemas.py — where to add the Pydantic model for the filter parameters
  • service.py — where to add the query logic
  • api.py — where to wire up the endpoint
  • ai/tools.py — where to expose it as an agent tool
  • ui/ — where to add the filter component to the alarm dashboard
  • tests/ — where to add the test

No searching, no guessing, no project-specific conventions to learn. Every module looks the same. The agent's skill package maps the structure to the right patterns. This is why experienced developers building with coding agents are more productive with a fixed structure, not less — the agent can execute confidently because there's no ambiguity.

Module Scaffolding

The Platform CLI generates a new module skeleton with one command. The skeleton contains every file the Module Loader expects, pre-configured with the correct imports, base classes, and SDK patterns. The developer fills in the business logic — the plumbing is already done.

# Generate a new module skeleton
$ platform modules create my_module --code MY_MODULE --name "My Module"

Created module skeleton at app/modules/my_module/

  app/modules/my_module/
  ├── __init__.py              # Empty (marks as Python package)
  ├── module.json              # Pre-filled: code, name, version, api_base_path
  ├── api.py                   # FastAPI router with example endpoint
  ├── service.py               # Service layer with TenantDataConnector pattern
  ├── schemas.py               # Example Pydantic schema
  ├── requirements.txt         # Backend deps (empty, add as needed)
  ├── ui/                      # Frontend skeleton
  │   ├── index.tsx            # Entry point with SDK hooks wired up
  │   ├── package.json         # @platform/ui-sdk dependency pre-configured
  │   └── tsconfig.json        # Extends platform base config
  └── tests/
      ├── test_api.py          # Example API test with TestTenantDataConnector
      └── test_service.py      # Example service test

# Add optional components
$ platform modules add-ai my_module         # Adds ai/ directory with agent.json + skill template
$ platform modules add-bridge my_module     # Adds bridge/ directory with tools.py template
$ platform modules add-connector my_module  # Adds connectors/ directory with base class
$ platform modules add-scheduler my_module  # Adds scheduler.py with handler template
💡
Scaffolding + Skills = instant productivity. The CLI generates the skeleton. The coding agent loads the Platform Module Development skills package. Between them, the developer has a running module in minutes — the skeleton compiles, the tests pass, the API mounts, and the UI renders in the platform shell. From there, every change is additive: add a table to data_tables, add a service function, wire up a tool, build a UI view. The agent knows the patterns because the skills package teaches them. The pipeline validates the result because the structure is predictable.

What the Skeleton Includes (Ready to Run)

FileGenerated Content
module.json Pre-filled with module_code, module_name, module_version: "0.1.0", api_base_path. Empty sections for data_tables, permissions, ai, tenant_settings. Ready for the developer to populate.
api.py FastAPI APIRouter with a health-check endpoint. Imports from service.py and schemas.py pre-wired. The pattern is set — add endpoints by following the example.
service.py Example service function with TenantDataConnector parameter. Shows the data.query() pattern. The coding agent sees this and knows exactly how to add business logic.
schemas.py Example Pydantic model for a request and response. Strict typing, field validation. The pattern for all future schemas in the module.
ui/index.tsx React component with useApi(), usePermissions(), and useTenant() hooks already imported. Renders a placeholder panel. Platform CSS variables referenced. The developer replaces the placeholder with real content — the SDK integration is already done.
ui/package.json @platform/ui-sdk as a dependency. TypeScript and React pre-configured. Ready for npm install.
ui/tsconfig.json Extends the platform's tsconfig.base.json with strict: true. Module gets type safety for all SDK interfaces immediately.
tests/ Example tests using TestTenantDataConnector and TestToolContext fixtures. Shows how to test service functions and API endpoints in isolation. The coding agent follows this pattern for all new tests.
The skeleton is the skills package in code form. Every pattern the coding agent needs to know is demonstrated in the generated files: how to use the TenantDataConnector, how to structure a service function, how to wire up an API endpoint, how to use SDK hooks in the UI, how to write tests. The agent reads the skeleton and learns the module's conventions. New files follow the same patterns. The developer focuses on what the module does — the skeleton and the skills package handle how it integrates with the platform.

Full Module Directory Structure

app/modules/scada_monitor/
├── __init__.py
├── module.json              # Identity, permissions, data tables, AI, bridge tools, connectors
├── api.py                   # FastAPI router (auto-mounted)
├── service.py               # Business logic (uses TenantDataConnector)
├── schemas.py               # Pydantic request/response validation
├── services.py              # Inter-module service registration (optional)
├── scheduler.py             # Schedule handlers (optional)
├── connectors/              # Platform Connectors (optional)
│   └── control_tower.py
├── ai/                      # AI components (optional)
│   ├── agent.json          # Agent configuration
│   ├── skills/              # Skill packages (Agent Skills spec)
│   │   ├── alarm-triage/    # Each skill is a directory
│   │   │   ├── SKILL.md    # Required: frontmatter + instructions
│   │   │   ├── scripts/     # Optional: executable code
│   │   │   ├── references/  # Optional: docs loaded on demand
│   │   │   └── assets/      # Optional: templates, data files
│   │   └── conveyor-diagnostics/
│   │       ├── SKILL.md
│   │       ├── references/
│   │       └── assets/
│   └── tools.py            # @platform_tool implementations
├── ui/                        # Module frontend (optional)
│   ├── index.tsx             # Entry point — mounted by platform shell
│   ├── package.json          # Frontend dependencies (own installs)
│   ├── tsconfig.json         # Extends platform base (strict: true)
│   ├── pages/                 # Module's internal views
│   ├── components/            # Module's own React components
│   └── styles/                # Scoped CSS modules
├── bridge/                  # Bridge tool definitions (optional)
│   └── tools.py
├── requirements.txt         # Backend Python dependencies (own installs)
└── tests/
    ├── test_api.py
    └── test_service.py
🔌
No models.py or repository.py needed. Unlike traditional frameworks, module data tables are declared in module.json under data_tables and provisioned by the Tenant Data Connector. Modules use data.query() and data.insert() — no ORM models, no SQLAlchemy, no raw SQL. No ORM overhead, no boilerplate.

What's Required vs. What's Convention

The platform requires exactly one file to recognize a module: module.json. Everything else is either required conditionally (only if the manifest references it) or is a convention that the skeleton and skills package establish. Modules are free to create any additional files, folders, and structure they need.

FileStatusWhen It's Required
module.json Always Required The Module Loader discovers modules by scanning for this file. Without it, the directory is invisible to the platform. This is the only file the platform looks for unconditionally.
api.py Conditional Required if the module has HTTP endpoints (most will). Must export a router variable. If the manifest declares api_base_path, the loader expects this file.
ui/index.tsx Conditional Required if the manifest declares a ui.entry. If the module has no UI (e.g., a background ETL module), this file and the entire ui/ directory can be omitted.
ai/tools.py Conditional Required if the manifest declares AI tools or skills with tools_required. If the module has no agent, omit the entire ai/ directory.
bridge/tools.py Conditional Required if the manifest declares bridge_tools. If the module doesn't need on-prem execution, omit.
service.py Convention Not enforced by the loader. The skeleton creates it and the skills package teaches the pattern because it's the right architecture (shared business logic). But the platform won't reject a module without it.
schemas.py Convention Not enforced by the loader. The pipeline does check that API endpoints use Pydantic validation — but the schemas can live anywhere the module wants.
services.py Convention Only relevant if the module registers services in the Service Registry. Not enforced by the loader.
scheduler.py Convention Only relevant if the manifest declares scheduler jobs. The handler functions can technically live anywhere importable.
tests/ Convention The Module Pipeline runs tests if they exist. No tests means nothing to run — but the pipeline logs a warning.

Modules Can Create Anything They Need

Beyond the required and conventional files, a module can create any folders and files it wants. The platform doesn't police your internal structure — it only reads what the manifest points to. If your module needs custom directories, add them.

app/modules/warehouse_management/
├── module.json                     # Required: the platform reads this
├── api.py                          # Conditional: manifest has api_base_path
├── service.py                      # Convention: shared business logic
├── schemas.py                      # Convention: validation models
│
├── integrations/                  # Custom: external system integrations
│   ├── salesforce/
│   │   ├── client.py              # Salesforce API client
│   │   ├── sync.py                # Order sync logic
│   │   └── mappings.py            # Field mapping configuration
│   └── sap/
│       ├── client.py
│       └── inventory_sync.py
│
├── wms/                            # Custom: WMS-specific business logic
│   ├── pick_optimization.py
│   ├── slot_allocation.py
│   └── wave_planning.py
│
├── reports/                        # Custom: report generators
│   ├── daily_throughput.py
│   └── templates/
│       └── throughput_report.html
│
├── docs/                           # Custom: module documentation
│   ├── architecture.md
│   ├── api-reference.md
│   └── integration-guide.md
│
├── scripts/                        # Custom: utility scripts
│   ├── seed_test_data.py
│   └── migrate_legacy_orders.py
│
├── data/                           # Custom: static data files
│   ├── warehouse_layouts.json
│   └── default_slot_config.csv
│
├── ai/                             # Conditional: manifest declares AI
│   ├── skills/
│   └── tools.py
├── ui/                             # Conditional: manifest declares UI
│   ├── index.tsx
│   └── package.json
├── requirements.txt
└── tests/
Your module, your structure. The platform needs module.json. It conditionally needs api.py, ui/index.tsx, ai/tools.py, and bridge/tools.py based on what the manifest declares. Everything else — service.py, schemas.py, integrations/, wms/, docs/, scripts/, data/ — is yours to organize however makes sense for your module. The skeleton gives you a starting point with proven conventions. The coding agent's skills package teaches those conventions. But they're conventions, not constraints. Build the internal structure that fits your domain.

File-by-File Breakdown

FileRequiredPurposeKey Rules
module.json Required The module manifest. Declares everything the platform needs to know: identity, version, permissions, data tables, AI config, bridge tools, connectors, scheduler, and settings. This is the single source of truth — the Module Loader reads this file to discover and wire up the entire module. Must be valid JSON. Must include module_code, module_name, module_version, and api_base_path. See Section 11 for full schema.
api.py Required FastAPI router that defines the module's HTTP endpoints. Auto-mounted by the Module Loader at the api_base_path declared in the manifest. Handles request validation, calls into the service layer, and returns responses. Must export a router variable (APIRouter). Every endpoint must use Pydantic schemas from schemas.py for request/response validation. Never access the database directly — delegate to service.py.
service.py Required Business logic layer. Contains all domain logic, data access via TenantDataConnector, cross-module calls via Service Registry, and any computation. Both api.py and ai/tools.py call into this layer — it's the shared brain of the module. Receives TenantDataConnector and CoreDataService as injected dependencies. Never imports from other modules directly — use the Service Registry for cross-module access. All data queries go through data.query() / data.insert() / data.update().
schemas.py Required Pydantic models for request/response validation. Defines the shapes of data flowing in and out of the module's API. Shared across api.py, service.py, and ai/tools.py. Every API endpoint must use schemas for both input and output. Use strict types (no Any). Enums for constrained values. All schemas importable by other files in the module.
services.py Optional Inter-module service registration. Defines functions that other modules can invoke through the Service Registry. This is how module A exposes capabilities to module B without direct imports. Each registered service must declare its permission requirement. Functions receive TenantDataConnector automatically. Other modules call these via service_registry.invoke("MODULE_CODE", "service_name", ...).
scheduler.py Optional Handler functions for scheduled tasks declared in the manifest. The Platform Scheduler (Section 26) invokes these at the configured frequency per tenant. Each handler receives TenantDataConnector for the specific tenant being processed. Handlers must be idempotent — the scheduler may retry on failure.
connectors/ Optional Platform Connector implementations for external system integrations. Each connector file handles a specific integration type (REST API, WebSocket, MQTT, webhook). See Section 20. Connector config (endpoints, credentials, polling frequency) stored per-tenant in the Integration Registry — not hardcoded. Use the Connector Framework base classes for standard patterns.
ai/agent.json Optional Agent configuration that extends the manifest's ai.agent section with additional runtime details: context profiles, compaction preferences, recall event categories. Can also be configured entirely in the manifest — this file is for when the config is large enough to warrant separation. Values here merge with (and override) the manifest's ai.agent block. Same schema.
ai/skills/ Optional Skill packages following the Agent Skills spec. Each subdirectory is a self-contained skill with SKILL.md + optional scripts, references, and assets. See Section 14. Each skill directory must contain a SKILL.md with valid YAML frontmatter. Skills must be declared in the manifest's ai.skills array with their tools_required.
ai/tools.py Optional Cloud-side tool implementations using @platform_tool. These run on the platform server and access tenant data via TenantDataConnector. See Section 12. Every function must be decorated with @platform_tool. Business logic should call into service.py rather than being duplicated. Tools must have summary_instructions.
bridge/tools.py Optional Bridge tool definitions for on-premises execution. These are routed to the tenant's Bridge agent via WebSocket. The tool definition lives here; the actual execution happens on the tenant's infrastructure. See Section 18. Bridge tools must declare execution="bridge" in their @platform_tool decorator. Must have complete security policies — permission, audit, error_policy are all required for bridge tools.
tests/ Expected Module test suite. Unit tests for service logic, integration tests for API endpoints, and tool tests with mocked TenantDataConnector. Tests run against a test tenant context. The SDK provides TestTenantDataConnector and TestToolContext fixtures for isolated testing.

Module Layering Rules

Every module follows the same layered architecture. The layers enforce separation of concerns — each file has a clear responsibility and a clear set of dependencies. This consistency across all modules means developers can navigate any module without learning a new structure.

┌─────────────────────────────────────────────────────────────┐ │ Incoming Requests │ │ │ │ HTTP (users, frontend) Agent tool calls (LLM) │ │ │ │ │ │ ▼ ▼ │ │ ┌──────────┐ ┌──────────────┐ │ │ │ api.py │ │ ai/tools.py │ │ │ │ │ │ │ │ │ │ Routes │ │ @platform_tool│ │ │ │ Schemas │ │ Security │ │ │ │ Validation│ │ Guardrails │ │ │ └────┬─────┘ └──────┬───────┘ │ │ │ │ │ │ │ ┌────────────────────┐│ │ │ └─────────►│ service.py │◄┘ │ │ │ │ │ │ │ Business logic │ │ │ │ Data access │ │ │ │ Cross-module calls │ │ │ └───────┬────────────┘ │ │ │ │ │ ┌────────────┼────────────┐ │ │ ▼ ▼ ▼ │ │ ┌──────────────┐ ┌───────────┐ ┌──────────────┐ │ │ │TenantData │ │CoreData │ │Service │ │ │ │Connector │ │Service │ │Registry │ │ │ │ │ │ │ │ │ │ │ │data.query() │ │get_user() │ │invoke() │ │ │ │data.insert() │ │get_perms()│ │(cross-module)│ │ │ └──────────────┘ └───────────┘ └──────────────┘ │ │ │ │ │ ▼ │ │ Tenant's Dedicated PostgreSQL Database │ └─────────────────────────────────────────────────────────────┘

Dependency Rules

FileCan Import FromCannot Import From
api.py service.py, schemas.py, SDK base classes ai/tools.py, bridge/tools.py, other modules
ai/tools.py service.py, schemas.py, SDK base classes api.py, bridge/tools.py, other modules
service.py schemas.py, SDK services (TenantDataConnector, CoreDataService, ServiceRegistry) api.py, ai/tools.py, other modules
schemas.py Pydantic, Python stdlib Everything else — schemas are leaf nodes with no internal dependencies
services.py service.py, schemas.py api.py, ai/tools.py, other modules
Modules never import from each other. If module A needs to call module B, it goes through the Service Registry (service_registry.invoke("MODULE_B", "service_name", ...)) or dispatches to module B's agent via dispatch_agents. Direct imports between modules are rejected by the Module Pipeline (Section 24). This isolation is what makes modules independently deployable and independently testable.

The Service Layer Is the Center

The most important architectural decision in the module structure: service.py is the single source of business logic. Both the API endpoints and the AI tools call into the same service functions. This guarantees that the same business rules apply whether a human triggers an action via the UI or an agent triggers it via a tool call.

Anti-Pattern: Logic in Tools

  • Tool queries the database directly
  • API endpoint has its own query for the same data
  • Business rules duplicated in two places
  • Bug fix in one place missed in the other
  • Different validation logic for API vs. tool

Correct: Shared Service Layer

  • Tool calls service.get_active_alarms()
  • API endpoint calls service.get_active_alarms()
  • Business rules in one place
  • One fix, both paths updated
  • Same validation, same logic, same result
# service.py — the shared business logic

async def get_active_alarms(
    data: TenantDataConnector,
    zone_id: str | None = None,
    severity: str | None = None,
) -> list[dict]:
    filters = {"acknowledged": False}
    if zone_id:
        filters["zone_id"] = zone_id
    if severity:
        filters["severity"] = severity
    return await data.query("alarms", filters=filters, order_by="created_at DESC")


# api.py — HTTP endpoint calls the service

@router.get("/alarms")
async def list_alarms(zone_id: str = None, severity: str = None):
    data = get_tenant_data_connector()  # Injected by platform middleware
    return await service.get_active_alarms(data, zone_id, severity)


# ai/tools.py — agent tool calls the same service

@platform_tool(
    module_code="SCADA_MONITOR",
    permission_code="SCADA.READ_SENSORS",
    risk_level="safe",
    summary_instructions="Get active (unacknowledged) alarms. Optionally filter by zone or severity.",
)
async def get_active_alarms(
    context: ToolContext,
    data: TenantDataConnector,
    zone_id: str = None,
    severity: str = None,
) -> list[dict]:
    return await service.get_active_alarms(data, zone_id, severity)

Minimal Module vs. Full Module

Not every module needs AI, bridge tools, or connectors. The SDK requires only the core files — everything else is opt-in. Here's the simplest possible module compared to a fully-featured one:

Minimal Module (API only)

  • module.json — identity + data tables
  • api.py — HTTP endpoints
  • service.py — business logic
  • schemas.py — validation
  • 4 files, no AI, no bridge, no connectors

Full Module (all features)

  • Everything in minimal, plus:
  • ai/ — agent, skills, cloud tools
  • bridge/ — on-prem tool definitions
  • connectors/ — external integrations
  • services.py — inter-module services
  • scheduler.py — periodic tasks
  • tests/ — test suite
💡
Start minimal, grow as needed. A module can ship with just module.json, api.py, service.py, and schemas.py. Add ai/ when you need an agent. Add bridge/ when you need on-prem access. Add connectors/ when you need external integrations. The Module Loader only wires up what exists — missing optional directories are silently skipped.

12. The Module Manifest

The module manifest (module.json) is the single file the platform reads to understand everything about a module. It declares identity, version, API path, permissions, data tables, AI agent configuration, bridge tools, connectors, scheduler jobs, and settings. The Module Loader parses this file at startup and uses it to wire up the entire module — mount routes, register tools, provision tenant tables, and configure agents. Nothing about the module is implicit — if it's not in the manifest, the platform doesn't know about it.

The Manifest Is the Contract

Think of module.json as the API between the module developer and the platform. The developer declares what the module needs (permissions, data tables, tools). The platform provides it (tenant isolation, security enforcement, data provisioning). Neither side needs to know the other's implementation details. A module developer never writes middleware, database migrations, or security plumbing — they declare what they need in the manifest and the platform delivers it.

  • Identity & version — how the platform tracks and loads the module
  • Permissions — what actions the module can control access to
  • Data tables — what data the module stores, provisioned per tenant automatically
  • AI config — agent persona, skills, tool bindings, model preferences, context profile
  • Bridge tools — on-prem tool definitions with full security policies
  • Connectors — external system integrations
  • Scheduler — periodic tasks and agent triggers
  • Settings — platform-level defaults and per-tenant overrides

Full Manifest Example (SCADA Monitor)

{
  "module_code": "SCADA_MONITOR",
  "module_name": "SCADA Monitoring",
  "module_description": "Real-time SCADA monitoring, alarm management, and PLC interaction",
  "module_version": "1.0.0",
  "api_base_path": "/api/v1/scada",

  // ── Data Tables (provisioned per tenant) ─────────
  "data_tables": {
    "scada_readings": {
      "columns": {
        "id": {"type": "uuid", "primary_key": true},
        "zone_id": {"type": "string", "indexed": true},
        "sensor_type": {"type": "string"},
        "value": {"type": "float"},
        "unit": {"type": "string"},
        "timestamp": {"type": "timestamp", "indexed": true},
        "created_at": {"type": "timestamp", "auto": true}
      },
      "partition_by": "timestamp",
      "retention_days": 365
    },
    "alarms": {
      "columns": {
        "id": {"type": "uuid", "primary_key": true},
        "zone_id": {"type": "string", "indexed": true},
        "fault_code": {"type": "integer"},
        "severity": {"type": "string", "enum": ["info", "warning", "critical"]},
        "message": {"type": "string"},
        "acknowledged": {"type": "boolean", "default": false},
        "acknowledged_by": {"type": "uuid", "nullable": true},
        "created_at": {"type": "timestamp", "auto": true}
      }
    }
  },

  // ── Permissions ─────────────────────────────────
  "permissions": [
    {"code": "SCADA.READ_SENSORS",  "description": "View sensor data"},
    {"code": "SCADA.WRITE_PLC",     "description": "Write to PLC registers"},
    {"code": "SCADA.ACK_ALARMS",    "description": "Acknowledge alarms"},
    {"code": "SCADA.CONFIGURE",     "description": "Configure thresholds"}
  ],

  // ── AI Agent ────────────────────────────────────
  "ai": {
    "agent_enabled": true,
    "agent": {
      "persona": "SCADA monitoring specialist for distribution center operations.",
      "model": "claude-sonnet-4-6",
      "max_iterations": 15,
      "memory_mode": "housekeeping",
      "dispatch_permissions": ["MAINTENANCE", "ETL_PIPELINE"],
      "failover_chain": ["claude-sonnet-4-6", "gpt-4o", "gemini-2.5-flash"]
    },
    "skills": [
      {"name": "alarm-triage", "path": "ai/skills/alarm-triage/",
       "tools_required": ["get_active_alarms", "acknowledge_alarm"]},
      {"name": "conveyor-diagnostics", "path": "ai/skills/conveyor-diagnostics/",
       "tools_required": ["read_plc_register", "query_sensor_history"]}
    ]
  },

  // ── Bridge Tools ────────────────────────────────
  "bridge_tools": [
    {"name": "read_plc_register", "security": {
      "risk_level": "medium", "required_permission": "SCADA.READ_SENSORS",
      "audit": "standard", "error_policy": "retry_once"
    }},
    {"name": "write_plc_register", "security": {
      "risk_level": "critical", "required_permission": "SCADA.WRITE_PLC",
      "requires_approval": true, "allowed_agents": ["SCADA_MONITOR"],
      "audit": "full", "error_policy": "halt_and_notify"
    }}
  ],

  // ── Connectors, Scheduler, Settings ─────────────
  "connectors": [{"name": "control_tower", "type": "websocket", "direction": "bidirectional"}],
  "scheduler": {
    "entity_types": [{"entity_type": "sensor_poll", "frequencies": ["1m", "5m", "15m"]}],
    "agent_triggers": [{"event": "critical_alarm", "action": "run_agent", "skill": "alarm_triage"}]
  },
  // ── Platform Settings (module-level defaults & constraints) ─
  "platform_settings": {
    "max_concurrent_pipelines": {"type": "integer", "default": 5, "locked": true},
    "data_retention_days": {"type": "integer", "default": 365, "min": 90, "max": 730},
    "enable_ai_agent": {"type": "boolean", "default": true},
    "bridge_heartbeat_interval_s": {"type": "integer", "default": 30, "locked": true}
  },

  // ── Tenant Settings (per-tenant overrides within constraints) ─
  "tenant_settings": {
    "alarm_severity_thresholds": {"type": "object", "default": {"critical": 90, "warning": 70}},
    "auto_acknowledge_info_alarms": {"type": "boolean", "default": false},
    "default_etl_frequency": {"type": "string", "default": "15m", "options": ["1m", "5m", "15m", "1h"]}
  }
}

Manifest Field Reference

FieldRequiredPurposeValidated At
module_code Required Unique identifier for the module (e.g., SCADA_MONITOR). Used in permissions, Tool Registry, Service Registry, and cross-module dispatch. Must be uppercase with underscores. Module load — duplicates rejected
module_name Required Human-readable display name shown in Admin Console and module marketplace. Module load
module_description Required Short description of the module's purpose. Displayed in the Admin Console module catalog and used by platform search. Module load
module_version Required Semantic version (e.g., 1.2.0). The Module Loader compares this against the module_registry to detect updates and run data_migrations. Module load — compared to registry
api_base_path Required URL prefix where the module's FastAPI router is mounted (e.g., /api/v1/scada). Must be unique across all modules. Module load — path conflicts rejected
data_tables Optional Declarative schema for the module's data tables. Each table defines columns with types, indexes, partitioning, and retention. Provisioned per tenant by TenantDataConnector. Module load + tenant subscription
data_migrations Optional Ordered list of schema migrations tied to version bumps. Run per tenant when a module version changes. Declarative operations translated to engine-specific DDL. Module Pipeline (Section 24) + Module load
permissions Optional Permission codes the module introduces (e.g., SCADA.READ_SENSORS). Registered in the platform permission system. Tools and API endpoints reference these codes for access control. Module load — registered in permission table
ai Optional AI configuration block: agent settings (persona, model, iterations, memory, dispatch permissions, failover chain) and skills (name, path, tools_required). See Section 14 and Section 16. Module load — skills validated against Tool Registry
bridge_tools Optional Bridge tool declarations with full security policies. Synced to the tenant's Bridge agent. Each entry declares name, risk level, permission, audit level, and error policy. Module load — security policy completeness enforced
connectors Optional External system integrations. Each connector declares a name, type (REST, WebSocket, MQTT, webhook), and direction (inbound, outbound, bidirectional). Per-tenant config stored in Integration Registry. Module load
scheduler Optional Scheduled task definitions: entity types with available frequencies, and agent triggers (events that auto-invoke the agent with a specific skill). See Section 26. Module load — skill references validated
platform_settings Optional Module-wide defaults and constraints. locked: true fields cannot be overridden by tenants. Defines the boundaries within which tenants can customize. Module load
tenant_settings Optional Per-tenant customization fields with defaults and valid options. Tenant admins configure these in the Admin Console. Resolved at runtime by merging with platform settings. Module load + tenant admin validation

Platform Settings vs. Tenant Settings

LevelSet ByPurposeExamples
Platform Settings Module developer / Platform admin Module-wide defaults and hard constraints. locked: true settings cannot be overridden by tenants. Unlocked settings define the allowable range tenants can customize within. Max concurrent pipelines, data retention bounds, heartbeat interval, feature flags
Tenant Settings Tenant admin Per-tenant customization within the constraints defined by platform settings. Each tenant can tune the module to their operational needs without code changes. Alarm thresholds, auto-acknowledge rules, ETL frequency, notification preferences
💡
How it resolves at runtime: The SDK's CoreDataService.get_module_settings() merges both levels. Platform settings provide defaults and constraints. Tenant settings override where allowed. Locked platform settings are enforced — tenant overrides are silently ignored for locked fields. Module code calls one method and gets the resolved value.
💡
Key change from v1: The data_tables section replaces schema_name and models.py. Tables are declared in the manifest and provisioned per tenant by the Tenant Data Connector. Module data lives in the tenant's dedicated PostgreSQL database, separate from the platform PostgreSQL that handles configuration and identity.

13. Module Tools & @platform_tool

Tools are the actions an agent can take. Every tool on the platform — whether it reads sensor data, writes to a PLC, creates a work order, or dispatches to another agent — is a Python function wrapped with the @platform_tool decorator. The decorator is the contract between the module developer and the platform: the developer writes the business logic, the decorator handles everything else — authentication, tenant isolation, permission checks, input validation, guardrails, execution routing, audit logging, and instruction injection into the agent's prompt.

What the Developer Writes vs. What the Platform Provides

A module developer writing a tool focuses on one thing: the function body — the actual business logic of what the tool does. Everything surrounding that function is handled by the @platform_tool decorator and the platform runtime.

📝

Developer Writes

The function body (business logic), type-hinted parameters (auto-generates JSON Schema), summary_instructions (concise agent guidance), and optional full_instructions + examples (detailed docs, retrieved on demand).

🛡

Platform Provides

Auth & tenant resolution, permission enforcement, agent policy (allow/deny), approval gates, pre/post-condition guardrails, execution routing (cloud or bridge), timeout & error handling, response sanitization, audit logging, and instruction injection into the agent prompt.

Tools exist in two execution modes: cloud tools run on the platform server and access tenant data via TenantDataConnector. Bridge tools are routed to the tenant's on-premises Bridge agent via WebSocket for actions that require local network access (PLC reads/writes, local databases, equipment APIs). Both modes use the same @platform_tool decorator and the same security pipeline — the only difference is where the code executes.

Cloud Tool (tenant data via TenantDataConnector)

@platform_tool(
    module_code="SCADA_MONITOR",
    permission_code="SCADA.READ_SENSORS",
    risk_level="safe",
    summary_instructions="Query recent sensor readings. Returns time-series data.",
    # Guardrails — enforced by the executor at runtime (Section 16)
    pre_conditions=[
        "hours >= 1 and hours <= 720",          # Max 30 days
        "sensor_type in VALID_SENSOR_TYPES",    # Enum check
    ],
    post_conditions=[
        "result.count <= 50000",                # Cap row count
    ],
    on_failure="warn",                            # Log + continue
)
async def query_sensor_history(
    context: ToolContext,
    data: TenantDataConnector,   # Pre-routed to this tenant
    zone_id: str,
    sensor_type: str,
    hours: int = 24,
) -> dict:
    # No tenant logic, no security code, no engine-specific SQL
    readings = await data.query(
        table="scada_readings",
        filters={"zone_id": zone_id, "sensor_type": sensor_type},
        time_range={"column": "timestamp", "last_hours": hours},
        order_by="timestamp DESC",
    )
    return {"readings": readings, "count": len(readings)}

Bridge Tool (on-premises via Bridge WebSocket)

@platform_tool(
    module_code="SCADA_MONITOR",
    permission_code="SCADA.WRITE_PLC",
    execution="bridge",
    risk_level="critical",
    requires_approval=True,
    allowed_agents=["SCADA_MONITOR"],
    audit="full",
    error_policy="halt_and_notify",
    summary_instructions="Write a value to a PLC register. CRITICAL: Controls physical equipment.",
    # Guardrails — enforced by the executor at runtime (Section 16)
    pre_conditions=[
        "register_address matches r'^PLC-\\d{3}-R\\d{4}$'",  # Valid address format
        "value >= 0 and value <= 65535",                      # 16-bit register range
    ],
    post_conditions=[
        "result.status in ['success', 'acknowledged']",       # Bridge confirmed write
    ],
    success_criteria="read_after_write",                        # Verify by reading back
    on_failure="halt",                                           # Stop agent immediately
)
async def write_plc_register(
    context: ToolContext,
    register_address: str,
    value: int,
    plc_id: str,
) -> dict:
    # Routed to tenant's bridge via WebSocket automatically
    return await context.bridge.execute("write_plc", {
        "address": register_address, "value": value, "plc_id": plc_id,
    })

What the Decorator Handles

  1. Tenant scoping (hard boundary): Resolves tenant from JWT. Injects a TenantDataConnector that is pre-routed to the current tenant's data store only. Sets RLS context on Platform DB. Routes bridge calls to the current tenant's bridge only. The tool function receives a connector that is physically incapable of reaching another tenant's data. This is not a convention — it's an architectural constraint enforced by the platform before the tool code executes.
  2. Request validation: Input validated against Pydantic schemas before execution (Section 6 — injection prevention)
  3. Pre-condition guardrails: Business-rule validation on inputs — range checks, format checks, cross-field logic. Runs before execution. See Section 16 for the full guardrails lifecycle.
  4. Module subscription check: Verifies tenant subscribes to this module
  5. Permission check: Verifies user has the required permission in this tenant
  6. Agent policy: Checks allowed/denied agent lists
  7. Approval gate: If required, emits SSE approval_request and waits
  8. Execution routing: Cloud handler or Bridge WebSocket (routed to the current tenant's bridge)
  9. Post-condition guardrails: Output validation — expected fields, value ranges, status codes. Runs after execution.
  10. Success criteria: Optional verification step (e.g., read-after-write for critical operations)
  11. Failure action: If any guardrail fails: halt pause warn retry collect skip
  12. Timeout + error policy: Enforces limits, applies retry/escalation/halt
  13. Response sanitization: Strips internal errors and tenant metadata from responses (Section 6)
  14. Audit: Logs per the tool's audit level (includes guardrail pass/fail results)
  15. Instruction injection: summary_instructions injected into agent prompt
🛡
Tenant isolation is not optional — it's structural. Every @platform_tool function receives a TenantDataConnector and ToolContext that are pre-scoped to the current tenant by the platform. There is no API to "switch tenants," no parameter to pass a different tenant_id, and no way for module code to construct a connector for another tenant. The agent, the tool, and the data layer all operate within the same tenant boundary — set once at session start from the JWT and immutable for the session's lifetime.

This means:
  • data.query() can only reach the current tenant's tables — other tenants' data stores are not addressable
  • context.bridge.execute() routes to the current tenant's bridge only — other tenants' bridges are not reachable
  • context.agent_system.embed() tracks usage against the current tenant — billing is per-tenant
  • dispatch_agents dispatches within the current tenant only — an agent in Tenant A cannot dispatch to a Tenant B agent
  • Platform DB queries run with RLS set to the current tenant_id — even misconfigured queries cannot return another tenant's config
Module developers cannot accidentally violate tenant isolation because the SDK doesn't expose the mechanisms to do so. This is a platform-level guarantee, not a coding convention.
💡
From decorator to agent: When a module loads, the Module Loader discovers every @platform_tool-decorated function, validates its security policy and parameter schema, and registers it in the platform Tool Registry. Skills declare which tools they need via tools_required, and agents receive tools through their skills at runtime. See Section 15 — Tool Registry & Agent Binding for the full lifecycle, registry structure, and how tools connect to agents.

14. Module Development & SDK Tooling

Module development happens on the developer's local machine. Testing and validation happen on the Development Platform — a full platform instance that mirrors production. The Platform CLI bridges the two: push code from local to dev, run tests, stream logs, and iterate. When the module is ready, it goes through the Module Pipeline to production.

Two Environments

Development Platform

  • Full platform instance — same code, same services, same security as production
  • Dev tenants with test data
  • All production modules loaded (synced from prod)
  • Developer's module mounted on top via CLI push
  • Full logging, tracing, agent execution
  • Fast reload on code push (< 5 seconds)
  • Shared by the development team — each developer gets an isolated dev tenant

Production Platform

  • Live tenant data and real users
  • Only modules that passed the full pipeline
  • No direct code pushes — pipeline only
  • Production logging and monitoring
  • Changes synced to development when needed
Developer's Local Machine Development Platform Production Platform ┌────────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐ │ │ │ │ │ │ │ C:\github\my_module\ │ push │ Platform Services │ pipeline│ Platform Services │ │ ├── module.json │────────►│ Module Loader │────────►│ Module Loader │ │ ├── api.py │ │ Agent Executor │ │ Agent Executor │ │ ├── service.py │ logs │ Tool Registry │ │ Tool Registry │ │ ├── ui/ │◄────────│ Bridge Manager │ │ Bridge Manager │ │ ├── ai/ │ │ │ │ │ │ └── tests/ │ │ Dev Tenants (test) │ │ Live Tenants │ │ │ │ Dev Database │ │ Tenant Databases │ │ Coding Agent: │ │ All Prod Modules │ │ │ │ Claude Code / Cursor │ │ Your Module (live) │ │ Published Modules │ │ reads dev.log │ │ │ │ │ └────────────────────────┘ └──────────────────────┘ └──────────────────────┘ ▲ ▲ │ │ │ Agent writes code, │ Prod → Dev sync │ reads logs, iterates │ (on demand) │ autonomously │

The Developer CLI

The Platform CLI has a dedicated dev command group for module development. These commands are what the coding agent uses to push, test, and debug modules against the Development Platform.

Connection & Authentication

# Authenticate with the Development Platform (one-time setup)
$ platform dev login
  → Opens browser for OAuth
  → Stores dev platform credentials in OS keyring
  → Authenticated as paulson@datavisions.ai on dev.platform.dematic.com

# Verify connection
$ platform dev status
  Platform:    dev.platform.dematic.com
  User:        paulson@datavisions.ai
  Dev Tenant:  dev-paulson (auto-provisioned)
  Modules:     23 loaded (from production sync)
  Status:      Connected

Module Push & Reload

# Push module code from local to Development Platform
$ platform dev push ./my_module/

  Syncing my_module → dev.platform.dematic.com...
  ├── Uploading files (14 changed)            0.8s
  ├── Installing requirements.txt             2.1s
  ├── Installing ui/package.json              3.4s
  ├── Building frontend (TypeScript + CSS)    1.2s
  ├── Validating module.json                  0.3s
  ├── Reloading module in platform            0.9s
  ├── Running module tests                    4.1s (12 passed, 0 failed)
  └── Module mounted and ready              

  WMS_MODULE v0.1.0 running on dev.platform.dematic.com
  UI: https://dev-paulson.dev.platform.dematic.com → WMS Module
  API: https://dev.platform.dematic.com/api/v1/wms/
  Total push time: 8.8s

# Fast push (skip tests, skip frontend rebuild — for rapid iteration)
$ platform dev push ./my_module/ --fast
  Fast push: backend reload only — 1.2s

# Watch mode — auto-push on file changes
$ platform dev watch ./my_module/
  Watching ./my_module/ for changes...
  [14:23:01] service.py changed → pushing... ✓ reloaded in 1.1s
  [14:23:45] ui/pages/OrderDashboard.tsx changed → pushing... ✓ rebuilt in 2.3s
  [14:24:12] module.json changed → pushing... ✓ full reload in 4.2s

Log Streaming

# Stream all logs from the Development Platform (real-time)
$ platform dev logs

  [platform]  Module WMS_MODULE loaded, 3 tools registered, 1 skill activated
  [wms:api]   GET /api/v1/wms/orders → 200 (12ms)
  [wms:api]   POST /api/v1/wms/orders → 201 (34ms)
  [wms:tool]  create_pick_order called by agent, pre-conditions ✓
  [agent]    WMS agent iteration 2: tool call create_pick_order
  [wms:ui]   TypeScript compiled, 0 errors

# Filter to just your module
$ platform dev logs --module WMS_MODULE

# Filter by layer
$ platform dev logs --layer api         # API requests only
$ platform dev logs --layer agent       # Agent execution traces
$ platform dev logs --layer tools       # Tool calls and guardrails
$ platform dev logs --layer ui          # Frontend build and render errors
$ platform dev logs --layer platform    # Module loader, tool registry, services

# Write to file — coding agent reads this for autonomous debugging
$ platform dev logs --output ./dev.log
  Streaming all logs to ./dev.log (Ctrl+C to stop)

# Errors only — the agent checks this after every push
$ platform dev logs --errors --output ./errors.log
🚨
This is what makes coding agents autonomous. The agent pushes code, then reads dev.log or errors.log. If there are errors, it reads the traceback, understands the issue, fixes the code, and pushes again. No human in the loop for bug fixing. The agent can run this cycle hundreds of times — push, check logs, fix, push — until the module works. The Development Platform gives it the same feedback a developer would get from looking at a terminal.

Discovery — What's Available

# See all platform tools the module can bind
$ platform dev tools list
  dispatch_agents, storage_upload, storage_get_url, generate_embedding,
  schedule_task, send_notification, request_user_input, get_tool_instructions

# See available data connectors the module can use
$ platform dev connectors list
  CONNECTOR             TYPE        STATUS
  ──────────────────    ────────    ──────
  BigQueryConnector     analytics   available
  RedshiftConnector     analytics   available
  S3Connector           storage     available
  GCSConnector          storage     available
  SynapseConnector      analytics   available

# See other modules and their published services
$ platform dev modules list
  MODULE              VERSION   SERVICES                          STATUS
  ──────────────────  ────────  ──────────────────────────────    ──────
  SCADA_MONITOR       1.2.0     get_active_alarms, get_history   active
  MAINTENANCE         1.1.0     create_work_order, get_techs     active
  ETL_PIPELINE        1.0.0     run_pipeline, get_status         active
  ENGINEERING_CONFIG  1.0.0     get_equipment_config              active

# Inspect a specific module's services (for cross-module integration)
$ platform dev modules services MAINTENANCE
  SERVICE               PERMISSION              DESCRIPTION
  ────────────────────  ──────────────────────  ──────────────────────
  create_work_order     MAINTENANCE.CREATE      Create a new work order
  get_technicians       MAINTENANCE.READ        List available technicians
  assign_technician     MAINTENANCE.ASSIGN      Assign tech to work order

# Inspect a specific connector (see configuration requirements)
$ platform dev connectors inspect BigQueryConnector
  Type:         analytics
  Requires:     project_id, dataset_id, service_account_key
  Capabilities: query, insert, aggregate, vector_search (VECTOR_SEARCH)
  Config via:   module.json connectors section + tenant settings in Admin Console

Testing

# Run module tests against the Development Platform
$ platform dev test ./my_module/
  Running 18 tests against dev.platform.dematic.com...
  ├── test_api.py::test_create_order           PASSED  (23ms)
  ├── test_api.py::test_list_orders            PASSED  (15ms)
  ├── test_service.py::test_pick_optimization  PASSED  (45ms)
  ├── test_service.py::test_slot_allocation    PASSED  (31ms)
  ├── test_tools.py::test_create_pick_order    PASSED  (67ms)
  └── ... 13 more passed
  18 passed, 0 failed, 0 skipped (1.8s)

# Run a specific test
$ platform dev test ./my_module/ -k test_create_order

# Test with verbose output (useful for agent debugging)
$ platform dev test ./my_module/ --verbose --output ./test-results.log

The Autonomous Development Loop

This is the workflow that makes 50 developers building modules in parallel actually work. Each developer (or their coding agent) operates independently against their own dev tenant on the shared Development Platform.

1

Scaffold or Clone

platform modules create wms_module generates a skeleton. Or platform modules clone SCADA_MONITOR ./my_wms copies an existing module as a starting point. The coding agent now has a working module with all SDK patterns demonstrated.

2

Agent Discovers the Environment

The agent reads ModuleSDKInstruct.md (the skills package). It runs platform dev tools list, platform dev connectors list, platform dev modules list to see what's available. It now knows the full landscape: platform tools, data connectors, other modules' services.

3

Build

The agent writes code: manifest, service logic, API endpoints, tools, UI, tests. It follows the patterns from the skeleton and the skills package. Each file goes in the right place because the structure is predictable.

4

Push & Validate

platform dev push ./my_module/ syncs code to the Development Platform. The platform installs dependencies, validates the manifest, rebuilds the frontend, reloads the module, and runs tests. Total time: under 10 seconds for a typical push.

5

Check Logs & Fix

The agent reads dev.log or errors.log. If there are errors — import failures, test failures, API errors, TypeScript compile errors, agent tool guardrail failures — the agent reads the traceback, fixes the code, and pushes again. This cycle repeats autonomously until the module works.

6

Iterate

The developer or agent adds features, tests each change against the dev platform, and builds up the module incrementally. platform dev watch mode auto-pushes on every file save for the fastest possible iteration cycle.

7

Package & Submit

When the module is ready: platform modules validate ./my_module/ runs the full pipeline validation locally. platform modules package ./my_module/ creates the .mpkg. platform modules upload my_module-1.0.0.mpkg submits to the Module Pipeline for promotion to production.

ModuleSDKInstruct.md — The Agent Skills Package

ModuleSDKInstruct.md is a SKILL.md file that coding agents (Claude Code, Cursor, or any compatible agent) load when building platform modules. It teaches the agent everything it needs to know about the SDK, the CLI, and the development workflow. This is the single file that makes the difference between an agent that guesses and an agent that builds correctly on the first try.

SectionWhat It Teaches the Agent
Module Structure Required vs. conditional vs. convention files. Where each type of code goes. What the Module Loader expects.
module.json Schema Every field: required vs. optional, valid values, examples. How to declare data tables, permissions, AI config, UI entry, bridge tools, connectors, settings.
Backend Patterns TenantDataConnector usage (query, insert, update, vector_search). Service layer patterns. @platform_tool decorator with all security fields. Pydantic schema conventions.
Frontend Patterns SDK hooks (usePermission, useApi, useTenant). CSS scoping rules (CSS Modules, platform variables). TypeScript strict mode. PermissionGate component.
AI Patterns Skill package structure (SKILL.md format). Tool instructions (summary vs. full). Agent config in manifest. Context providers.
CLI Commands Every platform dev command: push, logs, test, watch, tools list, connectors list, modules list. Authentication. Connection status.
Development Workflow The autonomous loop: push → validate → check logs → fix → push. How to read dev.log. How to interpret error traces. When to use --fast vs. full push.
Discovery How to find platform tools, data connectors, other modules' services. How to inspect tool parameters and connector requirements. How to use the Service Registry for cross-module integration.
Testing How to write tests with TestTenantDataConnector and TestToolContext. How to run tests against the dev platform. How to read test output.
Common Tasks Step-by-step procedures: add a data table, add a permission, add an API endpoint, add a tool, add a UI view, add a data connector, wire up a cross-module service call.
One file, complete agent knowledge. The coding agent loads ModuleSDKInstruct.md at the start of every module development session. It now knows: the structure, the patterns, the CLI, the workflow, the available tools and connectors, how to test, and how to debug. It doesn't need to read the full architecture document. It doesn't need to ask the developer "where should this go?" The skill file is the compressed, actionable version of the SDK — designed for agents, not for humans reading documentation.

Module Cloning

Starting from a skeleton is good. Starting from a working module that does something similar is better. The CLI supports cloning any existing module as a starting point — with a new module code, name, and version.

# Clone an existing module as a starting point
$ platform modules clone SCADA_MONITOR ./my_wms_module --code WMS_MODULE --name "WMS Order Management"

  Cloned SCADA_MONITOR → ./my_wms_module/
  ├── module.json updated: code=WMS_MODULE, name="WMS Order Management", version=0.1.0
  ├── api_base_path updated: /api/v1/wms
  ├── Permission codes updated: WMS.* (was SCADA.*)
  └── Ready to customize

# Clone from the module catalog (production modules)
$ platform modules clone MAINTENANCE ./my_cmms --code CMMS --name "CMMS Integration"

# Clone with only specific components
$ platform modules clone SCADA_MONITOR ./my_module --code MY_MODULE --include api,service,ui
  # Clones only the backend and UI, skips ai/ and bridge/
💡
Clone, then customize. The coding agent clones a module that's architecturally similar to what it's building, then adapts it. The cloned module has real patterns — real service functions, real API endpoints, real tools, real UI components. The agent studies these patterns and applies them to the new domain. This is faster and more reliable than building from a blank skeleton, especially for agents that learn by example.

Development Platform Sync

DirectionCommandWhen
Production → Development platform dev sync-from-prod Pull latest production modules, platform services, and configuration into the dev environment. Run after a production deployment to keep dev in sync. Does NOT overwrite dev tenant data or developer's in-progress modules.
Development → Production Module Pipeline only No direct push from dev to production. The module goes through the full pipeline: validate → test → stage → acceptance → promote. This is the only path to production.
Developer isolation Automatic Each developer gets an isolated dev tenant (e.g., dev-paulson) on the shared Development Platform. Your module changes don't affect other developers' tenants. Multiple developers can push different modules simultaneously.

What the Coding Agent Sees (End-to-End Example)

Scenario: "Build me a WMS module that sends orders to our G2P pick system"

# Agent scaffolds the module
$ platform modules create wms_module --code WMS --name "WMS Order Management"

# Agent discovers what's available
$ platform dev tools list
  → dispatch_agents, storage_upload, generate_embedding, ...

$ platform dev connectors list
  → BigQueryConnector, S3Connector, ...

$ platform dev modules services SCADA_MONITOR
  → get_active_alarms, read_multishuttle_status, ...

# Agent finds the G2P bridge tool
$ platform dev tools list --module SCADA_MONITOR --execution bridge
  → read_plc_register, write_plc_register, read_multishuttle_status

# Agent builds the module (writes files locally)
# ... module.json, service.py, api.py, ai/tools.py, ui/ ...

# Agent pushes to dev and starts log streaming
$ platform dev push ./wms_module/
$ platform dev logs --errors --output ./errors.log &

# Push succeeds — agent tests the API
$ curl -s https://dev.platform.dematic.com/api/v1/wms/orders | jq
  → {"orders": [], "count": 0}   ✓ API works

# Agent runs tests
$ platform dev test ./wms_module/
  → 12 passed, 0 failed

# Agent checks for any background errors
$ cat ./errors.log
  → (empty — no errors)

# Agent packages for pipeline submission
$ platform modules validate ./wms_module/
  → All checks passed
$ platform modules package ./wms_module/
  → wms_module-0.1.0.mpkg created
This is the whole point. Fifty developers, each with a coding agent, each building custom modules for different customers. They clone templates, discover available tools and connectors, push to the shared Development Platform, iterate against real platform services with real feedback, and submit through the pipeline when done. The platform is stable. The modules are independent. The agents are autonomous. Development is fast because the feedback loop is fast — push, check logs, fix, push. The ModuleSDKInstruct.md skill file ensures every agent knows the rules. The CLI ensures every agent has the tools. The Development Platform ensures every agent gets real feedback.

Platform UI — Workspace shell, module rendering, design standards, and UI security

15. Platform UI & Module Rendering

The platform has two distinct UI applications served from the same codebase but separated by URL, authentication scope, and purpose. Both share the same workspace shell, design system, and component patterns — but they serve different audiences and enforce different security boundaries.

UI Applications Map

Platform Application

admin.platform.dematic.com

  • Users: Dematic platform team only
  • Auth: Platform-level credentials, bypasses tenant RLS
  • Purpose: Manage all tenants, modules, agents, and platform health
  • Agent Designer — create, test, and publish agents to tenants
  • Module Pipeline — validate, stage, promote modules
  • Model Registry — manage LLM providers and models
  • Full observability across all tenants

Tenant Application

{slug}.app.dematic.com

  • Users: Customer operators, managers, engineers, admins
  • Auth: Tenant-scoped JWT, full RLS enforced
  • Purpose: Use modules, interact with agents, manage tenant
  • Module panels — each subscribed module renders in a panel
  • Agent chat — interact with module agents
  • Tenant admin — users, roles, settings (Admin role only)
  • Screens gated by role-based permissions

Platform Application — Screen Inventory

The Platform Application is Dematic's internal command center. This is where the platform team manages everything — tenants, modules, agents, models, and system health. Agents designed here are published to tenants with tenant isolation enforced.

ScreenPurposeKey Actions
Dashboard Platform health overview: active tenants, module subscriptions, agent runs, bridge connections, system alerts View metrics, drill into alerts
Tenant Management Create, configure, suspend tenants. View all tenant details, subscriptions, users, bridge status Create tenant, assign slug, configure SSO/SAML, manage subscriptions, override tenant settings, view tenant database status
Module Catalog All modules on the platform: status, version, subscribers, health metrics View module details, inspect tools/skills/permissions, manage versions, deprecate
Module Pipeline Upload, validate, stage, and promote modules. Validation reports and test results Upload .mpkg, view validation steps, promote to production, reject with notes
Module Configuration Platform-level settings for each module: defaults, constraints, locked fields Edit platform settings, set allowed ranges for tenant-configurable fields, lock settings
Agent Designer Create, configure, test, and publish agents. Full agent composition: persona, skills, tools, context, memory, model selection Design agents, bind skills and tools, configure guardrails, test in sandbox, preview assembled prompt, publish to tenants. Published agents enforce tenant isolation — an agent designed in the Platform Application runs within the tenant's security boundary when deployed.
Agent Observability Cross-tenant agent traces: execution logs, tool calls, guardrail results, token usage, failover events Search traces by tenant/module/agent, view full execution history, analyze cost patterns, identify failing guardrails
Model Registry Manage LLM providers, models, and lifecycle. Commission, sunset, configure failover chains Add/remove providers, commission models, set defaults, configure per-model pricing, track token usage by provider
Bridge Dashboard All bridges across all tenants: connection status, health, tool sync, resource usage View bridge health, inspect tool definitions, trigger resync, view pipeline status
Audit Log Cross-tenant immutable audit trail Search by actor, module, action, tenant, date range. Export for compliance.
Cost Dashboard Token usage and LLM cost per tenant, per module, per agent, per model View cost breakdowns, identify optimization opportunities, set cost alerts
Platform Settings Global platform configuration: default models, embedding model, notification settings, security policies Configure platform-wide defaults
🛡
Agent Designer publishes to tenants with isolation enforced. An agent designed in the Platform Application is a template. When a tenant subscribes to the module that contains the agent, the platform creates a tenant-scoped instance. The agent runs with the tenant's data, the tenant's bridges, the tenant's permissions. The Platform team designs the agent's capabilities. The platform enforces the boundaries. An agent published from the Platform Application can never access data outside the tenant it's deployed to — the same tenant isolation guarantees from Section 12 apply.

Tenant Application — Screen Inventory

The Tenant Application is what customers use daily. It renders subscribed modules in panels, provides agent chat, and gives tenant admins tools to manage their users, roles, and module settings. Every screen is gated by the user's role and permissions — users only see what they're authorized to access.

ScreenVisible ToPurpose

Module Workspace (All Authenticated Users)

ScreenPermission GatePurpose
Module Panels Module's min_permission from manifest Each subscribed module renders in a panel. Sidebar shows only modules the user has at least one permission for. Module content gated internally by PermissionGate components.
Agent Chat Any authenticated user Right panel / overlay. Chat with module agents. Agent operates within the user's permission ceiling. Available from any module panel.
Notifications Any authenticated user Platform and module notifications: model changes, maintenance windows, module updates, bridge alerts. Actionable items link to relevant screens.
User Profile Any authenticated user Personal settings: name, email, notification preferences, SMS opt-in, password change, active sessions.

Tenant Administration (Admin Role Only)

ScreenPurposeKey Actions
Module Catalog Browse and subscribe to available modules View module descriptions, permissions required, subscribe/unsubscribe. On subscribe: tables provisioned, permissions auto-mapped to roles.
Module Settings Per-module configuration forms (auto-generated from manifest tenant_settings) Adjust thresholds, toggle features, set frequencies. Constrained by platform-level limits.
User Management Invite, manage, and deactivate users within the tenant Invite users (email or bulk CSV), assign roles, view activity, reset passwords, deactivate accounts, configure SSO mappings.
Role Management Configure roles and permission grants View/edit standard templates (Admin, Manager, Operator, Viewer). Create custom roles. Add/remove permission grants per module. Delete custom roles.
Permission Browser Audit "who can do what" across all subscribed modules View all permissions grouped by module. See which roles grant each permission. Identify gaps or over-grants.
Bridge Management Register and monitor on-premises bridges Register new bridges, view health/heartbeat, download installer, configure site assignments, view tool sync status.
Audit Log Tenant-scoped audit trail View who changed what, when, through which interface. Filter by module, user, action, date. Export for compliance.
💡
The Tenant Application is module-driven. The workspace is mostly empty until modules are subscribed. Each module adds its nav item, its panel, its settings page, and its permissions to the tenant's experience. The platform shell provides the frame — modules fill it. A tenant with only Maintenance subscribed sees only the Maintenance panel and its admin settings. A tenant with SCADA, Maintenance, ETL, and Reporting sees all four. The tenant admin controls what's visible by managing subscriptions and role assignments.

Workspace Architecture

Both applications share the same workspace shell. The shell detects the context (platform vs. tenant) from the URL and authentication, then renders the appropriate sidebar navigation and security context.

The Developer Experience

A developer starts a new module. They load the Platform Module Development skills package into their coding agent (Claude Code, Cursor, or any compatible agent). The agent knows the SDK standards — directory structure, manifest schema, CSS scoping rules, TypeScript interfaces, permission hooks, and the component patterns that render correctly inside the platform shell. The developer describes what they want to build. The agent builds it correctly because the skill tells it how.

The module goes through the Module Pipeline — tested, validated against SDK standards (CSS scoping, TypeScript strict, security hooks, responsive breakpoints), then published to the module catalog. Tenant admins subscribe. The module appears in their workspace.

Workspace Layout

The workspace has three zones: Sidebar (navigation), Panel Carousel (multiple module panels open simultaneously), and Action Panel (shared right sidebar for documents, viewers, and artifacts). This model is proven — it's the same architecture used in Chitty Workspace, adapted for multi-tenant module rendering.

Sidebar
Dashboard
SCADA
Maint.
ETL
Reports
ADMIN
Users
Roles
Settings
AGENT
Chat
Panel Carousel · Multiple Open
SCADA Monitor
Work Orders
Agent Chat
◀ ▶
SCADA Monitor — Alarm Dashboard
Module's React code renders here. Full control over content. Own dependencies, SDK hooks, platform CSS variables.
Active Alarms Sensor History Equipment Status
▶ View 3D Model → opens in Action Panel
Work Orders — 3 open WOs
Agent Chat — SCADA agent
Action Panel
3D Equipment Viewer
Opened by: SCADA Monitor
📦
Three.js STEP viewer
Multishuttle Aisle A3
Also available:
Document viewer
Report preview
Agent artifacts
Desktop: sidebar + carousel + action panel
Tablet: sidebar collapses, carousel + action overlay
Mobile: bottom nav, full-screen panels

Panel Carousel

Multiple panels can be open simultaneously. The user clicks modules in the sidebar to open them — each opens as a new panel in the carousel. Panels slide horizontally with smooth transitions. The focused panel is full-size; background panels are previewed at reduced scale (visible but not interactive). Users navigate between open panels via tabs at the top or left/right arrows.

BehaviorDetail
Open a panel Click a module in the sidebar. If already open, focus it. If not, mount the module's entry component in a new panel and animate it into focus.
Multiple panels Users can have SCADA Monitor, Work Orders, and Agent Chat all open at once. Each panel maintains its own state independently. Switching between them is instant — no remounting.
Focus & preview The active panel is full-size. Adjacent panels are scaled down (opacity 0.7, scale 0.96) and visible as previews. Click a preview to bring it into focus.
Close a panel Close button on the panel tab. Module unmounts, state cleaned up. Adjacent panels animate to fill the space.
Agent Chat as a panel Agent Chat is a panel like any other — it can be open alongside module panels. The chat panel is context-aware: it knows which module panel is focused and scopes the agent conversation accordingly.

Action Panel

The Action Panel is a shared right sidebar that any module can push content to. It's designed for viewing documents, artifacts, 3D models, reports, images, and any content that complements the main module panel. Only one module can use the Action Panel at a time — opening new content replaces the current content.

BehaviorDetail
Module opens content A module calls the SDK hook useActionPanel().open({ type: 'viewer', component: StepFileViewer, props: { fileId: '...' } }). The Action Panel slides open with the module's content rendered inside.
One owner at a time If Module A has content in the Action Panel and Module B requests it, Module A's content closes and Module B's content opens. The Action Panel shows which module currently owns it.
Content types Any React component: document viewer, Three.js 3D model viewer (STEP files), image gallery, report preview, data inspector, agent artifacts, form builder preview. The module ships the viewer component — the Action Panel is just the container.
Resizable Drag handle on the left edge. Users can resize the Action Panel from 30% to 55% of the viewport. The carousel panels resize to accommodate.
Closeable Close button or Escape key. The Action Panel collapses and the carousel expands to fill the space. Module content is unmounted.
Responsive Desktop: fixed right sidebar. Tablet: overlay that slides in from the right. Mobile: full-screen overlay.

Action Panel SDK Hook

import { useActionPanel } from '@platform/ui-sdk';
import { StepFileViewer } from './components/StepFileViewer';

function EquipmentCard({ equipment }) {
  const actionPanel = useActionPanel();

  return (
    <Card>
      <h4>{equipment.name}</h4>
      <Button onClick={() => actionPanel.open({
        title: `3D Model — ${equipment.name}`,
        component: StepFileViewer,
        props: { fileUrl: equipment.stepFileUrl },
      })}>
        View 3D Model
      </Button>
    </Card>
  );
}
💡
The Action Panel is the extension point. Modules ship their own viewer components (3D model viewers, document renderers, chart inspectors) and open them in the Action Panel when needed. The platform provides the container, resize behavior, and lifecycle management. The module provides the content. A SCADA module might open a Three.js equipment model. A maintenance module might open a work order PDF. A reporting module might open an interactive chart. Same panel, different content, driven by whatever module the user is working in.

What the Shell Provides

Shell ResponsibilityDetail
Authentication JWT management, session state, token refresh. The shell authenticates the user before any module code runs. Modules never handle auth.
Navigation Sidebar assembled from module manifests. Each module declares nav_label, nav_icon, and nav_group in its manifest. The shell builds the sidebar, respects permissions (hides modules the user can't access), and handles routing.
Panel Lifecycle Opens, closes, and manages module panels. When a user clicks a module in the sidebar, the shell mounts the module's entry component inside a panel container. The shell handles panel transitions, focus state, and cleanup on close.
Security Context Injects tenant context, user identity, and permission set into every module panel via React context. Modules read permissions through SDK hooks — they never fetch auth state directly.
Responsive Layout The shell manages breakpoints and layout shifts (desktop/tablet/mobile). Module panels receive the available width — the module's own components handle their internal responsive behavior using the platform's CSS custom properties and breakpoint hooks.
Agent Chat The agent chat panel is always available — as a right sidebar on desktop or a full-screen overlay on mobile. The chat panel is a platform feature, not a module. Module agents are accessible through it based on the current module context.
Theming CSS custom properties for the Dematic design language (colors, typography, spacing, border-radius). Modules that use these variables automatically match the platform theme. Dark mode support built in.
CSP & Isolation Content Security Policy headers prevent module code from accessing other modules' DOM, modifying the shell, or loading external scripts. The shell owns the security boundary.

Module Panel Rendering

When a user opens a module, the shell mounts the module's entry component inside a panel container. The module receives platform context (tenant, user, permissions) and renders its own UI. The module controls everything inside the panel — routing, state management, data fetching, component structure.

1

User Clicks Module in Sidebar

Shell checks: Does this user have at least one permission for this module? If not, the nav item is hidden (they never see it). If yes, the shell opens a panel.

2

Shell Mounts Module Entry Component

The module's ui/index.tsx is mounted inside the panel container. The shell wraps it with platform context providers (tenant, user, permissions, theme, API client).

3

Module Renders

The module's React code takes over. It reads permissions via SDK hooks, fetches data via the platform API client (pre-scoped to the tenant), and renders its views. Internal navigation (tabs, sub-pages, forms) is the module's responsibility.

4

User Closes Module or Navigates Away

Shell unmounts the module component, cleans up. Module state is preserved per session if the module uses the platform's panel state API.

Module UI in the Manifest

The manifest declares a single UI entry point. The platform opens it in a panel. Internal navigation is the module's responsibility.

// In module.json — ui section
"ui": {
  "entry": "ui/index.tsx",              // Module's React entry point
  "nav_label": "SCADA Monitor",         // Sidebar label
  "nav_icon": "activity",              // Icon from platform icon set
  "nav_group": "Operations",           // Sidebar group
  "min_permission": "SCADA.READ_SENSORS" // Minimum permission to see nav item
}
💡
Keep it simple. One entry point. One panel. The module handles everything inside. This means a simple module can render a single table, and a complex module can render a full multi-tab application with its own router — same manifest, same shell integration, same security context.

16. Module UI SDK

The Module UI SDK is what a module developer imports to integrate with the platform. It provides hooks for reading platform context (permissions, tenant, user), an API client pre-scoped to the tenant, and CSS custom properties for theming. It does not restrict what the module builds — it provides the integration points.

Module UI Directory Structure

app/modules/scada_monitor/
├── module.json
├── api.py                     # Backend (existing)
├── service.py                 # Backend (existing)
├── schemas.py                 # Backend (existing)
├── ai/                        # Agent system (existing)
│   ├── skills/
│   └── tools.py
├── ui/                        # NEW: Module frontend
│   ├── index.tsx             # Entry point — mounted by the shell
│   ├── package.json          # Module's own frontend dependencies
│   ├── tsconfig.json         # TypeScript config (extends platform base)
│   ├── pages/                 # Module's internal views
│   │   ├── AlarmDashboard.tsx
│   │   ├── SensorHistory.tsx
│   │   └── Settings.tsx
│   ├── components/            # Module's own components
│   │   ├── AlarmTable.tsx
│   │   ├── SeverityBadge.tsx
│   │   └── PlcWriteForm.tsx
│   └── styles/                # Scoped CSS modules
│       ├── alarm-dashboard.module.css
│       └── sensor-history.module.css
├── requirements.txt           # Backend Python deps
└── tests/

Module's Own Dependencies

Each module has its own package.json for frontend dependencies and requirements.txt for backend dependencies. The module pipeline installs, builds, and validates both before publishing.

// ui/package.json — module's own frontend deps
{
  "name": "@modules/scada-monitor-ui",
  "dependencies": {
    "@platform/ui-sdk": "workspace:*",   // Platform SDK (required)
    "recharts": "^2.12.0",             // Module's choice: charting
    "@tanstack/react-table": "^8.0",   // Module's choice: data tables
    "date-fns": "^3.6.0"               // Module's choice: date handling
  }
}
💡
Use any library you want. The platform doesn't prescribe which charting library, table library, or form library you use. The only required dependency is @platform/ui-sdk which provides the hooks and CSS variables for platform integration. Everything else is the module developer's choice. The Module Pipeline validates that your dependencies install cleanly and your build succeeds before publishing.

SDK Hooks

HookReturnsPurpose
usePermission(code) boolean Check if the current user has a specific permission. Use to conditionally render buttons, forms, tabs.
usePermissions() Set<string> Get all permission codes for the current user in this tenant. Use for complex permission logic.
useTenant() {id, slug, name} Current tenant context. Use for display, scoping, or conditional behavior per tenant.
useUser() {id, name, email, role} Current user identity. Use for display, audit attribution, or user-specific views.
useApi() ApiClient Pre-authenticated, tenant-scoped HTTP client. Calls go to /api/v1/{module}/... with JWT automatically attached. No manual auth handling.
useBreakpoint() 'mobile' | 'tablet' | 'desktop' Current responsive breakpoint based on panel width (not viewport). Use for layout decisions within the module panel.
useModuleSettings() object Resolved module settings for this tenant (platform defaults merged with tenant overrides). Read-only.
useAgent(skillName?) {sendMessage, messages, isLoading} Open a conversation with the module's agent. Optionally activate a specific skill. Use for inline agent assistance within the module UI.
useActionPanel() {open, close, isOpen, owner} Open content in the shared Action Panel (right sidebar). Pass a React component + props. Only one module can own the Action Panel at a time — opening replaces current content. Use for document viewers, 3D models, report previews, artifacts.

Permission-Gated UI Components

import { usePermission, PermissionGate, useApi } from '@platform/ui-sdk';

function AlarmDashboard() {
  const api = useApi();
  const canAcknowledge = usePermission('SCADA.ACK_ALARMS');
  const [alarms, setAlarms] = useState([]);

  useEffect(() => {
    api.get('/scada/alarms').then(setAlarms);
  }, []);

  return (
    <div>
      <h2>Active Alarms</h2>
      <AlarmTable data={alarms}>
        {/* This column only renders if user has ACK permission */}
        <PermissionGate permission="SCADA.ACK_ALARMS">
          <ActionColumn
            label="Acknowledge"
            onClick={(alarm) => api.post(`/scada/alarms/${alarm.id}/ack`)}
          />
        </PermissionGate>
      </AlarmTable>

      {/* Settings tab only visible to users with CONFIGURE permission */}
      <PermissionGate permission="SCADA.CONFIGURE">
        <SettingsTab />
      </PermissionGate>
    </div>
  );
}
UI permissions are a UX convenience, not the security boundary. PermissionGate hides a button the user can't use — but even if someone hacks the frontend and sends the API request directly, the platform's @platform_tool decorator and API middleware enforce the same permission server-side. The UI hooks prevent confusion ("why doesn't this button work?"). The backend enforces security.

CSS Scoping & Design Standards

Modules write their own CSS but follow platform scoping rules to prevent leaking styles into the shell or other modules. The platform provides CSS custom properties for the Dematic design language — modules that use these variables automatically match the platform theme.

StandardRuleWhy
CSS Modules All module styles must use CSS Modules (.module.css) or CSS-in-JS scoping. No global CSS allowed. Prevents modules from leaking styles into the shell or other modules. The Module Pipeline rejects global style declarations.
Platform CSS Variables Use var(--platform-*) for colors, typography, spacing, and border-radius. The shell sets these on the panel container. Modules automatically match the platform theme (including dark mode) without hardcoding values. Consistent visual language across all modules.
TypeScript Strict Modules extend the platform's tsconfig.base.json with strict: true. No any types on SDK interfaces. Type safety at the integration boundary. The SDK's TypeScript interfaces catch permission typos, invalid API calls, and context misuse at compile time.
Responsive Modules must render correctly at all three breakpoints: desktop (>1024px), tablet (768-1024px), mobile (<768px). Use useBreakpoint() or the platform's responsive CSS variables. Operators use tablets on the warehouse floor. Engineers use desktops. The platform shell handles layout shifts — the module handles its content reflow.
No Shell Modification Module code cannot modify the shell DOM, override CSP headers, or access other modules' state. Enforced by CSP and the build pipeline. Security isolation. The shell is the trust boundary.

Platform CSS Custom Properties

/* Available to all module CSS — set by the platform shell */
/* Colors (Dematic design language) */
--platform-color-primary:     #ffb517;   /* Primary accent (Dematic yellow) */
--platform-color-background:  #000;      /* Dark background */
--platform-color-surface:     #191919;   /* Card/panel background */
--platform-color-border:      #303030;   /* Borders */
--platform-color-text:        #d9d9d9;   /* Primary text */
--platform-color-text-muted:  #999;      /* Secondary text */
--platform-color-success:     #27AE60;
--platform-color-warning:     #ffb517;
--platform-color-danger:      #dc2626;
--platform-color-info:        #004665;   /* Dematic teal */

/* Typography */
--platform-font-family:       'Helvetica Neue', system-ui, sans-serif;
--platform-font-mono:         'JetBrains Mono', 'Consolas', monospace;
--platform-font-size-sm:      12px;
--platform-font-size-base:    14px;
--platform-font-size-lg:      16px;
--platform-font-size-xl:      20px;
--platform-font-size-2xl:     28px;

/* Spacing */
--platform-space-xs:  4px;
--platform-space-sm:  8px;
--platform-space-md:  16px;
--platform-space-lg:  24px;
--platform-space-xl:  32px;

/* Layout */
--platform-radius:    0px;       /* Angular/industrial (Dematic brand) */
--platform-panel-width: ...;     /* Current panel width (set by shell) */
--platform-breakpoint: ...;      /* 'mobile' | 'tablet' | 'desktop' */

Module Build & Publish Pipeline

1

Develop

Developer builds the module with their coding agent. The agent loads the Platform Module Development skills package — knows the SDK standards, CSS scoping rules, TypeScript interfaces, permission patterns, and manifest schema. The module runs locally against a dev platform instance.

2

Install Dependencies

Module Pipeline reads requirements.txt (backend) and ui/package.json (frontend). Installs and validates all dependencies. Rejects packages with known vulnerabilities. Locks versions for reproducible builds.

3

Build & Validate

TypeScript compiled with strict: true. CSS analyzed for global leaks. Static analysis checks: no dangerouslySetInnerHTML, no direct DOM manipulation outside module scope, no hardcoded colors (must use CSS variables), responsive breakpoints tested at all three widths. Backend validated: manifest schema, tool security policies, permission completeness.

4

Test

Module's own test suite runs (frontend + backend). Platform integration tests verify: module mounts in shell correctly, permission hooks return correct values, API client routes to correct tenant, CSS doesn't leak, responsive layout renders at all breakpoints.

5

Publish to Module Catalog

Module registered in module_registry. Available for tenant subscription in the Admin Console. Tenant admins browse the catalog, subscribe, and the module appears in their workspace sidebar.

Standards without restrictions. The platform doesn't tell you which charting library to use or how to structure your components. It tells you: scope your CSS with CSS Modules, use platform CSS variables for theming, use TypeScript strict mode, use SDK hooks for permissions and data access, and make it responsive. Inside those boundaries, build whatever you want. The pipeline validates the standards — not the implementation.

17. Roles, Permissions & UI Security

The permission system operates at two layers: the platform defines standard role templates and enforces permissions on every API call. Modules define their own permission codes and map them to UI actions. Tenant admins customize roles by mixing permissions across modules. Security is always enforced server-side by the platform — UI gating is a usability layer on top.

Standard Role Templates

The platform ships four standard role templates. These are starting points — tenant admins can modify them or create entirely custom roles. When a tenant subscribes to a new module, the platform auto-maps the module's permissions to these templates using the module's default_role_mapping.

RoleScopeDescription
Admin All module permissions + tenant management Full access. Can manage users, roles, module subscriptions, tenant settings, and bridge configuration. Receives all module permissions by default.
Manager Read + write + approve per module Operational lead. Can view data, create/edit records, and approve actions (acknowledge alarms, approve work orders). Cannot manage users or tenant settings.
Operator Read + write per module Day-to-day user. Can view data and perform standard operations (create work orders, run ETL). Cannot approve critical actions or configure modules.
Viewer Read only per module View dashboards, reports, and data. Cannot create, edit, or perform any actions. Ideal for executives, auditors, and external stakeholders.

Module-Defined Permissions

Each module declares its permissions in the manifest with three pieces of information: the permission code (used in API enforcement and SDK hooks), a description (shown in the Admin Console when configuring roles), and the default role mapping (which standard roles get this permission by default when the tenant subscribes).

// In module.json — permissions section
"permissions": [
  {
    "code": "SCADA.READ_SENSORS",
    "description": "View sensor data, alarm list, and equipment status",
    "default_roles": ["admin", "manager", "operator", "viewer"]
  },
  {
    "code": "SCADA.ACK_ALARMS",
    "description": "Acknowledge and dismiss active alarms",
    "default_roles": ["admin", "manager"]
  },
  {
    "code": "SCADA.WRITE_PLC",
    "description": "Write values to PLC registers (controls physical equipment)",
    "default_roles": ["admin"]
  },
  {
    "code": "SCADA.CONFIGURE",
    "description": "Configure alarm thresholds, severity rules, and module settings",
    "default_roles": ["admin"]
  }
]

How Roles Get Assigned

1

Tenant Subscribes to Module

Platform reads the module's default_role_mapping. Automatically grants permissions to the tenant's standard roles. Admin gets all module permissions. Manager, Operator, Viewer get what the module recommends.

2

Tenant Admin Customizes (Optional)

In the Admin Console, the tenant admin can: add/remove permissions from existing roles, create entirely new custom roles (e.g., "Shift Supervisor" with specific permissions across SCADA + Maintenance), and assign custom roles to users. The standard templates are starting points — not locked.

3

Users Assigned to Roles

Users are assigned one or more roles within their tenant. A user can be "Operator" for SCADA and "Manager" for Maintenance — their effective permissions are the union of all assigned roles. Stored in user_role_assignments in the Platform DB.

4

Permissions Enforced at Every Layer

The resolved permission set is included in the JWT and injected into every request context. API middleware checks permissions on every endpoint. @platform_tool checks permissions on every tool call. UI SDK hooks read permissions for frontend gating. All three layers use the same permission codes — defined once in the module manifest.

Enforcement: Two Layers, One Permission Code

LayerHow It's EnforcedWhat Happens on Failure
API / Backend
Security boundary
Platform middleware checks the permission code on every API request. @platform_tool decorator checks on every tool call. This is the real security boundary — it cannot be bypassed from the frontend. 403 Forbidden — request rejected, audit logged
UI / Frontend
UX layer
Module code uses usePermission() and PermissionGate to hide UI elements the user can't use. Sidebar navigation hides modules the user has no permissions for. Buttons, tabs, and form actions gated by permission code. Element not rendered — user never sees what they can't do
Permission: SCADA.ACK_ALARMS UI Layer (UX convenience): Sidebar → SCADA module visible (user has SCADA.READ_SENSORS) Alarm table → "Acknowledge" button hidden (user lacks SCADA.ACK_ALARMS) Settings tab → hidden (user lacks SCADA.CONFIGURE) API Layer (security boundary): POST /api/v1/scada/alarms/{id}/ack → 403 Forbidden @platform_tool acknowledge_alarm → permission denied, error to agent Both layers use the same permission code: SCADA.ACK_ALARMS Module defines it once in module.json. Platform enforces everywhere.

Permission Hierarchy

The platform uses a flat additive permission model. Roles grant specific permission codes. A user's effective permissions are the union of all their assigned roles. There is no implicit inheritance — granting SCADA.WRITE_PLC does not automatically grant SCADA.READ_SENSORS. This is explicit and auditable.

💡
Why flat over hierarchical: Hierarchical permissions ("Admin inherits everything from Manager") create hidden grants that are hard to audit. Flat permissions are explicit — every grant is visible in the role_permissions table. The default_roles mapping in the manifest handles the convenience ("Admin should get all module permissions") without hiding the individual grants. When a tenant admin reviews a role, they see exactly which permissions it has — no inheritance chains to trace.

Agent Permission Inheritance

When a user interacts with an agent, the agent operates with the user's permissions in the current tenant. The agent cannot do anything the user can't do. If the user has SCADA.READ_SENSORS but not SCADA.WRITE_PLC, the agent can read sensor data but cannot write to PLCs — even if the agent's skills and tools support PLC writing. Tool-level security policies (allowed_agents, requires_approval) add additional constraints on top of the user's permissions.

ConstraintSet ByScope
User permissions Tenant admin via role assignments The ceiling — the agent cannot exceed the user's permissions
Tool security policy Module developer via @platform_tool Additional constraints: requires_approval, allowed_agents, risk_level
Effective permission Platform (intersection) User must have the permission AND the tool policy must allow the agent. Both must pass.
Defined by modules, enforced by platform. Module developers declare permissions and map them to default roles in the manifest. The platform enforces those permissions on every API call, every tool invocation, and every UI rendering decision. Tenant admins customize roles in the Admin Console. Users get exactly the access their roles grant — no more. Agents inherit the user's ceiling. One permission code, defined once, enforced everywhere.

Agent System — Platform-level AI infrastructure, model registry, and embedding

18. Agent Architecture

The Agent System is platform-level infrastructure. Every module can define agents with skills and tools. The platform provides the executor, model registry, embedding generation, memory, guardrails, streaming, failover, and dispatch. Agents always run in tenant context — they can only access the current tenant's data and bridges.

Agent Tenant Isolation — Enforced at Every Layer

When an agent session starts, the platform resolves the tenant from the user's JWT and locks the entire session to that tenant. Every component the agent interacts with is pre-scoped:

ComponentHow Tenant Is Enforced
ToolContextInjected per tool call. Contains tenant_id, tenant-scoped bridge connection, and agent system interface. Immutable for the session.
TenantDataConnectorPre-routed to the current tenant's dedicated PostgreSQL database. Other tenants' data stores are not addressable through this connector.
BridgeWebSocket connection resolved to the current tenant's registered bridge. Bridge routing table maps tenant_id → bridge connection. No mechanism to address another tenant's bridge.
Platform DB (RLS)SET app.current_tenant_id applied to every DB session. Row-Level Security policies physically prevent cross-tenant reads on config tables.
dispatch_agentsDispatched child agents inherit the parent's tenant context. An agent in Tenant A dispatching to a MAINTENANCE agent creates a MAINTENANCE session in Tenant A — never Tenant B.
Memory & RecallRecall events stored and retrieved per tenant_id + user_id. An agent cannot recall memories from another tenant's sessions.
Embeddingscontext.agent_system.embed() tracks usage per tenant for billing. Vector searches via data.vector_search() run against the current tenant's tables only.
SSE StreamingAgent events stream to the current tenant's user session. SSE channels are scoped by tenant_id + session_id.

There is no "admin mode" or "cross-tenant" flag available to agents or module code. Cross-tenant operations (analytics, billing rollups, platform maintenance) are restricted to platform admin CLI tools that use a separate, non-agent database connection. Module developers and agents operate entirely within the boundaries of a single tenant.

Model Registry & Provider Management

The Model Registry is the central catalog of all LLM providers and models available on the platform. Models are manually managed — commissioned when ready, sunset when replaced. The registry tracks capabilities, pricing, and lifecycle status for every model.

Registry FieldPurposeExample
provider LLM provider identity anthropic, openai, google, xai
model_id Provider's model identifier claude-sonnet-4-6, gpt-4o, gemini-2.5-flash
model_type What the model does chat, embedding, image_gen, video_gen, audio_gen, stt
capabilities Feature flags tool_calling, vision, streaming, thinking, json_mode
context_window Max input tokens 200000, 128000, 1000000
max_output_tokens Max response tokens 8192, 16384
status Lifecycle state active, deprecated, sunset
replacement_model When sunset, points to the successor claude-sonnet-4-6 replaces claude-3-5-sonnet
pricing Cost per 1M input/output tokens {"input": 3.00, "output": 15.00}

Provider Interface & Multi-Modal Adapters

Every LLM provider implements a base adapter with standard methods for each capability type. Module developers never call providers directly — the Agent Executor resolves the model from the registry and calls the provider through this interface. Each provider maps its API-specific features to these platform-level standards.

Base Adapter (Required for All Providers)

MethodPurpose
generate(messages, tools, config) → ResponseChat completion with optional tool calls. Handles streaming internally via chunked response. Returns: content, tool_calls, token_usage (from provider), model_id used.
embed(texts) → vectors[]Generate vector embeddings. Returns float arrays. Model resolved from default_embedding_model in registry.
validate_api_key(key) → boolTest if an API key is valid for this provider.
list_models() → Model[]Discover available models from the provider API. Used to seed/update the Model Registry.

Media Adapters (Optional per Provider)

Providers expose media capabilities through standardized adapter interfaces. The Model Registry tracks which providers support which capabilities — if a provider doesn't support image generation, it doesn't implement the image adapter.

AdapterMethodsProviders
Image Generation generate_image(prompt, size, quality, style) → image_url OpenAI (DALL-E), Google (Imagen), xAI (Grok Imagine)
Video Generation submit_video(prompt, config) → job_id
poll_video(job_id) → status, video_url
OpenAI (Sora), Google (Veo), xAI (Grok Video)
Text-to-Speech generate_speech(text, voice, model) → audio_bytes OpenAI (TTS-1/TTS-1-HD), Google (Cloud TTS)
Speech-to-Text transcribe(audio_bytes, language) → transcript OpenAI (Whisper), Google (Cloud STT)
Vision / Multimodal Input Handled via generate() — images, audio, video, and PDF passed as message content parts All chat models with vision capability flag in registry
💡
Provider-specific features map to platform standards. Each provider has its own API format (Anthropic uses content blocks, OpenAI uses messages, Google uses generateContent). The adapter translates these to the platform's standard response format. Module tools and agents never see provider-specific structures — they call generate() and get a consistent response regardless of which provider/model is underneath.

Token Usage Tracking

Token usage is tracked from the provider's response as the source of truth. The platform also runs its own token calculation alongside every call — not as a fallback, but as a validation signal. Both numbers are stored. When they diverge, it means either the platform's calculator needs calibration for that provider or the provider changed their tokenization.

FieldSourcePurpose
input_tokens Provider response Tokens consumed by the prompt. Source of truth for billing.
output_tokens Provider response Tokens generated in the response. Source of truth for billing.
cache_read_tokens Provider response (if supported) Tokens served from provider cache (Anthropic, Google). Reduced cost.
cache_write_tokens Provider response (if supported) Tokens written to provider cache for future requests.
calc_input_tokens Platform calculator Platform's own token estimate for the same input. Stored alongside provider value for validation.
calc_output_tokens Platform calculator Platform's own token estimate for the same output. Stored alongside provider value for validation.
drift_pct Computed Percentage difference between provider and platform calculation. Used to calibrate per-provider offsets and detect tokenization changes.
💡
Why both: Provider tokens are always the billing truth — you never override them. But running a parallel calculation gives you a calibration signal per provider. If Anthropic consistently reports 2% higher than your calculator, you store that offset. If the drift suddenly jumps to 15%, something changed — the provider updated their tokenizer, or your prompt structure shifted. The Admin Console surfaces drift alerts so the team can investigate and recalibrate.

Every token usage record is stored with: tenant_id, user_id, agent_id, module_code, provider, model_id, session_id, timestamp. This enables cost breakdowns by any dimension:

# CLI: Token usage by tenant
$ platform usage --by tenant --period 30d

TENANT      PROVIDER    MODEL                INPUT        OUTPUT       COST
──────────  ──────────  ───────────────────  ───────────  ───────────  ────────
Tenant A    anthropic   claude-sonnet-4-6    12,400,000   3,200,000   $52.80
Tenant A    openai      text-embedding-3     8,100,000    —            $0.81
Tenant B    anthropic   claude-sonnet-4-6    8,900,000    2,100,000   $37.65
Tenant C    anthropic   claude-haiku-4-5     2,300,000    890,000     $2.07

# CLI: Token usage by user within a tenant
$ platform usage --tenant tenant-a --by user --period 7d

USER              AGENT RUNS   INPUT        OUTPUT       COST
────────────────  ──────────   ───────────  ───────────  ────────
michael.guo       312          48,700,000   12,400,000   $198.42
tucker.watson     47           2,100,000    540,000      $8.73
matt.gann         31           1,400,000    380,000      $5.82
matt.haley        28           1,200,000    310,000      $4.91
sean.wang         44           1,900,000    490,000      $7.64
brett.webster     39           1,700,000    420,000      $6.88
sched:sensor_poll_5m   84      3,200,000    800,000      $13.10
sched:alarm_triage     36      1,600,000    400,000      $6.70

Agent Session Tracing

Every agent run produces a full execution trace — a complete record of what happened, what the LLM saw, what it decided, what tools it called, and what it returned. Traces are stored in the Platform DB and accessible via the Admin Console and CLI.

💡
Traces are training data. Every trace captures the full input/output pair: what the LLM received (assembled prompt, tools, context) and what it produced (reasoning, tool calls, final response). This is the raw material for future model evaluation, fine-tuning, and agent performance optimization. By capturing complete traces now — not just outcomes but the full decision chain — the platform builds a dataset that enables AI-driven analysis of agent behavior: which skills produce the best outcomes, which tool sequences are most efficient, where agents get stuck, and how different models perform on the same tasks. The trace data is structured specifically to support future training pair extraction and evaluation pipelines.
Trace RecordWhat It Captures
Session metadata Session ID, tenant, user, agent, module, model used (including failover), start/end time, total duration, iteration count, final status (completed/max_iterations/error)
Prompt snapshot The full assembled system prompt at session start: persona, skills loaded, tool instructions injected, security context, recall events. Exactly what the LLM saw.
Per-iteration log For each iteration of the agent loop:
LLM input: messages sent (with token count)
LLM output: raw response (text content, tool calls with arguments)
Model used: which provider/model (tracks failover swaps)
Token usage: input, output, cache (from provider)
Latency: time-to-first-token, total response time
Tool call log For each tool call within each iteration:
Tool name: which tool, scope (platform/module)
Input parameters: full arguments sent to the tool
Execution route: cloud or bridge
Output result: full response from the tool
Duration: execution time
Security checks: permission verified, approval gate triggered (yes/no), audit level applied
Dispatch log If dispatch_agents was called: which agents were dispatched, tasks sent, results received, sub-agent traces linked
Memory operations Compaction events (when triggered, tokens before/after), recall events loaded, todo updates
Guardrail events Pre/post condition checks, pass/fail results, failure actions taken (stop/pause/retry/skip)
# CLI: View a session trace
$ platform agents trace sess_8f3a2b1c

SESSION: sess_8f3a2b1c
  Tenant:     Tenant A
  User:       mike.rodriguez
  Agent:      SCADA_MONITOR
  Model:      anthropic/claude-sonnet-4-6
  Duration:   12.4s (4 iterations)
  Tokens:     input 14,200 | output 3,800 | cache_read 8,400
  Status:     completed

  Iteration 1:
    LLM → text: "Let me check the current status of Sorter 7..."
    LLM → tool_call: read_multishuttle_status(aisle_id="A3")
      Route:    bridge (Tenant A DC-1)
      Duration: 1.2s
      Result:   {shuttles: 12, utilization: 91%, faults: [{shuttle: 7, code: 5012}]}
      Audit:    standard

  Iteration 2:
    LLM → tool_call: query_sensor_history(zone_id="A3", hours=168)
      Route:    cloud (PostgreSQL: tenant_a_modules)
      Duration: 0.8s
      Result:   {readings: 2847, avg_motor_temp: 63.2, trend: "increasing"}

  Iteration 3:
    LLM → tool_call: dispatch_agents(agent="MAINTENANCE", task="Create WO...")
      Sub-trace: sess_8f3a2b1c_sub1 (MAINTENANCE agent, 2 iterations)
      Result:   {work_order: "WO-2026-0847", assigned: "next available"}

  Iteration 4:
    LLM → text: "Shuttle 7 in Aisle A3 shows elevated motor temperature..."
    Status: completed (response delivered to user)
Full observability: Every agent run is fully traceable — from the assembled prompt to every LLM call, every tool execution (with raw inputs and outputs), every dispatch chain, every token consumed. Token usage comes from the provider, not estimated. Traces are scoped by tenant, user, agent, and model — enabling debugging, cost analysis, and performance optimization at any level.

Model Lifecycle Management

1

Commission

Platform admin adds a new model to the registry: provider, model_id, capabilities, context window, pricing. Status set to active. Available for agent configuration.

2

Assign

Module agents reference models by ID in their manifest (ai.agent.model) or through platform defaults. Multiple agents can use the same model.

3

Deprecate

When a model is being phased out, admin sets status to deprecated, assigns replacement_model, and sets a sunset_date (aligned with the provider's announced end-of-life). The platform starts a deprecation countdown — Admin Console, CLI, and tenant notifications all show the days remaining until the model is no longer available. Existing agents continue to work but the countdown creates urgency. No new agents can select deprecated models.

# Deprecate a model with a sunset date
$ platform models deprecate claude-3-5-sonnet \
    --replacement claude-sonnet-4-6 \
    --sunset-date 2026-06-15

claude-3-5-sonnet: active → deprecated
  Replacement:  claude-sonnet-4-6
  Sunset date:  June 15, 2026 (65 days remaining)
  Agents using: 6 (across 3 tenants)
  Critical:     2 (require clone & test)

Notifications sent to 3 tenant admins.
Countdown active in Admin Console + CLI.

# CLI always shows the countdown
$ platform models list

PROVIDER    MODEL                STATUS       SUNSET          AGENTS
──────────  ───────────────────  ───────────  ──────────────  ──────
anthropic   claude-sonnet-4-6    active       —               6
anthropic   claude-3-5-sonnet    deprecated   47 days left    6
openai      gpt-4o               active       —               2
google      gemini-2.5-flash     active       —               3
4

Migrate (before countdown reaches zero)

Admin triggers platform models swap before the sunset date. Non-critical agents are swapped in bulk across all tenants. Critical agents (critical_validation: true) are flagged — tenant admins clone, test with the new model, and approve individually. The countdown drives urgency — at 30 days, 14 days, and 7 days, reminder notifications are automatically sent to any tenant admin with agents still on the deprecated model.

5

Sunset (countdown reaches zero)

On the sunset date, the model status changes to sunset. Any agents still using the model are automatically swapped to the replacement — the platform will not allow agents to call a sunset model. The model is removed from all selection lists. Provider API credentials are never touched during model sunset — keys are per provider, not per model. Decommissioning or rotating a provider API key would crash every active model on that provider across all tenants. Model lifecycle management operates at the model level only, never at the credential level. Historical records retain the model ID and sunset date for audit. If a critical agent was not validated before the deadline, the auto-swap still happens but the agent is flagged for immediate review.

Model Swap (Migration)

Over the past year, hundreds of model sunsets have occurred across all major providers — OpenAI, Anthropic, Google, xAI. Models are deprecated, replaced, and retired on a cadence that shows no sign of slowing down. New models ship monthly. Older models lose API support with 60-90 day notice windows. A platform running agents across 100 tenants cannot manually update agent configurations one by one — model swap is an operational necessity, not a convenience feature.

The model swap command migrates all agents across all tenants from a sunset model to its replacement in a single operation:

# Preview what would change (always dry-run first)
$ platform models swap claude-3-5-sonnet claude-sonnet-4-6 --dry-run

Model Swap Preview: claude-3-5-sonnet → claude-sonnet-4-6

TENANT      AGENT            USAGE           ACTION
──────────  ───────────────  ──────────────  ──────────────────
Tenant A    SCADA_MONITOR    primary model   → will update
Tenant A    MAINTENANCE      primary model   → will update
Tenant A    ETL_PIPELINE     failover [2]    → will update chain
Tenant B    SCADA_MONITOR    primary model   → will update
Tenant B    REPORTING        primary model   → will update
Tenant C    MAINTENANCE      primary model   → will update

Agents affected: 6 (across 3 tenants)
Failover chains updated: 2
Critical agents requiring validation: 2 (see below)

# Execute the swap
$ platform models swap claude-3-5-sonnet claude-sonnet-4-6

Swapped 6 agents across 3 tenants.
claude-3-5-sonnet status: deprecated → sunset

Critical agents flagged for validation:
  Tenant A / SCADA_MONITOR — flagged: critical_validation=true
  Tenant B / SCADA_MONITOR — flagged: critical_validation=true
  → Tenant admins notified. Clone & test available in Admin Console.

Critical Agent Validation (Clone & Test)

Not all agents are equal. A reporting agent that formats dashboards can tolerate a model swap with minimal risk. A SCADA agent that triages alarms and creates work orders for live production equipment must be validated before the new model goes live. The platform supports this through agent cloning and testing:

StepDetail
1. Flag critical agents In the agent's manifest config, set "critical_validation": true. These agents are never auto-swapped — the swap command flags them and notifies the tenant admin instead of updating in place.
2. Clone for testing The Admin Console (or CLI) creates a clone of the flagged agent with the new model. The clone runs in a sandbox environment with the same skills, tools, context providers, and configuration — but against test data, not production.
3. Run validation tests The tenant admin (or automated test suite) runs the cloned agent through representative scenarios: alarm triage, equipment diagnostics, work order creation. Compare outputs against the original model's behavior.
4. Approve or adjust If the new model performs well, the admin approves the swap for the production agent. If not, they can adjust skill instructions, tool configurations, or context settings to tune for the new model before approving.
5. Promote Approved clone replaces the production agent's model. The original model reference is archived for audit. If the new model causes issues in production, rollback to the previous model is one CLI command.
# Clone a critical agent for model testing
$ platform agents clone SCADA_MONITOR --tenant tenant-a --model claude-sonnet-4-6

Cloned: SCADA_MONITOR_test_clone (tenant-a)
  Model: claude-sonnet-4-6 (new)
  Skills: alarm-triage, conveyor-diagnostics (same)
  Tools: 5 (same)
  Environment: sandbox (test data only)

# Run the cloned agent through test scenarios
$ platform agents test SCADA_MONITOR_test_clone --scenarios alarm_triage,diagnostics

Running 2 scenarios against SCADA_MONITOR_test_clone...

  alarm_triage:    PASSED — correctly classified 12/12 fault codes
  diagnostics:     PASSED — motor temp analysis matches baseline

# Approve the swap for production
$ platform agents promote-model SCADA_MONITOR --tenant tenant-a --model claude-sonnet-4-6

SCADA_MONITOR (tenant-a): model updated to claude-sonnet-4-6
Clone removed. Previous model archived for rollback.

# Rollback if needed
$ platform agents rollback-model SCADA_MONITOR --tenant tenant-a

SCADA_MONITOR (tenant-a): rolled back to claude-3-5-sonnet
The model lifecycle is operationalized. New model arrives → commission in registry. Old model deprecated → tenants auto-notified with countdown. Swap command migrates all agents across all tenants. Critical agents are flagged, cloned, and tested. Rollback is one command. All communication handled through Platform Notifications (Section 5).

Platform Defaults

DefaultPurposeExample
default_chat_model Primary model for agent execution anthropic/claude-sonnet-4-6
default_fast_model Cheap/fast model for utility tasks (todo failover, recall events, routing) anthropic/claude-haiku-4-5
default_embedding_model Default model for generating vector embeddings openai/text-embedding-3-small

Module agents can override the default by specifying a model in their manifest. If the specified model is sunset, the platform falls back to the default. Failover chains provide ordered alternatives when the primary model fails (rate limit, timeout, API error).

Embedding Generation

What Are Embeddings?

An embedding is a way to turn text into numbers that capture its meaning. When you describe a fault as "motor overheating on conveyor belt 7," an embedding model converts that sentence into a list of numbers (a vector) — typically 1,536 numbers for modern models. These numbers encode the meaning of the text, not just the words. Two descriptions that mean similar things — "motor overheating on conveyor belt 7" and "belt drive thermal fault, line 7" — will produce vectors that are mathematically close together, even though they use completely different words.

Why This Matters for Operations

Traditional search requires exact keyword matches. If your database contains "motor overcurrent fault" and an operator searches for "drive drawing too much power," traditional search finds nothing. Embedding-based search finds it instantly — because the meanings are similar. This is what powers:

  • Similar fault lookup: When a new alarm fires, the agent finds similar past alarms and their resolutions — even if the descriptions use different terminology
  • Knowledge search: Operators ask questions in plain language ("what usually causes the sorter to jam?") and the system finds relevant procedures, past incidents, and resolution history
  • Pattern detection: The platform can identify that faults across different zones share similar root causes by comparing their embedding vectors
  • Recommendation: "Tenants with similar equipment configurations typically use these alarm thresholds"

How It Works (Technical)

The Agent System provides embedding generation as a platform service. Modules never manage embedding models — the platform owns the model (configured in the Model Registry), handles API calls to the provider (OpenAI, Google, etc.), and returns the vector. The module's only job is to decide what text to embed and where to store the result.

1

Module Declares a Vector Column

In module.json, the module's data_tables section includes a column with "type": "vector" and a dimensions count matching the embedding model (e.g., 1536 for OpenAI text-embedding-3-small). The TenantDataConnector provisions this as a native vector column in the tenant's data store.

2

Module Tool Calls the Platform Embed API

When the module needs to create an embedding — indexing a new alarm, storing a procedure, or preparing a search query — it calls context.agent_system.embed(text). The platform resolves the configured embedding model from the Model Registry, calls the provider API, and returns a float array.

3

Module Stores the Vector

The module stores the vector alongside its data using data.insert() — same as any other column. The TenantDataConnector stores the vector in the tenant's PostgreSQL database using pgvector.

4

Module Searches by Similarity

To find similar records, the module embeds the search query, then calls data.vector_search() with the query vector. The TenantDataConnector executes the similarity search using pgvector's cosine distance operators in the tenant's PostgreSQL database and returns results ranked by similarity.

Defining Embeddings in the Module Manifest

A module declares vector columns in its data_tables schema, just like any other column type. The platform handles provisioning, storage, and search capabilities per engine.

// In module.json — data_tables section
"data_tables": {
  "alarm_embeddings": {
    "columns": {
      "id":          {"type": "uuid", "primary_key": true},
      "description": {"type": "string"},                     // Original text
      "fault_code":  {"type": "integer"},
      "zone_id":     {"type": "string", "indexed": true},
      "resolution":  {"type": "string"},                     // How it was resolved
      "embedding":   {"type": "vector", "dimensions": 1536}, // The vector column
      "created_at":  {"type": "timestamp", "auto": true}
    }
  }
}
💡
The dimensions field must match the embedding model. The platform's default embedding model (openai/text-embedding-3-small) produces 1536-dimensional vectors. If the platform admin changes the default model to one with different dimensions, existing vector columns continue to work — the platform tracks which model was used to generate each vector and handles re-indexing when models change. New vectors use the current model; existing vectors are re-generated in the background during scheduled maintenance.

Indexing: Storing Embeddings

# Module tool that indexes an alarm for future similarity search
@platform_tool(
    module_code="SCADA_MONITOR",
    permission_code="SCADA.READ_SENSORS",
    summary_instructions="Index an alarm pattern for future similarity search. Call after resolving an alarm to build the knowledge base.",
)
async def index_alarm_pattern(
    context: ToolContext,
    data: TenantDataConnector,
    alarm_description: str,
    fault_code: int,
    zone_id: str,
    resolution: str,
) -> dict:
    # Step 1: Platform generates the embedding (module never touches the model)
    vector = await context.agent_system.embed(alarm_description)

    # Step 2: Store in tenant's data store with the vector
    await data.insert("alarm_embeddings", {
        "description": alarm_description,
        "fault_code": fault_code,
        "zone_id": zone_id,
        "resolution": resolution,
        "embedding": vector,
    })
    return {"status": "indexed", "dimensions": len(vector)}

Searching: Finding Similar Records

# Module tool that searches for similar past alarms
@platform_tool(
    module_code="SCADA_MONITOR",
    permission_code="SCADA.READ_SENSORS",
    summary_instructions="Search for similar past alarms and their resolutions. Use when diagnosing a fault to find relevant history.",
)
async def search_similar_alarms(
    context: ToolContext,
    data: TenantDataConnector,
    query_text: str,
    zone_id: str = None,
    limit: int = 5,
) -> dict:
    # Step 1: Embed the search query (same model as indexing)
    query_vector = await context.agent_system.embed(query_text)

    # Step 2: Vector similarity search in tenant's data store
    results = await data.vector_search(
        table="alarm_embeddings",
        vector_column="embedding",
        query_vector=query_vector,
        limit=limit,
        filters={"zone_id": zone_id} if zone_id else None,
        # Returns: description, fault_code, resolution, similarity_score
    )
    return {"similar_alarms": results, "count": len(results)}

What the Agent Sees

From the agent's perspective, embedding-powered search is just another tool call. The agent doesn't know or care about vectors, dimensions, or similarity algorithms — it calls search_similar_alarms with a text description and gets back relevant results. The complexity is entirely in the platform layer.

Agent investigating fault code 5012 on Shuttle 7: Agent: "Let me check if we've seen this before." Tool call: search_similar_alarms( query_text="motor overcurrent fault 5012 on multishuttle", zone_id="A3", limit=5 ) Platform (behind the scenes): 1. embed("motor overcurrent fault 5012 on multishuttle") → [0.0234, -0.0891, 0.1456, ... 1,536 floats] 2. vector_search(alarm_embeddings, query_vector, limit=5) → Finds 3 similar past alarms ranked by cosine similarity Tool result → Agent: {similar_alarms: [ {description: "Drive overcurrent on shuttle 7, aisle A3", fault_code: 5012, resolution: "Replaced drive belt — worn bearing caused excess load", similarity: 0.94}, {description: "Motor stall on multishuttle bay A", fault_code: 5011, resolution: "Cleared jam at transfer point, recalibrated position sensor", similarity: 0.87}, {description: "Overcurrent trip on conveyor drive motor", fault_code: 5015, resolution: "Reduced speed profile during peak load hours", similarity: 0.82} ]} Agent: "I found 3 similar past incidents. The closest match (94% similarity) was the same fault on the same shuttle — it was resolved by replacing the drive belt due to a worn bearing..."

Platform Responsibilities vs. Module Responsibilities

ResponsibilityPlatform Owns ThisModule Owns This
Which embedding model to use Model Registry default_embedding_model config. Platform admin can swap models without module changes.
Calling the embedding API context.agent_system.embed() handles provider API calls, retries, rate limiting, token tracking, and failover.
Vector storage engine TenantDataConnector provisions vector columns using pgvector in the tenant's PostgreSQL database.
Vector search implementation data.vector_search() translates to engine-native similarity search (cosine similarity, approximate nearest neighbor, etc.).
Model migration When the embedding model changes, the platform re-generates vectors in the background during scheduled maintenance. Modules don't need to do anything.
What text to embed Module decides which data to index — alarm descriptions, procedure text, equipment specs, user queries.
When to index Module decides the indexing trigger — on alarm resolution, on new procedure upload, on data import, via scheduled task.
Table schema Module declares vector columns in data_tables with the appropriate dimensions. Module decides what metadata columns to store alongside the vector.
Search logic Module builds the search tools: what to embed as the query, what filters to apply, how many results to return, how to present results to the agent.
Separation of concerns: The Agent System owns the embedding model (via the Model Registry). Module tools call context.agent_system.embed() to get vectors. The TenantDataConnector stores and searches vectors in the tenant's data store. Module developers never configure embedding models, manage API keys, handle rate limits, or write engine-specific vector queries — the platform handles all of it. The module focuses entirely on what to embed and when — the domain decisions that require business expertise.
💡
Embeddings are also available as a platform tool. Agents can call generate_embedding directly as a tool call without any module code — useful for ad-hoc similarity comparisons during agent reasoning. But for structured, persistent vector search (like the alarm knowledge base above), module tools provide the right abstraction: they combine embedding generation with data storage and search into domain-specific operations the agent can call by name.

Agent Types

TypeScopeExample
Module AgentOne module's skills/tools, current tenant's dataSCADA Agent — alarm triage, diagnostics
Orchestrator AgentDispatches to module agents via dispatch_agentsOperations Agent — coordinates SCADA + Maintenance + ETL
Platform AgentCross-module, platform servicesDashboard Agent, Onboarding Agent

Agent Dispatch

User: "Sorter 7 fault code 4012 — what's happening?" Operations Agent (Orchestrator, tenant: Tenant A) │ ├─ dispatch → SCADA_MONITOR agent │ Tools: read_plc_register (bridge), query_sensor_history (tenant data) │ Result: "Belt tension at 42% (threshold 60%)" │ ├─ dispatch → MAINTENANCE agent │ Tools: create_work_order (tenant data), assign_technician (tenant data) │ Result: "WO-2026-4521 created, assigned Mike Rodriguez, ETA 30m" │ └─ Response: "Belt tension failure. Work order created and assigned." All queries hit Tenant A's PostgreSQL database. Never Tenant B's. Never Tenant C's.

Dispatch Security

  • Dispatch permissions declared in manifest: "dispatch_permissions": ["MAINTENANCE", "ETL_PIPELINE"]
  • Enforced by Service Registry, not by the LLM
  • All dispatched agents inherit the same tenant context — they cannot access other tenants

19. Skills & Prompt Assembly

Skills are packages, not single files. Each skill is a directory containing a SKILL.md file (metadata + instructions), plus optional scripts, reference documents, and assets. The platform follows the open Agent Skills specification — skills built for this platform are portable to any compatible agent product.

Skill Package Structure

ai/skills/alarm-triage/
├── SKILL.md              # Required: YAML frontmatter + instructions
├── scripts/              # Optional: executable code the agent can run
│   └── classify_alarm.py # Python script for alarm severity classification
├── references/           # Optional: additional docs loaded on demand
│   ├── fault-codes.md    # Complete fault code reference table
│   └── escalation-matrix.md # Who to escalate to by severity/zone
├── assets/               # Optional: templates, schemas, data files
│   └── alarm-report-template.json
└── LICENSE.txt           # Optional: license for the skill package

SKILL.md Format (Agent Skills Spec)

---
name: alarm-triage
description: Classify, prioritize, and respond to SCADA alarms based on fault codes,
  severity thresholds, and escalation rules. Use when alarms are reported, faults
  are detected, or equipment status changes unexpectedly.
license: Proprietary
compatibility: Requires SCADA_MONITOR module tools
metadata:
  author: dematic-engineering
  version: "1.2"
  tools_required: get_active_alarms get_alarm_history acknowledge_alarm
---

# Alarm Triage Procedure

When alarms are reported, follow this workflow:

## 1. Classification
- **Critical** (90+): Equipment failure imminent. Dispatch maintenance immediately.
- **Warning** (70-89): Degraded performance. Schedule within 4 hours.
- **Info** (below 70): Normal variance. Log and monitor.

## 2. Fault Code Reference
See [fault-codes.md](references/fault-codes.md) for the complete reference.

| Code Range | System          | Common Cause              |
|------------|-----------------|---------------------------|
| 4000-4099  | Belt/Conveyor   | Tension, alignment, speed |
| 5000-5099  | Motor/Drive     | Overtemp, overcurrent     |
| 6000-6099  | Sensor/IO       | Communication, calibration|

## 3. Response Actions
- Always READ current PLC values before recommending any write
- For Critical: dispatch to MAINTENANCE module via agent dispatch
- For repeat alarms (same code, same zone, 24h): escalate per
  [escalation-matrix.md](references/escalation-matrix.md)
- Use [classify_alarm.py](scripts/classify_alarm.py) for complex multi-fault scenarios

Progressive Disclosure (Agent Skills Standard)

The platform follows the agentskills.io progressive disclosure model exactly as specified. This is an open standard adopted by Claude, Gemini CLI, Cursor, VS Code, OpenAI Codex, and 30+ agent products. Models are being natively trained to understand this pattern — deviating from it means our agents miss out on that native understanding. As the standard evolves, our skills remain compatible.

LevelWhat LoadsWhenToken Cost
1. Discovery name and description from YAML frontmatter At agent startup — all available skills scanned. The agent sees what skills exist and what they do. ~100 tokens per skill
2. Activation Full SKILL.md body (instructions, procedures, decision trees) When the agent decides a skill is relevant to the current task. Activated skills become part of the working context. < 5,000 tokens recommended per skill
3. Resources Files from references/, scripts/, assets/ On demand — agent reads a reference file or executes a script only when the skill instructions direct it to. Loaded via file read or script execution tools. Varies by file — loaded individually, not all at once
💡
Why we follow the standard exactly: The agentskills.io spec is becoming the industry-standard format for agent capabilities. Models like Claude and Gemini are trained to recognize SKILL.md frontmatter, activate skills based on description matching, and follow the progressive disclosure pattern natively. By staying aligned, our skills work not only with our platform agents but are also compatible with external agent products and future coding agents we may add. Skills built for this platform are portable — they can be used in Claude Code, Cursor, or any compatible agent without modification.

References:
agentskills.io — Open specification for Agent Skills format
github.com/anthropics/skills — Example skills from Anthropic

Agent Designer Workflow

The Agent Designer is the UI for configuring agents. It walks through each layer of the agent composition model. Every text input in the designer — persona, skill content, tool instructions — includes an Agent Assist button that invokes a built-in AI assistant to review, format, and improve your content based on the agent's current configuration (bound tools, skills, context budget, security policies).

Agent Assist (Built into the Designer)

Every text editor in the Agent Designer has an Agent Assist button in the corner. When clicked, the assist agent:

  • Reviews your text against the current agent configuration — are you referencing tools that aren't bound? Are your instructions consistent with the agent's permissions?
  • Formats and structures content to follow the Agent Skills spec (YAML frontmatter, markdown sections, reference links)
  • Elaborates on sparse instructions based on the bound tools, available skills, and security context — fills in gaps where the agent would need more guidance
  • Validates alignment between the persona, skills, and tools — flags mismatches (e.g., skill references a tool the agent doesn't have, persona describes capabilities not backed by tools)
  • Estimates token cost of the current content using the platform's tuned token calculator

The Agent Assist is itself an agent — it has skills for understanding the Agent Skills spec format and tools to read the current agent configuration (bound tools, skill packages, context budget, dispatch permissions). It's trained on the platform's own SDK patterns. The assist agent helps build better agents.

1

Agent Configuration

Set the agent's persona, model preference, max iterations, temperature, failover chain, dispatch permissions, and memory mode. This defines WHO the agent is and HOW it operates. Tools are assigned to the agent in two ways: at configuration time (bound through skills' tools_required in the designer) and at runtime (passed programmatically when the module creates or invokes the agent). Agent Assist reviews the persona text against the full assigned tool set — both statically bound and available platform tools — and suggests improvements based on the selected model's capabilities.

2

Skills Builder

Create and manage skill packages for the agent. Each skill is a directory with SKILL.md + optional scripts, references, and assets. The builder provides: a markdown editor for SKILL.md, file management for scripts/, references/, assets/, YAML frontmatter validation, and tools_required binding (select from Tool Registry). Agent Assist reviews skill content against the bound tools and suggests procedure steps, decision trees, and reference file structures.

3

Tool Binding

Review and confirm which tools each skill requires. The Tool Registry shows all available tools for the module (plus platform tools). Skills declare tools_required — the designer validates that all referenced tools exist and have complete security policies. Agent Assist can suggest tools that match the skill's described capabilities but aren't yet bound.

4

Memory & Context

Configure the agent's memory mode (none, automatic, housekeeping), compaction threshold, context budget, and recall event categories. Agent Assist analyzes the total token cost of the current configuration (persona + skills + tools + context providers) and recommends context profile adjustments if the budget is tight.

5

Preview Generated Prompt

The designer assembles and previews the full system prompt: persona + activated skill content + tool instructions + security context. The developer sees exactly what the LLM will receive — no hidden prompt engineering. Token breakdown shown per component. Agent Assist can review the assembled prompt and flag redundancies, gaps, or inconsistencies across skills and tool instructions.

6

Test & Iterate

Run the agent in a sandbox with test data. Review tool calls, skill activations, and prompt token usage. Adjust skills, tool instructions, or agent config. Iterate until the agent performs reliably. Test traces feed back into Agent Assist — it can analyze test results and suggest skill or configuration adjustments based on observed behavior.

Prompt Assembly Order

When an agent runs, the platform assembles the prompt from all layers:

1

Agent Persona

From ai.agent.persona in the manifest. Defines who the agent IS.

2

Skill Discovery Metadata

All skill name + description fields loaded (~100 tokens each). Agent can see what skills are available.

3

Activated Skill Content

Full SKILL.md body loaded for skills relevant to the current task. References loaded on demand.

4

Tool Instructions

Each activated skill's tools_required resolved from Tool Registry. summary_instructions injected. Tool schemas sent for function calling.

5

Security Context

Tenant, permissions, dispatch scope. "You are operating for Tenant A. You have SCADA.READ_SENSORS. You can dispatch to MAINTENANCE."

6

Memory & Recall

Recall events from previous sessions. Shift handoff context. Compacted conversation history.

7

Conversation History

Recent messages in full. Older compacted. Cache-optimized structure (static prefix → memory → dynamic).

Assembled Prompt Example

This is what the LLM actually receives — every layer assembled in order. Click each layer to expand and see the content. This example shows the SCADA Monitor agent on iteration 2 of a diagnostic session.

System Prompt — SCADA_MONITOR Agent
Iteration 2 of 15 · 14,600 tokens used · 181,400 remaining
LAYER 1 — Agent Persona ~180 tokens
You are a SCADA monitoring specialist for distribution center operations. You monitor conveyor systems, multi-sorters, and AVGs. You triage alarms by severity and recommend corrective actions. You always read current PLC values before recommending any write operations. You communicate clearly with operators who may not be engineers.
LAYER 2 — Available Skills (discovery metadata) ~200 tokens
Available skills:
- alarm-triage: Classify, prioritize, and respond to SCADA alarms based on fault codes, severity thresholds, and escalation rules.
- conveyor-diagnostics: Diagnose conveyor and multishuttle equipment issues using PLC data, sensor history, and fault patterns.
LAYER 3 — Activated Skill Content (conveyor-diagnostics) ~2,400 tokens
# Conveyor Diagnostics Procedure

When investigating equipment issues, follow this workflow:

## 1. Read Current State
- Use read_multishuttle_status to get live PLC data for the aisle
- Check: shuttle positions, motor temps, cycle counts, fault codes

## 2. Compare Against History
- Use query_sensor_history with 7-day window
- Look for: trending temperatures, increasing cycle times, repeat faults

## 3. Diagnose
- Motor temp > 15% above 7-day avg = degradation warning
- Same fault code 3x in 24h = escalate to maintenance
- See references/fault-codes.md for complete code reference

## 4. Act
- For warnings: log and monitor
- For degradation: dispatch to MAINTENANCE for preventive WO
- For critical: dispatch URGENT to MAINTENANCE immediately
LAYER 4 — Context Provider (inject_current_alarm_state) ~320 tokens
CURRENT ACTIVE ALARMS (2):
1. CRITICAL — Fault 5012 (motor overcurrent) on Sorter Bay A, Shuttle 7. Active 34 minutes. Unacknowledged.
2. WARNING — Fault 4031 (belt tension low) on Conveyor Line 4, Zone B. Active 2 hours. Unacknowledged.
LAYER 5 — Tool Summary Instructions (injected into prompt) ~600 tokens
Tool instructions:

read_multishuttle_status: Read real-time status of all shuttles in a multishuttle aisle. Returns position, load status, speed, fault codes, and cycle counts per shuttle. For detailed usage, edge cases, and examples call get_tool_instructions("read_multishuttle_status").

query_sensor_history: Query historical performance data for an aisle. Use to compare current state against trends and detect degradation. For detailed usage, edge cases, and examples call get_tool_instructions("query_sensor_history").

acknowledge_alarm: Acknowledge an active alarm by ID. Only acknowledge after confirming root cause. For detailed usage, edge cases, and examples call get_tool_instructions("acknowledge_alarm").

dispatch_agents: Dispatch tasks to other module agents in parallel. You can dispatch to: MAINTENANCE, ETL_PIPELINE. For detailed usage, edge cases, and examples call get_tool_instructions("dispatch_agents").

storage_upload: Upload a file to the tenant's platform storage. For detailed usage, edge cases, and examples call get_tool_instructions("storage_upload").

get_tool_instructions: Retrieve detailed usage instructions, edge cases, error handling, and examples for any tool by name. Call this before using an unfamiliar tool or when you need guidance beyond the summary above.
LAYER 6 — Security Context ~150 tokens
Security context:
Tenant: Tenant A (tenant-a.app.dematic.com)
User: mike.rodriguez (Operator)
Permissions: SCADA.READ_SENSORS, SCADA.ACK_ALARMS
Dispatch permissions: MAINTENANCE, ETL_PIPELINE
Data scope: All zones at Facility DC-1
LAYER 7 — Recall Events (from previous sessions) ~280 tokens
Previous session context:
- [discovery] Shuttle 7 in Aisle A3 has shown elevated motor temp trend over the past week (63°C avg → 71°C yesterday)
- [action] WO-2026-0801 was created for Shuttle 7 belt inspection but has not been completed
- [preference] Operator mike.rodriguez prefers data presented as trends with percentages, not raw values
LAYER 8 — Conversation History + Tool Results ~4,200 tokens
[Compacted summary of earlier turns]
Operator asked about Aisle A3 performance. Agent read multishuttle status via bridge tool.

[User — iteration 1]
"Sorter Bay A shuttle 7 is throwing fault 5012 again. What's going on? Is this related to last week?"

[Assistant — iteration 1]
"Let me pull the current status and compare against the historical trend."

[Tool call — iteration 1] read_multishuttle_status(aisle_id="A3")
[Tool result] {shuttles: 12, shuttle_7: {motor_temp: 78°C, fault_code: 5012, fault_count_24h: 2, cycle_time: 4.2s (vs 3.1s avg), status: "FAULTED"}}

[Tool call — iteration 1] query_sensor_history(zone_id="A3", sensor_type="motor_temp", days=7)
[Tool result] {shuttle_7_avg: [63.2, 64.1, 65.8, 67.4, 69.1, 71.3, 78.0], trend: "accelerating increase", baseline_avg: 63.2}

[Current turn — iteration 2]
(Agent is now processing these results to formulate a response and decide next actions)
TOOL DEFINITIONS (sent separately as function calling schema — not in system prompt)
6 tools · ~3,400 tokens · click to expand MCP-compatible schemas
{"name": "read_multishuttle_status", "description": "Read real-time status...", "inputSchema": {"type": "object", "properties": {"aisle_id": {"type": "string"}}, "required": ["aisle_id"]}}
{"name": "query_sensor_history", "description": "Query historical...", "inputSchema": {"type": "object", "properties": {"zone_id": {"type": "string"}, "sensor_type": {"type": "string"}, "days": {"type": "integer"}}}}
{"name": "acknowledge_alarm", ...}
{"name": "dispatch_agents", ...}
{"name": "storage_upload", ...}
{"name": "get_tool_instructions", "description": "Retrieve detailed usage...", "inputSchema": {"type": "object", "properties": {"tool_name": {"type": "string"}}, "required": ["tool_name"]}}
System prompt: ~4,130 tokens · Context providers: ~320 · History + results: ~4,200 · Tool schemas: ~3,400 Total: ~14,600 / 200,000 tokens (7.3%)
Every layer is visible. The Agent Designer's prompt preview shows this exact breakdown. Developers see exactly what the LLM receives — no hidden prompt engineering. Token costs per layer are tracked by the tuned calculator. When the context budget gets tight, lower-priority layers are compacted first (oldest history, then older tool results, then context providers by priority). The system prompt layers (persona, skills, tool instructions, security) are never compacted until the final iteration.

20. Tool Registry & Agent Binding

The Tool Registry is the platform-level catalog of every tool across all modules. It connects the Module SDK (where tools are developed using @platform_tool — see Section 12) to the Agent Executor (where tools are called at runtime — see Section 16). Skills reference tools by name. Agents receive tools through their skills. The registry is the bridge.

Tool Development & Registration Lifecycle

1

Developer Writes a Tool

A function decorated with @platform_tool in the module's ai/tools.py (cloud tools) or bridge/tools.py (bridge tools). The decorator defines: module code, permission, risk level, execution mode, security policy, and summary_instructions.

2

Module Loader Discovers Tools

At startup, the Module Loader imports each module's ai/tools.py and bridge/tools.py. Every @platform_tool-decorated function is collected with its metadata: name, parameters schema (auto-generated from type hints), security policy, and instructions.

3

Tools Registered in Tool Registry

Each discovered tool is registered in the platform Tool Registry — an in-memory catalog backed by the agent_tool_permissions table. The registry stores: tool name, module code, parameter schema, security policy, execution mode (cloud/bridge), and summary instructions.

4

Skills Declare Required Tools

In the module manifest, each skill lists its tools_required — the tool names the skill needs to function. This is a declarative binding: the skill says "I need these tools," the registry verifies they exist.

5

Agent Receives Tools via Skills

When an agent runs, the platform resolves its skills → collects all tools_required across those skills → looks up each tool in the Tool Registry → assembles the tool definitions for the LLM. The agent only sees tools that its skills declare.

6

Agent Calls a Tool

When the LLM outputs a tool call, the Tool Execution Engine looks up the tool in the registry, runs the full security stack (@platform_tool decorator), routes to cloud or bridge, and returns the result to the agent.

Tool Registry Structure

FieldSourcePurpose
tool_name Function name from @platform_tool Unique identifier (e.g., read_multishuttle_status)
scope @platform_tool(scope=...) platform (available to any agent) or module (owned by a specific module). Platform tools are built into the core — module tools are loaded from module packages.
module_code @platform_tool(module_code=...) Which module owns this tool. null for platform-scoped tools.
parameters_schema Auto-generated from function type hints JSON Schema describing the tool's parameters — sent to the LLM for function calling
summary_instructions @platform_tool(summary_instructions=...) Always in context. Concise guidance injected into the agent's system prompt — when to use this tool, what it returns, critical constraints. Must be self-sufficient — if a tool is bound without a skill, these instructions are all the agent has.
full_instructions @platform_tool(full_instructions=...) On demand. Detailed usage guide with parameter explanations, edge cases, error handling, and examples. Not in the agent's context by default — loaded via skill references or Agent Designer inspection. Available to Agent Assist for validation.
examples @platform_tool(examples=[...]) On demand. Input/output examples showing the tool in use. Used by Agent Assist, test scenarios, and the CLI platform tools inspect. Can be referenced by skills that need to show the agent how to call the tool in specific scenarios.
security_policy @platform_tool decorator fields Permission, risk level, allowed/denied agents, approval gate, execution timeout, audit level, error policy, guardrails (pre_conditions, post_conditions, success_criteria, on_failure). Guardrails are enforced by the Agent Executor at runtime — see Section 16, Step 4.
tenant_context Injected by platform at runtime Not a registry field — injected at execution time. Every tool invocation receives a ToolContext and TenantDataConnector that are pre-scoped to the current tenant. The registry does not store tenant information because tools are tenant-agnostic definitions — tenant scoping happens at the execution layer, not the registration layer. This is by design: the same tool definition serves all tenants, but every execution is tenant-isolated.
execution_mode @platform_tool(execution=...) cloud (runs on platform server) or bridge (routed to the current tenant's on-prem Bridge — tenant scoping enforced at execution)
handler The decorated function itself The actual callable that executes when the tool is invoked. Receives tenant-scoped ToolContext and TenantDataConnector as injected arguments — the handler never resolves tenant context itself.
token_cost Measured at registration + calibrated over time Estimated token footprint of this tool — the schema definition tokens (sent to LLM every call) plus average result tokens (based on historical returns). Used by the context manager to forecast whether a tool call will fit in the remaining context budget. Calibrated using the platform's tuned token calculator.
💡
Why tools have a token cost: Every tool the agent has access to consumes context tokens — the tool's JSON Schema definition is sent to the LLM on every call. A module with 10 tools might consume 3,000+ tokens just for tool definitions before the conversation even starts. The token_cost field lets the context manager forecast the total overhead and make informed decisions about compaction, tool stripping, and iteration limits.

Tool Documentation: Three Levels

Each tool carries three levels of documentation. This mirrors the progressive disclosure pattern from the Agent Skills spec — the agent gets concise guidance in its prompt, detailed instructions are available on demand, and examples support testing and validation.

LevelWhat It ContainsWhere It's UsedToken Impact
1. summary_instructions
Always in context
Concise: what the tool does, when to use it, what it returns, critical constraints. 1-3 sentences. Injected into the agent's system prompt on every LLM call. This is all the agent sees during execution. Must be self-sufficient — if the tool is used without a skill, these instructions alone must be enough. ~50-150 tokens per tool. Always consumed.
2. full_instructions
Agent-retrievable
Detailed usage guide: parameter explanations, edge cases, error handling, when NOT to use the tool, interaction patterns with other tools. Agent retrieves on demand via the platform tool get_tool_instructions when it needs detailed guidance before calling a tool. Also used by Agent Designer (Agent Assist reads this to validate skills), CLI platform tools inspect, and skill authors when writing procedures. Not in the agent's prompt by default — loaded only when the agent decides it needs deeper understanding. Zero by default. Loaded into context only when the agent calls get_tool_instructions — typically ~200-500 tokens per retrieval. The agent pays the cost only for tools it actually needs detailed guidance on.
3. examples
Agent-retrievable
Input/output examples: sample calls with parameters and expected responses. Covers common scenarios, edge cases, and error conditions. Agent retrieves on demand via get_tool_instructions (returned alongside full_instructions). Also used by Agent Designer (test scenario generation), CLI inspection, and skill reference files for specific workflows. Zero by default. Included in the get_tool_instructions response when retrieved.

Automatic Retrieval Directive Injection

Developers write summary_instructions as concise guidance — what the tool does, when to use it, critical constraints. The platform automatically appends the retrieval directive to every tool's summary at prompt assembly time. Developers never write it themselves.

How it works

  1. Developer writes: summary_instructions="Query recent sensor readings. Returns time-series data for the specified zone and sensor type."
  2. Platform injects at prompt assembly: "Query recent sensor readings. Returns time-series data for the specified zone and sensor type. For detailed usage, edge cases, and examples call get_tool_instructions("query_sensor_history")."
  3. Agent sees the combined text in its system prompt. If the summary is sufficient, it calls the tool directly. If it needs more guidance, it calls get_tool_instructions first — one extra iteration, but detailed docs loaded into context only when needed.

This is handled by the prompt assembler during Step 1 of the Agent Executor loop. The directive is appended to every tool that has full_instructions defined. Tools without full_instructions get no directive — there's nothing to retrieve.

Progressive disclosure for tools — matching the skill pattern. Skills use three-tier progressive disclosure: discovery (name + description) → activation (full SKILL.md) → resources (references/ on demand). Tools now follow the same pattern: summary_instructions (always in prompt) → full_instructions + examples (agent retrieves via get_tool_instructions). The agent pays context tokens only for the tools it actually needs deep guidance on. A skill binding 8 tools where the agent only needs detailed docs for 2 saves ~1,500-3,000 tokens of unnecessary context.

How Tools and Skills Share the Documentation Responsibility

ResponsibilityTool Owns ThisSkill Owns This
What does the tool do? summary_instructions — always in prompt
When should the agent use it? summary_instructions (general guidance) SKILL.md procedures (specific workflows: "after checking the alarm, call this tool to...")
How to use it in this scenario? full_instructions (agent retrieves via get_tool_instructions when it needs detailed guidance) SKILL.md procedures + references/ files (in context when skill is activated)
What parameters, what format? parameters_schema (JSON Schema, always sent to LLM)
Show me an example examples (agent retrieves via get_tool_instructions) references/ files can include scenario-specific examples (loaded on demand)
What could go wrong? full_instructions (agent retrieves via get_tool_instructions — includes error handling docs) SKILL.md edge cases section ("if the tool returns error X, do Y")
The tool works without a skill. The skill makes it work better. A tool's summary_instructions must be good enough that an agent can use the tool correctly without any skill context — this was proven in DataVisions where tools carried all their own instructions. When the agent encounters an unfamiliar tool or a complex scenario, the auto-injected retrieval directive tells it to call get_tool_instructions for the full guide — no skill needed, no upfront context cost. Skills add domain-specific procedures, decision trees, and scenario guidance on top. The tool is the capability. The skill is the expertise. get_tool_instructions is the bridge between summary and mastery. When a skill references a tool by name in tools_required, it doesn't need to describe the tool — the tool's schema and summary_instructions (with retrieval directive) are automatically resolved from the Tool Registry and injected into the prompt.

MCP Compatibility

The platform's tool system is built on the Model Context Protocol (MCP) standard for tool definitions. MCP is an open protocol supported by Claude, ChatGPT, VS Code, Cursor, and the broader AI ecosystem. Models are natively trained to understand MCP tool schemas — by following this standard, our tools get the best possible tool-calling behavior from every model.

MCP StandardOur ImplementationNotes
name tool_name Identical. Function name from @platform_tool.
description summary_instructions MCP's description field maps to our summary_instructions. Same purpose — tells the model what the tool does and when to use it.
inputSchema (JSON Schema) parameters_schema Identical. Auto-generated from Python type hints. Standard JSON Schema format that every model understands.
not in MCP security_policy Platform extension. MCP has no security model. Our @platform_tool adds permissions, risk levels, agent allow/deny, approval gates, and audit. This wraps MCP tools with enterprise security.
not in MCP summary_instructions injection Platform extension. MCP tools send their description as part of the tool definition. We additionally inject summary_instructions directly into the system prompt as instructional text — the agent sees guidance about the tool in its instructions, not just in the tool schema. This was developed and proven in DataVisions and produces better tool-calling behavior than description-only approaches.
not in MCP full_instructions, examples Platform extension. Progressive disclosure layers that MCP doesn't define. Available for Agent Assist, CLI inspection, and skill authors.
💡
MCP is the baseline. We extend it. Every @platform_tool produces an MCP-compatible tool definition (name + description + inputSchema). External MCP clients can connect to the platform and discover tools through the standard protocol. On top of MCP, we add the security layer, instruction injection, progressive documentation, and token cost tracking. MCP gives us model compatibility. Our extensions give us enterprise-grade agent infrastructure.

Reference: modelcontextprotocol.io — Model Context Protocol specification

Platform Tools vs. Module Tools

Tools exist at two scopes. Platform tools are built into the core and available to any agent on the platform. Module tools are loaded from module packages and scoped to that module's agents (unless accessed via dispatch or Service Registry).

ScopeOwned ByAvailable ToExamples
Platform Platform core (app/core/platform_tools/) Any agent on the platform — module agents can bind platform tools in their skills dispatch_agents, storage_upload, storage_get_url, generate_embedding, schedule_task, send_notification, request_user_input, get_tool_instructions
Module A specific module (app/modules/{name}/ai/tools.py) That module's agents. Other modules access via dispatch or Service Registry. read_plc_register, create_work_order, query_sensor_history, run_pipeline
💡
Why this matters for module developers: When building a module, your agent can use platform tools without writing them. Need file uploads? Bind storage_upload in your skill's tools_required. Need to dispatch to another agent? Bind dispatch_agents. Need embeddings? Bind generate_embedding. Platform tools are discoverable via the CLI and ready to use.

Platform Tool Detail: Agent Orchestration

Three platform tools give agents the ability to pause for human input, dispatch work in parallel, and send notifications. Modules can also call these same tools deterministically (not just through agents) — a module's service layer can call dispatch_agents or request_user_input programmatically with the same interface and security.

Agent Orchestration Flow

User: "Analyze equipment health for Aisle A3 and send the report to the engineering lead for review" SCADA Agent (Module Agent) │ │ ──── PHASE 1: Parallel Research (dispatch_agents) ──────── │ ├─ dispatch_agents(parallel=true) │ │ │ ├─ → SCADA_MONITOR agent │ │ Task: "Pull live PLC data and 30-day history for Aisle A3" │ │ Tools: read_multishuttle_status (bridge), query_sensor_history (cloud) │ │ Result: {shuttles: 12, fault_trends: [...], motor_temps: [...]} │ │ │ ├─ → ENGINEERING_CONFIG agent │ │ Task: "Get equipment specs and known issues for Aisle A3" │ │ Tools: get_equipment_config, search_fault_history (vector search) │ │ Result: {specs: {...}, similar_faults: [...], resolutions: [...]} │ │ │ └─ → MAINTENANCE agent │ Task: "Get open and recent work orders for Aisle A3" │ Tools: get_open_work_orders, get_work_order_history │ Result: {open: 1, recent: [...], avg_resolution_time: "4.2h"} │ │ ←── All 3 agents complete. Results merged into context. │ │ ──── PHASE 2: Generate Report ──────────────────────────── │ │ Agent synthesizes all results into an equipment health report. │ Stores report: storage_upload(purpose="reports", file=report.pdf) │ │ ──── PHASE 3: Human Review (request_user_input) ────────── │ ├─ send_notification( │ to="sarah.chen@dematic.com", │ message="Equipment health report for Aisle A3 ready for review" │ ) │ ├─ request_user_input( │ form={ │ fields: [ │ {name: "approved", type: "select", options: ["Approved", "Needs Changes"]}, │ {name: "comments", type: "textarea", label: "Engineering comments"}, │ {name: "priority", type: "select", options: ["Normal", "Urgent", "Critical"]} │ ] │ }, │ message: "Please review the Aisle A3 health report and approve or request changes." │ ) │ │ ═══ AGENT PAUSES ═══════════════════════════════════════════ │ State checkpointed to DB: context, tool results, conversation │ Agent process ends. No timers. No sleeping. │ SSE → frontend shows form to user │ Email notification sent to sarah.chen │ │ ... hours or days later ... │ │ Sarah submits form: {approved: "Approved", comments: "Schedule │ preventive maintenance for Shuttle 7", priority: "Urgent"} │ │ ═══ AGENT RESUMES ══════════════════════════════════════════ │ Platform restores checkpointed state │ Agent receives form response as tool result │ │ ──── PHASE 4: Act on Review ────────────────────────────── │ ├─ dispatch_agents → MAINTENANCE agent │ Task: "Create urgent preventive WO for Shuttle 7 per engineering review" │ Result: {work_order: "WO-2026-0912", priority: "Urgent"} │ └─ Response to user: "Report approved by Sarah Chen. Urgent preventive work order WO-2026-0912 created for Shuttle 7 per her comments."

request_user_input (Pause & Resume)

AspectDetail
What it does Pauses the agent, checkpoints full state to the Platform DB, presents a dynamic form to the user via SSE, and resumes the agent when the user submits. No timers, no sleeping — event-driven resume.
Form definition Dynamic JSON form: text, textarea, number, select, multi_select, checkbox, radio, date fields. The agent builds the form at runtime based on what it needs to ask.
State checkpoint Full agent state saved: conversation history, tool results, context providers, activated skills, iteration count, todo state. Resume skips straight to processing the form response.
Resume trigger User submits the form in the frontend. Platform loads checkpointed state, injects form response as the tool result, agent continues from where it paused.
Timeout Configurable expiry (default 7 days). If the user doesn't respond, the agent receives a timeout result and can handle it (send a reminder, escalate, or close gracefully).
Notification Pair with send_notification to alert the user via email, Slack, or in-app notification that their input is needed.

dispatch_agents (Parallel Execution)

AspectDetail
What it does Dispatches tasks to one or more module agents in parallel. Each agent runs with its own tools, skills, and tenant context. Results collected and returned as an array.
Parallel mode parallel=true — all dispatched agents run concurrently. parallel=false — run sequentially (output of one feeds into the next).
Result strategies all_required (default) — wait for all agents. first_success — return as soon as one completes. timeout — return whatever is ready after N seconds.
Dispatch permissions The calling agent can only dispatch to agents listed in its dispatch_permissions manifest field. Enforced by the Service Registry, not the LLM.
SSE events agent_start, agent_text, agent_tool_call, agent_tool_result, agent_complete — frontend can show parallel agent activity in real time.
Programmatic access Modules can also call dispatch_agents from their service layer (not just from an agent). A scheduled task or API endpoint can programmatically fan out work to multiple agents.
Same tools, two access patterns. An agent calls request_user_input or dispatch_agents as tool calls during its execution loop. A module's service layer calls the same tools programmatically for deterministic workflows. Same security, same checkpoint/resume, same parallel execution — whether the caller is an AI agent or a Python function. Modules that need workflow-like behavior (pause for approval, fan-out/fan-in, send notification and wait) get it through platform tools, not a separate workflow engine.

How Skills Bind to Tools

module.json manifest: "ai": { "skills": [ { "name": "alarm-triage", ← Skill package name "path": "ai/skills/alarm-triage/", ← Directory (contains SKILL.md) "tools_required": [ ← Tools this skill needs "get_active_alarms", ← Must exist in Tool Registry "get_alarm_history", ← Must exist in Tool Registry "acknowledge_alarm" ← Must exist in Tool Registry ] } ] } ai/skills/alarm-triage/ ← Skill PACKAGE ├── SKILL.md ← Instructions + frontmatter ├── references/fault-codes.md ← Loaded on demand ├── references/escalation-matrix.md ← Loaded on demand └── scripts/classify_alarm.py ← Agent can execute ai/tools.py: @platform_tool(module_code="SCADA_MONITOR", ...) async def get_active_alarms(...) ← Registered in Tool Registry @platform_tool(module_code="SCADA_MONITOR", ...) async def get_alarm_history(...) ← Registered in Tool Registry @platform_tool(module_code="SCADA_MONITOR", ...) async def acknowledge_alarm(...) ← Registered in Tool Registry At agent runtime: Agent activates skill "alarm-triage" → SKILL.md body loaded into system prompt (~instructions) → tools_required resolved: get_active_alarms, get_alarm_history, acknowledge_alarm → Tool definitions (schema + instructions) sent to LLM → references/ and scripts/ available on demand (progressive disclosure) → Agent KNOWS the procedures (skill) and CAN DO the actions (tools)

Tool Access Patterns

PatternHow It WorksExample
Platform tools Any agent can bind platform tools in its skills' tools_required. Platform tools are always available — no module dependency needed. SCADA agent uses dispatch_agents, storage_upload, generate_embedding
Same-module tools Agent's skills declare tools_required → resolved from the module's own Tool Registry entries. Direct access. SCADA agent uses read_plc_register (a SCADA module tool)
Cross-module via dispatch Agent dispatches to another module's agent via the platform tool dispatch_agents. The dispatched agent runs with its own tools. Results returned. SCADA agent dispatches to MAINTENANCE agent to create a work order.
Cross-module via Service Registry A tool in module A can call a registered service in module B through the Service Registry. Permission-checked at invocation. SCADA tool calls MAINTENANCE.get_technician_availability as part of its logic.

Tool Registry in the Admin Console

The Admin Console (Section 23) and Platform CLI (Section 25) provide visibility into the Tool Registry:

# List all tools — platform and module
$ platform tools list

SCOPE     OWNER            TOOL NAME                   EXECUTION  RISK
────────  ───────────────  ──────────────────────────  ─────────  ────────
platform  PLATFORM         dispatch_agents             cloud      safe
platform  PLATFORM         storage_upload              cloud      safe
platform  PLATFORM         storage_get_url             cloud      safe
platform  PLATFORM         generate_embedding          cloud      safe
platform  PLATFORM         schedule_task               cloud      low
platform  PLATFORM         send_notification           cloud      low
platform  PLATFORM         request_user_input          cloud      safe
platform  PLATFORM         get_tool_instructions       cloud      safe
module    SCADA_MONITOR    read_multishuttle_status    bridge     safe
module    SCADA_MONITOR    write_plc_register          bridge     critical
module    SCADA_MONITOR    query_sensor_history        cloud      safe
module    SCADA_MONITOR    get_active_alarms           cloud      safe
module    SCADA_MONITOR    acknowledge_alarm           cloud      medium
module    MAINTENANCE      create_work_order           cloud      medium
module    MAINTENANCE      assign_technician           cloud      medium
module    ETL_PIPELINE     run_pipeline                bridge     medium
module    ETL_PIPELINE     get_pipeline_status         cloud      safe

# List only platform tools (useful for module developers discovering what's available)
$ platform tools list --scope platform

SCOPE     TOOL NAME                   DESCRIPTION
────────  ──────────────────────────  ──────────────────────────────────────
platform  dispatch_agents             Dispatch tasks to other module agents in parallel
platform  storage_upload              Upload a file to the tenant's platform storage
platform  storage_get_url             Generate a signed URL for a stored file
platform  generate_embedding          Generate vector embeddings for text via Model Registry
platform  schedule_task               Create or update a scheduled task
platform  send_notification           Send a notification to a user or channel
platform  request_user_input          Pause agent and request input from the user via form
platform  get_tool_instructions       Retrieve full_instructions and examples for a tool on demand

# Show details for a specific tool
$ platform tools inspect write_plc_register

TOOL: write_plc_register
  Scope:           module
  Module:          SCADA_MONITOR
  Execution:       bridge
  Risk Level:      critical
  Permission:      SCADA.WRITE_PLC
  Requires Approval: yes
  Allowed Agents:  [SCADA_MONITOR]
  Denied Agents:   [REPORTING, ETL_PIPELINE]
  Timeout:         3000ms
  Error Policy:    halt_and_notify
  Audit:           full
  Used by Skills:  plc_operations
  Parameters:
    register_address  (str, required)
    value             (int, required)
    plc_id            (str, required)
  Instructions:
    "Write a value to a PLC register. CRITICAL: Controls physical
     equipment. Only use when you have confirmed the correct
     register address and value. Always read the current value first."

Validation at Module Load

What the Module Loader validates at registration

  • Tool names are unique within the module — duplicate names rejected
  • Skills reference valid tools — every tools_required entry must match a registered @platform_tool in the module. Missing tools block module loading.
  • Security policies are complete — every tool must have risk_level. Bridge tools must have full security policy (permission, audit, error_policy). Incomplete policies block registration.
  • Parameter schemas are valid — type hints must produce a valid JSON Schema. Untyped parameters are rejected.
  • Dispatch permissions are valid — if the agent config declares dispatch_permissions, the target module codes must exist in the module registry.
The binding model: Tools are developed with @platform_tool (Section 12). The Module Loader registers them in the Tool Registry. Skills declare which tools they need (Section 14). Agents receive tools through their skills. The Agent Executor runs them at runtime (Section 16). The registry validates all bindings at load time — if a skill references a tool that doesn't exist, the module won't load. No runtime surprises.

21. Agent Executor

The Agent Executor is the runtime engine that powers every agent on the platform. It does one thing: take a user request, reason about it, call tools as needed, and produce a response. The executor handles context management, tool routing, safety limits, and token tracking — module developers configure it per agent in the manifest and never write execution logic.

How It Works (Overview)

At its core, the executor is a loop. The agent receives a message, thinks about what to do, optionally calls tools, and either responds or loops again with the tool results. Every agent on the platform runs through this same loop — what makes each agent unique is its skills, tools, context configuration, and model.

User sends a message
Agent Executor Loop
1. Assemble Prompt
Persona Skills Context Providers Tool Instructions Security Context Recall Events History
Check context budget → compact if threshold exceeded
2. Call the LLM
Send prompt + tool schemas Stream via SSE Track tokens (provider + calc) Failover chain on error
3. What did the LLM return?
Text only
DONE → Return response to user
Tool calls
Continue to step 4 ↓
4. Execute Tools
Security check (@platform_tool) Pre-condition guardrails Route: cloud or bridge Execute & collect result Post-condition guardrails Add to history
5. Safety Checks
Max iterations reached? → Force stop Same tool failing 3x? → Break loop Context exceeded? → Compact
Loop back to step 1
Response delivered to user via SSE
💡
That's it. Assemble → Call LLM → Text or Tools → Execute Tools → Safety Check → Loop. Every agent on the platform runs through this same loop. The complexity isn't in the loop — it's in what goes INTO the loop (prompt assembly, context management) and what happens DURING tool execution (security, routing, bridge). Those details follow below.

Step 1: Prompt Assembly (Detail)

Before each LLM call, the executor assembles the full prompt from all layers. Each component is measured by the platform's tuned token calculator. If the total exceeds the agent's context budget, compaction runs automatically before calling the LLM.

ComponentWhat It ContainsBudget Impact
PersonaAgent identity from the manifestFixed ~100-200 tokens. Never compacted.
Activated SkillsSKILL.md body for relevant skillsFixed per session ~2,000-5,000 tokens. Rarely compacted.
Context ProvidersInjected data from @context_provider functions (live alarm state, equipment config)Capped per provider. Skipped by priority if budget is tight.
Tool Instructionssummary_instructions from each bound toolFixed ~50-150 tokens per tool. Never compacted until final iteration.
Security ContextTenant, permissions, dispatch scopeFixed ~150 tokens. Never compacted.
Recall EventsPrevious session knowledge (shift handoffs, preferences)Small ~200-400 tokens. Rarely compacted.
Conversation HistoryRecent messages in full, older messages compacted or summarizedGrows with each turn. First to compact.
Tool ResultsResults from previous tool calls in this sessionGrows with each iteration. Older results truncated per tool_result_retention.

Step 2-3: LLM Call & Response

AspectDetail
StreamingResponse streamed via SSE to the frontend in real time. Text, tool calls, and thinking events delivered as they arrive.
FailoverIf the primary model fails (rate limit, timeout, API error), the executor walks the failover chain. Response metadata tracks which model was actually used.
Token trackingProvider-reported tokens recorded as billing truth. Platform calculator runs in parallel for validation. Drift tracked per provider.
Response typesText only → the agent is done, deliver response to user. Tool calls → continue to tool execution. Both can appear in the same response.

Step 4: Tool Execution & Guardrails

Every tool call — platform tools and module tools alike — goes through the same @platform_tool security and guardrails stack. Both use the same decorator, same guardrail parameters, same executor pipeline. The only difference is scope: platform tools are built into the core and available to any agent; module tools are loaded from a module package and scoped to that module. The executor doesn't distinguish between them — a pre_condition on dispatch_agents (platform) runs through the same pipeline as a pre_condition on write_plc_register (module).

Tool Execution Pipeline

StepWhat HappensOn Failure
1. Tenant scoping Hard isolation boundary. Resolve tenant from the agent session's JWT. Inject a TenantDataConnector pre-routed to the current tenant's data store. Set RLS context on Platform DB (SET app.current_tenant_id). Resolve the current tenant's bridge connection for bridge tools. All downstream operations — data queries, bridge calls, embeddings, dispatch — are locked to this tenant. This step runs before any tool logic and cannot be overridden by the tool, the agent, or the LLM. Reject — if tenant cannot be resolved (invalid JWT, expired session), the tool call is denied before any other step runs
2. Security checks Permission check (user has the required permission in this tenant), module subscription check (tenant subscribes to this module), agent allow/deny list Reject — tool call denied, error returned to agent
3. Pre-conditions Business-rule validation on inputs before execution. Defined on the tool via pre_conditions=[...]. Checks: value ranges, format patterns, cross-field logic, enum membership. Apply on_failure action (see below)
4. Approval gate If requires_approval=True: emit SSE approval_request, pause, wait for human response Denied → tool call rejected, reason returned to agent
5. Execute Route to cloud handler (with tenant-scoped TenantDataConnector) or Bridge WebSocket (routed to the current tenant's bridge). Apply timeout. Collect result. Apply error_policy (retry/halt/escalate)
6. Post-conditions Validation on outputs after execution. Defined via post_conditions=[...]. Checks: expected fields present, value ranges, status codes, response structure. Apply on_failure action
7. Success criteria Optional verification step. read_after_write: re-read the value to confirm the write took effect. dual_confirm: require a second confirmation for critical operations. Apply on_failure action
8. Audit + return Log the full execution (inputs, outputs, guardrail results, timing). Sanitize response. Add result to conversation history.

Guardrail Failure Actions

When a pre-condition, post-condition, or success criteria check fails, the executor applies the tool's on_failure action:

ActionWhat the Executor DoesAgent SeesUse When
halt Stop the agent immediately. No more iterations. Log full context. Error: "Tool halted by guardrail: {reason}" Critical operations — writing to PLCs, financial transactions, irreversible actions
pause Pause the agent, checkpoint state, emit SSE to request human review of the failed condition. Resume when human approves or rejects. Paused — waiting for human review High-risk operations where a human should decide whether to proceed
warn Log the failure, emit SSE guardrails_check event, but let execution continue. Warning included in the tool result. Result + warning: "Post-condition warning: {detail}" Soft constraints — unexpected but not dangerous output
retry Re-execute the tool (up to max_retries, default 2). If still failing after retries, escalate to halt. Transparent — agent sees final result (success or halt) Transient failures — bridge timeouts, network glitches
collect Log the failure but return the result anyway. Guardrail failures are collected and surfaced in the session trace for post-hoc review. Normal result — no indication of failure Monitoring — tracking edge cases without disrupting the agent
skip Don't execute the tool at all. Return a synthetic "skipped" result to the agent with the reason. "Tool skipped: pre-condition failed: {reason}" Defensive — prevent execution when inputs are clearly invalid

Guardrails in Practice: Bridge Tool Example

Agent calls: write_plc_register(register_address="INVALID", value=99999, plc_id="PLC-001") Executor pipeline: ✓ Security checks — SCADA.WRITE_PLC permission confirmed ✗ PRE-CONDITION FAIL — register_address must match r'^PLC-\d{3}-R\d{4}$' "INVALID" does not match → on_failure = "halt" → Agent receives: "Tool halted by guardrail: register_address format invalid" → Agent trace logs: pre_condition_failed, inputs, reason, action=halt → SSE event: guardrails_check { tool: "write_plc_register", result: "fail", action: "halt" } → Tool NEVER EXECUTED — the PLC was never touched
Guardrails are defined on tools, enforced by the executor. Module developers declare pre_conditions, post_conditions, success_criteria, and on_failure on each @platform_tool (Section 12). The Tool Registry stores them as part of the tool's security_policy (Section 15). The executor runs them on every tool call in this pipeline. The agent never writes guardrail logic — it's infrastructure. The Admin Console (Section 23) surfaces guardrail pass/fail rates across all tools and tenants.

Other Tool Execution Details

AspectDetail
Routingexecution: "cloud" → runs on platform server. execution: "bridge" → routed to tenant's on-prem Bridge via WebSocket.
ResultsTool results added to conversation history. Result token size measured for context forecasting.
Parallel callsIf the LLM outputs multiple tool calls in one response, they execute concurrently (where safe). Each goes through the full guardrails pipeline independently. Results collected and added together.

Step 5: Safety & Iteration Control

ControlWhat Happens
Max iterationsConfigurable per agent (5-30). On final iteration, tools are stripped from the prompt and the agent must produce a text summary of what it accomplished and what remains.
Loop detectionIf the same failing tool call repeats 3 times, the loop breaks with an error. Prevents runaway cost from stuck agents.
Context forecastingBefore each iteration, the executor forecasts whether the next LLM call + expected tool results will fit in the remaining budget. If not, compact before calling — not after it fails.
Todo failoverAfter each iteration, if the agent didn't call todo_update, a fast/cheap background model infers the correct todo state. Non-blocking. Agent never knows.
Mid-session messagesUser messages sent during tool execution are queued and injected at the next iteration boundary. Never mid-tool-call.

Context Budget Breakdown

The context window is a fixed-size container. The executor calculates how every component fills it, using the platform's tuned token calculator (calibrated against provider-reported actuals per the drift tracking in the Model Registry).

Model: claude-sonnet-4-6 (200K context) Agent: SCADA_MONITOR (configured: compaction_threshold_pct = 70%) ┌──────────────────────────────────────────────────────────┐ │ 200,000 tokens (model max) │ │ │ │ ┌────────────────────────────────────────────────────┐ │ │ │ FIXED OVERHEAD (calculated at session start) │ │ │ │ │ │ │ │ System prompt + persona ~800 tokens │ │ │ │ Security context (tenant, permissions) ~200 tokens │ │ │ │ Tool definitions (5 tools × ~600 avg) ~3,000 tokens│ │ │ │ ───────────────────────────────────────────────── │ │ │ │ Total fixed: ~4,000 tokens│ │ │ └────────────────────────────────────────────────────┘ │ │ │ │ Usable budget: 200,000 - 4,000 = 196,000 tokens │ │ Compaction threshold (70%): 137,200 tokens │ │ │ │ ┌────────────────────────────────────────────────────┐ │ │ │ DYNAMIC CONTENT (grows during session) │ │ │ │ │ │ │ │ Activated skills (alarm-triage SKILL.md) ~2,500 │ │ │ │ Context providers (live alarm state) ~800 │ │ │ │ Recall events (3 from last shift) ~400 │ │ │ │ Conversation history (user + assistant) ~varies │ │ │ │ Tool results (accumulated) ~varies │ │ │ │ │ │ │ │ When dynamic content hits 137,200 → COMPACT │ │ │ └────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────┘

Context Providers (Custom Context Injection)

Sometimes an agent needs data in the prompt before it starts reasoning, not from a tool call after it starts thinking. Context providers are module-defined functions that run during prompt assembly and inject content directly into the system prompt — saving an iteration and giving the agent immediate situational awareness.

ApproachWhen Data ArrivesLLM IterationsToken Cost
Tool call Agent decides it needs data → calls tool → gets result → reasons 2+ iterations minimum Tool schema on every call + result tokens
Context provider Data injected into prompt before agent starts. Agent reasons immediately. 1 iteration — data is already there Injected tokens on every call (capped by max_tokens)

The @context_provider Decorator

@context_provider(
    module_code="SCADA_MONITOR",
    trigger="always",           # Run on every prompt assembly
    max_tokens=2000,            # Budget cap — injection truncated if exceeded
    priority=80,                # Higher = injected earlier, less likely to be compacted
)
async def inject_current_alarm_state(
    context: ToolContext,
    data: TenantDataConnector,
) -> str:
    alarms = await data.query("alarms", filters={"acknowledged": False})
    if not alarms:
        return ""  # Empty = don't inject anything, save tokens
    return f"CURRENT ACTIVE ALARMS ({len(alarms)}):\n" + format_alarms(alarms)


@context_provider(
    module_code="SCADA_MONITOR",
    trigger="keyword:conveyor,sorter,shuttle,aisle",  # Only when relevant
    max_tokens=3000,
    priority=70,
)
async def inject_equipment_config(
    context: ToolContext,
    data: TenantDataConnector,
) -> str:
    # Pull current equipment config from Engineering Configuration module
    # via Service Registry — agent gets equipment context without a tool call
    config = await service_registry.invoke(
        "ENGINEERING_CONFIG", "get_equipment_config",
        tenant_id=context.tenant_id, zone_id=context.zone_id,
    )
    return format_equipment_summary(config)

Trigger Types

TriggerWhen It RunsUse Case
always Every prompt assembly for this agent Live alarm state, current shift info, system health summary — always-relevant context
keyword:word1,word2 Only when the user's message contains one of the keywords Equipment config (only when asking about specific equipment), order details (only when asking about orders)
on_demand Only when explicitly requested by a skill or tool via context.request_injection("provider_name") Heavy data sets that should only load when a specific workflow needs them

How Context Providers Fit in Prompt Assembly

Prompt Assembly Order (updated): 1. Agent Persona ~800 tokens (fixed) 2. Activated Skills (SKILL.md) ~2,500 tokens (fixed per session) 3. Context Providers ← NEW ~800-3,000 (dynamic, capped per provider) ├── inject_current_alarm_state ~800 tokens (trigger: always) └── inject_equipment_config ~2,200 tokens (trigger: keyword match) 4. Tool Instructions ~1,500 tokens (from summary_instructions) 5. Security Context ~200 tokens (fixed) 6. Recall Events ~400 tokens (loaded from DB) 7. Conversation History ~varies (compacted as needed) 8. Tool Results ~varies (retained per config)

Context Budget Controls

ControlDetail
max_tokens Hard cap per context provider function (not the LLM model's max tokens). If the function returns more than max_injection_tokens, the output is truncated. Prevents a runaway query from blowing the context budget.
priority 0-100. Higher priority providers are injected first and compacted last. If context budget is tight, lower-priority providers are skipped entirely.
Empty return If the function returns "", nothing is injected and zero tokens are consumed. Providers should return empty when there's nothing relevant — no active alarms means no alarm context.
Budget forecasting The context forecaster includes provider max_tokens in the overhead calculation. If enabling a provider would push the budget past the compaction threshold, it's skipped with a log entry.

Declaring Context Providers in the Manifest

// In module.json — context_providers section
"context_providers": [
  {
    "name": "current_alarm_state",
    "trigger": "always",
    "max_tokens": 2000,
    "priority": 80
  },
  {
    "name": "equipment_config",
    "trigger": "keyword:conveyor,sorter,shuttle,aisle",
    "max_tokens": 3000,
    "priority": 70
  }
]
💡
The tradeoff is explicit: Context providers consume tokens on every call they trigger. An always provider with max_tokens: 3000 costs 3,000 tokens on every single LLM call for that agent. Use always for small, high-value context (alarm state, shift info). Use keyword for larger context that's only relevant sometimes (equipment configs, order details). Use on_demand for heavy data that should only load when explicitly needed. The max_tokens cap and budget forecasting ensure a provider can never blow the context budget.

Invocation-Time Context (Passed by Caller)

In addition to registered context providers, context can be passed in when the agent is invoked — by a user in chat, by a module's service layer, by the scheduler, or by another agent via dispatch. This covers three scenarios:

ScenarioHow Context ArrivesWhere It's Injected
User sends initial context User types a message with pasted data: error logs, equipment IDs, screenshots. This is the user message — it goes into conversation history as the first turn. Conversation history (Layer 8). The agent sees it immediately on the first LLM call.
System/module invokes agent with context A scheduler, webhook, or module service calls the agent programmatically with structured context: invoke_agent("SCADA_MONITOR", context={work_order_id: "WO-2026-0847", equipment: "Shuttle 7", trigger: "fault_5012"}) Injected as a system context block between context providers and conversation history. The agent starts with this data — no tool call needed to retrieve it.
Agent dispatches to agent with context A parent agent dispatches with a task description and structured data: dispatch_agents(agent="MAINTENANCE", task="Create WO", context={equipment: "Shuttle 7", fault: "5012", severity: "critical"}) Injected as the dispatched agent's initial context. The child agent starts with full situational awareness from the parent.

Mid-Session User Messages

If a user sends additional messages while the agent is mid-iteration (processing a tool call), the message is queued and injected at the next iteration boundary — between the tool result and the next LLM call. The agent finishes its current tool execution, sees the new user message in the updated conversation history, and reasons about it on the next iteration. Messages are never injected mid-tool-call.

Iteration 2 in progress:
  Agent called read_multishuttle_status() → waiting for Bridge response...

  User sends: "Also check Line 4, it's been making noise"
    → Message queued

  Bridge returns tool result for Shuttle 7
  → Queued user message injected into conversation history
  → Iteration 3 LLM call sees BOTH the tool result AND the new user message
  → Agent can address both in its next response

Per-Agent Context Configuration

Different agents have different context needs. A simple alarm acknowledger needs tight context and fast compaction. A diagnostic agent investigating a complex equipment failure might need the full context window with everything preserved. All configurable in the manifest:

Config FieldPurposeExample Values
max_context_tokens Hard cap on context usage — can be less than the model's max. Controls cost and latency. 32000 (tight, fast agent) · 180000 (standard) · 1000000 (full 1M for deep analysis)
compaction_threshold_pct When dynamic content hits this % of usable budget, trigger compaction. 50 (aggressive, compact early) · 70 (standard) · 90 (preserve as much as possible)
compaction_strategy How to compact when threshold is hit. truncate (fast, drop oldest) · summarize (LLM summarizes older content) · preserve_tools (compact history but keep all tool results)
max_iterations Maximum tool-calling loop iterations before forced stop. 5 (simple tasks) · 15 (standard) · 30 (complex diagnostics)
tool_result_retention How many recent tool results to keep in full. Older results are truncated to a summary. 2 (aggressive) · 5 (standard) · all (never truncate tool results)
memory_mode Cross-session memory behavior. none · automatic · housekeeping

Context Profiles (Pre-Built Configurations)

ProfileContextCompactionIterationsTool ResultsBest For
Tight 32K 50% / truncate 5 Keep last 2 Simple alerts, acknowledgments, quick lookups. Fast and cheap.
Standard 180K 70% / summarize 15 Keep last 5 Most module agents. Alarm triage, work order management, reporting queries.
Deep Analysis 500K+ 90% / preserve_tools 30 Keep all Engineering diagnostics, multi-equipment investigations. Agent needs full history of every tool result to reason across them.
Maximum 1M 95% / preserve_tools 30 Keep all Complex cross-module analysis. Full conversation + all tool results + all skill content preserved. Cost-intensive — use selectively.

Context Forecasting

Before each iteration, the executor forecasts whether the next LLM call + expected tool results will fit in the remaining budget. This prevents mid-iteration failures where the context exceeds the model's limit.

Forecast InputSource
Current context sizePlatform tuned token calculator (calibrated per provider via drift tracking)
Expected tool result sizetoken_cost from Tool Registry (schema tokens + historical average result tokens)
Expected LLM response sizeBased on max_output_tokens from model config or historical average for this agent
Remaining iterationsmax_iterations minus current iteration count

If the forecast shows the next iteration won't fit, the executor compacts before calling the LLM — not after it fails. If compaction still isn't enough, tools are stripped on the next call (final-iteration behavior) to free context for the response.

💡
The tuned calculator is critical here. Context forecasting uses the same platform token calculator that's calibrated against provider-reported actuals (Section 13, Token Usage Tracking). As drift tracking tunes the calculator per provider, context forecasting becomes more accurate. The calculator, the forecaster, and the provider token reports form a feedback loop that improves over time.

22. Memory & Context

Memory Modes

ModeCompactionTodosRecallBest For
noneOne-shot tools, scheduled polls, stateless operations
automaticAt threshold %Agent-managedStandard module agents, conversational assistants
housekeepingAt threshold %Agent + system failoverPer-user/tenant DB eventsShift handoffs, multi-session workflows, operator training

Compaction Strategies

StrategyHow It WorksToken CostBest For
truncate Drop oldest messages beyond the retention window. Fast, mechanical. No additional LLM call. Zero — just removes content Agents where old conversation turns have no value (alert handlers, simple lookups)
summarize Use the tenant's fast/cheap model to summarize older messages into a condensed paragraph. Preserves semantic meaning at reduced token cost. One fast-model LLM call per compaction Conversational agents where older context matters but doesn't need to be verbatim
preserve_tools Compact conversation history (summarize or truncate) but keep all tool results in full. The agent may need to reference earlier diagnostic data, sensor readings, or query results. Minimal — only conversation is compacted Engineering diagnostics, complex investigations where tool results are the primary reasoning context

What Gets Compacted vs. What's Protected

Context ComponentCompaction Behavior
System prompt + personaNever compacted — always present in full
Security contextNever compacted — tenant, permissions always present
Tool definitionsNever compacted until final iteration (then stripped to force summary response)
Active skill contentRarely compacted — skills are the agent's knowledge. Only deactivated if context is critically low.
Context providersSkipped by priority — if budget is tight, lower-priority providers are skipped entirely. Higher-priority providers kept. Controlled by priority field (0-100).
Recall eventsRarely compacted — small token footprint, high value
Recent messages (last 3-5)Protected — always kept in full for conversational continuity
Older messagesFirst to compact — truncated or summarized per strategy
Tool results (recent)Protected — kept per tool_result_retention setting
Tool results (older)Compacted — truncated to a summary (name, status, key values) unless preserve_tools strategy

Recall Events

Recall events are stored per user per tenant. Night shift discoveries are injected into day shift agent context. Categories: action, preference, correction, discovery, error, tool_pattern. Generated at session end by the tenant's fast model — no added latency during execution.

Context management is per-agent, not one-size-fits-all. A simple alarm acknowledger runs with 32K context and aggressive compaction — fast and cheap. An engineering diagnostic agent gets 1M context with preserve_tools strategy — every tool result preserved for cross-referencing. The platform's tuned token calculator forecasts usage, the compaction strategy manages overflow, and the context profiles give module developers sensible defaults to start from.

Connectivity — Bridge, Connectors, Service Registry, Real-Time

23. The Bridge

The Bridge is a native Rust application deployed on-premises per tenant/site. It installs as a headless service (CLI install for servers) or as a system tray application with a monitoring and configuration interface. The Rust core provides high-performance WebSocket communication, native hardware access (OPC-UA, Modbus, serial), and low-level system integration. An embedded Python runtime executes module tool handlers and scripts — giving module developers the flexibility of Python while the Rust shell handles networking, security, and hardware-level I/O.

The Bridge provides three core capabilities: SCADA data ingestion from PLCs and sensors, ETL pipeline execution for local data processing, and AI or deterministic processing at the edge. It executes module tools locally and transmits only contracted data upstream to the tenant's data store. The native Rust architecture also positions the Bridge for local model inference in the future — running lightweight AI models at the edge without cloud round-trips.

PhaseDetail
RegistrationBridge registers with platform via API key. Assigned to a tenant + optional site label.
Tool SyncDownloads bridge tool definitions for modules the tenant subscribes to. Installs tool handlers locally.
SCADA IngestionConnects to PLCs (Modbus/OPC-UA) and sensor networks. Collects, buffers, and preprocesses operational data locally. Publishes enriched events to the Streaming Service (Section 19) via persistent WebSocket topics.
ETL ProcessingRuns ETL pipelines locally — extract from SCADA/historian, transform and validate, load contracted data to tenant data store.
AI / DeterministicExecutes AI inference or deterministic logic at the edge — anomaly detection, threshold alerts, predictive models — without round-tripping to cloud.
Tool ExecutionReceives tool calls from platform agents via WebSocket. Validates locally (defense in depth — see Section 6, Bridge Security). Executes against local systems. Returns results.
Data ContractOnly data specified in the tenant's contract is transmitted upstream. Raw SCADA streams, unprocessed sensor data, and intermediate ETL state remain on-site.
HeartbeatEvery 30s: uptime, connected PLCs, active tools, pipeline status, resource usage.
Credential VaultLocal credentials (PLC passwords, DB connections, API keys). Never transmitted to cloud.
Offline ModeContinues SCADA ingestion, ETL, and local processing if disconnected. Queues contracted data. Reconnects and syncs automatically.

Module Bridge Deployment

When a module declares bridge_tools in its manifest, it's stating: "this module needs on-premises execution capabilities." The platform handles deployment to the tenant's bridges automatically.

1

Module Declares Bridge Requirements

The manifest's bridge_tools section lists the tools that execute on-premises. The module's bridge/ directory contains the actual handlers and scripts that run on the bridge. This is the deployable package.

2

Tenant Subscribes to Module

If the tenant has registered bridges: the platform queues a tool sync for each bridge. Bridge tool definitions, handlers, and scripts are packaged and staged for deployment. If the tenant has no bridges: the module loads in cloud-only mode. Bridge tools are marked as unavailable. The Tenant Admin sees a notice: "This module has on-premises capabilities that require a Bridge connection."

3

Bridge Syncs on Connection

When a bridge connects (or reconnects), it pulls the latest tool definitions and handlers for all modules the tenant subscribes to. The bridge installs the module's bridge-side packages locally — Python tool handlers, configuration files, and dependencies into the embedded Python runtime. The bridge_tool_sync table tracks what's deployed to each bridge and which version.

4

Module Updates Propagate Automatically

When a module is updated through the pipeline, the platform stages new bridge tool packages. Connected bridges receive a sync notification and pull the update. Disconnected bridges sync when they reconnect. The bridge never runs stale tool versions after reconnection.

Multi-Bridge Tenants

A tenant can have multiple bridges — one per distribution center, one per zone, or any combination. Each bridge registers with a site identifier that the platform uses to route tool calls to the correct on-premises location.

Tenant A: Kroger ├── Bridge: Cincinnati DC (site_id: "cincinnati-dc") │ ├── PLCs: Aisles A1-A6, Conveyor Lines 1-8 │ ├── Synced modules: SCADA, Maintenance, ETL │ └── Local equipment: 6 multishuttle aisles, 8 conveyor lines │ └── Bridge: Atlanta DC (site_id: "atlanta-dc") ├── PLCs: Aisles A1-A4, Sorter Bay A-C ├── Synced modules: SCADA, Maintenance, ETL └── Local equipment: 4 multishuttle aisles, 3 sorter bays Both bridges sync the same SCADA module tools. Each bridge only executes against its own local equipment.

Tool Call Routing

When an agent calls a bridge tool, the platform routes to the correct bridge using the site context in the tool parameters. Module tools that need bridge execution should include a site or equipment identifier in their parameters so the platform can route correctly.

ScenarioHow It Routes
Tool call includes site read_multishuttle_status(aisle="A3", site_id="cincinnati-dc") → routed to Cincinnati bridge. The site_id parameter maps to a registered bridge. Unambiguous.
Tool call includes equipment ID read_plc_register(plc_id="PLC-CIN-001", register="R4000") → platform looks up which bridge owns PLC-CIN-001 from the equipment registry. Routed to Cincinnati bridge.
Tenant has only one bridge All bridge tool calls route to the single bridge. No site_id needed — the platform resolves automatically.
Tenant has multiple bridges, no site specified The platform returns an error to the agent: "Tenant has 2 bridges. Specify site_id: cincinnati-dc or atlanta-dc." The agent asks the user or infers from context (e.g., the user said "check Aisle A3 in Cincinnati").
Bridge is disconnected Tool call fails with a clear error: "Bridge cincinnati-dc is offline (last heartbeat: 14 minutes ago)." The agent can inform the user and suggest alternatives or retry.
💡
Equipment-to-bridge mapping. When a bridge registers, it reports its local equipment inventory — which PLCs, which aisles, which zones it manages. The platform stores this mapping in bridge_instances. Tool calls that reference specific equipment (PLC IDs, aisle IDs, zone IDs) can be routed by equipment lookup, not just by explicit site_id. The agent doesn't need to know the bridge topology — it says "read PLC-CIN-001" and the platform routes to the right bridge.

Tool Execution Engine

The Tool Execution Engine is the runtime that processes every tool call — whether initiated by a platform agent, a scheduled job, or a connector event. It handles the full lifecycle: receive the call, resolve execution context (cloud or bridge), enforce security, execute, handle errors, and return results. The same engine runs in both the cloud platform and the Bridge.

Agent calls tool: read_multishuttle_status(aisle_id="A3") │ ▼ ┌──────────────────────────────────────────────────────────────┐ │ TOOL EXECUTION ENGINE │ │ │ │ 1. RESOLVE │ Look up tool definition from module registry │ │ │ Determine execution mode: cloud or bridge │ │ │ │ 2. AUTHORIZE │ @platform_tool security stack: │ │ │ ├── Tenant context (RLS) │ │ │ ├── Permission check (SCADA.READ_SENSORS) │ │ │ ├── Agent allow/deny list │ │ │ ├── Risk level evaluation │ │ │ └── Approval gate (if requires_approval) │ │ │ │ 3. ROUTE │ Cloud tool → execute on platform server │ │ │ Bridge tool → route to tenant's Bridge via WS │ │ │ │ 4. EXECUTE │ Run the tool handler │ │ │ ├── Cloud: TenantDataConnector query/insert │ │ │ └── Bridge: local PLC/SCADA/ETL execution │ │ │ │ 5. VALIDATE │ Guardrails post-condition check │ │ │ ├── Required fields present? │ │ │ ├── Values in expected range? │ │ │ └── Success criteria met? │ │ │ │ 6. RESPOND │ Return result to agent │ │ │ Apply error_policy on failure: │ │ │ ├── retry_once → re-execute │ │ │ ├── escalate_to_human → SSE approval_request │ │ │ ├── halt_and_notify → stop agent, alert │ │ │ └── fallback_tool → try alternate tool │ │ │ │ 7. AUDIT │ Log per tool's audit level │ │ │ (none / standard / full) │ └──────────────────────────────────────────────────────────────┘

Bridge Execution Detail

When a tool's execution is set to "bridge", the Tool Execution Engine routes the call to the tenant's on-premises Bridge via WebSocket. The Bridge runs its own local instance of the execution engine with the same security validation (defense in depth).

Platform (Cloud) Bridge (On-Premises) ───────────────── ──────────────────── Agent calls read_multishuttle_status(aisle="A3") │ ▼ Tool Execution Engine ├── Authorize (permission, agent policy) ├── Route → bridge (execution="bridge") │ │ ┌─── WebSocket (TLS) ──────────┐ │ │ tool_call: │ │ │ name: read_multishuttle │ │ │ params: {aisle: "A3"} │ │ │ tenant_id: tenant-a │ │ │ request_id: req_8f3a... │ │ └───────────────────────────────┘ │ ▼ │ Bridge Execution Engine │ ├── Validate locally │ ├── Connect to PLC │ │ (OPC-UA / Modbus TCP) │ ├── Read registers: │ │ Aisle A3, Shuttle 1-12 │ │ Position, load, speed, │ │ fault codes, cycle count │ ├── Process locally: │ │ Calculate utilization % │ │ Check fault thresholds │ │ Flag anomalies │ │ │ ┌─── WebSocket (TLS) ──────────┐ │ │ │ tool_result: │◄┘ │ │ request_id: req_8f3a... │ │ │ status: success │ │ │ data: {contracted results} │ │ └───────────────────────────────┘ │ ▼ Agent receives result └── Continues iteration

Example: Dematic Multishuttle Aisle Health Check

A maintenance operator asks the platform agent: "How is Aisle A3 performing? Any shuttles showing problems?"

The agent orchestrates multiple tool calls through the execution engine:

Step 1 — Read Shuttle Status (Bridge Tool)

@platform_tool(
    module_code="SCADA_MONITOR",
    permission_code="SCADA.READ_SENSORS",
    execution="bridge",
    risk_level="safe",
    audit="standard",
    summary_instructions="Read real-time status of all shuttles in a multishuttle aisle. Returns position, load status, speed, fault codes, and cycle counts per shuttle.",
)
async def read_multishuttle_status(
    context: ToolContext,
    aisle_id: str,
) -> dict:
    # Routed to Bridge → PLC via OPC-UA
    return await context.bridge.execute("read_multishuttle", {
        "aisle_id": aisle_id,
        "read_type": "full_status",
    })

Bridge executes locally: Connects to the aisle PLC via OPC-UA, reads registers for all 12 shuttles (position, load, speed, fault code, cycle count, motor temp), calculates utilization percentages, flags any shuttles with active faults or degraded performance. Returns contracted summary.

Step 2 — Query Historical Performance (Cloud Tool via TenantDataConnector)

@platform_tool(
    module_code="SCADA_MONITOR",
    permission_code="SCADA.READ_SENSORS",
    execution="cloud",
    risk_level="safe",
    summary_instructions="Query historical performance data for an aisle. Use to compare current state against trends and detect degradation.",
)
async def query_aisle_history(
    context: ToolContext,
    data: TenantDataConnector,
    aisle_id: str,
    days: int = 7,
) -> dict:
    # Runs against tenant's PostgreSQL database
    return await data.aggregate(
        table="scada_readings",
        filters={"aisle_id": aisle_id, "equipment_type": "multishuttle"},
        group_by=["shuttle_id"],
        metrics=["avg(cycle_time)", "count(fault_events)", "avg(motor_temp)"],
        time_range={"column": "timestamp", "last_days": days},
    )

Cloud executes against tenant data store: Queries 7-day historical averages per shuttle — cycle time trends, fault frequency, motor temperature averages. The agent compares real-time Bridge data against historical baselines to identify degradation.

Step 3 — ETL: Log Current Reading (Bridge Tool, Deterministic)

@platform_tool(
    module_code="SCADA_MONITOR",
    permission_code="SCADA.READ_SENSORS",
    execution="bridge",
    risk_level="safe",
    audit="none",
    summary_instructions="Write the current aisle snapshot to the local Bridge historian and queue contracted metrics for upstream sync.",
)
async def log_aisle_snapshot(
    context: ToolContext,
    aisle_id: str,
    snapshot: dict,
) -> dict:
    # Bridge stores full snapshot locally (historian)
    # Queues only contracted metrics for upstream ETL
    return await context.bridge.execute("log_snapshot", {
        "aisle_id": aisle_id,
        "snapshot": snapshot,
        "upstream_fields": ["utilization_pct", "active_faults", "avg_cycle_time"],
    })

Bridge processes locally: Full 12-shuttle snapshot stored in the Bridge's local historian database. Only contracted metrics (utilization %, active fault count, average cycle time) are queued for the next ETL sync to the tenant's data store. Raw PLC register values, individual motor temps, and position data stay on-site.

Step 4 — Agent Dispatches to Maintenance (if faults detected)

# If the SCADA agent detects a degrading shuttle, it dispatches to Maintenance
dispatch_agents → MAINTENANCE agent
  Task: "Shuttle 7 in Aisle A3 shows motor temp trending 12% above baseline
         over 7 days. Current temp 78°C vs 63°C avg. Fault code 5012
         (motor overcurrent) triggered twice in last 24h.
         Recommend preventive work order before failure."

MAINTENANCE agent creates:
  WO-2026-0847
  Priority: High
  Type: Preventive
  Equipment: Multishuttle Aisle A3, Shuttle 7
  Assigned to: Next available technician

Agent response to operator: "Aisle A3 is running at 91% utilization. 11 of 12 shuttles are healthy. Shuttle 7 is showing elevated motor temperature (78°C vs 63°C baseline) with two overcurrent faults in the last 24 hours. I've created preventive work order WO-2026-0847 to address it before it fails. All other shuttles are within normal parameters."

What Stays On-Site vs. What Goes Upstream

DataLocationReason
Raw PLC register values (40001-49999) Bridge only High-frequency, high-volume, only meaningful to local diagnostics
Individual motor temperatures per shuttle Bridge only Stored in local historian for on-site troubleshooting
Shuttle position and speed data Bridge only Real-time operational data, no analytical value upstream
Intermediate ETL state and buffers Bridge only Processing artifacts, not business data
Aisle utilization percentage Upstream Contracted KPI — needed for reporting and dashboards
Active fault count and codes Upstream Contracted — triggers alerts and maintenance workflows
Average cycle time per shuttle Upstream Contracted KPI — performance trending
Work orders created by agent Upstream Business data — stored in tenant data store via Maintenance module

24. Streaming Service

The Streaming Service is a dedicated, independently scaled service that handles real-time data flowing between Bridges and frontend clients. It is not part of the Backend (FastAPI) — it runs as a separate Cloud Run or GKE service optimized for persistent connections and high-throughput data streams. The Backend handles CRUD, auth, and agent orchestration. The Streaming Service handles live operational data.

💡
Why a separate service: A Dematic distribution center can generate thousands of SCADA events per minute — barcode scans, conveyor speeds, motor currents, divert confirmations, fault codes. This traffic would overwhelm an HTTP request/response API. The Streaming Service is purpose-built for persistent WebSocket connections, topic-based routing, and live state caching.

Three Services, Three Responsibilities

ServiceTechnologyHandlesScale Pattern
Backend Python / FastAPI, Cloud Run Module APIs, auth, admin, CRUD, agent orchestration Auto-scale on HTTP request count. Stateless.
Streaming Service Python or Go, Cloud Run or GKE Live SCADA data from Bridges, real-time state cache, WebSocket fan-out to frontends, topic routing Scale on connection count + message throughput. Stateful (holds live state cache).
Agent System Part of Backend Agent execution, tool dispatch. Can subscribe to Streaming Service topics for real-time agent reactions (e.g., auto-triage on fault events). Scale with Backend. Async processing.

Inbound: Bridge → Streaming Service

Each Bridge maintains a persistent WebSocket connection to the Streaming Service. Data flows through tenant-scoped topics:

# Topic format: {tenant_slug}/{facility_id}/{category}/{subcategory}

# SCADA events — high frequency, enriched at the Bridge
tenant-a/facility-1/scada/barcode_events      # 2,400+ events/min per facility
tenant-a/facility-1/scada/equipment_status     # Updated every 1-5s per equipment
tenant-a/facility-1/scada/conveyor_metrics     # Speed, motor amps, throughput
tenant-a/facility-1/scada/faults               # Fault codes, severity, timestamps

# ETL metrics — contracted KPIs from Bridge processing
tenant-a/facility-1/etl/metrics                # Throughput, utilization, fault counts

# Equipment telemetry — sensor data streams
tenant-a/facility-1/telemetry/sorter-bay-a     # Shuttle positions, cycle times, temps
tenant-a/facility-1/telemetry/conveyor-main    # Belt speed, motor current, photoeyes

Live State Cache

The Streaming Service maintains an in-memory live state cache (backed by Redis or Memorystore) per tenant. This is the current truth of what's happening on the facility floor — not historical data, not analytics, just right now.

State TypeWhat It HoldsUpdate Frequency
Equipment State Current status, speed, motor amps, utilization %, active faults, cycles today — per conveyor line, sorter bay, shuttle, AVG Every 1-5 seconds per equipment
Zone Overview Aggregated KPIs per zone: throughput/min, active equipment count, fault count, items in transit Computed from equipment state, updated continuously
Facility Summary Roll-up of all zones: total throughput, overall utilization, critical alerts, system health score Computed from zone data, updated continuously
Active Alerts Current unacknowledged faults with equipment ID, fault code, severity, duration Added on fault event, removed on acknowledge

Outbound: Streaming Service → Frontend

Frontend modules connect to the Streaming Service via WebSocket and subscribe to tenant-scoped topics. The Streaming Service enforces tenant isolation at the topic level — a frontend authenticated as Tenant A cannot subscribe to Tenant B topics.

Module Use CaseSubscribes ToWhat It Renders
Live Operations Dashboard {tenant}/*/scada/equipment_status Real-time facility overview — conveyor speeds, sorter utilization, active faults, throughput KPIs. The "command center" view.
Engineering Support Tool {tenant}/{facility}/telemetry/* Live PLC data streams for remote troubleshooting — motor currents, sensor states, cycle times. Engineers diagnose issues without being on-site.
Customer Analytics {tenant}/*/etl/metrics Live KPI dashboard for the tenant — throughput trends, SLA tracking, equipment health scores. Customer-facing operational insights.
Maintenance Module {tenant}/*/scada/faults Real-time fault feed. Auto-populates work order queue. Shows fault duration and escalation status.
Agent System {tenant}/*/scada/faults (filtered by severity) Critical faults trigger agent runs automatically — alarm triage skill activates, diagnostics run, work orders created.

Topic Routing & Tenant Isolation

Security at the Stream Level

  • Topic prefix = tenant isolation. Every topic starts with the tenant slug. The Streaming Service validates the frontend's JWT tenant claim against the topic prefix. Mismatch = connection rejected.
  • Bridge authentication. Each Bridge authenticates with its API key. The Streaming Service maps the Bridge to its registered tenant. A Bridge can only publish to its own tenant's topics.
  • Subscribe permissions. Module frontends can only subscribe to topics for modules the tenant subscribes to. No SCADA subscription if the tenant doesn't have the SCADA module.
  • No cross-tenant fan-out. A message published to tenant-a/... is never delivered to a tenant-b subscriber. Topic isolation is enforced at the routing layer.

Data Flow: What Goes Where

DataStreaming ServiceTenant Data StoreBridge Local
Live equipment state Live cache — current values, overwritten on each update Full history in local historian
Contracted KPI metrics Pass-through — routes to Tenant Data Connector Stored — time-series for analytics Full resolution in local historian
Fault events Live + fan-out — to dashboards + Agent System Stored — fault history for trending Full detail in local log
Raw PLC register values Local only — high frequency, no upstream value
Barcode scan events Live state — current item locations Stored — if contracted for tracking Full scan log in local historian

Scaling

DimensionCurrent Design TargetScale Mechanism
Concurrent Bridge connections 100-500 (one per tenant/facility) Each Streaming Service instance handles ~200 persistent WebSocket connections. Auto-scale on connection count.
Inbound event throughput 50K-200K events/minute across all tenants Topic-based sharding. High-volume tenants can be pinned to dedicated instances.
Frontend WebSocket connections 2,000+ concurrent dashboards Fan-out: one Bridge event → N frontend subscribers. Connection load is the bottleneck, not message size.
Live state cache size ~10KB per equipment × 1000s of equipment Redis/Memorystore. Sub-millisecond reads. Tenant-keyed for isolation.
The power of this design: Any module can build a live dashboard by subscribing to streaming topics. A SCADA module shows live conveyor speeds. A maintenance module shows live fault feeds. An engineering support module shows live PLC telemetry for remote troubleshooting. A customer-facing analytics module shows live KPIs. The Streaming Service is the data bus — modules decide what to show.

25. Integrations as Modules

Integrations are not a separate platform feature — they are modules. An integration follows the exact same Module SDK: module.json manifest, @platform_tool decorators, tenant settings, Service Registry exposure. The only difference is that an integration module's primary purpose is connecting to an external system and exposing that connection as services other modules can consume.

Why this matters: No separate integration framework needed in the platform. No special integration_configurations table. No custom connector code. An integration module uses tenant_settings for per-tenant connection config (API keys, endpoints, credentials), exposes services via the Service Registry, and other modules call those services. The platform doesn't need to know it's an "integration" — it's just another module.

Integration Module Example: Control Tower

app/modules/control_tower/
├── module.json              # Standard manifest
├── api.py                   # Health check, connection status endpoints
├── service.py               # WebSocket connection manager
├── schemas.py               # Request/response validation
├── services.py              # Registers services other modules can call
├── ai/
│   ├── agent.json          # Agent for troubleshooting connection issues
│   └── tools.py            # @platform_tool: query_control_tower, push_metrics
└── tests/
// module.json for an integration module
{
  "module_code": "CONTROL_TOWER",
  "module_name": "Control Tower Integration",
  "module_description": "Connects to Dematic Control Tower via WebSocket for real-time operational data exchange",
  "api_base_path": "/api/v1/control-tower",

  // Per-tenant connection settings — each tenant configures their own endpoint
  "tenant_settings": {
    "websocket_endpoint": {"type": "string", "required": true},
    "api_key": {"type": "secret", "required": true},
    "poll_interval_seconds": {"type": "integer", "default": 30}
  },

  // No data_tables needed — this module doesn't store business data
  // It connects and exposes services for other modules to use

  "permissions": [
    {"code": "CONTROL_TOWER.READ",  "description": "Read data from Control Tower"},
    {"code": "CONTROL_TOWER.PUSH",  "description": "Push metrics to Control Tower"}
  ]
}
# The integration module exposes services via Service Registry
# app/modules/control_tower/services.py

def register_services():
    service_registry.register(
        module_code="CONTROL_TOWER",
        service_name="query_equipment_status",
        func=query_equipment_status,
        permission_code="CONTROL_TOWER.READ",
    )
    service_registry.register(
        module_code="CONTROL_TOWER",
        service_name="push_kpi_metrics",
        func=push_kpi_metrics,
        permission_code="CONTROL_TOWER.PUSH",
    )

# Any other module can now call:
# service_registry.invoke("CONTROL_TOWER", "query_equipment_status", ...)

How Other Modules Consume Integrations

PatternHow It WorksExample
Service Registry call Module tool calls the integration module's registered service. Permission-checked automatically. SCADA module calls CONTROL_TOWER.query_equipment_status to get real-time conveyor data.
Agent dispatch Module agent dispatches to the integration module's agent for complex queries or troubleshooting. ETL agent dispatches to CONTROL_TOWER agent: "Pull last 24h of throughput data for all sorters."
Scheduler trigger Integration module registers a schedule handler that polls the external system on an interval. CONTROL_TOWER module polls every 30s, pushes updates via Service Registry to subscribed modules.

Integration Module Types

TypeConnection StyleExamples
WebSocketPersistent bidirectional connection, managed by the module's service layerControl Tower, real-time sensor feeds
REST APIOn-demand HTTP calls via module toolsServiceNow, Salesforce, ERP systems
WebhookModule exposes an API endpoint that receives inbound eventsAlert receivers, CI/CD callbacks
MQTTPub/sub managed by the module's service layerIoT sensor networks, equipment telemetry
DatabaseModule tools query external databases via connection in tenant settingsBigQuery analytics, historian databases
💡
The pattern: Build an integration the same way you build any module. module.json for identity and tenant settings (connection config). services.py to register services. ai/tools.py for agent-callable tools. Other modules consume via Service Registry or agent dispatch. The platform validates, loads, and secures it like every other module. No special integration framework required.

26. Service Registry

Modules never import each other's data directly. The Service Registry brokers all cross-module calls with permission checking. Agent dispatch uses the same registry.

# SCADA registers a service
service_registry.register(
    module_code="SCADA_MONITOR",
    service_name="get_zone_alarm_status",
    func=get_zone_alarm_status,
    permission_code="SCADA.READ_SENSORS",
)

# Maintenance module calls it (permission auto-checked)
result = await service_registry.invoke(
    "SCADA_MONITOR", "get_zone_alarm_status",
    tenant_id=context.tenant_id, zone_id=zone_id,
)
📏
The rule: Platform data (users, roles, settings) → query via CoreDataService. Your own module data → query via TenantDataConnector. Another module's data → go through the Service Registry. Agent dispatch → same registry.

27. SSE & Real-Time

CategoryEvents
Agent Coretext, thinking, tool_call_start/delta/end, tool_result, token_usage, context_info, error, done
Agent Dispatchagent_start, agent_text, agent_tool_call, agent_tool_result, agent_complete
Approvalapproval_request, approval_response
Bridgebridge_tool_start, bridge_tool_result, bridge_status, bridge_heartbeat
Platformtodo_update, guardrails_check/failed, media_progress, schedule_triggered
Connectorconnector_event, sensor_alert, etl_progress

Platform Administration — Internal tooling for tenant management, module lifecycle, and observability

28. Admin Console

The platform has two admin interfaces: the Platform Admin Console (for Dematic platform administrators managing the entire system) and the Tenant Admin (for customer administrators managing their own tenant). Both are built on the same platform shell — the difference is scope and permissions.

Platform Admin (Dematic)

  • Cross-tenant — sees all tenants, all modules, all agents
  • Bypasses RLS where necessary
  • Manages the module catalog, pipeline, and deployment
  • Model Registry, provider config, platform health
  • Restricted to Dematic platform administrators

Tenant Admin (Customer)

  • Single-tenant — sees only their own tenant
  • Full RLS enforced
  • Subscribes to modules, configures settings, manages users & roles
  • Registers bridges, configures connectors
  • Assigned by the platform admin during tenant provisioning

Module Management (Platform Admin)

The Module Management section of the Platform Admin Console is the central hub for the module lifecycle — from upload through validation to production deployment. Every module that loads on the platform is visible here.

CapabilityDetail
Module Catalog View all modules on the platform: name, version, status (ACTIVE, STAGING, DEPRECATED), subscriber count, last updated. Filter by status, category, or search by name. This is the master list of everything available to tenants.
Module Detail View Deep inspection of a single module: manifest contents, registered tools (with security policies), skills (with SKILL.md preview), permissions declared, data tables schema, UI entry point, frontend dependencies, backend dependencies, connectors, scheduler jobs, and current subscribers.
Module Pipeline Upload new modules or updates via drag-and-drop (.mpkg) or CLI. Track validation progress in real time. View validation reports with pass/fail per step. Promote from staging to production or reject with notes. See Section 27 for pipeline detail.
Module Configuration (Platform-Level) View and edit platform_settings for each module — the defaults and constraints that apply to all tenants. Lock settings that tenants cannot override. Set allowed ranges for tenant-configurable fields.
Module Permissions Inspector View all permission codes a module declares. See which standard role templates each permission maps to by default. View across all tenants: which tenants have customized the default role mappings and how.
Module Dependencies View a module's requirements.txt (backend) and ui/package.json (frontend) dependencies. Flag known vulnerabilities. Track dependency versions across all active modules for compatibility.
Version Management View version history for each module. Compare manifests between versions (diff view). Track which tenants are on which version. Manage deprecation: mark old versions as deprecated, set migration timelines, notify affected tenants.
Module Health Per-module metrics across all tenants: API error rates, tool call success rates, agent execution stats, guardrail failure rates, average response times. Identify modules that are degrading or misbehaving.

Tenant Management (Platform Admin)

CapabilityDetail
Tenant Registry Create, view, edit, suspend tenants. Assign slug, billing contacts, technical contacts. Configure email domain mapping and SSO/SAML per tenant.
Tenant Subscriptions Enable or disable module packages per tenant. When enabled, the Tenant Data Connector provisions module tables in the tenant's database. When disabled, tables are archived (never deleted). View subscription history.
Tenant Settings Override View and override per-tenant module settings. See which settings the tenant admin has customized vs. module defaults. Platform admin can force-override tenant settings when necessary (e.g., compliance requirements).
User Overview View users across all tenants. See role assignments, last login, activity. Support operations (reset passwords, unlock accounts) without logging into the tenant.
Bridge Status View all registered bridges across tenants. Health dashboard: connected/disconnected, last heartbeat, active tools, pipeline status, resource usage.
Tenant Database View the tenant's PostgreSQL database status: provisioned tables (by module), migration status, storage usage, connection pool health. Emergency operations: run pending migrations, archive module tables.

Tenant Admin Console (Customer-Facing)

The Tenant Admin is a section of the platform workspace available to users with the Admin role within their tenant. It runs inside the same platform shell as modules — with full RLS enforcement. Tenant admins manage their own users, roles, module settings, and integrations.

CapabilityDetail
Module Catalog Browse available modules the tenant can subscribe to. View descriptions, screenshots, permission requirements, and pricing. Subscribe or unsubscribe. When subscribing, the platform provisions tables and maps permissions to the tenant's existing roles using the module's default_roles.
Module Configuration Per-module settings page. Each subscribed module surfaces its tenant_settings as a configuration form — dynamically generated from the manifest schema. Tenant admin adjusts thresholds, toggles features, sets frequencies. Changes take effect immediately. Settings constrained by the module's platform_settings ranges (e.g., data retention can be 90-730 days, not less).
User Management Invite users (email invite or bulk CSV import). Assign roles. View user activity (last login, recent actions). Deactivate users. Reset passwords. Configure SSO/SAML mappings if available.
Role Management View standard role templates (Admin, Manager, Operator, Viewer) with their current permission grants per module. Customize: add/remove permissions from existing roles. Create custom roles (e.g., "Shift Supervisor" with specific permissions across SCADA + Maintenance). Delete custom roles (standard templates cannot be deleted, only modified).
Permission Browser View all permissions across all subscribed modules in one place. Grouped by module. Each permission shows: code, description, and which roles currently grant it. Makes it easy to audit "who can do what" across the tenant.
Bridge Management Register new bridge instances. View bridge health (connected/disconnected, heartbeat, active tools). Download bridge installer for new sites. Configure bridge-to-site assignments. View bridge sync status (which tool definitions are current).
Connector Configuration For modules that use external data connectors (BigQuery, Redshift, APIs): configure connection details per module. Provide credentials, project IDs, connection strings. Test connectivity. View sync status.
Agent Settings View module agents and their configurations. Tenant admin can adjust: model preferences (within the platform's available models), context profiles, temperature, and agent-level settings defined by the module. Cannot modify skills, tools, or security policies — those are module-defined.
Audit Log View the tenant's audit trail: who changed what, when, through what interface (UI, API, agent, bridge). Filter by module, user, action type, date range. Export for compliance.
💡
Module configuration is dynamic. When a module declares tenant_settings in its manifest, the Tenant Admin automatically generates a settings form for that module — no custom admin UI code needed. The form schema comes from the manifest: field types, defaults, valid ranges, and descriptions. When the tenant admin saves, the platform validates against the module's constraints and stores the values in module_settings. Module code reads these at runtime via CoreDataService.get_module_settings() (backend) or useModuleSettings() (frontend). Add a setting to the manifest, and it appears in the Tenant Admin automatically.

Module Configuration Flow

Module Manifest (module.json): "tenant_settings": { "alarm_severity_thresholds": {"type": "object", "default": {"critical": 90, "warning": 70}}, "auto_acknowledge_info_alarms": {"type": "boolean", "default": false}, "default_etl_frequency": {"type": "string", "default": "15m", "options": ["1m","5m","15m","1h"]} } │ ▼ Platform Admin Console (Dematic): Module Management → SCADA Monitor → Platform Settings max_concurrent_pipelines: 5 (locked — tenants cannot change) data_retention_days: 365 (range: 90-730 — tenants can adjust within) │ ▼ Tenant Admin Console (Customer): Module Configuration → SCADA Monitor → Settings ┌──────────────────────────────────────────┐ │ SCADA Monitor Settings │ │ │ │ Alarm Thresholds │ │ Critical: [90] Warning: [70] │ │ │ │ Auto-acknowledge info alarms │ │ [ ] Enabled │ │ │ │ Default ETL frequency │ │ [▼ 15 minutes ] │ │ │ │ Data retention (days) │ │ [365] (min: 90, max: 730) │ │ │ │ [Save Settings] │ └──────────────────────────────────────────┘ │ ▼ Runtime: service.py: settings = await core.get_module_settings() ui/App.tsx: const settings = useModuleSettings() Both return: {alarm_severity_thresholds: {critical: 90, warning: 70}, ...}

Agent Observability

CapabilityDetail
Agent Traces Full execution trace for every agent run: prompt assembly, each LLM call, tool calls (with params and results), dispatch chains, memory operations, token usage. Filterable by tenant, module, agent, time range.
Tool Call Log Every tool execution across all tenants: tool name, actor (user/agent/bridge), execution mode (cloud/bridge), duration, result status, error details. Audit-level detail for full tools.
Failover Tracking When an agent's primary model fails and the failover chain activates, trace which model was used, why failover triggered, and response quality comparison.
Cost Dashboard Token usage and LLM cost per tenant, per module, per agent. Identify high-cost patterns and optimization opportunities.
Guardrails Report Pre/post condition failures, approval gate triggers, error policy activations. Identify modules or tools that frequently fail validation.

Platform Health

📊

System Dashboard

Active tenants, total users, module subscriptions, bridge connections, agent runs (last 24h), scheduled jobs, SSE connections.

🚨

Alerts

Bridge disconnects, agent failure spikes, tool timeout patterns, data store connectivity issues, module validation failures.

📝

Audit Log Viewer

Cross-tenant audit log search. Filter by actor type (human, agent, bridge, service), module, action, time range. Immutable records.

🔍

Module Inspector

View loaded modules, their validation status, registered tools, skills, connectors, and which tenants subscribe to each.


29. Module Pipeline

Modules are packaged and uploaded to the platform through a standardized pipeline — like a Docker image for application logic. Anyone with the SDK can develop a module. The pipeline validates it meets platform standards, stages it for testing, and promotes it to production. Two upload paths: CLI (developer workflow) and Admin Console UI (drag-and-drop for non-CLI users).

Module Package Format

The Module Package (.mpkg)

A module package is a compressed archive containing the standard module directory structure. Think of it like a Docker image for platform modules — self-contained, versioned, and validated before deployment.

scada_monitor-1.2.0.mpkg
  ├── module.json              # Manifest (identity, data tables, AI, migrations)
  ├── api.py                   # API routes
  ├── service.py               # Business logic
  ├── schemas.py               # Pydantic validation
  ├── services.py              # Service Registry exports
  ├── ai/                      # Agent config, skill packages, tools
  ├── bridge/                  # Bridge tool definitions
  ├── frontend/                # Module UI components
  └── tests/                   # Test suite

Two Upload Paths

MethodWho Uses ItHow It Works
Platform CLI Developers, AI code assistants, CI/CD pipelines platform modules validate ./my_module/ — validate locally before upload
platform modules package ./my_module/ — package into .mpkg archive
platform modules upload scada_monitor-1.2.0.mpkg — upload to platform
The CLI is the primary developer workflow. AI code assistants use the same commands — "package and upload the SCADA module" is one instruction.
Admin Console UI Platform admins, non-CLI users Drag-and-drop .mpkg file to the Module Pipeline page in the Admin Console. The UI shows real-time validation progress, step-by-step results, and the validation report. Promote or reject from the same screen.

Pipeline: Upload to Deployment

1

Upload & Unpack

Module package received via CLI or Admin Console UI. Unpacked to a staging area. Never touches the live app/modules/ directory until fully validated and accepted.

2

Manifest Validation

module.json parsed and validated: required fields present, module_code unique, permission codes don't conflict with existing modules, data_tables schema is valid, AI agent config is well-formed, bridge tool security policies are complete.

3

Structure Validation

Required files present (api.py with router export, schemas.py). Python syntax and import checks pass. No direct imports of other module tables (isolation check). No references to platform internals outside of SDK interfaces.

4

SDK Compliance & Security Scan

All tools use @platform_tool decorator. Data access uses TenantDataConnector (no raw SQL, no direct DB connections). Security policies defined for every bridge tool. Skills files are valid markdown with YAML frontmatter. Static analysis for unsafe patterns: no eval(), no os.system(), no dangerouslySetInnerHTML, no hardcoded secrets (Section 6).

5

Connector Validation

If the module declares connectors to other modules via the Service Registry, verify those target modules exist and the required services are registered. If the module exposes services, validate the function signatures and permission codes.

6

Automated Test Suite

Module's tests/ directory executed in an isolated environment with mock TenantDataConnector and mock Bridge. All tests must pass. Coverage report generated.

7

Staging Deployment

Module deployed to a staging environment. Platform admin can test it against a sandbox tenant with sample data before promoting to production.

8

User Acceptance

Platform admin or designated reviewer performs manual acceptance testing. Signs off on functionality, agent behavior, bridge tool execution, and UI integration.

9

Promote to Production

Accepted module moved to app/modules/. Module Loader picks it up on next deployment or hot-reload. Available for tenant subscriptions.

Module Status Lifecycle

StatusMeaningVisible to Tenants
UPLOADED Package received, unpacked, queued for validation No
VALIDATING Automated validation pipeline running (manifest, structure, SDK compliance, connectors, tests) No
VALIDATION_FAILED One or more validation steps failed. Detailed report available in Admin Console. No
STAGING All automated checks passed. Deployed to staging for manual testing. No
ACCEPTANCE In user acceptance testing. Awaiting sign-off from reviewer. No
ACTIVE Promoted to production. Available for tenant subscriptions. Yes
DEPRECATED Superseded by a newer version. Existing subscriptions continue. No new subscriptions. Existing only

Validation Report (Admin Console View)

Module: SCADA_MONITOR v1.2.0

Manifest Validation         PASSED
  ├── module_code unique     
  ├── permissions valid      
  ├── data_tables schema       (2 tables: scada_readings, alarms)
  ├── AI agent config          (2 skills, failover chain valid)
  └── bridge_tools policies    (2 tools, all security fields present)

Structure Validation        PASSED
  ├── api.py router export   
  ├── schemas.py present     
  ├── no cross-module imports
  └── no platform internals  

SDK Compliance              PASSED
  ├── all tools use @platform_tool    (5 tools)
  ├── data access via connector       (no raw SQL)
  ├── bridge tools have policies      (2/2)
  └── skills valid markdown           (2 files)

Connector Validation        PASSED
  ├── exposes: get_zone_alarm_status       registered
  ├── exposes: get_aisle_health            registered
  ├── consumes: MAINTENANCE.create_work_order    exists
  └── dispatch_permissions: [MAINTENANCE]        target exists

Automated Tests             PASSED
  ├── 24 tests run           
  ├── 24 passed, 0 failed    
  └── coverage: 87%          

Staging Deployment          PASSED
User Acceptance             AWAITING SIGN-OFF

30. Platform CLI

The Platform CLI is a command-line tool for developers building modules. It provides access to platform-level information needed during development: what modules are loaded, what services are available for inter-module communication, and tools for packaging and uploading modules.

Module Discovery

# List all active modules on the platform
$ platform modules list

MODULE CODE        VERSION   STATUS    TOOLS   SERVICES EXPOSED
─────────────────  ────────  ────────  ──────  ─────────────────
SCADA_MONITOR      1.2.0     ACTIVE    5       2
MAINTENANCE        1.0.0     ACTIVE    4       3
ETL_PIPELINE       1.1.0     ACTIVE    3       1
REPORTING          1.0.0     ACTIVE    2       0
ASSET_TRACKING     0.9.0     STAGING   3       1
AVG_OPS            0.8.0     VALIDATING 4      2

Module Connector Discovery

# See what services a module exposes (what you can call)
$ platform modules services SCADA_MONITOR

MODULE: SCADA_MONITOR (v1.2.0)
EXPOSED SERVICES:
  get_zone_alarm_status
    Permission: SCADA.READ_SENSORS
    Params:     zone_id (str), severity_filter (str, optional)
    Returns:    {zone_id, alarm_count, alarms: [...]}

  get_aisle_health
    Permission: SCADA.READ_SENSORS
    Params:     aisle_id (str)
    Returns:    {aisle_id, utilization_pct, active_faults, shuttles: [...]}

# See what services are available across all modules
$ platform modules services --all

MODULE              SERVICE                        PERMISSION
──────────────────  ─────────────────────────────  ────────────────────
SCADA_MONITOR       get_zone_alarm_status           SCADA.READ_SENSORS
SCADA_MONITOR       get_aisle_health                SCADA.READ_SENSORS
MAINTENANCE         create_work_order               MAINTENANCE.CREATE
MAINTENANCE         get_open_work_orders             MAINTENANCE.READ
MAINTENANCE         get_technician_availability      MAINTENANCE.READ
ETL_PIPELINE        get_pipeline_status              ETL.READ

# See what a module consumes from other modules
$ platform modules deps SCADA_MONITOR

MODULE: SCADA_MONITOR (v1.2.0)
CONSUMES:
  MAINTENANCE.create_work_order    (via dispatch_agents)
  MAINTENANCE.get_technician_availability
DISPATCH PERMISSIONS:
  Can dispatch to: MAINTENANCE, ETL_PIPELINE

Module Packaging & Upload

# Validate a module locally before uploading
$ platform modules validate ./my_module/
  Manifest:     PASSED
  Structure:    PASSED
  SDK Compliance: PASSED
  Connectors:   PASSED (requires MAINTENANCE module on platform)
  Tests:        PASSED (18/18, coverage 82%)

# Package the module into a .mpkg archive
$ platform modules package ./my_module/
  Created: scada_monitor-1.2.0.mpkg (124 KB)

# Upload to platform — kicks off the validation pipeline
$ platform modules upload scada_monitor-1.2.0.mpkg
  Uploaded. Status: VALIDATING
  Track progress: platform modules status SCADA_MONITOR

# Or do it in one command — validate, package, and upload
$ platform modules deploy ./my_module/
  Validating...    PASSED
  Packaging...     scada_monitor-1.2.0.mpkg (124 KB)
  Uploading...     UPLOADED
  Pipeline:        VALIDATING → track with: platform modules status SCADA_MONITOR

# Check validation pipeline status
$ platform modules status SCADA_MONITOR
  Module:  SCADA_MONITOR v1.2.0
  Status:  STAGING
  Passed:  Manifest, Structure, SDK, Connectors, Tests (24/24)
  Next:    Awaiting acceptance in Admin Console

# Promote from CLI (admin only) — or promote from Admin Console UI
$ platform modules promote SCADA_MONITOR
  Promoted to production. Module Loader will pick up on next deploy.
  Tenants subscribed: 3
  Data migrations pending: v1.1.0, v1.2.0 (will run at startup)
  Next:    Awaiting user acceptance sign-off

Platform Info

# Platform overview
$ platform info
  Platform Version:  2.1.0
  Active Tenants:    12
  Loaded Modules:    6 (4 active, 1 staging, 1 validating)
  Registered Bridges: 8
  Agent Runs (24h):  1,247

# List tenant subscriptions (admin only)
$ platform tenants list
  TENANT          SLUG          STATUS   MODULES              BRIDGES
  ──────────────  ────────────  ───────  ───────────────────  ───────
  Tenant A        tenant-a      ACTIVE   SCADA, MAINT, ETL    2
  Tenant B        tenant-b      ACTIVE   SCADA, MAINT, AVG    3
  Tenant C        tenant-c      ACTIVE   MAINT, REPORTING     1

# Show data store config for a tenant (admin only)
$ platform tenants datastore tenant-a
  Engine:   PostgreSQL
  Dataset:  tenant_a_modules
  Tables:   scada_readings, alarms, work_orders, pipeline_runs
  Status:   Connected

Complete CLI Command Reference

platform info

CommandPurpose
platform infoPlatform overview: version, active tenants, loaded modules, registered bridges, agent runs (24h)
platform healthHealth check across all services: Backend, Streaming Service, Platform DB, Bridge connections

platform modules

CommandPurpose
platform modules listList all modules with status, version, tool count, service count
platform modules services [MODULE]Show services exposed by a module (or --all for all modules)
platform modules deps MODULEShow what a module consumes from other modules and its dispatch permissions
platform modules validate ./path/Run local validation against SDK standards (manifest, structure, SDK compliance, security, tests)
platform modules package ./path/Package module into .mpkg archive
platform modules upload FILE.mpkgUpload a packaged module to the platform validation pipeline
platform modules deploy ./path/Validate + package + upload in one command
platform modules status MODULECheck validation pipeline status and results
platform modules promote MODULEPromote a staged module to production (admin only)
platform modules inspect MODULEShow full module detail: manifest, data tables, skills, tools, services, subscribed tenants

platform tools

CommandPurpose
platform tools listList all tools (platform + module). Filter: --scope platform or --scope module
platform tools list --scope platformList platform-level tools available to any agent — key for module developers discovering what's built in
platform tools inspect TOOLShow tool details: scope, security policy, parameters, instructions, which skills use it

platform agents

CommandPurpose
platform agents listList all registered agents with module, model, skill count, dispatch permissions
platform agents inspect AGENTShow agent detail: persona, model, skills, bound tools, dispatch permissions, memory mode, failover chain
platform agents skills AGENTList skill packages for an agent with tools_required and file paths
platform agents trace RUN_IDShow full execution trace for an agent run: prompt, LLM calls, tool calls, results, token usage

platform models

CommandPurpose
platform models listList all registered LLM models with provider, type, status (active/deprecated/sunset), capabilities
platform models inspect MODELShow model detail: context window, pricing, capabilities, agents using it
platform models swap OLD NEWMigrate all agents from one model to its replacement. --dry-run to preview.

platform tenants

CommandPurpose
platform tenants listList all tenants with slug, status, subscribed modules, bridge count (admin only)
platform tenants inspect SLUGShow tenant detail: data store config, module subscriptions, bridge instances, user count
platform tenants datastore SLUGShow tenant data store: engine type, connection, provisioned tables, migration status

platform bridges

CommandPurpose
platform bridges listList all bridges with tenant, status (connected/disconnected), last heartbeat, tool count
platform bridges inspect BRIDGE_IDShow bridge detail: tenant, site, connected PLCs, synced tools, health history
platform bridges healthShow health dashboard: all bridges with uptime, resource usage, connection quality

platform storage

CommandPurpose
platform storage listList storage buckets with module, purpose, file count, total size
platform storage usageStorage usage breakdown by tenant and module

platform streaming

CommandPurpose
platform streaming statusStreaming Service health: active connections, topics, message throughput, cache size
platform streaming topicsList active topics with tenant, subscriber count, messages/min
The CLI is the developer's window into the platform. Building a module? Use platform tools list --scope platform to discover built-in tools. Use platform modules services --all to find inter-module services. Use platform agents inspect to understand how agents are configured. Use platform modules deploy to ship. The CLI works the same way for human developers and AI code assistants — one command set, full platform visibility.

Reference — Dematic implementation, deployment profiles, tech stack, schemas

Dematic Models Reach Goal

This section describes the long-term vision for Dematic-specific AI models. It is not part of the initial platform build, but the platform architecture is designed to support it from day one. The data structures, the embedding infrastructure, the tenant data isolation, and the agent skill system all lay the groundwork. What follows is an enterprise-wide initiative that starts with a single focus group and expands across all departments.

The Progression: Tools → Fine-Tuning → Foundation Model

Not every piece of domain knowledge requires a custom model. The approach is incremental — use the simplest solution that works, escalate only when needed:

  1. Tools first. Can a general-purpose model (Claude, GPT) solve this with the right tools and data access? If yes, build a tool. The platform's @platform_tool and skill system handle most domain tasks today — the agent reads equipment data, follows procedures, and uses tools to act. No training needed.
  2. Fine-tuning second. If the general-purpose model struggles with Dematic-specific terminology, fault patterns, or engineering conventions even with tools, fine-tune an existing model on curated Dematic data. This is cheaper and faster than training from scratch — and the platform's data curation pipeline produces the training pairs.
  3. Dematic foundation model last. If fine-tuned models still can't handle the depth of Dematic's engineering domain — PLC programming patterns, mechanical design reasoning, cross-system integration logic — then a purpose-built foundation model trained on Dematic's proprietary data is the goal. This is the reach goal. It requires the most data, the most investment, and the most time.

The clear question for every department is: "Can this data improve our tools and skills, improve a fine-tuned model, or does it require a foundation model?" Most data starts at level 1 (tools). The best data eventually feeds level 3. The platform captures it all from day one so nothing is lost.

Enterprise Initiative, Not a Separate Department

This is not a project that lives inside an AI team. It's an enterprise-wide initiative that starts with one focus group — likely controls engineering, where the richest structured data already exists — and expands into every department that touches equipment. Engineering, resident maintenance, customer support, commissioning, parts and supply chain, sales. Each department's contribution follows the same question: is this data for tools, for fine-tuning, or for training? The platform captures all of it. The initiative drives the culture that produces it.

The Vision: Data Is the Gold

The platform name isn't accidental. The gold is the data that flows through every tenant deployment — SCADA readings, fault histories, PLC configurations, resolution procedures, engineering designs, operator interactions. Today this data sits in disconnected historians, spreadsheets, and tribal knowledge. The platform captures it, structures it, and makes it agent-accessible. The reach goal is to use that structured data to build models that understand Dematic equipment — not just monitor it.

What Dematic Models Would Know

Knowledge DomainWhat the Model LearnsData Source
Equipment Diagnostics How to diagnose faults on Dematic multishuttles, conveyors, sorters, and AGVs. Not generic troubleshooting — specific to Dematic control logic, fault code structures, and resolution procedures. Fault histories, resolution records, engineering notes, operator interactions — captured through the platform's SCADA and Maintenance modules across all tenant deployments.
PLC Programming How to program and configure PLCs for Dematic equipment. Siemens TIA Portal patterns for multishuttles, Allen-Bradley patterns for conveyors, standard I/O configurations, safety circuit logic. PLC configuration exports uploaded through the Engineering Configuration module. Ladder logic, function blocks, tag databases, I/O maps — the complete programming context.
WMS Integration How Dematic equipment integrates with warehouse management systems. Message formats, handshake protocols, order flow patterns, exception handling between WMS and material handling controls. Integration module configurations, WMS connector definitions, message logs, error patterns captured across tenant deployments.
Customer Configurations Tenant-specific equipment configurations, custom alarm thresholds, site-specific operational patterns. The model understands that Tenant A's multishuttle runs at different parameters than Tenant B's because of their specific product mix and throughput requirements. Per-tenant configuration data from the Engineering Configuration module, alarm threshold settings, operational parameters — all tenant-isolated but available for cross-tenant pattern learning with explicit consent.
Design & Engineering Files How Dematic systems are designed: mechanical layouts, control architectures, safety systems, commissioning procedures. The model can assist in new system design by referencing patterns from previous deployments. Engineering design documents, commissioning checklists, as-built drawings, test procedures — uploaded and indexed through the platform's storage and embedding infrastructure.

AI-First by Department: How Training Data Is Created

This is the sell. Every department at Dematic touches the equipment lifecycle — design, build, install, commission, operate, maintain, replace. Every touchpoint generates data that makes the platform smarter. No competitor has this. No foundation model can replicate it. This is the compounding advantage that grows with every deployment, every service call, every part replacement, every engineering engagement.

Training data doesn't appear automatically. It requires every department to adopt one question: "How does what I'm doing right now teach an AI to do it better next time?"

Engineering (Design & Controls)

Dematic equipment is like highly customizable building blocks — standard products (conveyors, multishuttles, sorters, AGVs) assembled into configurations that vary enormously by customer. The products are the same, but a grocery DC running 4,000 totes per hour with 2-35kg weight variance looks nothing like a pharma DC running 800 totes per hour with climate-controlled aisles. Every deployment is a unique combination of standard components configured for specific facility layouts, product mixes, throughput requirements, and operational constraints.

Today these design patterns live in project files that leave with the engineer. But the patterns between different customers are where the gold is. AI trained on hundreds of deployment configurations can identify what works: which multishuttle motor configurations handle which product mixes, which conveyor speeds reduce jam rates for which carton sizes, which sorter divert geometries optimize throughput for which parcel distributions. The platform captures every design as structured data — so the patterns emerge across customers, not just within one project.

Example: Custom Multishuttle System Design

What happens today:
  Engineer designs a 12-aisle multishuttle for a grocery DC.
  PLC program saved to shared drive. I/O map in a spreadsheet.
  Fault codes documented in a Word doc. Knowledge in the engineer's head.
  Next project starts from scratch with a different engineer.

What happens with AI-first thinking:
  Engineer uploads design to the platform through Engineering Configuration module.

Training pair generated:

  Input:  Siemens TIA Portal export — 12-aisle multishuttle
          8 shuttles/aisle, dual-motor, grocery DC, 4000 totes/hour

  Output: Structured knowledge base:
          - Equipment topology: aisle layout, shuttle assignments, zone map
          - I/O map: 192 inputs, 96 outputs per aisle → named and documented
          - Fault code table: 47 codes → root causes → resolution procedures
          - Control sequences: initialization → homing → storage/retrieval → recovery
          - Safety circuits: E-stop zones, light curtain interlocks, interlock logic
          - Commissioning parameters: speeds, acceleration profiles, temp thresholds
          - Design decisions: "dual-motor chosen because grocery tote weights
            vary 2-35kg — single motor overheats above 28kg sustained"

  The value:
  Next time an engineer designs a grocery DC multishuttle, the AI says:
  "Based on 14 previous grocery DC deployments, dual-motor config
  is recommended for tote weight variance above 20kg. Here's the
  standard I/O map, fault code library, and commissioning checklist
  from the most similar deployment (Site: Kroger Cincinnati)."

  The engineer doesn't start from scratch. They start from the
  collective knowledge of every previous deployment.

Example: Automatic Monitoring from PLC Config

The most powerful application: PLC config → automatic SCADA monitoring

What happens today:
  Engineer programs the PLC (Allen-Bradley Studio 5000 or Siemens TIA Portal).
  A separate SCADA engineer manually configures monitoring: which tags to poll,
  what thresholds trigger alarms, what fault codes mean. This takes weeks.
  Errors in the manual mapping cause missed alarms or false positives.

What happens with the platform:
  Engineer exports the PLC config file (Allen-Bradley .L5X or Siemens .XML).
  Uploads it to the Engineering Configuration module.

Agent reads the config and automatically derives:

  1. Tag database extraction
     - Every named tag with address, data type, and description
     - Motor temps, belt speeds, position sensors, fault registers
     - The agent knows Allen-Bradley tag naming conventions and
       Siemens data block structures from previous configs

  2. Fault code library
     - Identifies all fault/alarm tags (FLT_, ALM_, Fault_ prefixes)
     - Maps fault code numbers to descriptions from PLC comments
     - Assigns severity levels based on the control logic context:
       "this fault triggers an E-stop → Critical"
       "this fault logs a warning but doesn't stop → Info"

  3. Monitoring configuration for the Bridge
     - Generates the OPC-UA tag list: what to poll, how often
     - Critical tags (faults, safety): 100ms poll rate
     - Performance tags (temps, speeds): 1-second poll rate
     - Status tags (positions, counts): 5-second poll rate

  4. SCADA dashboard definition
     - Equipment layout with live tag bindings
     - Alarm rules generated from fault code library
     - Threshold-based alerts from commissioning parameters

Engineer reviews, adjusts, approves.
  The agent did 90% of the work. The engineer adds context the
  config file doesn't contain: "this motor runs hot by design,
  raise the temp threshold 10°C" or "ignore fault 4012 during
  the first 30 seconds after startup — it's a known transient."

Result:
  Upload a PLC config → monitoring is live in hours, not weeks.
  Every future config upload is faster because the agent learned
  from the engineer's corrections on previous configs.

Resident Maintenance

Dematic places resident maintenance technicians at customer sites. These technicians perform daily rounds, preventive maintenance, corrective repairs, and emergency response. They are the richest source of equipment knowledge in the company — they see how equipment actually behaves over years, not just how it was designed to behave. Today most of this knowledge is in their heads or in paper logs. The platform changes that.

Example: AI-First Maintenance Plans

What happens today:
  Maintenance plans are static documents created during commissioning.
  "Inspect belt tension monthly. Replace motor bearings annually."
  Same plan regardless of actual equipment condition or usage patterns.
  Technician knows the real schedule from experience — not the plan.

What happens with AI-first thinking:
  Every maintenance action feeds the learning loop.

Corrective maintenance training pair:

  Input:  Shuttle 7 motor failure — fault 5012 (overcurrent)
          Equipment: Dematic multishuttle MS-1200, age: 3.2 years
          Last PM: 6 weeks ago (bearings inspected, within spec)
          Sensor history: motor temp trend +12% over 60 days
          Vibration: 0.3mm/s → 1.1mm/s over 90 days
          Parts on hand: none (4-day lead time for motor assembly)

  Output: Resolution + lessons learned:
          1. Motor bearing failure — inner race spalling
          2. Root cause: contamination from nearby conveyor belt dust
          3. Bearing was "within spec" at last PM but vibration trend
             showed degradation starting 90 days before failure
          4. Corrective: replaced motor assembly (part: MS-1200-MOT-ASM)
          5. Preventive update: add vibration threshold (0.8mm/s) as
             PM trigger for this shuttle position (dust exposure zone)
          6. Parts lesson: stock 1x MS-1200-MOT-ASM per site with
             dust-exposed shuttles — 4-day lead time is unacceptable
             for a position that fails every ~3 years

Maintenance plan evolution:

  Static plan:  "Replace motor bearings every 12 months"

  AI-first plan: "Monitor vibration trend. If shuttle is in dust
  exposure zone AND vibration exceeds 0.8mm/s AND trend is rising,
  schedule bearing replacement within 2 weeks. Pre-position parts
  at sites with 3+ dust-exposed shuttle positions."

  The plan becomes predictive — driven by actual equipment behavior,
  not calendar intervals. And it improves with every repair.

Example: Parts Lifecycle & Predictive Ordering

Data accumulated across 50 sites over 2 years:

Component: BC-2400 conveyor belt
  Replacements tracked: 847 across all tenants
  Average lifespan: 18 months (light product mix)
  Average lifespan: 11 months (heavy product mix)
  Average lifespan: 7 months (abrasive products — gravel, aggregate)

  Early failure signals (identified from data, not from manuals):
  ├── Tension fault frequency > 2x/week (normal: < 0.5x/week)
  ├── Motor temp trending +5% above 30-day baseline
  ├── Throughput increase > 20% from commissioning baseline
  └── Belt elongation > 1.8% at last inspection

Predictive ordering pair:

  Input:  Site: Amazon SEA2, Conveyor Line 4
          Belt age: 9 months, heavy product mix
          Tension faults: 2.3x/week (up from 0.4x at install)
          Motor temp: trending +7% over 30 days
          Next planned maintenance window: 3 weeks

  Output: Belt replacement confidence: 91% within 6 weeks
          Recommended action: order BC-2400-BELT-STD now (2-week lead)
          Schedule replacement during the window in 3 weeks
          Estimated downtime: 45 minutes (during planned PM)
          Cost avoided: $12,400 unplanned downtime if belt fails
          during peak shift

The compounding value:
  Year 1: 847 replacements tracked, patterns emerging
  Year 2: model predicts 73% of belt failures 4+ weeks in advance
  Year 3: model predicts 89%, parts pre-positioned, zero unplanned
           belt-related downtime across the fleet

  This data exists nowhere else. No competitor has 847 labeled
  belt replacement records with sensor correlations across 50 sites.
  Every replacement makes the prediction better.

Customer Support

When a customer calls with a problem, the support engineer's diagnostic process is training data. The chain of thought — what they checked, what they ruled out, what they found — teaches the model how to diagnose faults on Dematic equipment.

Example: Remote Diagnostic Support

Support call: "Sorter is rejecting 30% of parcels to the reject lane"

Support engineer's chain of thought (captured as training data):

  1. Check reject rate history → normal is 2-5%, spiked to 30% at 14:00
  2. Check what changed at 14:00 → new product line introduced (oversized)
  3. Check scanner read rates → 98% (scanners working fine)
  4. Check divert confirmation sensors → 3 of 12 divert points
     showing "no confirm" on oversized parcels
  5. Root cause: divert shoes not fully extending for oversized items
     — mechanical adjustment needed on diverts 4, 7, 11
  6. Temporary fix: reduce belt speed 15% at divert points (allows
     more extension time)
  7. Permanent fix: adjust divert shoe stroke length per Dematic
     spec DS-4400-ADJ-023 for parcel width > 600mm

Training value:
  This teaches the model a diagnostic pattern:
  "High reject rate + new product introduction + divert no-confirm
   → check divert mechanical adjustment for new product dimensions"

  Next time ANY customer introduces oversized products and sees
  reject rate spikes, the agent already knows the diagnostic path.
  It doesn't wait for a support call — it proactively alerts:
  "Reject rate increased 6x since the product line change at 14:00.
  Based on similar incidents at 3 other sites, this is likely a
  divert shoe adjustment issue. Recommended: check diverts 4, 7, 11."

Commissioning & Installation

Commissioning teams bring equipment from installation to operational. Every commissioning generates a baseline — the "healthy" state of the equipment when it's new. This baseline is the reference point for every future diagnostic, every predictive model, and every maintenance plan.

Example: Commissioning Baseline as Training Data

Commissioning data captured for a new multishuttle installation:

Baseline record:
  Site: Target Atlanta DC
  Equipment: MS-1200 Multishuttle, 8 aisles, 6 shuttles per aisle
  Date: 2026-03-15

  Per-shuttle baselines:
  ├── Motor temp at rated load: 52-58°C (ambient: 24°C)
  ├── Cycle time (storage): 3.1-3.4 seconds
  ├── Cycle time (retrieval): 2.8-3.1 seconds
  ├── Vibration: 0.15-0.22 mm/s
  ├── Current draw at rated load: 4.2-4.8A
  └── Position accuracy: ±0.5mm

  Commissioning notes (from the installing engineer):
  "Aisle 3, Shuttle 4 runs 3°C warmer than siblings — bearing
  pre-load slightly high but within spec. Monitor at first PM."

Training value:
  Every future reading is compared against THIS baseline.
  When Shuttle 4 in Aisle 3 hits 65°C in month 8, the model knows:
  "This shuttle ran warm from commissioning (pre-load note).
  Current temp is 7°C above its specific baseline, not the fleet
  average. Flag for bearing inspection at next PM."

  Commissioning data is the birth certificate of every piece of
  equipment. Without it, diagnostics compare against generic specs.
  With it, diagnostics are personalized to each unit's actual
  starting condition.

Parts & Supply Chain

Every part ordered, stocked, consumed, and returned tells a story about equipment reliability. Parts data closes the loop between design, maintenance, and forecasting.

Example: Cross-Department Intelligence

The complete lifecycle for one component — motor assembly MS-1200-MOT-ASM:

Engineering:   Designed for 5-year MTBF in clean environments
Commissioning: 48 units installed across 8 sites
Maintenance:   17 replacements in 3 years (expected: ~10)
Root causes:   12 bearing failures (contamination), 3 overcurrent
               (overloading), 2 electrical (moisture ingress)
Parts data:    Average 4.1-day lead time, 3 emergency orders ($2,800
               expedite premium each)
Support data:  23 related support calls, average 2.1 hours to diagnose

AI-first intelligence (connects all departments):

  → To Engineering: "Motor MTBF in dusty environments is 2.1 years,
    not 5. Consider sealed bearing option for DC sites with
    conveyor belt dust. 12 of 17 failures were contamination."

  → To Maintenance: "Pre-position 1 spare motor per site with
    3+ dust-exposed shuttles. Vibration monitoring at 0.8mm/s
    threshold catches 89% of failures 4+ weeks in advance."

  → To Parts: "Stock 6 units regionally (currently 2). Eliminate
    emergency orders by predictive positioning. Saves $8,400/year
    in expedite premiums alone."

  → To Support: "When a customer reports motor overcurrent on
    MS-1200 in a dusty environment, skip the standard diagnostic.
    Go directly to vibration check + bearing inspection. Saves
    1.5 hours per call based on 23 previous cases."

  → To Sales: "Offer sealed bearing upgrade as a value-add for
    sites with conveyor dust exposure. ROI: $14,200/year per site
    in avoided downtime + expedite costs. Payback: 6 months."

  No single department has this picture. The platform connects them.

The Chain-of-Thought Evaluation Process

Raw data is not training data. Every department's contributions go through a chain-of-thought evaluation process that transforms operational records into labeled training pairs:

1

Capture

The platform captures the raw event: fault record, resolution notes, PLC config upload, parts replacement, agent conversation. This happens automatically through normal module usage.

2

Agent Drafts Chain of Thought

An AI agent (using general-purpose models) analyzes the raw data and generates a draft chain of thought: "Given these inputs, the reasoning should be X → Y → Z, and the conclusion is W." This is the agent's best guess at the diagnostic or engineering logic.

3

Human Expert Reviews

The domain expert (engineer, technician, support agent) reviews the agent's draft. They correct errors, add missing context, and validate the reasoning chain. The corrections are the highest-value training data: "The agent got step 2 wrong — here's why, and here's the correct reasoning."

4

Pair Stored

The reviewed input/output pair is stored as a labeled training example: input (raw data + context), output (validated chain of thought + conclusion), corrections (what the agent got wrong and why). Tagged by equipment type, fault category, department, and confidence level.

5

AI Evaluation

Periodically, a separate evaluation agent reviews accumulated pairs for consistency: do similar inputs produce similar outputs? Are there contradictions between pairs from different sites? Are the reasoning chains logically sound? Flagged inconsistencies go back to human experts for resolution. The result is a clean, validated training corpus.

How the Platform Enables This

The platform doesn't build these models — it generates and structures the data that makes them possible. Every module, every agent interaction, every Bridge data stream contributes to the training corpus. This is AI-first thinking: the architecture captures data in a form that's useful for model training, not just for operational dashboards.

Platform FeatureHow It Feeds Model Training
Engineering Configuration Module PLC configs parsed and structured by agents. I/O maps, fault codes, control logic documented per tenant. Engineering reviews and corrects agent output — creating curated input/output training pairs. Over time, the platform accumulates a library of "config in, structured documentation out" pairs that train models to parse new configs accurately.
Vector Embeddings Fault descriptions, resolution procedures, and equipment documentation embedded for similarity search. The embedding index across all tenants (with consent) becomes a training dataset for domain-specific embedding models that understand material handling terminology better than general-purpose models.
Agent Interaction Logs Every agent conversation — operator questions, agent reasoning, tool calls, results — is a training example. "Operator asked X, agent diagnosed Y, resolution was Z" pairs teach models the diagnostic workflow for Dematic equipment.
SCADA Streaming Data Time-series sensor data tagged with engineering context (sensor names, equipment types, zone IDs). When correlated with fault events and resolution records, creates labeled training data: "these sensor patterns preceded this fault on this equipment type."
Bridge Local Processing Bridges process data at the edge and transmit contracted data. With tenant consent, aggregated patterns (not raw data) can feed cross-tenant model training: "multishuttles across 50 sites show this temperature degradation pattern before motor failure."

Roadmap

1

Policy First: AI-First Thinking Across Departments (Now)

Training data starts with a forward policy: every department — controls engineering, field service, commissioning, design, customer support — identifies, curates, and includes AI learning content in their standard workflows. When an engineer resolves a fault, they document it in a format that trains models. When a designer creates a PLC program, they include the context that teaches an agent to understand it. When a technician completes a work order, the resolution description is written for both humans and AI. This is a cultural shift, not a technical one — and it starts before a single model is trained.

2

Platform Launch: Capture & Structure (Months 1-6)

The platform captures operational data, engineering configurations, and agent interactions. The Engineering Configuration module builds per-tenant technical libraries. Agents use general-purpose models (Claude, GPT) with domain knowledge provided through skills and tool instructions. Data accumulates in structured, tenant-isolated databases. The AI-first policy means this data is already being shaped for training — not retrofitted later.

3

Data Curation: Label & Pair (6-12 months)

Engineering defines input/output pairs: "this config file → this structured documentation," "this sensor pattern → this fault → this resolution." Agents generate draft pairs, engineering corrects them. The review/correction cycle produces high-quality labeled data at scale. This requires a dedicated effort from the controls engineering team.

3

Model Training: Fine-Tune & Evaluate (12-24 months)

With sufficient curated data, fine-tune domain-specific models: a Dematic diagnostics model, a PLC configuration model, a WMS integration model. Evaluate against held-out test sets. Compare against general-purpose models on Dematic-specific tasks. This requires a dedicated AI/ML engineering team and a model training pipeline.

4

Deployment: Register & Serve (24+ months)

Trained models registered in the platform's Model Registry alongside general-purpose providers. Module agents can be configured to use Dematic-specific models for equipment tasks while falling back to general-purpose models for other tasks. The failover chain and model selection infrastructure already exists in the platform — Dematic models slot in as another provider.

💡
The compounding advantage: Every customer deployment generates training data. Every engineering engagement builds the technical library. Every agent interaction produces labeled examples. The platform doesn't wait for models to be useful — general-purpose models with domain skills work from day one. But the data pipeline is running from day one too. When Dematic models are ready, the data is already there — structured, labeled, and tenant-consented. This is what AI-first architecture means: the data is the gold, and the platform is designed to mine it from the start.

Deployment Model

The platform uses a single, consistent deployment model for all tenants. No engine choices, no deployment profiles, no configuration matrix.

LayerTechnologyIsolation
Platform DB Shared PostgreSQL (Cloud SQL) Row-Level Security on tenant_id. Config, identity, orchestration.
Tenant DB Dedicated PostgreSQL per tenant Physical isolation — separate database per tenant. Module business data.

Every tenant gets the same stack: dedicated PostgreSQL with ACID transactions, relational joins, pgvector for similarity search, and full SQL capabilities. Modules that need access to external data sources (BigQuery warehouses, data lakes, third-party APIs) use module-level data connectors — configured per tenant in the Admin Console, not a deployment choice.

One deployment model means zero ambiguity. No "which engine does this tenant use?" questions during development. No adapter-specific bugs. No deployment profile matrix to test against. Every module developer writes for PostgreSQL. Every tenant gets PostgreSQL. Module data connectors handle the rest.

Tech Stack

LayerTechnologyPurpose
BackendPython 3.12+ / FastAPIPlatform server, Module Loader, Agent Executor
Platform DBPostgreSQL 17 (Cloud SQL)Tenant config, identity, orchestration (shared, RLS)
Tenant DataPostgreSQL (dedicated per tenant)Module business data (per-tenant, via TenantDataConnector)
AuthJWT + Firebase AuthEmail/password, OAuth, SSO/SAML
FrontendReact + TypeScript (Vite, Ant Design)Dashboard, module views, agent chat
AIClaude (Anthropic) + multi-provider failoverAgent executor, skills, tool calling
StreamingSSE (FastAPI StreamingResponse)Real-time agent + bridge events
StreamingPython or Go, Cloud Run / GKEDedicated service: live SCADA streams, topic routing, WebSocket fan-out, live state cache
Live State CacheRedis / MemorystoreIn-memory live equipment state, zone overviews, active alerts. Sub-millisecond reads.
BridgeRust native application + embedded Python runtimeInstallable service (headless CLI or system tray). Native hardware access (OPC-UA, Modbus, serial), WebSocket client, SCADA ingestion, ETL processing, Python tool execution, future local model inference
HostingGCP (Cloud Run, auto-scaling)Serverless containers
CI/CDCloud BuildBuild, test, deploy pipeline

Database Schemas

PLATFORM DATABASE (PostgreSQL, shared, RLS on tenant_id)
│
├── tenants                        ← Tenant registry + deployment profiles
├── users                          ← All accounts (tenant_id FK, RLS)
├── roles                          ← Per-tenant roles
├── role_permissions               ← RBAC grants per role per module
├── user_role_assignments          ← Links users to roles
├── module_registry                ← All available modules (synced from manifests)
├── tenant_module_subscriptions    ← Which modules each tenant subscribes to
├── tenant_data_stores             ← Connection config per tenant (engine, dataset, credentials)
├── agent_registry                 ← All agents, skills, tool bindings
├── agent_tool_permissions         ← Tool security policies
├── bridge_instances               ← Registered bridges per tenant
├── bridge_tool_sync               ← Tool definition sync state
├── bridge_health_log              ← Bridge heartbeat history
├── form_definitions               ← Reusable JSON form schemas
├── schedules                      ← Schedule definitions
├── schedule_assignments           ← Module entity ↔ schedule links
├── schedule_occurrences           ← Auto-generated due dates
├── module_settings                ← Per-tenant module configuration (JSONB)
├── storage_buckets                ← GCS bucket registry (provisioned from manifests)
├── storage_files                  ← File metadata (tenant_id, module, purpose, GCS path)
├── (integrations are modules — no dedicated integration tables)
├── user_recall_events             ← Memory recall (housekeeping mode)
└── audit_log                      ← Immutable change tracking

TENANT DATA STORES (per-tenant, engine varies)

Tenant: Tenant A (PostgreSQL: kroger_modules)
├── scada_readings     ← Time-series sensor data (partitioned by timestamp)
├── alarms              ← Alarm events and acknowledgments
├── alarm_rules         ← Threshold configurations
├── equipment_status   ← Current equipment state
├── work_orders         ← Maintenance work orders
├── work_order_tasks    ← Task breakdown per WO
├── pipeline_definitions← ETL pipeline configs
├── pipeline_runs       ← ETL execution history
├── assets              ← Equipment asset registry
└── report_definitions ← Report templates

Tenant: Tenant B (PostgreSQL: amazon_modules)
├── scada_readings     ← (same schema, different dataset)
├── alarms
├── work_orders
├── vehicles            ← AVG fleet (Tenant B subscribes to AVG module)
├── routes
└── charging_stations

Tenant: Tenant C (PostgreSQL: target_modules)
├── work_orders         ← (same schema, dedicated PostgreSQL)
└── report_definitions
Implementation — Team structure, directory ownership, build order, and Git strategy

Implementation Strategy

Foundation-First: Why This Order Matters

The platform is built foundation-first. The foundation — core services, agent system, CLI, platform shell, and the first module — must be established before the team scales into independent module development. How the foundation gets built, who contributes what, and how the team divides the work are decisions the team makes together at kickoff. What matters is the outcome: a working platform with proven patterns that every coding agent can follow.

Team Discussion Point: How Coding Agents Learn Project Conventions

Coding agents strongly infer project conventions from both instruction files like CLAUDE.md and the existing codebase. In practice, repeated code patterns often become the strongest signal for how new code should be written.

Because of that, early implementations have outsized influence. The first examples of service structure, decorators, testing style, error handling, and data access often become the default pattern the agent extends.

These patterns are not impossible to change, but they are sticky. If the team wants to adopt a new convention, it usually needs to update the written guidance, refactor or replace old reference examples, and review agent-generated code until the new pattern is consistently represented.

That creates several team decisions:

  • Who owns the initial canonical patterns — the first examples of each convention need to be intentional, not accidental
  • When a convention is considered stable — at what point does the team agree "this is how we do it" vs. "this is still evolving"
  • How convention changes are rolled out — updating CLAUDE.md, refactoring old reference examples, and reviewing agent-generated code until the new pattern is consistently represented
  • What level of review is required for precedent-setting code — early code needs more scrutiny because it sets the template for everything that follows

The architecture defines the boundaries. The team must intentionally define and maintain the patterns within those boundaries, especially early in the project and during any convention shift.

Build Order

The platform is built in a specific order because frontend and backend must stay aligned. Experience shows that building a large backend first then connecting the frontend later creates orphan endpoints, duplicated logic, and confused coding agents. The approach: build each capability as a vertical slice — backend API + frontend screen together — so the system is always testable and the agent always sees connected patterns.

Why frontend and backend build together: When a coding agent builds backend endpoints without a connected frontend, it has no feedback loop. Later, when the frontend is added, the agent often creates new endpoints instead of connecting to existing ones — because it can't see the relationship. The result: orphan endpoints, conflicting API patterns, and a framework the agent struggles to navigate. Building each feature as a vertical slice (API + UI together) prevents this. The agent sees the full pattern: "this endpoint serves this screen" — and follows that pattern for every future feature.

Phase 0 — UI Framework + Platform Foundation (Parallel, 2 Weeks)

Two tracks run in parallel. Both are essential and both inform each other.

Track A: Platform Foundation

Build the fully working platform: core services, agent system, CLI, bridge, security, and the first module. Each backend capability gets a minimal frontend connection point so the pattern is established.

Track B: UI Framework & Standards

Design the product UI: wireframes, navigation maps, CSS standards, TypeScript conventions. Build the navigable shell with blank screens. Every screen from both applications mapped and agreed on before implementation begins.

Track B Deliverables: UI Framework

  • Platform Application wireframes: Every screen from the Screen Inventory — Dashboard, Tenant Management, Module Catalog, Module Pipeline, Agent Designer, Agent Observability, Model Registry, Bridge Dashboard, Audit Log, Cost Dashboard, Platform Settings
  • Tenant Application wireframes: Module workspace (panel carousel, action panel), Agent Chat, Notifications, User Profile. Tenant Admin: Module Catalog, Module Settings, User Management, Role Management, Permission Browser, Bridge Management, Audit Log
  • CSS design standards: Platform CSS custom properties finalized (colors, typography, spacing, breakpoints). Component patterns (cards, tables, forms, buttons, badges, alerts). Dark mode. Responsive behavior at desktop/tablet/mobile
  • TypeScript standards: SDK hook interfaces (usePermission, useApi, useTenant, useActionPanel). Component prop patterns. State management approach. Module entry point contract
  • Navigable shell: Both applications as working React apps with real navigation — sidebar items link to blank placeholder screens. Click through the full product flow. Structure is real, content is placeholder

Track A Build Sequence: Platform Foundation

1

Platform Core + Auth

Module Loader, @platform_tool, TenantDataConnector, CoreDataService, Service Registry, Platform DB (Alembic migrations, RLS), tenant provisioning. Auth middleware (JWT, slug-based routing), security middleware (CORS, CSP, CSRF). Frontend: Login flow, auth context, tenant routing — connected to real backend from day one.

2

Agent System

Agent Executor, Tool Registry, Skills Registry, prompt assembler, memory manager, guardrails, tool dispatcher, LLM providers, failover, SSE streaming, Model Registry, embeddings. Frontend: Agent chat panel connected via SSE. CLI: platform agents test for headless testing.

3

Platform CLI + Dev Workflow

platform dev push, platform dev logs, platform dev watch, platform dev test, platform dev tools list, platform modules create, platform modules clone, platform modules validate, platform modules package, platform modules upload, platform tenants.

4

Bridge Infrastructure

Native Rust application: WebSocket client, hardware drivers (OPC-UA, Modbus), embedded Python runtime, SCADA ingestion, ETL engine, credential vault, health monitoring. Headless CLI and system tray modes.

5

First Module: SCADA Monitor (Full Vertical Slice)

Complete module proving every platform feature: module.json, api.py, service.py, schemas.py, ai/ (agent, skills, tools, bridge tools), ui/ (alarm dashboard, sensor history, settings), tests/. Backend and frontend built together — every API endpoint has a connected UI component. This module is the pattern every future module follows.

6

CLAUDE.md + ModuleSDKInstruct.md

Written from the codebase, not before it. Every convention, every pattern, every "never do this" — documented based on working code. These files are the foundation every coding agent reads.

Phase 1 — Team Joins (After Foundation)

With the foundation established, the team reviews the codebase and aligns on assignments together. Work splits into two tracks that run in parallel: module development and platform UI completion. Assignments are based on team strengths, reviewed and agreed as a group — not pre-assigned.

Work TrackWhat It CoversHow It's Assigned
Module Development New modules: Maintenance, ETL Pipeline, Reporting, Asset Tracking, Engineering Configuration. Each module is a self-contained directory — zero conflicts between developers. Team reviews the module list, discusses complexity and dependencies, and self-selects based on strengths and interest. Each developer owns one or more modules.
Platform UI Completion Connect the Phase 0 wireframe screens to real backend APIs: Platform Admin (tenant management, module pipeline, agent designer, observability), Tenant Admin (user/role management, settings, bridge management), Bridge Dashboard. Team reviews the Screen Inventory and divides screens based on frontend experience and backend familiarity. Each screen is a vertical slice — API + UI together.
💡
Module work and platform UI work happen in parallel without conflicts. Developers building modules work in app/modules/{name}/. Developers finishing platform UI work in frontend/platform/ and frontend/tenant/. Different directories, different concerns. The Consultant maintains the platform core, CLI, and CLAUDE.md — and transitions to project owner as the team takes ownership of their areas.

Ongoing

The team builds over time. New modules, new customer requirements, new platform capabilities. The process stays the same: clone a module, load the skills package, push to dev, iterate, submit through the pipeline. The foundation is set. The patterns are proven. New team members start productive on day one because the codebase teaches them how the project works.


CLAUDE.md — Starting Point for Team Review

This is a draft CLAUDE.md built from the architecture defined in this proposal. It is a starting point for team review — the team should discuss, adjust, and finalize before development begins. Once finalized, this file lives in the project root and is read by every coding agent on every session.

💡
This is not documentation. This file is instructions for a coding agent. It must be short, opinionated, and unambiguous. One way to do each thing. No optional styles. Every rule here is enforced — if the agent violates it, the code should be rejected in review.
# CLAUDE.md — Modular Application Platform
# Read by every coding agent on every session. Follow strictly.
# Sections 2-30 of the architecture proposal compressed into rules.

## Project Identity

Multi-tenant platform with pluggable modules.
Backend: Python 3.12+ / FastAPI
Frontend: React + TypeScript (Vite)
Database: PostgreSQL 17 — shared Platform DB (RLS) + dedicated DB per tenant
Bridge: Rust native application with embedded Python runtime
AI: Claude (Anthropic) primary + multi-provider failover

Two UI applications:
- Platform Application (admin.platform.dematic.com) — internal management
- Tenant Application ({slug}.app.dematic.com) — customer workspace

## Architecture Boundaries (NEVER violate)

### Tenant Isolation (Sections 3, 5)
- Tenant isolation is STRUCTURAL, not conventional
- TenantDataConnector is pre-scoped — no mechanism to reach another tenant
- Platform DB uses RLS: SET app.current_tenant_id on every session
- Tenant DBs are physically separate PostgreSQL databases
- NEVER pass tenant_id as a parameter. NEVER construct cross-tenant queries
- Bridge connections route to current tenant only
- Agents inherit the user's tenant context — immutable for the session

### Module Isolation (Sections 10, 12)
- Modules NEVER import from other modules
- Cross-module: Service Registry → service_registry.invoke("MODULE_CODE", "service_name")
- Cross-module agents: dispatch_agents platform tool
- Each module is a self-contained directory under app/modules/

### Security (Sections 5, 6, 17)
- Security is platform-enforced. Module code writes ZERO security code
- Backend: @platform_tool(permission_code="...") checks permissions automatically
- Frontend: usePermission() and PermissionGate for UI gating
- API middleware: JWT validation, tenant routing, RLS context — automatic
- NEVER handle JWT tokens, login flows, or session state in module code
- UI permissions are UX convenience — the real boundary is the API layer
- Agents operate at the user's permission ceiling — cannot exceed it

## Platform Layer (Sections 4, 7, 8, 9)

### Platform DB (app/data/platform_db/)
- Alembic migrations for ALL platform tables
- RLS policies on every table with tenant_id
- Migration numbers assigned centrally — request before creating
- Tables: tenants, users, roles, role_permissions, user_role_assignments,
  module_registry, tenant_module_subscriptions, tenant_data_stores,
  agent_registry, agent_tool_permissions, bridge_instances, bridge_tool_sync,
  bridge_health_log, schedules, schedule_assignments, schedule_occurrences,
  module_settings, storage_buckets, storage_files, form_definitions,
  notifications, notification_preferences, user_recall_events, audit_log

### CoreDataService (app/core/core_data_service.py)
- Access platform data (users, roles, permissions, settings) with RLS
- get_current_tenant(), get_user_permissions(), get_module_settings()
- NEVER query platform tables directly — use CoreDataService

### TenantDataConnector (app/core/tenant_data_connector.py)
- PostgreSQL only. One adapter: PostgreSQLAdapter (asyncpg)
- Methods: query, insert, update, delete, aggregate, vector_search, get_schema
- vector_search uses pgvector (cosine similarity)
- provision_tables called on tenant module subscription
- NEVER import database drivers directly. NEVER construct SQL strings

### Platform Scheduler (app/core/scheduler/)
- Modules declare scheduling in module.json: entity_types + agent_triggers
- Scheduler generates occurrences per tenant at configured frequency
- Agent triggers spawn agent sessions on platform events
- Modules provide handler functions in scheduler.py
- NEVER build custom cron/timer loops in modules

### Service Registry (app/core/service_registry.py)
- Modules register services for cross-module access
- service_registry.invoke("MODULE_CODE", "service_name", ...)
- Permission-checked at invocation
- NEVER import from another module — use the registry

## Module Layer (Sections 10-14)

### Module Structure
app/modules/{module_name}/
├── module.json           # REQUIRED — only file the loader needs
├── api.py                # FastAPI router — must export `router`
├── service.py            # Business logic — shared by api.py AND ai/tools.py
├── schemas.py            # Pydantic models — strict types, no Any
├── requirements.txt      # Backend Python deps (own installs)
├── ui/                   # Frontend (optional)
│   ├── index.tsx          # Entry point — mounted by platform shell
│   ├── package.json       # @platform/ui-sdk required
│   └── tsconfig.json      # Extends platform base (strict: true)
├── ai/                   # Agent system (optional)
│   ├── skills/{name}/SKILL.md
│   └── tools.py           # @platform_tool implementations
├── bridge/               # Bridge tools (optional)
│   └── tools.py           # execution="bridge"
└── tests/

Modules can create ANY additional folders (integrations/, wms/, docs/, etc.)
Platform only reads what the manifest references.

### Service Layer Pattern (THE MOST IMPORTANT PATTERN)

service.py is the CENTER. Both api.py and ai/tools.py call into service.py.
Business logic lives in service.py ONLY. Never in api.py. Never in tools.py.

# service.py
async def get_active_alarms(
    data: TenantDataConnector,
    zone_id: str | None = None,
) -> list[dict]:
    filters = {"acknowledged": False}
    if zone_id: filters["zone_id"] = zone_id
    return await data.query("alarms", filters=filters, order_by="created_at DESC")

# api.py — calls service
@router.get("/alarms")
async def list_alarms(zone_id: str = None):
    return await service.get_active_alarms(get_data(), zone_id)

# ai/tools.py — calls the SAME service function
@platform_tool(module_code="SCADA_MONITOR", permission_code="SCADA.READ_SENSORS",
    summary_instructions="Get active unacknowledged alarms.")
async def get_active_alarms(context: ToolContext, data: TenantDataConnector,
    zone_id: str = None) -> list[dict]:
    return await service.get_active_alarms(data, zone_id)

### @platform_tool Rules
Every tool MUST have: module_code, permission_code, summary_instructions
Every tool MUST delegate to service.py
Platform auto-appends retrieval directive to summary_instructions for tools
  with full_instructions — agents call get_tool_instructions() for detail
Bridge tools MUST also have: execution="bridge", risk_level, audit, error_policy

### Module Manifest (module.json)
Required: module_code, module_name, module_version, api_base_path
Permissions include default_roles: ["admin","manager","operator","viewer"]
UI declares: entry, nav_label, nav_icon, nav_group, min_permission
Scheduler declares: entity_types (with handler + frequencies) + agent_triggers

## Agent System (Sections 18-22)

### Agent Executor (app/agents/executor.py)
- Loop: Assemble prompt → Call LLM → Text or Tools → Execute → Safety → Loop
- Every agent runs through the same executor — skills/tools make each unique
- Context budget managed: compaction at threshold, forecasting before each call
- Max iterations configurable per agent (5-30)
- Failover chain: primary model fails → next in chain

### Skills (app/agents/skills_registry.py)
- Follow agentskills.io open standard (SKILL.md with YAML frontmatter)
- Progressive disclosure: discovery (name+desc) → activation (full SKILL.md)
  → resources (references/ loaded on demand)
- Skills declare tools_required — resolved from Tool Registry

### Tool Registry (app/agents/tool_dispatcher.py)
- All tools registered at startup from @platform_tool decorators
- Tools scoped: platform (any agent) or module (owning module's agents)
- Tool execution pipeline: tenant scoping → security → pre-conditions →
  approval gate → execute → post-conditions → success criteria → audit
- Tenant scoping is STEP 1 — before any security check or tool logic

### Prompt Assembly Order
1. Agent Persona (from manifest)
2. Skill Discovery (name + description, ~100 tokens each)
3. Activated Skill Content (full SKILL.md body)
4. Context Providers (@context_provider injections)
5. Tool Summary Instructions (injected into prompt + retrieval directive)
6. Security Context (tenant, permissions, dispatch scope)
7. Recall Events (previous session knowledge)
8. Conversation History (recent full, older compacted)

### Memory Modes
- none: stateless
- automatic: compaction at threshold, agent-managed todos
- housekeeping: compaction + recall events + system failover

## UI Layer (Sections 15-17)

### Platform Shell
- Workspace: Sidebar (nav) + Panel Carousel (multiple open) + Action Panel (right)
- Shell owns: auth, navigation, panel lifecycle, responsive layout, theming
- Modules render inside panels — full control over panel content
- Agent chat is a panel like any other — context-aware to focused module

### Action Panel
- Shared right sidebar — any module can push content (viewers, 3D, docs)
- One owner at a time — new content replaces current
- SDK hook: useActionPanel().open({ component, props })

### Frontend Rules
- CSS Modules (.module.css) ONLY. No global CSS.
- Use platform CSS variables: var(--platform-color-primary), etc.
  NEVER hardcode colors, fonts, spacing
- TypeScript strict: true. No `any` on SDK interfaces.
- SDK hooks: usePermission, useApi, useTenant, useUser, useBreakpoint,
  useModuleSettings, useAgent, useActionPanel
- PermissionGate hides UI elements — API enforces security
- Responsive: desktop (>1024), tablet (768-1024), mobile (<768)
- NEVER modify shell DOM. NEVER access other modules' state.

### Permissions (Section 17)
- Standard roles: Admin, Manager, Operator, Viewer
- Modules define permissions with default_roles in manifest
- Flat additive model — no inheritance, every grant explicit
- Same permission code used in @platform_tool AND usePermission()
- Agents inherit user's permissions — cannot exceed

## Connectivity (Sections 23-27)

### Bridge (app/bridge/)
- Native Rust application + embedded Python runtime
- Install: headless CLI (servers) or system tray (monitoring UI)
- Hardware: OPC-UA, Modbus, serial — native from Rust core
- Tool execution: Python handlers run in embedded runtime
- Multi-bridge: tenants can have multiple bridges (per site)
- Routing: tool calls routed by site_id or equipment_id
- Tool sync: bridge pulls module handlers on connect/reconnect
- Auth: API key + mutual TLS on WebSocket

### Streaming Service
- Separate Cloud Run/GKE service — NOT part of FastAPI backend
- WebSocket hub with topic routing and tenant-scoped streams
- Live state cache (Redis) for equipment status
- Bridges publish to tenant topics, frontend subscribes

### Module Data Connectors
- TenantDataConnector = PostgreSQL (always available, default)
- Module data connectors = opt-in external sources (BigQuery, Redshift, S3)
- Configured per tenant in Admin Console, declared in module manifest
- Module decides what to connect to — platform provides the adapter

## Naming Conventions

Python: snake_case files/functions, PascalCase classes, UPPER_SNAKE constants
TypeScript: PascalCase components/files, camelCase hooks/utils
Module codes: UPPER_SNAKE (SCADA_MONITOR)
Permissions: MODULE.ACTION (SCADA.READ_SENSORS)
CSS modules: kebab-case (alarm-dashboard.module.css)

## NEVER List (applies to ALL platform and module code)

- Create new architectural patterns without team discussion
- Import from another module's directory
- Write raw SQL or create direct database connections
- Handle auth/JWT in module code
- Skip TypeScript strict mode
- Create global CSS
- Pass tenant_id as a parameter
- Hardcode IDs, credentials, or connection strings
- Use eval(), os.system(), or subprocess without approval
- Commit code that isn't a good example (it becomes the pattern)
- Build custom cron/timer loops (use Platform Scheduler)
- Build custom notification systems (use send_notification platform tool)
- Build custom file storage (use storage_upload platform tool)
- Accept agent-generated structure without reviewing pattern decisions

## Golden Path Files (copy patterns from these)

Platform:
- Core service: app/core/tenant_data_connector.py
- Decorator: app/core/platform_tool.py
- Middleware: app/middleware/auth.py
- Agent executor: app/agents/executor.py
- Platform tool: app/core/platform_tools/dispatch_agents.py

Module (SCADA Monitor = reference implementation):
- Manifest: app/modules/scada_monitor/module.json
- Service: app/modules/scada_monitor/service.py
- API: app/modules/scada_monitor/api.py
- Tool: app/modules/scada_monitor/ai/tools.py
- Skill: app/modules/scada_monitor/ai/skills/alarm-triage/SKILL.md
- UI entry: app/modules/scada_monitor/ui/index.tsx
- Tests: app/modules/scada_monitor/tests/test_service.py

Frontend:
- Shell: frontend/shell/App.tsx
- Platform screen: frontend/platform/agents/ (pattern for platform screens)
- Tenant screen: frontend/tenant/workspace/ (pattern for tenant screens)
- Shared component: frontend/shared/components/PermissionGate.tsx

## CLI Commands

# Development
platform dev push ./my_module/         # Push to dev platform
platform dev push ./my_module/ --fast  # Backend only
platform dev logs                      # Stream all logs
platform dev logs --errors             # Errors only
platform dev logs --module MY_MODULE   # Filter to module
platform dev test ./my_module/         # Run tests
platform dev watch ./my_module/        # Auto-push on changes

# Discovery
platform dev tools list                # Platform tools available
platform dev connectors list           # Data connectors available
platform dev modules list              # Published modules + services

# Module lifecycle
platform modules create my_module      # Generate skeleton
platform modules clone SCADA ./my_mod  # Clone as starting point
platform modules validate ./my_module/ # Full validation
platform modules package ./my_module/  # Create .mpkg
platform modules upload my_mod-1.0.mpkg # Submit to pipeline

# Agent testing
platform agents test SCADA_MONITOR --prompt "What alarms are active?"
platform agents list

# Admin
platform tenants list
platform tenants create --slug acme --name "Acme Corp"
platform tools list
platform tools inspect write_plc_register
Team review required. This draft is built from the architecture proposal. Before development starts, the team should review every rule, discuss any disagreements, and finalize. Once committed to the repo, this file becomes the law. Changes follow the Pattern Change Protocol — update the file, refactor examples, enforce until stable.

Directory Ownership

Each directory is owned by one person or team. The ownership is assigned during Phase 1 when the team reviews the codebase and agrees on responsibilities. The structure below shows suggested ownership areas — final assignments are a team decision.

app/
│
├── core/                        ← [Owner: ________] contracts, interfaces, SDK surface
│   ├── platform_tool.py            @platform_tool decorator
│   ├── tenant_data_connector.py    TenantDataConnector base class
│   ├── core_data_service.py        CoreDataService
│   ├── service_registry.py         Service Registry
│   ├── module_loader.py            Module auto-discovery + mounting
│   ├── settings.py                 Platform/tenant settings resolution
│   └── interfaces/                 Shared type definitions (frozen after foundation)
│
├── middleware/                   ← [Owner: ________] auth, security, tenant routing
│   ├── auth.py                     JWT validation, tenant resolution from slug
│   ├── security.py                 CORS, rate limiting, CSP, CSRF (Section 6)
│   ├── rls.py                      RLS context management
│   └── request_validation.py       Pydantic validation middleware
│
├── agents/                      ← [Owner: ________] agent system
│   ├── executor.py                 Agent execution loop
│   ├── skills_registry.py          Skills loading and prompt injection
│   ├── memory_manager.py           Memory modes, compaction, recall events
│   ├── prompt_assembler.py         Prompt assembly pipeline
│   ├── tool_dispatcher.py          Tool execution engine
│   ├── guardrails.py               Pre/post validation
│   └── providers/                  LLM provider adapters, failover chain
│       ├── anthropic.py
│       ├── openai_adapter.py
│       └── google_adapter.py
│
├── data/                         ← [Owner: ________] data layer + bridge server
│   ├── platform_db/                 Alembic migrations for platform tables
│   │   └── migrations/
│   ├── connectors/                  TenantDataConnector adapters
│   │   └── postgresql_adapter.py
│   ├── bridge_manager.py            Bridge WebSocket server, tool sync
│   ├── tenant_provisioner.py        Create tenant databases on subscription
│   └── etl/                         Bridge ETL framework
│
├── admin/                        ← [Owner: ________] admin console APIs
│   ├── api.py                       Admin Console API routes
│   ├── tenant_management.py         CRUD tenants, subscriptions, settings
│   ├── module_pipeline.py           Upload, validation, staging, acceptance
│   ├── observability.py             Agent traces, tool logs, cost dashboard
│   └── bridge_dashboard.py          Bridge health monitoring
│
├── modules/                      ← One folder per module, one owner per module
│   ├── scada_monitor/               [Owner: ________]
│   ├── maintenance/                 [Owner: ________]
│   ├── etl_pipeline/                [Owner: ________]
│   ├── reporting/                   [Owner: ________]
│   └── asset_tracking/              [Owner: ________]
│
├── frontend/
│   ├── platform/                 ← Platform Application (admin.platform.dematic.com)
│   │   ├── dashboard/               maps to: platform health APIs
│   │   ├── tenants/                 maps to: app/admin/tenant_management.py
│   │   ├── modules/                 maps to: app/admin/module_pipeline.py
│   │   ├── agents/                  maps to: app/agents/ (designer, observability)
│   │   ├── models/                  maps to: Model Registry APIs
│   │   ├── bridges/                 maps to: app/data/bridge_manager.py
│   │   ├── audit/                   maps to: audit_log APIs
│   │   ├── costs/                   maps to: token usage APIs
│   │   └── settings/                maps to: platform settings APIs
│   ├── tenant/                   ← Tenant Application ({slug}.app.dematic.com)
│   │   ├── workspace/               Module panel rendering engine
│   │   ├── chat/                    Agent chat panel (SSE consumer)
│   │   ├── notifications/           Notification center
│   │   ├── profile/                 User profile/settings
│   │   └── admin/                   Tenant admin screens (Admin role only)
│   │       ├── modules/             Subscribe, configure
│   │       ├── users/               User management
│   │       ├── roles/               Role + permission management
│   │       ├── bridges/             Bridge registration/monitoring
│   │       ├── connectors/          External data connector config
│   │       └── audit/               Tenant audit log
│   ├── shell/                    ← Shared workspace shell (both apps)
│   │   ├── App.tsx
│   │   ├── AuthProvider.tsx
│   │   ├── Sidebar.tsx
│   │   └── PanelManager.tsx
│   └── shared/                   ← Shared components, hooks, SDK, CSS
│       ├── components/              DataTable, FormEngine, PermissionGate
│       ├── hooks/                   usePermission, useApi, useTenant
│       └── styles/                  Platform CSS variables, base styles
│
├── bridge/                       ← DATA PLATFORM ENGINEER (infra) + SR AI ENGINEER (tool executor)
│   ├── src/                         Rust source (native application core)
│   │   ├── main.rs                  CLI entry point + system tray service
│   │   ├── websocket.rs             WebSocket connection to platform (tokio)
│   │   ├── hardware/                Native OPC-UA, Modbus, serial drivers
│   │   ├── tool_executor.rs         Python runtime host + tool dispatch
│   │   ├── scada_ingestion.rs       SCADA data collection + buffering
│   │   ├── credential_vault.rs      Encrypted local credential storage (OS keyring)
│   │   └── health.rs                Heartbeat, health monitoring, resource usage
│   ├── python/                      Embedded Python runtime for module tools
│   │   ├── tool_runner.py           Tool execution harness
│   │   └── etl_engine.py            Local ETL processing
│   └── Cargo.toml                   Rust dependencies
│
└── cli/                          ← CONSULTANT — Platform CLI
    ├── main.py
    ├── modules_cmd.py               platform modules list/validate/upload
    └── tenants_cmd.py               platform tenants list/datastore

Shared Touchpoints (Conflict Risk)

These are the only places where two people might touch the same code. The team agrees on resolution patterns before development starts:

File / AreaRiskSuggested Resolution
app/main.py Multiple developers add startup hooks Single owner. Others submit PRs for their startup registrations.
core/interfaces/ Everyone reads, changes affect everyone Frozen after foundation phase. Changes require team discussion and approval.
Alembic migrations Could conflict if two people add migrations simultaneously Migration numbers assigned centrally. Developer requests a number before creating a migration. Prevents merge conflicts on migration ordering.

Build Order

Phase 0 Delivery Checklist

Phase 0 is complete when all of the following work end-to-end — not as isolated components, but as a connected system tested through the CLI:

MilestoneCLI Verification
Platform starts gunicorn app.main:app — all services initialize, modules load, health check returns 200
Create a tenant platform tenants create --slug acme --name "Acme Corp" — tenant record in Platform DB, PostgreSQL database provisioned, admin user seeded
Load a module via CLI platform modules validate ./modules/scada_monitor/ → all checks pass
platform modules upload scada_monitor-1.0.0.mpkg → module available in catalog
Subscribe tenant to module platform tenants subscribe acme SCADA_MONITOR — module tables provisioned in tenant's database, permissions mapped to default roles
Open the UI Navigate to acme.dev.platform.dematic.com — login screen, workspace shell loads, SCADA module appears in sidebar, alarm dashboard renders in panel
Chat with an agent Open agent chat — SCADA agent responds, calls tools (read sensors, query history), returns results with citations, streams via SSE
Test agent via CLI platform agents test SCADA_MONITOR --prompt "What alarms are active?" — agent runs in CLI, shows tool calls, returns response, traces logged
Bridge connects Bridge instance registers, WebSocket connected, health heartbeat flowing, tool definitions synced, SCADA ingestion running
Dev workflow works platform dev push ./modules/scada_monitor/ — code synced, module reloaded, tests run < 10 seconds
platform dev logs --errors — log streaming works, errors visible in real time
Admin Console works Platform Admin: tenant management, module catalog, pipeline, agent traces, bridge status
Tenant Admin: module settings, user/role management, permission browser
CLAUDE.md complete Root CLAUDE.md written from the codebase: every convention, every pattern, every SDK interface, every "never do this." ModuleSDKInstruct.md complete and tested — a coding agent loads it and can build a module correctly on the first try.
Phase 0 is not done until a developer's coding agent can: open the project, load ModuleSDKInstruct.md, run platform modules create test_module, write a basic module, push it to the dev platform, see it load and work, and iterate using platform dev logs. If this workflow isn't smooth, the team will struggle. Fix it in Phase 0 — not in Phase 1 with four developers waiting.

Phase 1 — Team Joins, Assignments Begin (After Foundation)

With the platform foundation established, the team joins with clear assignments. Each developer owns a module directory — completely independent, zero conflicts. Their coding agents open the project, read CLAUDE.md, study the SCADA module, and know how to build.

Team Assignments

RoleModule AssignmentPlatform ResponsibilityWhat This Proves
Sr Full Stack Engineer modules/maintenance/ Platform UI enhancements, shared component library improvements as needed CRUD workflow: work orders, technician dispatch, approval flows. Cross-module integration via Service Registry (calls SCADA services). Permission-gated UI with PermissionGate.
Data Platform Engineer modules/etl_pipeline/ TenantDataConnector improvements, Bridge enhancements as needed Bridge-heavy module: local ETL processing, contracted data sync, pipeline scheduling. Tests the Bridge ETL framework and data connector under real load.
Jr Full Stack Engineer modules/reporting/ Admin Console enhancements, observability dashboards Cloud-only module (no bridge tools). Cross-module data access via Service Registry. Good ramp-up: follows established patterns without bridge complexity.
Sr AI Engineer modules/engineering_config/ Agent system improvements, prompt optimization, new provider adapters AI-heavy module: config parsing, document extraction, vector embeddings, similarity search. Proves the agent system handles complex multi-step workflows.

Consultant's Role During Phase 1

Platform owner + module developer

  • Platform core: Owns core/, cli/, app/main.py, CLAUDE.md. Handles bug fixes, SDK enhancements, and CLI improvements as developers encounter issues.
  • Module development: Builds modules/asset_tracking/ like any other developer — same SDK, same CLI, same pipeline. Validates the developer experience firsthand.
  • Code review: Reviews PRs that touch platform core. Module PRs are peer-reviewed by the team. The Consultant reviews module PRs when patterns drift from the established conventions.
  • CLAUDE.md maintenance: Updates CLAUDE.md and ModuleSDKInstruct.md as new patterns emerge or conventions need clarification. These files are living documents during Phase 1.

Shared Touchpoints (Conflict Management)

File / AreaWho Needs ItResolution
app/main.py Only if a developer needs a new startup hook Consultant owns this file. Others submit PRs.
core/interfaces/ Everyone reads, only Consultant writes Frozen after Phase 0. Changes require team discussion and Consultant approval.
Alembic migrations Could conflict if two people add migrations simultaneously Migration numbers assigned by Consultant. Developer requests a number before creating a migration.
Service Registry contracts Modules that expose services to other modules Service interface defined by the owning module. Consuming module submits a PR if the interface needs changes. Both developers agree on the contract.

Phase 2 — Scale and Specialize

New modules, new teams, new customer requirements. The process is the same regardless of how the team grows: clone an existing module, load the skills package, push to the dev platform, iterate, submit through the pipeline. The foundation doesn't change. The patterns are set. The CLAUDE.md and ModuleSDKInstruct.md are proven. Every new developer's coding agent starts productive on day one.


Agent-Driven Development Standards

These are the operational rules for building with coding agents on this project. They apply to every developer, every session, every commit.

Core Principle

Agents don't decide — they replicate the strongest signal in the repo.

That signal is, in order of weight:

  1. Existing code (highest weight — what the agent sees most)
  2. Local instruction files (CLAUDE.md, ModuleSDKInstruct.md)
  3. Your prompt (lowest, most temporary)

If those conflict, code wins. The agent will follow what it sees in the codebase over what you tell it in a prompt. This is why the codebase must be clean.

Non-Negotiable Rules

#RuleDetail
1 Establish canonical patterns early Define one clear reference implementation per pattern type: service function, API route, schema/model, tool/agent integration, error handling, test structure. These are reference implementations, not just working code. If it's not a good example, don't commit it.
2 Single source of truth Use CLAUDE.md for naming conventions, file structure, and architectural rules. Keep it short and opinionated, not descriptive. No optional styles. One way to do each thing.
3 Reinforce patterns in code, not just docs Duplicate the canonical pattern across 2-3 clean examples. Use consistent structure, naming, and flow. Agents learn from repetition, not explanation.
4 Kill ambiguity immediately If two patterns exist for the same thing — fix it immediately. Do not "let it ride." Mixed patterns guarantee future inconsistency.
5 Refactor before expanding Before adding new features: clean up existing patterns and align everything with current standards. Scale clean patterns, not legacy ones.
6 Mark deprecated patterns explicitly Don't just stop using old patterns — label them: # DEPRECATED: Do not use. Use NewServicePattern instead. Agents will copy old code unless you block it.
7 Use prompts to override when needed When changing patterns, explicitly tell the agent: "Do NOT follow existing X pattern. Use this new structure instead." Provide a fresh example inline. Otherwise it defaults back to the repo.
8 Review early code aggressively First implementations of anything get high scrutiny. Later code is pattern enforcement, not invention. Early code is architecture, whether you admit it or not.
9 Maintain golden paths Keep a small set of files that represent correct architecture, correct patterns, and correct integrations. These are what agents should copy from. The SCADA module is the first golden path.

What To Avoid

Anti-PatternWhy It Fails
"We'll clean it up later" You won't. Agents multiply bad patterns faster than you can fix them. By the time you get to cleanup, the bad pattern is in 15 files and the agent treats it as the standard.
Multiple valid ways to do the same thing "Either service pattern is fine" or "Both styles work" guarantees fragmentation. The agent picks whichever one it sees first, and different developers' agents pick different ones.
Over-reliance on documentation Long docs explaining patterns with few actual examples. Agents don't read like humans. They copy code. Three clean examples beat three pages of explanation.
Leaving legacy code untouched Old patterns sitting in the repo without deprecation markers. Agents will keep using them forever. If you don't mark it deprecated, it's still the standard.
Mixing refactor + feature work "Let's improve it while adding this feature." You'll end up with hybrid patterns that are neither the old way nor the new way. Refactor first, then add features on the clean foundation.
Vague prompts "Follow best practices" or "Use existing patterns" tells the agent nothing useful. Be specific: "Follow the pattern in modules/scada_monitor/service.py for the data access layer."
Letting agents define architecture Accepting generated structure blindly without reviewing pattern decisions. The agent reflects your repo — it doesn't design it. Architecture is a human decision.

Pattern Change Protocol

When the team needs to evolve a convention, follow this sequence in order. Skipping steps means the old pattern survives.

1

Define the New Pattern

Write one clean reference implementation — not theory, working code. This becomes the new canonical example.

2

Update CLAUDE.md

Replace the old convention, don't append to it. The instruction file should describe only the current way. No "Option A vs. Option B."

3

Refactor 2-5 Existing Files

Make the new pattern visible in the codebase. The agent needs to see the new pattern repeated — not just defined once. Update the golden path files first.

4

Mark Old Patterns as Deprecated

Add inline comments to any remaining old-pattern code: # DEPRECATED: Do not follow this pattern. See modules/scada_monitor/service.py for current standard.

5

Force Agent Compliance

For the next few coding sessions, explicitly prompt: "Use the new pattern from X, not the old pattern." Review all generated code until the new pattern is consistently produced without prompting.

6

Review Strictly Until Stable

Reject any PR that uses the old pattern. No exceptions. Once the new pattern is the majority in the codebase, the agent will follow it naturally.

Reality check: Patterns are not locked. But they are sticky. And agents amplify whatever you tolerate. If your repo is clean, agents accelerate you. If your repo is messy, agents scale the mess. The choice is made in the first two weeks of the project and reinforced every day after.

Deployment

The platform runs on Google Cloud Platform with auto-scaling serverless containers. Three services, one shared database, dedicated tenant databases, and on-premises bridges per tenant.

Cloud Run · Backend
Platform Server
FastAPI + Module Loader + Agent System + Scheduler + Bridge Manager + SSE
Cloud Run · Frontend
Platform Shell + Module UI
React SPA + Nginx. Workspace shell, panel carousel, action panel, module rendering.
Cloud Run / GKE · Streaming
Streaming Service
WebSocket hub, topic routing, live state cache (Redis), tenant-scoped streams.
Platform DB (PostgreSQL, shared, RLS) · Tenant DBs (PostgreSQL, dedicated per tenant) · Firebase Auth · Secret Manager · Cloud Build · Redis

Startup Sequence

1. alembic upgrade head           # Platform DB migrations only
2. gunicorn app.main:app
   └─ startup event
      ├─ Production safety checks  # Reject if secrets are defaults
      ├─ init_security_middleware() # CORS, rate limiting, CSP, CSRF, HSTS
      ├─ load_modules(app)          # Discover, mount, register all modules
      ├─ init_agent_system()        # Agent Executor, Memory Manager
      ├─ init_scheduler()           # Start Platform Scheduler
      ├─ init_bridge_manager()      # Bridge WebSocket listener (mutual TLS)
      ├─ verify_tenant_data()       # Verify tenant databases, provision missing tables
      └─ sync_bridge_tools()        # Queue tool sync for connected bridges
3. Health check 200               # Traffic routed

Git Strategy

Branch Model

main (protected — PR only, requires Consultant review) │ ├── dev (integration branch — daily merges from feature branches) │ │ Phase 1: ├── feature/core-contracts ← Consultant ├── feature/middleware-auth ← Sr Full Stack ├── feature/agent-executor ← Sr AI Engineer ├── feature/data-connectors ← Data Platform Engineer ├── feature/admin-console ← Jr Full Stack │ │ Phase 2: ├── feature/module-scada ← Sr AI Engineer ├── feature/module-maintenance ← Sr Full Stack ├── feature/module-etl ← Data Platform Engineer ├── feature/module-reporting ← Jr Full Stack └── feature/module-assets ← Consultant

Rules

RuleDetail
Branch per developer per feature Each developer works on their own feature branch. Never commit directly to dev or main.
PRs merge into dev Review required from the Consultant or the directory owner. PRs touching core/ require Consultant approval.
devmain weekly After integration testing. Consultant drives the merge. All tests must pass.
Contract changes require discussion Any change to core/interfaces/ is a team conversation. These are shared dependencies. Unilateral changes break other developers.
Migration numbers are centrally assigned Developer requests a number from the Consultant before creating a migration. Prevents the merge conflicts you've been hitting.
Modules never import from each other Cross-module access goes through the Service Registry. If a module imports another module's code directly, the PR is rejected.

AI Code Assistant Practices

See Agent-Driven Development Standards for the complete set of rules and patterns for working with coding agents on this project. Key operational practices:

PracticeDetail
Root CLAUDE.md Single source of truth for platform conventions, SDK patterns, and "never do this" rules. Every developer's coding agent reads this on session start. Short, opinionated, one way to do each thing.
Directory-level CLAUDE.md Each developer can add a .claude/ config in their owned directory with context specific to their domain.
Worktrees for isolation Each developer uses git worktree to work in isolated copies. Prevents accidental commits to someone else's branch.
One session per directory Scope each coding agent session to your owned directory. Never run two sessions that touch the same files.