Skip to main content
UnblockDevs

Physical AI Systems — Architecture, Design, and Real-World Deployment

Building physical AI systems requires integrating perception, cognition, and action in tight real-time loops. Unlike software-only AI, physical AI systems interact with the physical world — where mistakes have real consequences. This guide covers the engineering architecture of physical AI systems: sensor integration, real-time processing pipelines, the ROS 2 middleware stack, safety frameworks, and the key differences between physical AI and traditional industrial automation.

ROS 2

Robot Operating System — standard middleware for robotics

Perception

seeing and understanding the physical environment in real time

<10ms

control loop latency requirement for real-time robot control

Safety first

fail-safe design is non-negotiable in physical AI systems

1

Physical AI System Architecture Overview

Physical AI systems are fundamentally different from web applications or data science models. They operate under hard real-time constraints, must handle sensor failures gracefully, and interact with a physical world that cannot be "undone." The architecture reflects these constraints.

The four-layer architecture

A physical AI system has four interconnected layers: Perception (sensor data → world model), Planning (world model → decisions), Control (decisions → actuator commands), and Safety (continuous monitoring + override capability). Each layer must meet strict real-time requirements while failing safely when inputs are uncertain or hardware malfunctions.

2

Core System Components

Perception layer

Sensor drivers, data preprocessing, object detection/classification, SLAM (simultaneous localization and mapping), world model fusion and update. Must run at sensor frame rate (30–100Hz for cameras, 10–20Hz for LiDAR). Low latency is critical — stale perception data causes planning errors.

Planning layer

Task planning (what to do), path planning (how to move through space), and motion planning (specific joint trajectories). Translates high-level goals ("pick object from shelf") into sequences of lower-level actions. Can tolerate slightly higher latency (100–500ms) than the perception and control layers.

Control layer

PID controllers, model predictive control (MPC), joint torque/position controllers, motor drivers. Must run at actuator update rate — 1kHz for precise robotic arms, 100Hz for mobile robots. A real-time OS or dedicated hardware controller is often required at this layer.

Safety monitoring

An independent safety monitor watches all system layers simultaneously. Triggers emergency stop if: unexpected obstacles detected, joint limits approached, commands fall outside safe operating envelope, or communication between layers times out. Must be hardware-independent from the main AI system.

3

ROS 2 — The Standard Physical AI Middleware

ROS 2 (Robot Operating System 2) is the de facto standard middleware for physical AI systems, used in industrial robots, autonomous vehicles, medical devices, and research platforms. It provides publish/subscribe messaging, hardware abstraction, and a rich ecosystem of pre-built packages for common robotic tasks.

What ROS 2 provides

Publish/subscribe messaging between nodes (processes), service calls for request-response patterns, parameter server for configuration, TF2 for coordinate frame transforms, standardized sensor message formats, visualization with RViz2, and rosbag for data recording and replay.

Key ROS 2 packages

Nav2 (autonomous mobile robot navigation), MoveIt 2 (robotic arm motion planning), sensor_msgs (standard sensor interfaces for camera, LiDAR, IMU), ros2_control (hardware abstraction for controllers), Behavior Tree.CPP (robot behavior trees), OpenCV bridge for vision.

DDS real-time communication

ROS 2 uses DDS (Data Distribution Service) for communication. Fast DDS and CycloneDDS are common implementations. Enables deterministic message delivery with configurable QoS (quality of service) policies — critical for real-time systems where late messages are worse than no message.

Real-time OS integration

For safety-critical control, ROS 2 nodes run on RT Linux (PREEMPT_RT patch), QNX, or VxWorks for hardware-level timing guarantees. The planning and AI layers can run on standard Linux; the low-level control nodes need real-time OS scheduling for <1ms jitter.

4

Example: Basic Perception Node in ROS 2

pythonROS 2 perception node (Python)
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image, PointCloud2
from vision_msgs.msg import Detection3DArray
import numpy as np

class PerceptionNode(Node):
    """Fuses camera and LiDAR data into 3D object detections."""

    def __init__(self):
        super().__init__('perception_node')

        # Subscribe to camera and LiDAR topics
        self.camera_sub = self.create_subscription(
            Image, '/camera/color/image_raw',
            self.camera_callback, 10)  # queue depth = 10

        self.lidar_sub = self.create_subscription(
            PointCloud2, '/lidar/points',
            self.lidar_callback, 10)

        # Publish fused 3D detections
        self.detection_pub = self.create_publisher(
            Detection3DArray, '/perception/detections', 10)

        # Load object detection model
        self.detector = self.load_model('yolov8_object_detection.onnx')
        self.get_logger().info('Perception node initialized')

    def camera_callback(self, msg):
        """Process camera frames at 30Hz."""
        # Convert ROS image to numpy array
        image = self.ros_image_to_numpy(msg)

        # Run object detection
        detections_2d = self.detector.infer(image)

        # Store with timestamp for fusion with LiDAR
        self.latest_camera_detections = (msg.header.stamp, detections_2d)

    def lidar_callback(self, msg):
        """Process LiDAR point cloud at 10Hz and fuse with camera."""
        points = self.pointcloud2_to_numpy(msg)

        if hasattr(self, 'latest_camera_detections'):
            # Project detections from 2D image space to 3D world space
            detections_3d = self.fuse_camera_lidar(
                self.latest_camera_detections[1], points)

            # Publish fused 3D detections
            detection_msg = self.create_detection_msg(detections_3d, msg.header)
            self.detection_pub.publish(detection_msg)

def main(args=None):
    rclpy.init(args=args)
    node = PerceptionNode()
    rclpy.spin(node)   # keep node running, processing callbacks
    node.destroy_node()
    rclpy.shutdown()

if __name__ == '__main__':
    main()
5

Safety Design Principles

Safety in physical AI is not optional or a late-stage concern — it must be designed in from the beginning. A bug in a web app shows a wrong UI. A bug in a physical AI system can injure people or destroy expensive equipment.

Fail-safe defaults

System stops when uncertain, not continues. Emergency stop is a hardware circuit independent of software — cannot be blocked by a software bug. Power-on state is motors disabled. Any communication timeout → immediate controlled stop.

Redundancy for critical components

Duplicate critical sensors (two cameras for emergency stop zone, not one). Independent safety monitor runs on separate hardware from the AI system. For highest safety levels: two independent computers with majority voting for safety-critical decisions.

Operational Design Domain (ODD)

Define precisely where and when the system can operate safely: speed limits, environmental conditions, workspace boundaries, object types it can handle. Build hard limits that the system cannot be commanded to violate. Clear safe-state transition when approaching ODD boundaries.

Validation and certification

Simulation testing for millions of edge cases before hardware testing. Hardware-in-the-loop (HIL) testing. Fault injection testing — disconnect sensors mid-operation and verify safe response. For regulated industries: IEC 61508 (industrial), ISO 26262 (automotive), ISO 10218 (collaborative robots).

6

Deployment Checklist for Physical AI Systems

1

Define the operational domain

Document exactly what conditions the system can handle: speed limits, object types, lighting conditions, workspace boundaries. The system must refuse to operate outside this domain.

2

Run safety hazard analysis

Identify every failure mode and its consequences. Use FMEA (Failure Mode and Effects Analysis). For each hazard: what is the probability, what is the severity, what is the mitigation?

3

Implement independent safety monitor

Build a watchdog system on separate hardware that monitors all system layers. Can trigger emergency stop regardless of what the main AI is doing. Test it by killing the main AI process and verifying the stop response.

4

Test edge cases and failure modes

Deliberately test: sensor disconnection, corrupted data, power interruptions, communication timeouts, objects outside training distribution. Verify graceful degradation and safe stops in all cases.

5

Run controlled pilot deployment

Deploy to one location with extensive human monitoring before scaling. Log everything. Have an easy remote emergency stop. Collect real-world data to improve the system.

6

Establish OTA update process

Never update software on physical AI systems without validation. A/B test new versions on a subset of systems. Always maintain rollback capability. Safety-critical updates require extensive testing before deployment.

Physical AI safety is non-negotiable

Software bugs in a web app show a wrong UI. Software bugs in a physical AI system can injure people. ISO 26262 (automotive), IEC 61508 (industrial), and ISO 10218 (robots) define safety integrity levels (SIL/ASIL). Production physical AI systems require formal safety analysis and certification before deployment in any environment with humans present.

Frequently Asked Questions

Related AI & Systems Guides

Continue with closely related troubleshooting guides and developer workflows.