TL;DR:

  • Model optimisation (quantisation, pruning) typically reduces inference latency by 2–4x with under 2% accuracy loss on well-optimised models
  • Hardware selection matters more than framework: Coral TPU dominates sub-watt inference; NVIDIA Jetson is the right choice when you need CUDA flexibility
  • TFLite and ONNX Runtime are both production-ready; ONNX wins on portability, TFLite wins on mobile/MCU breadth

Edge AI inference — running trained neural networks locally on edge devices rather than sending data to the cloud — moved from experimental to standard practice in industrial deployments over the past two years. The shift is driven by three converging factors: purpose-built inference hardware that costs under £100, mature optimisation toolchains, and latency requirements that cloud round-trips simply can’t meet. Here’s what you need to know to put models into production at the edge in 2026.

Model Optimisation: Quantisation and Pruning

A model trained in full float32 precision is almost never the right artefact to deploy at the edge. Quantisation and pruning reduce size, memory bandwidth, and compute requirements — usually with negligible accuracy impact.

Quantisation maps float32 weights and activations to int8 or int4, reducing model size by 4–8x and enabling integer math units that run significantly faster than floating-point. There are two main approaches. Post-training quantisation (PTQ) converts a trained float32 model to int8 using a calibration dataset — 15–30 minutes of work, 2–4x speedup, 0.5–2% accuracy drop typical. Quantisation-aware training (QAT) simulates quantisation during training, which is more work but yields better accuracy on models where PTQ degrades too much. INT4 and FP8 are increasingly supported by hardware — INT4 roughly doubles throughput versus INT8 on compatible accelerators.

Pruning removes weights below a threshold, creating sparse models that require fewer computations. Structured pruning (removing entire channels or layers) translates directly to speedup on standard hardware. Unstructured pruning requires sparse execution engines to realise the gains.

TensorRT (NVIDIA) combines quantisation, layer fusion, and kernel selection into a single compilation pass. A ResNet-50 that takes 45ms on a Jetson Nano in PyTorch drops to 6ms after TensorRT INT8 compilation — a 7x speedup.

Hardware Options

HardwareComputePowerMemoryPrice (approx.)Best For
NVIDIA Jetson Orin NX100 TOPS10–25W8–16GB~£400Complex vision models, LLMs
NVIDIA Jetson Nano472 GFLOPS5–10W4GB~£80Mid-range inference, prototyping
Google Coral TPU (USB)4 TOPS<2WShared~£50Dedicated inference, low power
Hailo-826 TOPS2.5WDedicated~£160Production vision pipelines
Raspberry Pi 5CPU only5W4–8GB~£65Light inference, pre/post processing

NVIDIA Jetson Orin is the current performance leader for edge inference — the 100 TOPS figure translates to running 4K object detection at real-time frame rates. The CUDA ecosystem means any PyTorch model runs without modification; TensorRT provides the optimisation layer.

Google Coral TPU is the efficiency leader. At under 2W, it delivers 4 TOPS of INT8 inference — comparable to a Jetson Nano at one-fifth the power. The constraint is the Edge TPU compiler: it only runs models that fit fully on-chip (8MB). Models requiring off-chip memory fall back to slower pipelined inference. For models like MobileNetV2 or EfficientDet-Lite, it’s unbeatable.

Hailo-8 occupies the middle ground: 26 TOPS at 2.5W, with a compiler that handles larger models than Coral and better memory bandwidth. It’s gaining traction in automotive and industrial vision systems where Coral is too constrained and Jetson too power-hungry.

TFLite vs ONNX Runtime

TFLiteONNX Runtime
Source frameworksTensorFlow, KerasPyTorch, TF, scikit-learn, XGBoost, 20+ others
Hardware accelerationEdge TPU, GPU delegate, NNAPICUDA, TensorRT, CoreML, DirectML, ROCm
MCU supportTFLite Micro (Cortex-M)No
Model format.tflite.onnx
Python APItflite-runtimeonnxruntime

TFLite is the right choice when deploying to microcontrollers (Cortex-M via TFLite Micro), Android/iOS, or Coral TPU. The Edge TPU delegate integrates directly into the TFLite runtime.

ONNX Runtime wins on portability — the same .onnx file runs on CPU, CUDA, TensorRT, CoreML, and ROCm with an execution provider swap. For teams training in PyTorch and deploying across heterogeneous hardware, ONNX’s single model format is a significant operational advantage. Most UK teams building cross-platform inference pipelines have landed on ONNX for exactly this reason.

Latency Benchmarks

MobileNetV2 image classification, 224×224, batch 1:

PlatformFrameworkLatency
Jetson Orin NX (GPU)TensorRT INT81.2ms
Jetson Nano (GPU)TensorRT INT88ms
Coral USB TPUTFLite INT86ms
Hailo-8Hailo SDK INT84ms
Raspberry Pi 5 (CPU)ONNX Runtime FP3295ms
Raspberry Pi 5 (CPU)TFLite INT838ms

The Raspberry Pi numbers illustrate why dedicated inference hardware matters for latency-sensitive applications — even heavily quantised INT8 inference on a fast ARM CPU is 3–6x slower than purpose-built silicon.

The Bottom Line

Start model optimisation with PTQ — it’s fast, the tooling is mature, and the accuracy tradeoff is usually acceptable. Choose hardware based on power budget and model complexity: Coral for sub-2W applications with supported models, Hailo-8 for production vision at scale, Jetson when you need CUDA flexibility. ONNX Runtime is the safe default for most teams; reach for TFLite when you need MCU support or Coral integration.