Understanding Artificial Intelligence Part 3

Model Zoos and Ready-to-Run Models on Vitis AI

Introduction to Part 3

Okay, so you’ve got the basics down in parts 1 and 2. Now let’s talk about the model zoo, it’s basically a big collection of ready-to-go AI models that people have already trained and optimized. Instead of building everything from scratch, you grab one that fits your job, tweak it if needed, and run it.

Plenty of other model zoos are out there, or were big in the past.

Intel’s Open Model Zoo packed optimized stuff for their CPUs and VPUs: detection, pose estimation, action recognition, etc.

Hugging Face is the main spot now, loaded with transformers for language, vision, audio, you name it.

TensorFlow Hub and PyTorch Hub make it dead simple to pull models right inside those frameworks with one line of code.

Back when Caffe ruled, its Model Zoo had AlexNet, VGG, all the early hits.

ONNX Model Zoo helped move things between frameworks, and STM32 still has one for microcontrollers with tiny vision and audio models. They all save you serious time, especially if they’re tuned to your hardware.

But if you’re targeting AMD® FPGAs or Versal adaptive SoCs, the Vitis AI Model Zoo is built exactly for you. It’s a curated set of about 87 models (in the 3.5 release) that AMD has already quantized, pruned where appropriate, and benchmarked on real DPU hardware. Think ZCU102, ZCU104, KV260, VCK190, Versal VEK280, Alveo cards like U50 or U250, whatever board you’re using, they have versions ready.

The models cover the usual tasks: image classification, object detection, semantic segmentation, depth estimation, 3D medical stuff, and even some 3D point-cloud detection for automotive. Everything comes pre-compiled as .xmodel files for the DPU, with INT8 quantization, so you get the speed without losing much accuracy. And they ship performance numbers; frames per second, latency, GOPs, so you can see right away if it’ll hit your real-time needs on your board.

Using them is straightforward. Head over to the Xilinx/Vitis-AI repo on GitHub and clone it. Navigate to the model_zoo folder. There’s a downloader.py script; run it with Python 3. It asks for the framework (pt for PyTorch, tf2 for TensorFlow 2, tf for older TensorFlow), then lists the models. Pick one, choose your target board, and it downloads the right tar.gz file with checksum verification so you know it’s clean.

Unpack the tar.gz, and you’ll find a clean folder: readme.md with exact steps, code/ with test scripts and evaluation tools, data/ for sample datasets, and the model files themselves. For PyTorch models, you get the quantized .pth, quant_info.json, and the ready-to-deploy .xmodel. For TensorFlow, it’s frozen.pb and quantized versions. Some include quantization-aware training results if you want that extra accuracy bump.

From there, plug it into the Vitis AI Library for quick wins. That library gives you C++ and Python APIs so you can stand up a face detector or object tracker in basically no time, no need to write DPU code from scratch. Or if you want full control, use the Vitis AI compiler tools to rebuild it for your exact DPU config. The readme walks you through benchmarking on the board, so you fire it up, feed it a USB camera or test images, and see the fps numbers yourself.

Let me walk through a few models so you can picture what this looks like in practice.

Start with ResNet50. It’s the workhorse for image classification. The PyTorch version (pt_resnet50_imagenet_224_224) takes 224 by 224 images, runs around 5-8 GOPs depending on the prune level (they have 30% and 45% pruned variants). Float accuracy sits at 76-79% top-1 on ImageNet; after INT8 quantization, it drops maybe half a percent. On a ZCU102 with a solid DPU, you see single-thread throughput over 1,000 fps, and multi-thread pushes 2,500-3,000. Perfect for factory inspection, labeling parts on a conveyor belt without breaking a sweat.

Next, YOLOv5-nano. Lightweight object detection on the COCO dataset. 640 by 640 input, only 4.6 GOPs. Mean average precision lands around 27% after quantization, good enough for most edge jobs. Runs at 650 fps single-thread on the right board. If you need something beefier, there’s YOLOv4 at 416 by 416 with higher accuracy but lower speed.

HRNet for semantic segmentation shines on Cityscapes road scenes. The big version takes 1024 by 2048 images, yeah, it’s heavy at 378 GOPs, but the accuracy is excellent, 81% after quantization. On a high-end Versal with multiple DPU cores, you still get a handful of frames per second. Use it for pixel-perfect defect detection on manufactured parts or lane marking in automotive prototypes.

For medical or 3D work, 3D-UNET (pt_3D-UNET_kits19) handles volumetric data like kidney CT scans from the KiTS19 dataset. Input is 128 cubed voxels, a whopping 1,065 GOPs. Dice score stays above 87% quantized. Versal boards with AIE engines eat this up because they handle the 3D convolutions better than older FPGAs. Great if you’re building something for healthcare imaging on an edge device.

Then there’s PointPillars (pt_pointpillars) for 3D object detection from lidar point clouds, KITTI dataset, car/pedestrian/cyclist classes. It turns raw points into pillars, runs inference fast enough for ADAS. Quantized and ready for VCK190 or Versal, with benchmarks showing solid latency for real-time driving assistance.

You also get depth estimation models like FADNet (576 by 960 input, 154 GOPs, good for stereo vision), super-resolution ones for cleaning up low-res camera feeds, and even a few NLP models like BERT variants if your project mixes vision with text.

The real win is that you don’t waste months training and optimizing. Grab the model, run the test script on your board, and see the numbers. If the accuracy isn’t quite there for your dataset, the repo includes retraining code, so you fine-tune on your own data, re-quantize with vai_q_pytorch or vai_q_tensorflow, and recompile. Takes hours instead of weeks.

Bottom line, the Vitis AI Model Zoo turns your FPGA or Versal into an AI workhorse without the usual headache. Start there for any vision or 3D project on AMD hardware. It just works, it’s documented, and the performance is real, not marketing fluff. Once you try one, you’ll wonder why you ever started from scratch.