Performance Benchmarking on Qualcomm Platforms

This guide covers techniques for measuring, analyzing, and optimizing performance of vision applications on Qualcomm platforms.

Key Performance Metrics

When benchmarking vision applications, consider these critical metrics:

Metric	Description	Importance
Inference Time	Time to process one frame/input	Determines overall system throughput
Initialization Time	Time to load and prepare models/resources	Impacts startup experience
Memory Usage	Peak and average memory consumption	Affects system stability and multitasking
Power Consumption	Energy used per inference	Critical for battery-powered devices
Thermal Impact	Temperature increase during sustained operation	Affects long-term performance stability

Benchmarking Tools

1. Snapdragon Profiler

The primary tool for detailed performance analysis on Qualcomm hardware:

# Launch Snapdragon Profiler
snapdragon-profiler

Key capabilities:

CPU/GPU/DSP utilization monitoring
Memory usage tracking
Power consumption measurement
Kernel trace collection
System metrics visualization

2. SNPE Benchmarking

For neural network performance specifically:

# Measure model performance across available runtimes
snpe-net-run --container model.dlc \
             --input_list input_list.txt \
             --output_dir results \
             --perf_profile high_performance \
             --benchmarks

# Generate detailed performance report
snpe-diagview --input_log results/logs/*.log \
              --output_html performance_report.html

3. adb-based Benchmarking

For quick measurements on Android-based platforms:

# Monitor system-wide CPU usage
adb shell top -m 10 -n 1

# Track memory usage of a specific process
adb shell dumpsys meminfo com.example.visionapp

# Measure battery consumption
adb shell dumpsys batterystats

4. Custom Instrumentation

Add timing points in your code for precise measurements:

#include <chrono>

void measurePerformance() {
    auto start = std::chrono::high_resolution_clock::now();
    
    // Operation to measure
    runVisionAlgorithm();
    
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    
    printf("Execution time: %lld ms\n", duration.count());
}

Benchmark Methodology

1. Establish Baselines

Create baseline measurements under controlled conditions:

// Example of establishing performance baselines
void runBaselines() {
    // Warm up the system
    for (int i = 0; i < 10; i++) {
        processFrame();
    }
    
    // Measure steady state performance
    std::vector<double> timings;
    for (int i = 0; i < 100; i++) {
        auto start = std::chrono::high_resolution_clock::now();
        processFrame();
        auto end = std::chrono::high_resolution_clock::now();
        
        double ms = std::chrono::duration<double, std::milli>(end - start).count();
        timings.push_back(ms);
    }
    
    // Calculate statistics
    double avg = calculateAverage(timings);
    double stddev = calculateStdDev(timings);
    double p95 = calculatePercentile(timings, 95);
    
    printf("Average: %.2f ms, StdDev: %.2f ms, P95: %.2f ms\n", 
           avg, stddev, p95);
}

2. Test Under Varied Conditions

Measure performance under different scenarios:

Various input sizes and complexities
Different thermal conditions (cold start vs. warm device)
Background application load
Battery levels (full vs. low)
Different device SKUs/variants

3. Profile System Resources

Monitor resource utilization during benchmarking:

# Example bash script for resource monitoring
#!/bin/bash

# Start the application
adb shell am start -n com.example.visionapp/.MainActivity

# Monitor resources every second for 60 seconds
for i in {1..60}; do
    # CPU usage
    adb shell cat /proc/stat > cpu_$i.txt
    
    # Memory usage
    adb shell dumpsys meminfo com.example.visionapp > mem_$i.txt
    
    # Thermal state
    adb shell cat /sys/devices/virtual/thermal/thermal_zone*/temp > thermal_$i.txt
    
    sleep 1
done

Performance Analysis

1. Identifying Bottlenecks

Common bottlenecks in vision applications:

CPU-bound: High thread utilization but low GPU/DSP
Memory-bound: High memory traffic, cache misses
I/O-bound: Waiting on camera input or storage
Thermal throttling: Performance degradation over time
Power management: Clock frequency scaling

2. Case Study: Frame Rate Optimization

Configuration	Frame Rate	CPU Usage	GPU Usage	Power
CPU Only	15 FPS	90%	5%	1.8W
GPU Offload	24 FPS	30%	65%	2.1W
DSP Offload	20 FPS	15%	10%	1.2W
Hybrid (GPU+DSP)	30 FPS	20%	40%	1.6W

3. Optimization Cycle

Follow this iterative process for performance optimization:

Measure: Establish baseline performance
Profile: Identify bottlenecks
Optimize: Make targeted improvements
Validate: Confirm impact of changes
Repeat: Continue until goals are met

Advanced Optimization Techniques

1. Workload Balancing

Distribute work across heterogeneous processors:

// Example: Workload balancing across processors
void balancedProcessing(const cv::Mat& frame) {
    // Pre-processing on DSP (most efficient for image manipulation)
    cv::Mat preprocessed = preprocessOnDSP(frame);
    
    // Feature extraction on GPU (parallel-friendly)
    std::vector<cv::KeyPoint> features = extractFeaturesOnGPU(preprocessed);
    
    // Complex decision logic on CPU (more suitable for branching code)
    std::vector<DetectionResult> results = analyzeOnCPU(features);
    
    // Neural network inference on NPU/AIP (specialized accelerator)
    std::vector<Classification> classifications = classifyOnNPU(results);
}

2. Pipeline Parallelism

Process multiple frames simultaneously:

// Example: Pipeline parallelization
class VisionPipeline {
private:
    ThreadSafeQueue<Frame> captureQueue;
    ThreadSafeQueue<ProcessedFrame> preprocessQueue;
    ThreadSafeQueue<AnalyzedFrame> analysisQueue;
    
    std::thread captureThread;
    std::thread preprocessThread;
    std::thread analysisThread;
    std::thread outputThread;
    
public:
    void start() {
        captureThread = std::thread(&VisionPipeline::captureLoop, this);
        preprocessThread = std::thread(&VisionPipeline::preprocessLoop, this);
        analysisThread = std::thread(&VisionPipeline::analysisLoop, this);
        outputThread = std::thread(&VisionPipeline::outputLoop, this);
    }
    
    // Thread function implementations
    void captureLoop() { /* Camera capture logic */ }
    void preprocessLoop() { /* Pre-processing logic */ }
    void analysisLoop() { /* Analysis/inference logic */ }
    void outputLoop() { /* Result handling logic */ }
};

Reporting and Documentation

Document performance characteristics for traceability:

# Performance Report Template

## Test Environment
- Device: Qualcomm Reference Board XYZ
- SoC: Snapdragon Series XXX
- OS Version: Android 12
- Test App Version: 1.2.3
- Ambient Temperature: 23°C

## Performance Metrics
- Average Inference Time: 45ms
- P95 Inference Time: 52ms
- Initialization Time: 230ms
- Peak Memory Usage: 120MB
- Average Power Draw: 1.4W

## Recommendations
1. Optimize model with 8-bit quantization
2. Enable FastCV acceleration for pre-processing
3. Implement frame skipping under thermal pressure

Key Performance Metrics​

Benchmarking Tools​

1. Snapdragon Profiler​

2. SNPE Benchmarking​

3. adb-based Benchmarking​

4. Custom Instrumentation​

Benchmark Methodology​

1. Establish Baselines​

2. Test Under Varied Conditions​

3. Profile System Resources​

Performance Analysis​

1. Identifying Bottlenecks​

2. Case Study: Frame Rate Optimization​

3. Optimization Cycle​

Advanced Optimization Techniques​

1. Workload Balancing​

2. Pipeline Parallelism​

Reporting and Documentation​