tasks: Managing complex CI testing matrices for research software

How can I efficiently test my research software across multiple compiler versions, library dependencies, and target platforms?

Description

Research software, particularly performance-portable libraries and simulation codes, often needs to support extensive combinations of compilers, library versions, target architectures, and runtime environments. For example, accelerator abstraction libraries like Alpaka require testing across multiple GCC versions, Clang versions, CUDA SDK versions, CMake versions, and Boost versions. A naive approach testing all combinations can create thousands of test jobs, making CI pipelines impractically long and resource-intensive.

Consider this real-world example from accelerator development:

4 GCC compiler versions
6 Clang compiler versions
10 CUDA SDK versions
4 CMake versions
7 Boost library versions

This results in 2,800 potential combinations, requiring approximately ~9.3 hours of compute time at 6 minutes per job, even with 30 parallel runners.

Considerations

Combinatorial explosion: The number of possible combinations grows exponentially with each additional parameter (compilers × architectures × libraries × versions)
Resource constraints: CI runners have limited capacity, and excessive parallelization can monopolize shared infrastructure across multiple projects
Time constraints: Full test matrices can take many hours to complete, creating bottlenecks in development workflows
Hardware diversity: Different combinations may require specific hardware (NVIDIA GPUs, AMD GPUs, ARM processors, PowerPC architectures)
Invalid combinations: Some parameter combinations may be incompatible (e.g., older CUDA versions with newer GCC compilers)
Coverage vs efficiency: Need adequate test coverage without redundant or meaningless test combinations
Maintenance overhead: Large test matrices become difficult to update and debug when new versions are released

Solutions

Pairwise Testing Implementation

Use dynamic child pipelines: Dynamic child pipelines are essential for implementing pairwise testing as they allow programmatic generation of CI configurations at runtime. Leverage CI systems that support programmatically generated pipeline configurations based on computed test matrices, enabling runtime optimization based on available resources.
Implement pairwise testing algorithms: Use mathematical approaches that ensure every combination of two parameter values appears in at least one test job, dramatically reducing total tests while maintaining comprehensive interaction coverage.

Use specialized job generation libraries: Implement dynamic job generators using tools like allpairspy:

from allpairspy import AllPairs

parameters = {
    "host_compiler": ["gcc-8", "gcc-9", "gcc-10", "gcc-11"],
    "device_compiler": ["clang-10", "clang-11", "clang-12", "clang-13", "clang-14", "clang-15"],
    "cuda_sdk": ["cuda-10.2", "cuda-11.0", "cuda-11.1", "cuda-11.2", "cuda-11.3",
                 "cuda-11.4", "cuda-11.5", "cuda-11.6", "cuda-11.7", "cuda-11.8"],
    "cmake": ["cmake-3.18", "cmake-3.19", "cmake-3.20", "cmake-3.21"],
    "boost": ["boost-1.68", "boost-1.70", "boost-1.72", "boost-1.74",
              "boost-1.75", "boost-1.76", "boost-1.78"]
}

keys = list(parameters.keys())
values = [parameters[key] for key in keys]

for i, pairs in enumerate(AllPairs(values)):
    config = dict(zip(keys, pairs))
    print(f"Job {i+1}: {config}")

# This reduces 2800 combinations to ~60-100 jobs
# Each combination of compiler + CUDA version appears at least once

Develop domain-specific combination rules: Create libraries that encode your project’s specific compatibility requirements and testing priorities, such as the Alpaka approach.

Implement exclusion logic: Define rules to automatically exclude known incompatible combinations:

# Example exclusion rules for GPU computing
exclusions:
  - cuda_version: "11.0"
    gcc_version: "gcc-11"  # Incompatible combination
  - architecture: "ppc64le"
    cuda_version: "*"      # CUDA not available on PowerPC

Mathematical Optimization Analysis

Testing Approach	Total Jobs	Estimated Runtime (30 jobs parallel, 6 mins/job)	Coverage Type
Full matrix (naive)	2800	~9.3 hours	100% combinations
Pairwise testing	~60–100	~20–30 minutes	All 2-way interactions
Random sampling	~200	~40 minutes	Statistical coverage

How can I optimize CI pipeline performance while maintaining comprehensive testing?

Description

Even with reduced test matrices, complex research software CI pipelines face performance challenges. Multiple optimization strategies are needed to provide fast developer feedback while maintaining thorough testing coverage across diverse computing environments.

Considerations

Build time bottlenecks: Repeatedly compiling large dependency sets (like HPC libraries, scientific computing frameworks, or large C++ template libraries) wastes significant time
Resource competition: Simultaneous job execution can overwhelm shared CI infrastructure, affecting other projects
Failure feedback delays: Critical bugs may not be detected until late in pipeline execution
Development vs production workflows: Full test suites may be unnecessary during iterative development
Storage and bandwidth: Large scientific computing containers and datasets impact transfer times
Platform-specific testing: Different hardware platforms may have varying performance characteristics

Solutions

Container Optimization Strategies

Implement pre-built container strategies: Create and maintain container images with pre-compiled dependencies. Multi-stage builds allow you to separate dependency installation from application code, producing smaller, faster final images by copying only necessary artifacts from build to runtime:

# Multi-stage build for scientific computing dependencies
FROM nvidia/cuda:11.8-devel as builder
RUN apt-get update && apt-get install -y \
    gcc-10 g++-10 clang-12 \
    cmake libboost-all-dev \
    libomp-dev libfftw3-dev
  
FROM builder as runtime
COPY --from=builder /usr/local /usr/local
# Application-specific layers added dynamically

Deploy container registry optimization: Host container images in the same data center as CI runners to minimize transfer times and bandwidth costs. Use container registries that support layer caching (reusing unchanged layers between builds) and delta compression (only transferring changed parts of images). Check your registry documentation if it advertises support for these features - most modern registries like GitLab Container Registry, Harbor, and AWS ECR support them.
Optimize container layer caching: Structure container builds to maximize reuse of intermediate layers and minimize rebuild times. Group frequently changing components in separate layers from stable dependencies. For best practices, see Docker’s layer caching guide. You can also follow this tutorial for hands-on learning: Docker Layer Caching Tutorial by Earthly.

Wave Scheduling Implementation

Running all jobs simultaneously can overwhelm shared infrastructure and delay results. By grouping jobs into sequential stages (“waves”), critical tests can run earlier and free up resources faster. Hence, wave scheduling prevents resource monopolization by running jobs in stages, allowing other projects to use CI infrastructure between waves while providing early feedback on critical tests.

Use wave scheduling for resource management: Distribute jobs across pipeline stages to periodically release CI resources:

# GitLab CI wave scheduling example
stages:
  - wave1_critical
  - wave2_compatibility  
  - wave3_performance
  - wave4_extended
  
# Critical tests run first for fast feedback
test_core_functionality:
  stage: wave1_critical
  script: [run core unit tests]
  
# Extended testing runs after resources freed
test_gpu_performance:
  stage: wave4_extended
  script: [run performance benchmarks]

Implement intelligent job prioritization: Order jobs to maximize early failure detection:
- Place strict compiler configurations in early waves
- Run compatibility tests with cutting-edge tool versions first
- Schedule resource-intensive performance tests in later stages
Visualize wave scheduling structure:

Wave 1 ─────▶ Fast compile checks, style, small matrix
                   ↓
Wave 2 ─────▶ Medium-sized combinations, functional tests
                   ↓
Wave 3 ─────▶ Full matrix, slowest GPU/HPC tests

This structure helps fail early and frees resources for other users.

Development Workflow Optimization

Enable selective testing during development: Allow developers to run targeted subsets of CI pipeline during development using commit-message-based filtering to avoid running full pipeline for iterative development work. This reduces pipeline load during focused development:

# GitLab CI example - Allow developers to run tests based on commit message tags
rules:
  - if: '$CI_COMMIT_MESSAGE =~ /\[cuda-only\]/'
    variables:
      TEST_FILTER: "cuda"
  - if: '$CI_COMMIT_MESSAGE =~ /\[cpu-only\]/'
    variables:
      TEST_FILTER: "cpu"

# Example usage - commit message to run only CUDA tests
git commit -m "Add CUDA kernel optimization [ci:cuda-only]"

How can I manage the infrastructure complexity required for multi-platform research software testing?

Description

Supporting comprehensive test matrices for research software requires sophisticated CI infrastructure that can handle diverse hardware requirements, manage resources efficiently across multiple projects, and provide reliable service for computationally intensive workloads.

Considerations

Hardware diversity requirements: Research software often targets HPC systems, requiring testing on multiple CPU architectures (x86, ARM, PowerPC), GPU vendors (NVIDIA, AMD), and specialized accelerators
Resource scheduling complexity: Balancing competing demands from multiple research projects while ensuring fair resource allocation
Performance benchmarking: Validating not just correctness but also performance characteristics across different hardware configurations
HPC system integration: Connecting CI pipelines with production HPC environments for realistic performance testing
Cost and sustainability: Managing infrastructure costs while supporting open-source research software development
Reliability at scale: Maintaining consistent performance as research groups add more complex testing requirements

Solutions

Performance Testing Integration

Implement performance regression detection: Integrate performance benchmarking into CI pipelines to catch performance regressions early:

# Example performance testing job
performance_benchmark:
  stage: performance
  script:
    - cmake --build build --target benchmark
    - python benchmark_analysis.py --baseline previous_results.json
    - python performance_regression_check.py
  artifacts:
    reports:
      performance: performance_results.json

Configure performance thresholds: Establish automated performance regression detection with configurable thresholds for different hardware configurations and algorithm implementations.

Comprehensive Testing Strategy Implementation

Monitor and profile pipeline performance: Track job duration, resource usage, and failure patterns to continuously optimize the pipeline structure:

Metric	Target	Monitoring Method
Job Duration	<10 minutes average	Pipeline analytics
Queue Time	<5 minutes	Runner utilization metrics
Failure Rate	<5% for stable configurations	Historical trend analysis
Resource Utilization	70-90% of capacity	Real-time monitoring

These metrics can be obtained from your CI platform’s monitoring dashboard analytics (Gitlab CI/CD Analytics, GitHub Actions insights) or third-party monitoring tools like Prometheus or Grafana with GitLab Runner exporters, or APIs.

Use matrix optimization libraries: Leverage existing tools and libraries for combinatorial testing, such as specialized job matrix libraries developed for performance-portable software testing.

How can I implement this approach for my research software project?

Description

Transitioning from simple CI testing to comprehensive multi-platform testing matrices requires careful planning, tool selection, and gradual implementation to avoid disrupting existing development workflows.

Considerations

Current CI maturity: Existing testing infrastructure and team familiarity with CI/CD concepts
Project complexity: Size of parameter space and critical compatibility requirements
Resource availability: Access to diverse hardware platforms and CI infrastructure budgets
Team expertise: Developer familiarity with containerization, CI configuration, and testing strategies
Integration requirements: Compatibility with existing development tools and workflows

Solutions

Start with parameter identification: Systematically catalog all dimensions that require testing validation:

# Example parameter definition for scientific computing library
testing_parameters = {
    'compilers': ['gcc-9', 'gcc-10', 'gcc-11', 'clang-12', 'clang-13', 'clang-14'],
    'cuda_versions': ['11.0', '11.2', '11.4', '11.6', '11.8', '12.0'],
    'cmake_versions': ['3.18', '3.20', '3.22', '3.24'],
    'boost_versions': ['1.72', '1.75', '1.78', '1.80', '1.82'],
    'architectures': ['x86_64', 'arm64'],
    'build_types': ['Release', 'Debug']
}

Implement gradual migration strategy:
1. Begin with core compatibility testing using pairwise algorithms
2. Add specialized hardware testing incrementally
3. Introduce performance testing for stable configurations
4. Expand to full multi-platform validation
Use established toolchains: Leverage proven solutions from successful research software projects:
- Job matrix generation: Implement using libraries like allpairspy or domain-specific tools
- Container strategies: Base images on established scientific computing containers
- CI integration: Use GitLab dynamic child pipelines or GitHub Actions matrix strategies
Document testing rationale: Maintain clear documentation explaining testing parameter choices and exclusion rules to facilitate maintenance and onboarding.
Consider resource sustainability: Even with optimized matrices, extensive testing may be technically possible, but consumes computational resources and energy. Balance testing thoroughness with environmental impact by running full matrices only when necessary (e.g., before releases) and using smaller subsets for regular development work. Consider tradeoffs between coverage and efficiency when designing your matrix and scheduling jobs.

Acknowledgements

This approach was successfully implemented by the Helmholtz-Zentrum Dresden-Rossendorf for the Alpaka performance-portability library and PIConGPU particle-in-cell simulation code, demonstrating significant reductions in CI resource usage while maintaining comprehensive testing coverage across multiple compilers, accelerator platforms, and HPC architectures.

Training

EVERSE TeSS search results:

Srobona Ghosh, "Managing complex CI testing matrices for research software". everse.software. http://everse.software/RSQKit/ci_testing_matrices .

tasks: Managing complex CI testing matrices for research software

How can I efficiently test my research software across multiple compiler versions, library dependencies, and target platforms?

Description

Considerations

Solutions

Pairwise Testing Implementation

Mathematical Optimization Analysis

How can I optimize CI pipeline performance while maintaining comprehensive testing?

Description

Considerations

Solutions

Container Optimization Strategies

Wave Scheduling Implementation

Development Workflow Optimization

How can I manage the infrastructure complexity required for multi-platform research software testing?

Description

Considerations

Solutions

Performance Testing Integration

Comprehensive Testing Strategy Implementation

How can I implement this approach for my research software project?

Description

Considerations

Solutions

Acknowledgements

Further Reading

Related pages

Training