Drivers 2879 Published by

AMD has updated the ROCm Radeon Open Compute Linux stack, a global platform for GPU-accelerated computing on Radeon graphics cards.





ROCm 5.5.1 release

What's New in This Release

HIP Enhancements

The ROCm v5.5 release consists of the following HIP enhancements:

Enhanced Stack Size Limit

In this release, the stack size limit is increased from 16k to 131056 bytes (or 128K - 16).
Applications requiring to update the stack size can use hipDeviceSetLimit API.

hipcc Changes

The following hipcc changes are implemented in this release:

  • hipcc will not implicitly link to libpthread and librt, as they are no longer a link time dependence for HIP programs. Applications that depend on these libraries must explicitly link to them.
  • -use-staticlib and -use-sharedlib options are deprecated.

Future Changes

New HIP APIs in This Release

Note

This is a pre-official version (beta) release of the new APIs and may contain unresolved issues.

Memory Management HIP APIs

The new memory management HIP API is as follows:

  • Sets information on the specified pointer [BETA].

    hipError_t hipPointerSetAttribute(const void* value, hipPointer_attribute attribute, hipDeviceptr_t ptr);

Module Management HIP APIs

The new module management HIP APIs are as follows:

  • Launches kernel � with launch parameters and shared memory on stream with arguments passed to kernelParams, where thread blocks can cooperate and synchronize as they execute.

    hipError_t hipModuleLaunchCooperativeKernel(hipFunction_t f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, hipStream_t stream, void** kernelParams);
    
  • Launches kernels on multiple devices where thread blocks can cooperate and synchronize as they execute.

    hipError_t hipModuleLaunchCooperativeKernelMultiDevice(hipFunctionLaunchParams* launchParamsList, unsigned int numDevices, unsigned int flags);
    

HIP Graph Management APIs

The new HIP Graph Management APIs are as follows:

  • Creates a memory allocation node and adds it to a graph [BETA]

    hipError_t hipGraphAddMemAllocNode(hipGraphNode_t* pGraphNode, hipGraph_t graph, const hipGraphNode_t* pDependencies, size_t numDependencies, hipMemAllocNodeParams* pNodeParams);
  • Return parameters for memory allocation node [BETA]

    hipError_t hipGraphMemAllocNodeGetParams(hipGraphNode_t node, hipMemAllocNodeParams* pNodeParams);
  • Creates a memory free node and adds it to a graph [BETA]

    hipError_t hipGraphAddMemFreeNode(hipGraphNode_t* pGraphNode, hipGraph_t graph, const hipGraphNode_t* pDependencies, size_t numDependencies, void* dev_ptr);
  • Returns parameters for memory free node [BETA].

    hipError_t hipGraphMemFreeNodeGetParams(hipGraphNode_t node, void* dev_ptr);
  • Write a DOT file describing graph structure [BETA].

    hipError_t hipGraphDebugDotPrint(hipGraph_t graph, const char* path, unsigned int flags);
  • Copies attributes from source node to destination node [BETA].

    hipError_t hipGraphKernelNodeCopyAttributes(hipGraphNode_t hSrc, hipGraphNode_t hDst);
  • Enables or disables the specified node in the given graphExec [BETA]

    hipError_t hipGraphNodeSetEnabled(hipGraphExec_t hGraphExec, hipGraphNode_t hNode, unsigned int isEnabled);
  • Query whether a node in the given graphExec is enabled [BETA]

    hipError_t hipGraphNodeGetEnabled(hipGraphExec_t hGraphExec, hipGraphNode_t hNode, unsigned int* isEnabled);

OpenMP Enhancements

This release consists of the following OpenMP enhancements:

  • Additional support for OMPT functions get_device_time and get_record_type.
  • Add support for min/max fast fp atomics on AMD GPUs.
  • Fix the use of the abs function in C device regions.

Deprecations and Warnings

HIP Deprecation

The hipcc and hipconfig Perl scripts are deprecated. In a future release, compiled binaries will be available as hipcc.bin and hipconfig.bin as replacements for the Perl scripts.

Note

There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to hipcc.bin and hipconfig.bin. The hipcc/hipconfig soft link will be assimilated to point from hipcc/hipconfig to the respective compiled binaries as the default option.

Linux Filesystem Hierarchy Standard for ROCm

ROCm packages have adopted the Linux foundation filesystem hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new filesystem hierarchy, ROCm ensures backward compatibility with its 5.1 version or older filesystem hierarchy. See below for a detailed explanation of the new filesystem hierarchy and backward compatibility.

New Filesystem Hierarchy

The following is the new filesystem hierarchy:4

/opt/rocm-<ver&rt;
    | --bin
      | --All externally exposed Binaries
    | --libexec
        | --<component&rt;
            | -- Component specific private non-ISA executables (architecture independent)
    | --include
        | -- <component&rt;
            | --<header files&rt;
    | --lib
        | --lib<soname&rt;.so -&rt; lib<soname&rt;.so.major -&rt; lib<soname&rt;.so.major.minor.patch
            (public libraries linked with application)
        | --<component&rt; (component specific private library, executable data)
        | --<cmake&rt;
            | --components
                | --<component&rt;.config.cmake
    | --share
        | --html/<component&rt;/*.html
        | --info/<component&rt;/*.[pdf, md, txt]
        | --man
        | --doc
            | --<component&rt;
                | --<licenses&rt;
        | --<component&rt;
            | --<misc files&rt; (arch independent non-executable)
            | --samples
            

Note

ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.

For more information, refer to  https://refspecs.linuxfoundation.org/fhs.shtml.

Backward Compatibility with Older Filesystems

ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.

Note

ROCm will continue supporting backward compatibility until the next major release.

Wrapper header files

Wrapper header files are placed in the old location (/opt/rocm-xxx//include) with a warning message to include files from the new location (/opt/rocm-xxx/include) as shown in the example below:

// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"

The wrapper header files’ backward compatibility deprecation is as follows:

  • #pragma message announcing deprecation -- ROCm v5.2 release
  • #pragma message changed to #warning -- Future release
  • #warning changed to #error -- Future release
  • Backward compatibility wrappers removed -- Future release

Library files

Library files are available in the /opt/rocm-xxx/lib folder. For backward compatibility, the old library location (/opt/rocm-xxx//lib) has a soft link to the library at the new location.

Example:

$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root   24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so

CMake Config files

All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/ folder.
For backward compatibility, the old CMake locations (/opt/rocm-xxx//lib/cmake) consist of a soft link to the new CMake config.

Example:

$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake

ROCm Support For Code Object V3 Deprecated

Support for Code Object v3 is deprecated and will be removed in a future release.

Comgr V3.0 Changes

The following APIs and macros have been marked as deprecated. These are expected to be removed in a future ROCm release and coincides with the release of Comgr v3.0.

API Changes

  • amd_comgr_action_info_set_options()
  • amd_comgr_action_info_get_options()

Actions and Data Types

  • AMD_COMGR_ACTION_ADD_DEVICE_LIBRARIES
  • AMD_COMGR_ACTION_COMPILE_SOURCE_TO_FATBIN

For replacements, see the AMD_COMGR_ACTION_INFO_GET/SET_OPTION_LIST APIs, and the AMD_COMGR_ACTION_COMPILE_SOURCE_(WITH_DEVICE_LIBS)_TO_BC macros.

Deprecated Environment Variables

The following environment variables are removed in this ROCm release:

  • GPU_MAX_COMMAND_QUEUES
  • GPU_MAX_WORKGROUP_SIZE_2D_X
  • GPU_MAX_WORKGROUP_SIZE_2D_Y
  • GPU_MAX_WORKGROUP_SIZE_3D_X
  • GPU_MAX_WORKGROUP_SIZE_3D_Y
  • GPU_MAX_WORKGROUP_SIZE_3D_Z
  • GPU_BLIT_ENGINE_TYPE
  • GPU_USE_SYNC_OBJECTS
  • AMD_OCL_SC_LIB
  • AMD_OCL_ENABLE_MESSAGE_BOX
  • GPU_FORCE_64BIT_PTR
  • GPU_FORCE_OCL20_32BIT
  • GPU_RAW_TIMESTAMP
  • GPU_SELECT_COMPUTE_RINGS_ID
  • GPU_USE_SINGLE_SCRATCH
  • GPU_ENABLE_LARGE_ALLOCATION
  • HSA_LOCAL_MEMORY_ENABLE
  • HSA_ENABLE_COARSE_GRAIN_SVM
  • GPU_IFH_MODE
  • OCL_SYSMEM_REQUIREMENT
  • OCL_CODE_CACHE_ENABLE
  • OCL_CODE_CACHE_RESET

Known Issues In This Release

The following are the known issues in this release.

DISTRIBUTED/TEST_DISTRIBUTED_SPAWN Fails

When user applications call ncclCommAbort to destruct communicators and then create new
communicators repeatedly, subsequent communicators may fail to initialize.

This issue is under investigation and will be resolved in a future release.

Failures In HIP Directed Tests

Multiple HIP directed tests fail.

Library Changes in ROCM 5.5.0

Library Version
hipBLAS 0.53.0 ⇒  0.54.0
hipCUB 2.13.0 ⇒  2.13.1
hipFFT 1.0.10 ⇒  1.0.11
hipSOLVER 1.6.0 ⇒  1.7.0
hipSPARSE 2.3.3 ⇒  2.3.5
rccl 2.13.4 ⇒  2.15.5
rocALUTION 2.1.3 ⇒  2.1.8
rocBLAS 2.46.0 ⇒  2.47.0
rocFFT 1.0.21 ⇒  1.0.22
rocPRIM 2.12.0 ⇒  2.13.0
rocRAND 2.10.16 ⇒  2.10.17
rocSOLVER 3.20.0 ⇒  3.21.0
rocSPARSE 2.4.0 ⇒  2.5.1
rocThrust 2.17.0
rocWMMA 0.9 ⇒  1.0
Tensile 4.35.0 ⇒  4.36.0

hipBLAS 0.54.0

hipBLAS 0.54.0 for ROCm 5.5.0

Added

  • added option to opt-in to use __half for hipblasHalf type in the API for c++ users who define HIPBLAS_USE_HIP_HALF
  • added scripts to plot performance for multiple functions
  • data driven hipblas-bench and hipblas-test execution via external yaml format data files
  • client smoke test added for quick validation using command hipblas-test --yaml hipblas_smoke.yaml

Fixed

  • fixed datatype conversion functions to support more rocBLAS/cuBLAS datatypes
  • fixed geqrf to return successfully when nullptrs are passed in with n == 0 || m == 0
  • fixed getrs to return successfully when given nullptrs with corresponding size = 0
  • fixed getrs to give info = -1 when transpose is not an expected type
  • fixed gels to return successfully when given nullptrs with corresponding size = 0
  • fixed gels to give info = -1 when transpose is not in ('N', 'T') for real cases or not in ('N', 'C') for complex cases

Changed

  • changed reference code for Windows to OpenBLAS
  • hipblas client executables all now begin with hipblas- prefix

hipCUB 2.13.1

hipCUB 2.13.1 for ROCm 5.5.0

Added

  • Benchmarks for BlockShuffle, BlockLoad, and BlockStore.

Changed

  • CUB backend references CUB and Thrust version 1.17.2.
  • Improved benchmark coverage of BlockScan by adding ExclusiveScan, benchmark coverage of BlockRadixSort by adding SortBlockedToStriped, and benchmark coverage of WarpScan by adding Broadcast.

Fixed

  • Windows HIP SDK support
Known Issues
  • BlockRadixRankMatch is currently broken under the rocPRIM backend.
  • BlockRadixRankMatch with a warp size that does not exactly divide the block size is broken under the CUB backend.

hipFFT 1.0.11

hipFFT 1.0.11 for ROCm 5.5.0

Fixed

  • Fixed old version rocm include/lib folders not removed on upgrade.

hipSOLVER 1.7.0

hipSOLVER 1.7.0 for ROCm 5.5.0

Added

  • Added functions
    • gesvdj
      • hipsolverSgesvdj_bufferSize, hipsolverDgesvdj_bufferSize, hipsolverCgesvdj_bufferSize, hipsolverZgesvdj_bufferSize
      • hipsolverSgesvdj, hipsolverDgesvdj, hipsolverCgesvdj, hipsolverZgesvdj
    • gesvdjBatched
      • hipsolverSgesvdjBatched_bufferSize, hipsolverDgesvdjBatched_bufferSize, hipsolverCgesvdjBatched_bufferSize, hipsolverZgesvdjBatched_bufferSize
      • hipsolverSgesvdjBatched, hipsolverDgesvdjBatched, hipsolverCgesvdjBatched, hipsolverZgesvdjBatched

hipSPARSE 2.3.5

hipSPARSE 2.3.5 for ROCm 5.5.0

Improved

  • Fixed an issue, where the rocm folder was not removed on upgrade of meta packages
  • Fixed a compilation issue with cusparse backend
  • Added more detailed messages on unit test failures due to missing input data
  • Improved documentation
  • Fixed a bug with deprecation messages when using gcc9 (Thanks  @Maetveis)

rccl 2.15.5

RCCL 2.15.5 for ROCm 5.5.0

Changed

  • Compatibility with NCCL 2.15.5
  • Unit test executable renamed to rccl-UnitTests

Added

  • HW-topology aware binary tree implementation
  • Experimental support for MSCCL
  • New unit tests for hipGraph support
  • NPKit integration

Fixed

  • rocm-smi ID conversion
  • Support for HIP_VISIBLE_DEVICES for unit tests
  • Support for p2p transfers to non (HIP) visible devices

Removed

rocALUTION 2.1.8

rocALUTION 2.1.8 for ROCm 5.5.0

Added

  • Added build support for Navi32

Improved

  • Fixed a typo in MPI backend
  • Fixed a bug with the backend when HIP support is disabled
  • Fixed a bug in SAAMG hierarchy building on HIP backend
  • Improved SAAMG hierarchy build performance on HIP backend

Changed

  • LocalVector::GetIndexValues(ValueType*) is deprecated, use LocalVector::GetIndexValues(const LocalVector&, LocalVector*) instead
  • LocalVector::SetIndexValues(const ValueType*) is deprecated, use LocalVector::SetIndexValues(const LocalVector&, const LocalVector&) instead
  • LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*) is deprecated, use LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix*) instead
  • LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, float, LocalMatrix*, LocalMatrix*) is deprecated, use LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, LocalMatrix*) instead
  • LocalMatrix::RugeStueben() is deprecated
  • LocalMatrix::AMGSmoothedAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*, int) is deprecated, use LocalMatrix::AMGAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix*, int) instead
  • LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix*, LocalMatrix*) is deprecated, use LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix*) instead

rocBLAS 2.47.0

rocBLAS 2.47.0 for ROCm 5.5.0

Added

  • added functionality rocblas_geam_ex for matrix-matrix minimum operations
  • added HIP Graph support as beta feature for rocBLAS Level 1, Level 2, and Level 3(pointer mode host) functions
  • added beta features API. Exposed using compiler define ROCBLAS_BETA_FEATURES_API
  • added support for vector initialization in the rocBLAS test framework with negative increments
  • added windows build documentation for forthcoming support using ROCm HIP SDK
  • added scripts to plot performance for multiple functions

Optimizations

  • improved performance of Level 2 rocBLAS GEMV for float and double precision. Performance enhanced by 150-200% for certain problem sizes when (m==n) measured on a gfx90a GPU.
  • improved performance of Level 2 rocBLAS GER for float, double and complex float precisions. Performance enhanced by 5-7% for certain problem sizes measured on a gfx90a GPU.
  • improved performance of Level 2 rocBLAS SYMV for float and double precisions. Performance enhanced by 120-150% for certain problem sizes measured on both gfx908 and gfx90a GPUs.

Fixed

  • fixed setting of executable mode on client script rocblas_gentest.py to avoid potential permission errors with clients rocblas-test and rocblas-bench
  • fixed deprecated API compatibility with Visual Studio compiler
  • fixed test framework memory exception handling for Level 2 functions when the host memory allocation exceeds the available memory

Changed

  • install.sh internally runs rmake.py (also used on windows) and rmake.py may be used directly by developers on linux (use --help)
  • rocblas client executables all now begin with rocblas- prefix

Removed

  • install.sh removed options -o --cov as now Tensile will use the default COV format, set by cmake define Tensile_CODE_OBJECT_VERSION=default

rocFFT 1.0.22

rocFFT 1.0.22 for ROCm 5.5.0

Optimizations

  • Improved performance of 1D lengths < 2048 that use Bluestein's algorithm.
  • Reduced time for generating code during plan creation.
  • Optimized 3D R2C/C2R lengths 32, 84, 128.
  • Optimized batched small 1D R2C/C2R cases.

Added

  • Added gfx1101 to default AMDGPU_TARGETS.

Changed

  • Moved client programs to C++17.
  • Moved planar kernels and infrequently used Stockham kernels to be runtime-compiled.
  • Moved transpose, real-complex, Bluestein, and Stockham kernels to library kernel cache.

Fixed

  • Removed zero-length twiddle table allocations, which fixes errors from hipMallocManaged.
  • Fixed incorrect freeing of HIP stream handles during twiddle computation when multiple devices are present.

rocPRIM 2.13.0

rocPRIM 2.13.0 for ROCm 5.5.0

Added

  • New block level radix_rank primitive.
  • New block level radix_rank_match primitive.

Changed

  • Improved the performance of block_radix_sort and device_radix_sort.

Known Issues

  • Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows.

Fixed

  • Fixed benchmark build on Windows

rocRAND 2.10.17

rocRAND 2.10.17 for ROCm 5.5.0

Added
  • MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.
  • New benchmark for the device API using Google Benchmark, benchmark_rocrand_device_api, replacing benchmark_rocrand_kernel. benchmark_rocrand_kernel is deprecated and will be removed in a future version. Likewise, benchmark_curand_host_api is added to replace benchmark_curand_generate and benchmark_curand_device_api is added to replace benchmark_curand_kernel.
  • experimental HIP-CPU feature
  • ThreeFry pseudorandom number generator based on Salmon et al., 2011, "Parallel random numbers: as easy as 1, 2, 3".

Changed

  • Python 2.7 is no longer officially supported.

Fixed

  • Windows HIP SDK support

rocSOLVER 3.21.0

rocSOLVER 3.21.0 for ROCm 5.5.0

Added

  • SVD for general matrices using Jacobi algorithm:
    • GESVDJ (with batched and strided_batched versions)
  • LU factorization without pivoting for block tridiagonal matrices:
    • GEBLTTRF_NPVT (with batched and strided_batched versions)
  • Linear system solver without pivoting for block tridiagonal matrices:
    • GEBLTTRS_NPVT (with batched and strided_batched, versions)
  • Product of triangular matrices
    • LAUUM
  • Added experimental hipGraph support for rocSOLVER functions

Optimized

  • Improved the performance of SYEVJ/HEEVJ.

Changed

  • STEDC, SYEVD/HEEVD and SYGVD/HEGVD now use fully implemented Divide and Conquer approach.

Fixed

  • SYEVJ/HEEVJ should now be invariant under matrix scaling.
  • SYEVJ/HEEVJ should now properly output the eigenvalues when no sweeps are executed.
  • Fixed GETF2_NPVT and GETRF_NPVT input data initialization in tests and benchmarks.
  • Fixed rocblas missing from the dependency list of the rocsolver deb and rpm packages.

rocSPARSE 2.5.1

rocSPARSE 2.5.1 for ROCm 5.5.0

Added

  • Added bsrgemm and spgemm for BSR format
  • Added bsrgeam
  • Added build support for Navi32
  • Added experimental hipGraph support for some rocSPARSE routines
  • Added csritsv, spitsv csr iterative triangular solve
  • Added mixed precisions for SpMV
  • Added batched SpMM for transpose A in COO format with atomic atomic algorithm

Improved

  • Optimization to csr2bsr
  • Optimization to csr2csr_compress
  • Optimization to csr2coo
  • Optimization to gebsr2csr
  • Optimization to csr2gebsr
  • Fixes to documentation
  • Fixes a bug in COO SpMV gridsize
  • Fixes a bug in SpMM gridsize when using very large matrices

Known Issues

  • In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accuracy issues to investigate with XNACK enabled. The fallback is rocsparse_itilu0_alg_sync_split.

rocWMMA 1.0

rocWMMA 1.0 for ROCm 5.5.0

Added

  • Added support for wave32 on gfx11+
  • Added infrastructure changes to support hipRTC
  • Added performance tracking system

Changed

  • Modified the assignment of hardware information
  • Modified the data access for unsigned datatypes
  • Added library config to support multiple architectures

Tensile 4.36.0

Tensile 4.36.0 for ROCm 5.5.0

Added

  • Add functions for user-driven tuning
  • Add GFX11 support: HostLibraryTests yamls, rearragne FP32(C)/FP64(C) instruction order, archCaps for instruction renaming condition, adjust vgpr bank for A/B/C for optimize, separate vscnt and vmcnt, dual mac
  • Add binary search for Grid-Based algorithm
  • Add reject condition for (StoreCInUnroll + BufferStore=0) and (DirectToVgpr + ScheduleIterAlg<3 + PrefetchGlobalRead==2)
  • Add support for (DirectToLds + hgemm + NN/NT/TT) and (DirectToLds + hgemm + GlobalLoadVectorWidth < 4)
  • Add support for (DirectToLds + hgemm(TLU=True only) or sgemm + NumLoadsCoalesced > 1)
  • Add GSU SingleBuffer algorithm for HSS/BSS
  • Add gfx900:xnack-, gfx1032, gfx1034, gfx1035
  • Enable gfx1031 support

Optimizations

  • Use AssertSizeLessThan for BufferStoreOffsetLimitCheck if it is smaller than MT1
  • Improve InitAccVgprOpt

Changed

  • Use global_atomic for GSU instead of flat and global_store for debug code
  • Replace flat_load/store with global_load/store
  • Use global_load/store for BufferLoad/Store=0 and enable scheduling
  • LocalSplitU support for HGEMM+HPA when MFMA disabled
  • Update Code Object Version
  • Type cast local memory to COMPUTE_DATA_TYPE in LDS to avoid precision loss
  • Update asm cap cache arguments
  • Unify SplitGlobalRead into ThreadSeparateGlobalRead and remove SplitGlobalRead
  • Change checks, error messages, assembly syntax, and coverage for DirectToLds
  • Remove unused cmake file
  • Clean up the LLVM dependency code
  • Update ThreadSeparateGlobalRead test cases for PrefetchGlobalRead=2
  • Update sgemm/hgemm test cases for DirectToLds and ThreadSepareteGlobalRead

Fixed

  • Add build-id to header of compiled source kernels
  • Fix solution index collisions
  • Fix h beta vectorwidth4 correctness issue for WMMA
  • Fix an error with BufferStore=0
  • Fix mismatch issue with (StoreCInUnroll + PrefetchGlobalRead=2)
  • Fix MoveMIoutToArch bug
  • Fix flat load correctness issue on I8 and flat store correctness issue
  • Fix mismatch issue with BufferLoad=0 + TailLoop for large array sizes
  • Fix code generation error with BufferStore=0 and StoreCInUnrollPostLoop
  • Fix issues with DirectToVgpr + ScheduleIterAlg<3
  • Fix mismatch issue with DGEMM TT + LocalReadVectorWidth=2
  • Fix mismatch issue with PrefetchGlobalRead=2
  • Fix mismatch issue with DirectToVgpr + PrefetchGlobalRead=2 + small tile size
  • Fix an error with PersistentKernel=0 + PrefetchAcrossPersistent=1 + PrefetchAcrossPersistentMode=1
  • Fix mismatch issue with DirectToVgpr + DirectToLds + only 1 iteration in unroll loop case
  • Remove duplicate GSU kernels: for GSU = 1, GSUAlgorithm SingleBuffer and MultipleBuffer kernels are identical
  • Fix for failing CI tests due to CpuThreads=0
  • Fix mismatch issue with DirectToLds + PrefetchGlobalRead=2
  • Remove the reject condition for ThreadSeparateGlobalRead and DirectToLds (HGEMM, SGEMM only)
  • Modify reject condition for minimum lanes of ThreadSeparateGlobalRead (SGEMM or larger data type only)


Release ROCm 5.5.1 release · RadeonOpenCompute/ROCm