LLVM-IR Dataset Utilities Documentation

Contents

LLVM-IR Dataset Utilities Documentation#

LLVM-IR Dataset Utilities is a set of utilities for the construction of large LLVM IR-based datasets from multiple sources for the development of LLVM-focussed machine learning approaches. It is specifically designed to build corpora of bitcode out of language package indices. Built versions of the dataset are available from the LLVM-ML HuggingFace Organization.

Features#

Scalability

Readily scalable build infrastructure, rapidly scaling with Ray to support the rapid compilation of 1000s of code bases across entire CPU clusters.

Vast Builder Support

Extensive support for building from a variety of sources including C, C++, Rust, Swift, Julia, and more.

Statistical Introspection

Enabling cross-language statistical analysis of across LLVM infrastructure-based programming languages, on their primitive usage patterns, pass mutations, and beyond.

IR-Mutation Interception

Able to intercept the compilation process at every instance the IR gets mutated for in-depth analysis of the compilation process, and construction of IR compilation stages-based datasets.