llvm_ir_dataset_utils.tools

llvm_ir_dataset_utils.tools#

llvm_ir_dataset_utils.tools.aggregate_build_sizes

Tool for aggregating and providing statistics on bitcode size.

llvm_ir_dataset_utils.tools.audit_licenses

A script for analyzing the license distribution of an already built corpus.

llvm_ir_dataset_utils.tools.audit_package_list_licenses

A script for analyzing the license buildup of a list of packages.

llvm_ir_dataset_utils.tools.build_crate_from_repository

Tool to build a crate given just a repository.

llvm_ir_dataset_utils.tools.build_julia_packages

Tool for building a list of julia packages.

llvm_ir_dataset_utils.tools.build_spack_package_from_list

A tool for building individual spack packages or an entire list from a list of spack packages and their dependencies.

llvm_ir_dataset_utils.tools.build_swift_packages

Tool for building a list of cargo packages.

llvm_ir_dataset_utils.tools.collect_license_information

Tool for collecting license information on all projects and putting it into a JSON file.

llvm_ir_dataset_utils.tools.collect_textual_ir

A script for collecting a large amount of textual IR into a single file, aimed primarily at training basic BPE tokenizers.

llvm_ir_dataset_utils.tools.corpus_from_description

Tool that builds a bitcode corpus from a description

llvm_ir_dataset_utils.tools.count_values

A tool for counting various quantities like tokens from gathered statistics CSV files.

llvm_ir_dataset_utils.tools.delete_folder

Tool for deleting a lot of inodes in parallel.

llvm_ir_dataset_utils.tools.export_deduplicated_corpus

Tool for taking in a list of module hashes and extracting all deduplicated modules into a separate directory.

llvm_ir_dataset_utils.tools.extract_build_failure_logs

Tool to get build failure logs and copy them into a folder.

llvm_ir_dataset_utils.tools.get_bbs

Tool for extracting basic blocks from the corpus

llvm_ir_dataset_utils.tools.get_build_failure_logs

Tool to find all the logs for targets that failed to build from a corpus directory.

llvm_ir_dataset_utils.tools.get_common_constants

Tool for getting common tokenizer constants from bitcode modules.

llvm_ir_dataset_utils.tools.get_julia_packages

Tool for getting Julia packages.

llvm_ir_dataset_utils.tools.get_spack_package_list

Tool for getting all spack packages that are usable for producing LLVM bitcode.

llvm_ir_dataset_utils.tools.get_swift_packages

Tool for getting Swift package list.

llvm_ir_dataset_utils.tools.grep_source

Tool for searching all the source files within a corpus

llvm_ir_dataset_utils.tools.link_files

Tool for running llvm-link over all bitcode files in a corpus.

llvm_ir_dataset_utils.tools.module_statistics

Tool for getting statistics on bitcode modules.

llvm_ir_dataset_utils.tools.parse_crates_database

A tool for downloading and parsing the crates.io database to get repositories and corpus descriptions out.

llvm_ir_dataset_utils.tools.process_to_parquet

This is a script that allows for the conversion of a deduplicated dataset into a parquet dataset for distribution.

llvm_ir_dataset_utils.tools.search_strings

Search for strings in bc files that will be in the dataset distribution.

llvm_ir_dataset_utils.tools.spack_analyze_failures

A tool for finding spack build failures that break the most dependent packages.

llvm_ir_dataset_utils.tools.squash_hf_history

A tool for squashing the HF history.

llvm_ir_dataset_utils.tools.top_x_constants

Tool for getting the top x constants from a constant frequency histogram.

llvm_ir_dataset_utils.tools.upload_dataset_hf

A script for uploading a dataset in the form of a folder of parquet files to huggingface.

llvm_ir_dataset_utils.tools.validate_parquet_db

This script loads in a folder of parquet files from the process_to_parquet.py script and validates some of the fields.