llvm_ir_dataset_utils.tools#
Tool for aggregating and providing statistics on bitcode size. |
|
A script for analyzing the license distribution of an already built corpus. |
|
A script for analyzing the license buildup of a list of packages. |
|
Tool to build a crate given just a repository. |
|
Tool for building a list of julia packages. |
|
A tool for building individual spack packages or an entire list from a list of spack packages and their dependencies. |
|
Tool for building a list of cargo packages. |
|
Tool for collecting license information on all projects and putting it into a JSON file. |
|
A script for collecting a large amount of textual IR into a single file, aimed primarily at training basic BPE tokenizers. |
|
Tool that builds a bitcode corpus from a description |
|
A tool for counting various quantities like tokens from gathered statistics CSV files. |
|
Tool for deleting a lot of inodes in parallel. |
|
Tool for taking in a list of module hashes and extracting all deduplicated modules into a separate directory. |
|
Tool to get build failure logs and copy them into a folder. |
|
Tool for extracting basic blocks from the corpus |
|
Tool to find all the logs for targets that failed to build from a corpus directory. |
|
Tool for getting common tokenizer constants from bitcode modules. |
|
Tool for getting Julia packages. |
|
Tool for getting all spack packages that are usable for producing LLVM bitcode. |
|
Tool for getting Swift package list. |
|
Tool for searching all the source files within a corpus |
|
Tool for running llvm-link over all bitcode files in a corpus. |
|
Tool for getting statistics on bitcode modules. |
|
A tool for downloading and parsing the crates.io database to get repositories and corpus descriptions out. |
|
This is a script that allows for the conversion of a deduplicated dataset into a parquet dataset for distribution. |
|
Search for strings in bc files that will be in the dataset distribution. |
|
A tool for finding spack build failures that break the most dependent packages. |
|
A tool for squashing the HF history. |
|
Tool for getting the top x constants from a constant frequency histogram. |
|
A script for uploading a dataset in the form of a folder of parquet files to huggingface. |
|
This script loads in a folder of parquet files from the process_to_parquet.py script and validates some of the fields. |