ir_datasets: Command Line Interface
Data can be exported to stdout in various formats using the ir_datasets export command.
ir_datasets export [dataset-id] docs [--fields] [--format]Exports documents
--fields: select which fields from the document to export (defaults to all)
--format: select output format to use: tsv (default) or jsonl
ir_datasets export [dataset-id] queries [--fields] [--format]Exports queries
--fields: select which fields from the query to export (defaults to all)
--format: select output format to use: tsv (default) or jsonl
ir_datasets export [dataset-id] qrels [--fields] [--format]Exports queries
--fields: select which fields from the qrels to export (defaults to all)
--format: select output format to use: trec (default), tsv or jsonl
ir_datasets export [dataset-id] scoreddocs [--fields] [--format]Exports queries
--fields: select which fields from the scoreddocs to export (defaults to all)
--format: select output format to use: trec (default), tsv or jsonl
You can look up documents by their doc_id using the ir_datasets lookup command.
ir_datasets lookup [dataset-id] [doc_ids ...] [--fields] [--format]Efficiently finds documents that have the provided doc_ids
--fields: select which fields from the documents to export (defaults to all)
--format: select output format to use: trec (default), tsv or jsonl
You can create output FIFOs suitable for Anserini indexing using the ir_datasets doc_fifo command.
Note that unlike export and lookup, these always output as JSONL in a format that Anserini can use to index (id and content fields). All selected fields are concatenated.
This command will output a command you can run for indexing with Anserini. This process remains running until all documents are sent to fifos.
ir_datasets doc_fifos [dataset-id] [--fields] [--count]Creates a temporary directory with fifos
--fields: select which fields from the documents to export (defaults to all). These fields are concatenated.
--count: how many fifos to make? Defualts to 1 less than the number of processors (or 1).
--dir: where to put the fifos? Defaults to a new temp directory.
This command can be used for cleaning up data that can be re-downloaded or re-generated again later. This is helpful for freeing up disk space.
ir_datasets clean [dataset-ids ...] [--list] [-y] [-H][dataset-ids]: one or more top-level dataset IDs to clean up.
--list: list out the sizes of the provided dataset (or all datasets if non are given).
-y: say "yes" to prompt (i.e., confirm delete).
-H: print sizes in total bytes, rather than human-readable format.