ir_datasets
: Command Line Interface
Data can be exported to stdout in various formats using the ir_datasets export
command.
ir_datasets export [dataset-id] docs [--fields] [--format]
Exports documents
--fields
: select which fields from the document to export (defaults to all)
--format
: select output format to use: tsv
(default) or jsonl
ir_datasets export [dataset-id] queries [--fields] [--format]
Exports queries
--fields
: select which fields from the query to export (defaults to all)
--format
: select output format to use: tsv
(default) or jsonl
ir_datasets export [dataset-id] qrels [--fields] [--format]
Exports queries
--fields
: select which fields from the qrels to export (defaults to all)
--format
: select output format to use: trec
(default), tsv
or jsonl
ir_datasets export [dataset-id] scoreddocs [--fields] [--format]
Exports queries
--fields
: select which fields from the scoreddocs to export (defaults to all)
--format
: select output format to use: trec
(default), tsv
or jsonl
You can look up documents by their doc_id
using the ir_datasets lookup
command.
ir_datasets lookup [dataset-id] [doc_ids ...] [--fields] [--format]
Efficiently finds documents that have the provided doc_ids
--fields
: select which fields from the documents to export (defaults to all)
--format
: select output format to use: trec
(default), tsv
or jsonl
You can create output FIFOs suitable for Anserini indexing using the ir_datasets doc_fifo
command.
Note that unlike export and lookup, these always output as JSONL in a format that Anserini can use to index (id and content fields). All selected fields are concatenated.
This command will output a command you can run for indexing with Anserini. This process remains running until all documents are sent to fifos.
ir_datasets doc_fifos [dataset-id] [--fields] [--count]
Creates a temporary directory with fifos
--fields
: select which fields from the documents to export (defaults to all). These fields are concatenated.
--count
: how many fifos to make? Defualts to 1 less than the number of processors (or 1).
--dir
: where to put the fifos? Defaults to a new temp directory.
This command can be used for cleaning up data that can be re-downloaded or re-generated again later. This is helpful for freeing up disk space.
ir_datasets clean [dataset-ids ...] [--list] [-y] [-H]
[dataset-ids]
: one or more top-level dataset IDs to clean up.
--list
: list out the sizes of the provided dataset (or all datasets if non are given).
-y
: say "yes" to prompt (i.e., confirm delete).
-H
: print sizes in total bytes, rather than human-readable format.