← home
Github: allenai/ir_datasets

ir_datasets: Command Line Interface

export command

Data can be exported to stdout in various formats using the ir_datasets export command.

ir_datasets export [dataset-id] docs [--fields] [--format]

Exports documents

--fields: select which fields from the document to export (defaults to all)

--format: select output format to use: tsv (default) or jsonl

ir_datasets export [dataset-id] queries [--fields] [--format]

Exports queries

--fields: select which fields from the query to export (defaults to all)

--format: select output format to use: tsv (default) or jsonl

ir_datasets export [dataset-id] qrels [--fields] [--format]

Exports queries

--fields: select which fields from the qrels to export (defaults to all)

--format: select output format to use: trec (default), tsv or jsonl

ir_datasets export [dataset-id] scoreddocs [--fields] [--format]

Exports queries

--fields: select which fields from the scoreddocs to export (defaults to all)

--format: select output format to use: trec (default), tsv or jsonl

lookup command

You can look up documents by their doc_id using the ir_datasets lookup command.

ir_datasets lookup [dataset-id] [doc_ids ...] [--fields] [--format]

Efficiently finds documents that have the provided doc_ids

--fields: select which fields from the documents to export (defaults to all)

--format: select output format to use: trec (default), tsv or jsonl

doc_fifo command

You can create output FIFOs suitable for Anserini indexing using the ir_datasets doc_fifo command.

Note that unlike export and lookup, these always output as JSONL in a format that Anserini can use to index (id and content fields). All selected fields are concatenated.

This command will output a command you can run for indexing with Anserini. This process remains running until all documents are sent to fifos.

ir_datasets doc_fifos [dataset-id] [--fields] [--count]

Creates a temporary directory with fifos

--fields: select which fields from the documents to export (defaults to all). These fields are concatenated.

--count: how many fifos to make? Defualts to 1 less than the number of processors (or 1).

--dir: where to put the fifos? Defaults to a new temp directory.

clean command

This command can be used for cleaning up data that can be re-downloaded or re-generated again later. This is helpful for freeing up disk space.

ir_datasets clean [dataset-ids ...] [--list] [-y] [-H]

[dataset-ids]: one or more top-level dataset IDs to clean up.

--list: list out the sizes of the provided dataset (or all datasets if non are given).

-y: say "yes" to prompt (i.e., confirm delete).

-H: print sizes in total bytes, rather than human-readable format.