ir_datasets: GOVTo use this dataset, you need a copy of GOV, provided by the University of Glasgow.
Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to UoG to get a copy. The data are provided as hard drives that are shipped to you.
Once you have the data, ir_datasets will need the directories that look like the following:
ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/gov/corpus.
GOV web document collection. Used for early TREC Web Tracks. Not to be confused with gov2.
The dataset is obtained for a fee from UoG, and is shipped as a hard drive. More information is provided here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.gov')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
  "docs": {
    "count": 1247753,
    "fields": {
      "doc_id": {
        "max_len": 14,
        "common_prefix": "G"
      }
    }
  }
}
The TREC Web Track 2002 ad-hoc ranking benchmark.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2002")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export gov/trec-web-2002 queries
[query_id]    [title]    [description]    [narrative]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.gov.trec-web-2002.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from gov
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2002")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov/trec-web-2002 docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.gov.trec-web-2002')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Not Relevant | 55K | 97.2% | 
| 1 | Relevant | 1.6K | 2.8% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2002")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export gov/trec-web-2002 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.gov.trec-web-2002.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Craswell2002TrecWeb, title={Overview of the TREC-2002 Web Track}, author={Nick Craswell and David Hawking}, booktitle={TREC}, year={2002} }{
  "docs": {
    "count": 1247753,
    "fields": {
      "doc_id": {
        "max_len": 14,
        "common_prefix": "G"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 56650,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 55076,
          "1": 1574
        }
      }
    }
  }
}
The TREC Web Track 2002 named page ranking benchmark.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2002/named-page")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export gov/trec-web-2002/named-page queries
[query_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.gov.trec-web-2002.named-page.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from gov
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2002/named-page")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov/trec-web-2002/named-page docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.gov.trec-web-2002.named-page')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 1 | Name refers to this page | 170 | 100.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2002/named-page")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export gov/trec-web-2002/named-page qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.gov.trec-web-2002.named-page.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Craswell2002TrecWeb, title={Overview of the TREC-2002 Web Track}, author={Nick Craswell and David Hawking}, booktitle={TREC}, year={2002} }{
  "docs": {
    "count": 1247753,
    "fields": {
      "doc_id": {
        "max_len": 14,
        "common_prefix": "G"
      }
    }
  },
  "queries": {
    "count": 150
  },
  "qrels": {
    "count": 170,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 170
        }
      }
    }
  }
}
The TREC Web Track 2003 ad-hoc ranking benchmark.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2003")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description>
You can find more details about the Python API here.
ir_datasets export gov/trec-web-2003 queries
[query_id]    [title]    [description]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.gov.trec-web-2003.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from gov
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2003")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov/trec-web-2003 docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.gov.trec-web-2003')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Not Relevant | 51K | 99.0% | 
| 1 | Relevant | 516 | 1.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2003")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export gov/trec-web-2003 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.gov.trec-web-2003.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Craswell2003TrecWeb, title={Overview of the TREC 2003 Web Track}, author={Nick Craswell and David Hawking and Ross Wilkinson and Mingfang Wu}, booktitle={TREC}, year={2003} }{
  "docs": {
    "count": 1247753,
    "fields": {
      "doc_id": {
        "max_len": 14,
        "common_prefix": "G"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 51062,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 50546,
          "1": 516
        }
      }
    }
  }
}
The TREC Web Track 2003 named page ranking benchmark.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2003/named-page")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export gov/trec-web-2003/named-page queries
[query_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.gov.trec-web-2003.named-page.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from gov
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2003/named-page")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov/trec-web-2003/named-page docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.gov.trec-web-2003.named-page')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 1 | Name refers to this page | 352 | 100.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2003/named-page")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export gov/trec-web-2003/named-page qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.gov.trec-web-2003.named-page.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Craswell2003TrecWeb, title={Overview of the TREC 2003 Web Track}, author={Nick Craswell and David Hawking and Ross Wilkinson and Mingfang Wu}, booktitle={TREC}, year={2003} }{
  "docs": {
    "count": 1247753,
    "fields": {
      "doc_id": {
        "max_len": 14,
        "common_prefix": "G"
      }
    }
  },
  "queries": {
    "count": 300
  },
  "qrels": {
    "count": 352,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 352
        }
      }
    }
  }
}
The TREC Web Track 2004 ad-hoc ranking benchmark.
Queries include a combination of topic distillation, homepage finding, and named page finding.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2004")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export gov/trec-web-2004 queries
[query_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.gov.trec-web-2004.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from gov
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2004")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export gov/trec-web-2004 docs
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.gov.trec-web-2004')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Not Relevant | 87K | 98.0% | 
| 1 | Relevant | 1.8K | 2.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("gov/trec-web-2004")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export gov/trec-web-2004 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.gov.trec-web-2004.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Craswell2004TrecWeb, title={Overview of the TREC-2004 Web Track}, author={Nick Craswell and David Hawking}, booktitle={TREC}, year={2004} }{
  "docs": {
    "count": 1247753,
    "fields": {
      "doc_id": {
        "max_len": 14,
        "common_prefix": "G"
      }
    }
  },
  "queries": {
    "count": 225
  },
  "qrels": {
    "count": 88566,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 86803,
          "1": 1763
        }
      }
    }
  }
}