ir_datasets: ClueWeb09To use this dataset, you need a copy of ClueWeb 2009, provided by CMU.
Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to CMU to get a copy. The data are provided as hard drives that are shipped to you.
Once you have the data, ir_datasets will need the directories that look like the following:
ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/clueweb09/corpus.
ClueWeb 2009 web document collection. Contains over 1B web pages, in 10 languages.
The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
  "docs": {
    "count": 1040859705,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-"
      }
    }
  }
}
Subset of ClueWeb09 with only Arabic-language documents.
Language: ar
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/ar")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/ar docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.ar')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
  "docs": {
    "count": 29192662,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-ar000"
      }
    }
  }
}
Subset of ClueWeb09 with the first ~50 million English-language documents. Used as a smaller collection for TREC Web Track tasks.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  }
}
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2009.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2009')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | not relevant | 9.1K | 69.5% | 
| 1 | relevant | 2.5K | 19.2% | 
| 2 | highly relevant | 1.5K | 11.3% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2009.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 13118,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 9116,
          "1": 2514,
          "2": 1488
        }
      }
    }
  }
}
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009/diversity queries
[query_id]    [query]    [description]    [type]    [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2009.diversity.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009/diversity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009/diversity docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2009.diversity')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | not relevant | 12K | 75.0% | 
| 1 | relevant | 4.1K | 25.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2009/diversity")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2009/diversity qrels --format tsv
[query_id]    [doc_id]    [relevance]    [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2009.diversity.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 16347,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 12266,
          "1": 4081
        }
      }
    }
  }
}
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2010.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2010')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 715 | 4.5% | 
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 12K | 76.0% | 
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 2.3K | 14.6% | 
| 2 | HRel: The content of this page provides substantial information on the topic. | 682 | 4.3% | 
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 90 | 0.6% | 
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2010.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 15845,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 12040,
          "1": 2318,
          "-2": 715,
          "2": 682,
          "3": 90
        }
      }
    }
  }
}
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010/diversity queries
[query_id]    [query]    [description]    [type]    [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2010.diversity.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010/diversity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010/diversity docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2010.diversity')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 0 | 0.0% | 
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 0 | 0.0% | 
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 5.5K | 100.0% | 
| 2 | HRel: The content of this page provides substantial information on the topic. | 0 | 0.0% | 
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 0 | 0.0% | 
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2010/diversity")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2010/diversity qrels --format tsv
[query_id]    [doc_id]    [relevance]    [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2010.diversity.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 5522,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 5522
        }
      }
    }
  }
}
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2011.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2011')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 499 | 3.8% | 
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 11K | 83.5% | 
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 1.1K | 8.4% | 
| 2 | HRel: The content of this page provides substantial information on the topic. | 354 | 2.7% | 
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 208 | 1.6% | 
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2011.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 13081,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 10920,
          "1": 1100,
          "2": 354,
          "-2": 499,
          "3": 208
        }
      }
    }
  }
}
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011/diversity queries
[query_id]    [query]    [description]    [type]    [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2011.diversity.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011/diversity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011/diversity docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2011.diversity')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 1.7K | 3.9% | 
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 38K | 85.8% | 
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 3.0K | 6.9% | 
| 2 | HRel: The content of this page provides substantial information on the topic. | 919 | 2.1% | 
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 556 | 1.3% | 
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2011/diversity")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2011/diversity qrels --format tsv
[query_id]    [doc_id]    [relevance]    [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2011.diversity.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 43889,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 37665,
          "1": 3016,
          "2": 919,
          "-2": 1733,
          "3": 556
        }
      }
    }
  }
}
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2012.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2012')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 561 | 5.6% | 
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 7.2K | 71.6% | 
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 1.4K | 13.8% | 
| 2 | HRel: The content of this page provides substantial information on the topic. | 300 | 3.0% | 
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 17 | 0.2% | 
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 580 | 5.8% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2012.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 10022,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "-2": 561,
          "0": 7178,
          "1": 1386,
          "4": 580,
          "2": 300,
          "3": 17
        }
      }
    }
  }
}
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012/diversity queries
[query_id]    [query]    [description]    [type]    [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.catb.trec-web-2012.diversity.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/catb
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012/diversity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012/diversity docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.catb.trec-web-2012.diversity')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 2.2K | 5.7% | 
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 31K | 78.7% | 
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 3.5K | 9.0% | 
| 2 | HRel: The content of this page provides substantial information on the topic. | 887 | 2.3% | 
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 47 | 0.1% | 
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 1.7K | 4.3% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/catb/trec-web-2012/diversity")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/catb/trec-web-2012/diversity qrels --format tsv
[query_id]    [doc_id]    [relevance]    [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.catb.trec-web-2012.diversity.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }{
  "docs": {
    "count": 50220423,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 38992,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "-2": 2237,
          "0": 30669,
          "1": 3494,
          "4": 1658,
          "2": 887,
          "3": 47
        }
      }
    }
  }
}
Subset of ClueWeb09 with only German-language documents.
Language: de
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/de")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/de docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.de')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
  "docs": {
    "count": 49814309,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-de00"
      }
    }
  }
}
Subset of ClueWeb09 with only English-language documents.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  }
}
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2009.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2009')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | not relevant | 17K | 70.9% | 
| 1 | relevant | 4.8K | 20.5% | 
| 2 | highly relevant | 2.0K | 8.6% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2009.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 23601,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 16743,
          "1": 4832,
          "2": 2026
        }
      }
    }
  }
}
The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009/diversity queries
[query_id]    [query]    [description]    [type]    [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2009.diversity.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009/diversity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009/diversity docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2009.diversity')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | not relevant | 21K | 76.8% | 
| 1 | relevant | 6.5K | 23.2% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2009/diversity")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2009/diversity qrels --format tsv
[query_id]    [doc_id]    [relevance]    [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2009.diversity.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2009TrecWeb, title={Overview of the TREC 2009 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff}, booktitle={TREC}, year={2009} }{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 27964,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 21465,
          "1": 6499
        }
      }
    }
  }
}
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2010.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2010')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 1.4K | 5.6% | 
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 19K | 73.7% | 
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 4.0K | 15.9% | 
| 2 | HRel: The content of this page provides substantial information on the topic. | 1.1K | 4.3% | 
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 138 | 0.5% | 
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2010.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 25329,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 18665,
          "1": 4018,
          "-2": 1431,
          "2": 1077,
          "3": 138
        }
      }
    }
  }
}
The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010/diversity queries
[query_id]    [query]    [description]    [type]    [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2010.diversity.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010/diversity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010/diversity docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2010.diversity')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 0 | 0.0% | 
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 0 | 0.0% | 
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 9.0K | 100.0% | 
| 2 | HRel: The content of this page provides substantial information on the topic. | 0 | 0.0% | 
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 0 | 0.0% | 
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2010/diversity")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2010/diversity qrels --format tsv
[query_id]    [doc_id]    [relevance]    [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2010.diversity.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2010TrecWeb, title={Overview of the TREC 2010 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Gordon V. Cormack}, booktitle={TREC}, year={2010} }{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 9006,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 9006
        }
      }
    }
  }
}
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2011.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2011')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 1.0K | 5.3% | 
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 15K | 78.5% | 
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 2.0K | 10.5% | 
| 2 | HRel: The content of this page provides substantial information on the topic. | 711 | 3.7% | 
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 408 | 2.1% | 
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2011.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 19381,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 15205,
          "2": 711,
          "1": 2038,
          "-2": 1019,
          "3": 408
        }
      }
    }
  }
}
The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011/diversity queries
[query_id]    [query]    [description]    [type]    [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2011.diversity.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011/diversity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011/diversity docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2011.diversity')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 3.4K | 5.3% | 
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 53K | 81.8% | 
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 5.5K | 8.4% | 
| 2 | HRel: The content of this page provides substantial information on the topic. | 1.8K | 2.8% | 
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 1.1K | 1.7% | 
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 0 | 0.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2011/diversity")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2011/diversity qrels --format tsv
[query_id]    [doc_id]    [relevance]    [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2011.diversity.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2011TrecWeb, title={Overview of the TREC 2011 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ian Soboroff and Ellen M. Voorhees}, booktitle={TREC}, year={2011} }{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 64868,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 53055,
          "2": 1828,
          "1": 5469,
          "-2": 3435,
          "3": 1081
        }
      }
    }
  }
}
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012 queries
[query_id]    [query]    [description]    [type]    [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2012.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2012')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 858 | 5.3% | 
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 12K | 72.7% | 
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 2.2K | 13.8% | 
| 2 | HRel: The content of this page provides substantial information on the topic. | 405 | 2.5% | 
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 52 | 0.3% | 
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 858 | 5.3% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2012.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 16055,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "-2": 858,
          "0": 11674,
          "1": 2208,
          "4": 858,
          "2": 405,
          "3": 52
        }
      }
    }
  }
}
The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012/diversity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, description, type, subtopics>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012/diversity queries
[query_id]    [query]    [description]    [type]    [subtopics]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.en.trec-web-2012.diversity.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09/en
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012/diversity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012/diversity docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.en.trec-web-2012.diversity')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| -2 | Junk: This page does not appear to be useful for any reasonable purpose; it may be spam or junk | 3.4K | 5.4% | 
| 0 | Non: The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query. | 50K | 79.6% | 
| 1 | Rel: The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page. | 5.6K | 8.9% | 
| 2 | HRel: The content of this page provides substantial information on the topic. | 1.2K | 1.9% | 
| 3 | Key: This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine. | 130 | 0.2% | 
| 4 | Nav: This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site. | 2.5K | 4.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/en/trec-web-2012/diversity")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, subtopic_id>
You can find more details about the Python API here.
ir_datasets export clueweb09/en/trec-web-2012/diversity qrels --format tsv
[query_id]    [doc_id]    [relevance]    [subtopic_id]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.en.trec-web-2012.diversity.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Clarke2012TrecWeb, title={Overview of the TREC 2012 Web Track}, author={Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees}, booktitle={TREC}, year={2012} }{
  "docs": {
    "count": 503903810,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-en"
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 62394,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "-2": 3373,
          "0": 49653,
          "1": 5578,
          "4": 2486,
          "2": 1174,
          "3": 130
        }
      }
    }
  }
}
Subset of ClueWeb09 with only Spanish-language documents.
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/es")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/es docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.es')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
  "docs": {
    "count": 79333950,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-es"
      }
    }
  }
}
Subset of ClueWeb09 with only French-language documents.
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/fr")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/fr docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.fr')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
  "docs": {
    "count": 50883172,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-fr"
      }
    }
  }
}
Subset of ClueWeb09 with only Italian-language documents.
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/it")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/it docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.it')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
  "docs": {
    "count": 27250729,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-it"
      }
    }
  }
}
Subset of ClueWeb09 with only Japanese-language documents.
Language: ja
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/ja")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/ja docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.ja')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
  "docs": {
    "count": 67337717,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-ja"
      }
    }
  }
}
Subset of ClueWeb09 with only Korean-language documents.
Language: ko
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/ko")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/ko docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.ko')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
  "docs": {
    "count": 18075141,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-ko000"
      }
    }
  }
}
Subset of ClueWeb09 with only Portuguese-language documents.
Language: pt
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/pt")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/pt docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.pt')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
  "docs": {
    "count": 37578858,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-pt"
      }
    }
  }
}
TREC 2009 Million Query track.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export clueweb09/trec-mq-2009 queries
[query_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.clueweb09.trec-mq-2009.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from clueweb09
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/trec-mq-2009 docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.trec-mq-2009')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | not relevant | 26K | 74.1% | 
| 1 | relevant | 5.9K | 17.0% | 
| 2 | highly relevant | 3.1K | 9.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/trec-mq-2009")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, method, iprob>
You can find more details about the Python API here.
ir_datasets export clueweb09/trec-mq-2009 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [method]    [iprob]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.clueweb09.trec-mq-2009.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Carterette2009MQ, title={Million Query Track 2009 Overview}, author={Ben Carterette and Virgil Pavlu and Hui Fang and Evangelos Kanoulas}, booktitle={TREC}, year={2009} }{
  "docs": {
    "count": 1040859705,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-"
      }
    }
  },
  "queries": {
    "count": 40000
  },
  "qrels": {
    "count": 34534,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 25586,
          "1": 5856,
          "2": 3092
        }
      }
    }
  }
}
Subset of ClueWeb09 with only Chinese-language documents.
Language: zh
Examples:
import ir_datasets
dataset = ir_datasets.load("clueweb09/zh")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, date, http_headers, body, body_content_type>
You can find more details about the Python API here.
ir_datasets export clueweb09/zh docs
[doc_id]    [url]    [date]    [http_headers]    [body]    [body_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.clueweb09.zh')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
{
  "docs": {
    "count": 177489357,
    "fields": {
      "doc_id": {
        "max_len": 25,
        "common_prefix": "clueweb09-zh"
      }
    }
  }
}