Cricca
Cricca datasets
This page contains links to dataset produced during the research activities of the cricca group at the University of Trento, follow the “Read more” link for each dataset to learn how to download and cite each dataset.
WikiLinkGraphs
- WikiLinkGraphs'
RawWikilinks
dataset:- This dataset contains wikilinks, i.e. links between Wikipedia articles, extracted by processing each revision of each Wikipedia article (
namespace 0
) from Wikimedia's history dumps for the languages de, en, es, fr, it, nl, pl, ru, sv. See the complete list of fields [+].page_id
: an integer, the page identifier used by MediaWiki. This identifier is not necessarily progressive, there may be gaps in the enumeration;page_title
: a string, the title of the Wikipedia article;revision_id
: an integer, the identifier of a revision of the article, also called apermanent id , because it can be used to link to that specific revision of a Wikipedia article;revision_parent_id
: an integer, the identifier of the parent revision. In general, each revision as a unique parent; going back in time before 2002, however, we can see that the oldest articles present non-linear edit histories. This is a consequence of the import process from the software previously used to power Wikipedia, MoinMonWiki, to MediaWiki;revision_timestamp
: date and time of the edit that generated the revision under consideration;user_type
: a string ("registered"
or"anonymous"
), specifying whether the user making the revision was logged-in or not;user_username
: a string, the username of the user that made the edit that generated the revision under consideration;user_id
: an integer, the identifier of the user that made the edit that generated the revision under consideration;revision_minor
: a boolean flag, with value 1 if the edit that generated the current revision was marked asminor by the user, 0 otherwise;wikilink.link
: a string, the page linked by thewikilink ;wikilink.anchor
: a string, the anchor text of thewikilink ;wikilink.section_name
: the name of the section wherein thewikilink appears;wikilink.section_level
: the level of the section wherein thewikilink appears;wikilink.section_number
: the number of the section wherein thewikilink appears;
- Read the instructions to download the data using HTTP(S) (
recommended ) ordat
(experimental ). - Author info [+]
- This dataset has been produced by Cristian Consonni, David Laniado and Alberto Montresor.
- Cristian Consonni and Alberto Montresor are affiliated with the Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy; David is affiliated with Eurecat - Centre Tecnològic de Catalunya, Barcelona, Spain.
- This dataset has also been produced as part of the research related to the ENGINEROOM project. EU ENGINEROOM has received funding from the European Union's Horizon 2020 research and innovation programme under the Grant Agreement no 780643.
- Go to the dataset:
wikilinkgraphs-rawwikilinks
- This dataset contains wikilinks, i.e. links between Wikipedia articles, extracted by processing each revision of each Wikipedia article (
- WikiLinkGraphs'
RawWikilinksSnapshots
dataset:- This dataset contains wikilink snapshots, i.e. links between Wikipedia articles, extracted by processing each revision of each Wikipedia article (
namespace 0
) from Wikimedia's history dumps for the languages de, en, es, fr, it, nl, pl, ru, sv. The snapshots were taken on March 1st, for the years between 2001 and 2018 (included). See the complete list of fields [+].page_id
: an integer, the page identifier used by MediaWiki. This identifier is not necessarily progressive, there may be gaps in the enumeration;page_title
: a string, the title of the Wikipedia article;revision_id
: an integer, the identifier of a revision of the article, also called apermanent id , because it can be used to link to that specific revision of a Wikipedia article;revision_parent_id
: an integer, the identifier of the parent revision. In general, each revision as a unique parent; going back in time before 2002, however, we can see that the oldest articles present non-linear edit histories. This is a consequence of the import process from the software previously used to power Wikipedia, MoinMonWiki, to MediaWiki;revision_timestamp
: date and time of the edit that generated the revision under consideration;user_type
: a string ("registered"
or"anonymous"
), specifying whether the user making the revision was logged-in or not;user_username
: a string, the username of the user that made the edit that generated the revision under consideration;user_id
: an integer, the identifier of the user that made the edit that generated the revision under consideration;revision_minor
: a boolean flag, with value 1 if the edit that generated the current revision was marked asminor by the user, 0 otherwise;wikilink.link
: a string, the page linked by thewikilink ;wikilink.anchor
: a string, the anchor text of thewikilink ;wikilink.section_name
: the name of the section wherein thewikilink appears;wikilink.section_level
: the level of the section wherein thewikilink appears;wikilink.section_number
: the number of the section wherein thewikilink appears;wikilink.is_active
: a boolean representing whether the page pointed to by the link was existing in that moment or not.
- Read the instructions to download the data using HTTP(S) (
recommended ) ordat
(experimental ). - Author info [+]
- This dataset has been produced by Cristian Consonni, David Laniado and Alberto Montresor.
- Cristian Consonni and Alberto Montresor are affiliated with the Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy; David is affiliated with Eurecat - Centre Tecnològic de Catalunya, Barcelona, Spain.
- This dataset has also been produced as part of the research related to the ENGINEROOM project. EU ENGINEROOM has received funding from the European Union's Horizon 2020 research and innovation programme under the Grant Agreement no 780643.
- Go to the dataset:
wikilinkgraphs-rawwikilinks-snapshots
- This dataset contains wikilink snapshots, i.e. links between Wikipedia articles, extracted by processing each revision of each Wikipedia article (
- WikiLinkGraphs'
RevisionLists
dataset:- This dataset contains lists of all revisions for each Wikipedia article (
namespace 0
) from Wikimedia's history dumps for the languages de, en, es, fr, it, nl, pl, ru, sv. See the complete list of fields [+].page_id
: an integer, the page identifier used by MediaWiki. This identifier is not necessarily progressive, there may be gaps in the enumeration;page_title
: a string, the title of the Wikipedia article;revision_id
: an integer, the identifier of a revision of the article, also called apermanent id , because it can be used to link to that specific revision of a Wikipedia article;revision_parent_id
: an integer, the identifier of the parent revision. In general, each revision as a unique parent; going back in time before 2002, however, we can see that the oldest articles present non-linear edit histories. This is a consequence of the import process from the software previously used to power Wikipedia, MoinMonWiki, to MediaWiki;revision_timestamp
: date and time of the edit that generated the revision under consideration;user_type
: a string ("registered"
or"anonymous"
), specifying whether the user making the revision was logged-in or not;user_username
: a string, the username of the user that made the edit that generated the revision under consideration;user_id
: an integer, the identifier of the user that made the edit that generated the revision under consideration;revision_minor
: a boolean flag, with value 1 if the edit that generated the current revision was marked asminor by the user, 0 otherwise;
- Read the instructions to download the data using HTTP(S) (
recommended ) ordat
(experimental ). - Author info [+]
- This dataset has been produced by Cristian Consonni, David Laniado and Alberto Montresor.
- Cristian Consonni and Alberto Montresor are affiliated with the Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy; David is affiliated with Eurecat - Centre Tecnològic de Catalunya, Barcelona, Spain.
- This dataset has also been produced as part of the research related to the ENGINEROOM project. EU ENGINEROOM has received funding from the European Union's Horizon 2020 research and innovation programme under the Grant Agreement no 780643.
- Go to the dataset:
wikilinkgraphs-revisionlist
- This dataset contains lists of all revisions for each Wikipedia article (
- WikiLinkGraphs'
Snapshots
dataset:- This dataset contains snapshots of Wikipedia articles (
namespace 0
) taken yearly on March, 1st for the languages de, en, es, fr, it, nl, pl, ru, sv. The snapshots were taken on March 1st, for the years between 2001 and 2018 (included). The dataset has been produced by processing Wikimedia's history dumps. See the complete list of fields [+].page_id
: an integer, the page identifier used by MediaWiki. This identifier is not necessarily progressive, there may be gaps in the enumeration;page_title
: a string, the title of the Wikipedia article;revision_id
: an integer, the identifier of a revision of the article, also called apermanent id , because it can be used to link to that specific revision of a Wikipedia article;revision_parent_id
: an integer, the identifier of the parent revision. In general, each revision as a unique parent; going back in time before 2002, however, we can see that the oldest articles present non-linear edit histories. This is a consequence of the import process from the software previously used to power Wikipedia, MoinMonWiki, to MediaWiki;revision_timestamp
: date and time of the edit that generated the revision under consideration;
- Read the instructions to download the data using HTTP(S) (
recommended ) ordat
(experimental ). - Author info [+]
- This dataset has been produced by Cristian Consonni, David Laniado and Alberto Montresor.
- Cristian Consonni and Alberto Montresor are affiliated with the Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy; David is affiliated with Eurecat - Centre Tecnològic de Catalunya, Barcelona, Spain.
- This dataset has also been produced as part of the research related to the ENGINEROOM project. EU ENGINEROOM has received funding from the European Union's Horizon 2020 research and innovation programme under the Grant Agreement no 780643.
- Go to the dataset:
wikilinkgraphs-snapshots
- This dataset contains snapshots of Wikipedia articles (
- WikiLinkGraphs'
Redirects
dataset:- This dataset contains snapshots of Wikipedia articles (
namespace 0
) taken yearly on March, 1st for the languages de, en, es, fr, it, nl, pl, ru, sv. The snapshots were taken on March 1st, for the years between 2001 and 2018 (included). The dataset has been produced by processing Wikimedia's history dumps. See the complete list of fields [+].page_id
: an integer, the page identifier used by MediaWiki. This identifier is not necessarily progressive, there may be gaps in the enumeration;page_title
: a string, the title of the Wikipedia article;revision_id
: an integer, the identifier of a revision of the article, also called apermanent id , because it can be used to link to that specific revision of a Wikipedia article;revision_parent_id
: an integer, the identifier of the parent revision. In general, each revision as a unique parent; going back in time before 2002, however, we can see that the oldest articles present non-linear edit histories. This is a consequence of the import process from the software previously used to power Wikipedia, MoinMonWiki, to MediaWiki;revision_timestamp
: date and time of the edit that generated the revision under consideration;user_type
: a string ("registered"
or"anonymous"
), specifying whether the user making the revision was logged-in or not;revision_minor
: a boolean flag, with value 1 if the edit that generated the current revision was marked asminor by the user, 0 otherwise;wikilink.section_number
: the number of the section wherein thewikilink appears;wikilink.is_active
: a boolean representing whether the page pointed to by the link was existing in that moment or not.redirect.target
: a string, the page to which the redict points;redirect.tosection
: a string, the anchor text of thewikilink ;
- Read the instructions to download the data using HTTP(S) (
recommended ) ordat
(experimental ). - Author info [+]
- This dataset has been produced by Cristian Consonni, David Laniado and Alberto Montresor.
- Cristian Consonni and Alberto Montresor are affiliated with the Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy; David is affiliated with Eurecat - Centre Tecnològic de Catalunya, Barcelona, Spain.
- This dataset has also been produced as part of the research related to the ENGINEROOM project. EU ENGINEROOM has received funding from the European Union's Horizon 2020 research and innovation programme under the Grant Agreement no 780643.
- Go to the dataset:
wikilinkgraphs-redirects
- This dataset contains snapshots of Wikipedia articles (
- WikiLinkGraphs'
ResolvedRedirects
dataset:- This dataset contains snapshots of Wikipedia articles (
namespace 0
) taken yearly on March, 1st for the languages de, en, es, fr, it, nl, pl, ru, sv. The snapshots were taken on March 1st, for the years between 2001 and 2018 (included). The dataset has been produced by processing Wikimedia's history dumps. See the complete list of fields [+].page_id
: an integer, the page identifier used by MediaWiki. This identifier is not necessarily progressive, there may be gaps in the enumeration;page_title
: a string, the title of the Wikipedia article;revision_id
: an integer, the identifier of a revision of the article, also called apermanent id , because it can be used to link to that specific revision of a Wikipedia article;revision_parent_id
: an integer, the identifier of the parent revision. In general, each revision as a unique parent; going back in time before 2002, however, we can see that the oldest articles present non-linear edit histories. This is a consequence of the import process from the software previously used to power Wikipedia, MoinMonWiki, to MediaWiki;revision_timestamp
: date and time of the edit that generated the revision under consideration;redirect_id
: an integer, the page identifier used by MediaWiki. This identifier is not necessarily progressive, there may be gaps in the enumeration;redirect_title
: a string, the title of the Wikipedia article;redirect_revision_id
: an integer, the identifier of a revision of the article, also called apermanent id , because it can be used to link to that specific revision of a Wikipedia article;redirect_revision_parent_id
: an integer, the identifier of the parent revision. In general, each revision as a unique parent; going back in time before 2002, however, we can see that the oldest articles present non-linear edit histories. This is a consequence of the import process from the software previously used to power Wikipedia, MoinMonWiki, to MediaWiki;redirect_revision_timestamp
: date and time of the edit that generated the revision under consideration;
- Read the instructions to download the data using HTTP(S) (
recommended ) ordat
(experimental ). - Author info [+]
- This dataset has been produced by Cristian Consonni, David Laniado and Alberto Montresor.
- Cristian Consonni and Alberto Montresor are affiliated with the Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy; David is affiliated with Eurecat - Centre Tecnològic de Catalunya, Barcelona, Spain.
- This dataset has also been produced as part of the research related to the ENGINEROOM project. EU ENGINEROOM has received funding from the European Union's Horizon 2020 research and innovation programme under the Grant Agreement no 780643.
- Go to the dataset:
wikilinkgraphs-resolved-redirects
- This dataset contains snapshots of Wikipedia articles (
WikiLinkGraphs
dataset:- This dataset contains snapshots of Wikipedia articles (
namespace 0
) taken yearly on March, 1st for the languages de, en, es, fr, it, nl, pl, ru, sv. The snapshots were taken on March 1st, for the years between 2001 and 2018 (included). The dataset has been produced by processing Wikimedia's history dumps. See the complete list of fields [+].page_id_from
: an integer, the page identifier (used by MediaWiki) of the source article. This identifier is not necessarily progressive, there may be gaps in the enumeration;page_title_from
: a string, the title of the source Wikipedia article;page_id
: an integer, the page identifier (used by MediaWiki) of the target page. This identifier is not necessarily progressive, there may be gaps in the enumeration;page_title
: a string, the title of the target Wikipedia article;
- This dataset contains snapshots of Wikipedia articles (
Wikipedia's pagecounts
- Wikipedia
pagecounts-raw
sorted by page (years 2007-2016):- This dataset consists of hourly pagecounts for Wikipedia pages sorted by article, , ordered by
(project, page, timestamp)
. It has been created by processing Wikimedia’spagecounts-raw
dataset. Read more... - Read the instructions to download the data using HTTP(S) (
recommended ) ordat
(experimental ). - author info [+]
- This dataset has been produced by Cristian Consonni and Alberto Montresor, from the Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy. This research has been supported by Microsoft Azure Research Award CRM:0518942 as part of the “Azure for Research Award: Data Science” program. This dataset has also been utilized in the research related to the ENGINEROOM project, in collaboration with David Laniado of Eurecat - Centre Tecnològic de Catalunya, Barcelona, Spain. EU ENGINEROOM has received funding from the European Union’s Horizon 2020 research and innovation programme under the Grant Agreement no 780643.
- Go to the dataset:
pagecounts-raw-sorted
- This dataset consists of hourly pagecounts for Wikipedia pages sorted by article, , ordered by
- Wikipedia
pagecounts-ez
(2007-12-09 – 2011-11-15):- This dataset is a compressed format of the pageview data of Wikimedia projects. It has been created by processing Wikimedia's
pagecounts-raw
dataset. Read more... - Read the instructions to download the data using HTTP(S) (
recommended ) ordat
(experimental ). - author info [+]
- This dataset has been produced by Cristian Consonni and Alberto Montresor, from the Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy. This research has been supported by Microsoft Azure Research Award CRM:0518942 as part of the “Azure for Research Award: Data Science” program. This dataset has also been utilized in the research related to the ENGINEROOM project, in collaboration with David Laniado of Eurecat - Centre Tecnològic de Catalunya, Barcelona, Spain. EU ENGINEROOM has received funding from the European Union’s Horizon 2020 research and innovation programme under the Grant Agreement no 780643.
- Go to the dataset:
pagecounts-ez
- This dataset is a compressed format of the pageview data of Wikimedia projects. It has been created by processing Wikimedia's
- Wikipedia
pagecounts-all-sites
sorted by page (years 2014 – 2016):- This dataset consists of hourly pagecounts for Wikipedia pages sorted by article, ordered by
(project, page, timestamp)
. It has been created by processing Wikimedia’spagecounts-all-sites
dataset. Read more... - Read the instructions to download the data using HTTP(S) (
recommended ) ordat
(experimental ). - author info [+]
- This dataset has been produced by Cristian Consonni and Alberto Montresor, from the Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy. This research has been supported by Microsoft Azure Research Award CRM:0518942 as part of the “Azure for Research Award: Data Science” program. This dataset has also been utilized in the research related to the ENGINEROOM project, in collaboration with David Laniado of Eurecat - Centre Tecnològic de Catalunya, Barcelona, Spain. EU ENGINEROOM has received funding from the European Union’s Horizon 2020 research and innovation programme under the Grant Agreement no 780643.
- Go to the dataset:
pagecounts-all-sites-sorted
- This dataset consists of hourly pagecounts for Wikipedia pages sorted by article, ordered by
Menu
- Alberto Montresor’s homepage
- Cricca wiki
- Cricca datasets
- Back to the home