NTCIR Project
NTCIR-10 Crosslink-2
CEJK XML Corpora and Topics

[NTCIR Home] [NTCIR-10 Crosslink-2 Task Home]

NTCIR-10 Crosslink-2 CEJK XML Corpora

NTCIR-10 Crosslink-2 CEJK XML Corpora are the tagged versions of English, Chinese, Japanese and Korean article collections converted from Wikipedia by a YAWN[1] system. It is aimed to provide a standard document set for the research of automated cross-lingual link discovery.

[1] Schenkel, R., F. Suchanek, and G. Kasneci, "YAWN: A Semantically Annotated Wikipedia XML Corpus." In Proceedings of BTW'2007, 2007.

NTCIR has made it publicly available under the conditions of Creative Commons Attribution-Share-Alike License 3.0 (Unported). Users of the Corpora and Topics are advised to read Wikipedia's copyright policy carefully to ensure proper usage.

Download

- CEJK XML Corpora

Chinese Files (3.7GB):
zhwiki_xml_pages.tar.gz

English Files (33GB):
enwiki_xml_pages0.tar.gz
enwiki_xml_pages1.tar.gz
enwiki_xml_pages2.tar.gz
enwiki_xml_pages3.tar.gz
enwiki_xml_pages4.tar.gz
enwiki_xml_pages5.tar.gz
enwiki_xml_pages6.tar.gz
enwiki_xml_pages7.tar.gz
enwiki_xml_pages8.tar.gz
enwiki_xml_pages9.tar.gz

Japanese Files (11GB):
jawiki_xml_pages.tar.gz

Korean Files (2.7GB):
kowiki_xml_pages.tar.gz

-Topics
Chinese (25 topics): zh-10crosslink-topics.zip
English (25 topics): en-10crosslink-topics.zip
Japanese (25 topics): ja-10crosslink-topics.zip
Korean (25 topics): ko-10crosslink-topics.zip


License


Use and/or redistribution of the NTCIR-10 Crosslink-2 CEJK XML Corpora and Topics is permitted under the conditions of Creative Commons Attribution-Share-Alike License 3.0(Unported).
Details can be found at http://creativecommons.org/licenses/by-sa/3.0/