NTCIR Project
NTCIR-9 CrossLink
CJK XML Corpora and Topics

[NTCIR Home] [NTCIR-9 CrossLink Task Home]

NTCIR-9 CrossLink CJK XML Corpora

NTCIR-9 CrossLink CJK XML Corpora are the tagged versions of Chinese, Japanese and Korean article collections converted from Wikipedia by a YAWN[1] system. It is aimed to provide a standard document set for the research of automated cross-lingual link discovery.

[1] Schenkel, R., F. Suchanek, and G. Kasneci, "YAWN: A Semantically Annotated Wikipedia XML Corpus." In Proceedings of BTW'2007, 2007.


NTCIR-9 CrossLink Training and Test Topics

NTCIR has made it publicly available under the conditions of Creative Commons Attribution-Share-Alike License 3.0 (Unported). Users of the Corpora and Topics are advised to read Wikipedia's copyright policy carefully to ensure proper usage.

* To obtain the Qrels, Please visit this page.

Download

-CJK XML Corpora
Chinese Files (zh-pages.tar.bz2: 395,702,516 byte (377.3MB))
Japanese Files (ja-pages.tar.bz2: 1,183,361,653 byte (1128.5MB))
Korean Files (ko-pages.tar.bz2: 160,246,235 byte (152.8MB))

-Training and Test Topics (added on May 09, 2012)
Training (en-training-topics.zip: 82,218 byte)
Test (en-test-topics.zip: 149, 421 byte)


License


Use and/or redistribution of the NTCIR-9 CrossLink CJK XML Corpora and Topics is permitted under the conditions of Creative Commons Attribution-Share-Alike License 3.0(Unported).
Details can be found at http://creativecommons.org/licenses/by-sa/3.0/