[NTCIR Home] [NTCIR-9 CrossLink Task Home]
CJK XML Corpora and Topics
NTCIR-9 CrossLink CJK XML Corpora
NTCIR-9 CrossLink CJK XML Corpora are the tagged versions of Chinese, Japanese
and Korean article collections converted from Wikipedia by a YAWN system.
It is aimed to provide a standard document set for the research of automated
cross-lingual link discovery.
 Schenkel, R., F. Suchanek, and G. Kasneci, "YAWN: A Semantically Annotated Wikipedia XML Corpus." In Proceedings of BTW'2007, 2007.
NTCIR-9 CrossLink Training and Test Topics
Only three topics are used for system training in the NTCIR 9 Crosslink task.
A set of 25 articles will be randomly chosen from the English Wikipedia
and used as formal test topics.
NTCIR has made it publicly available under the conditions of Creative
Commons Attribution-Share-Alike License 3.0 (Unported). Users of the Corpora
and Topics are advised to read Wikipedia's copyright policy carefully to ensure proper usage.
* To obtain the Qrels, Please visit this page.
-CJK XML Corpora
Chinese Files (zh-pages.tar.bz2: 395,702,516 byte (377.3MB))
Japanese Files (ja-pages.tar.bz2: 1,183,361,653 byte (1128.5MB))
Korean Files (ko-pages.tar.bz2: 160,246,235 byte (152.8MB))
-Training and Test Topics (added on May 09, 2012)
Training (en-training-topics.zip: 82,218 byte)
Test (en-test-topics.zip: 149, 421 byte)
Use and/or redistribution of the NTCIR-9 CrossLink CJK XML Corpora and Topics is permitted under the conditions of Creative Commons Attribution-Share-Alike
Details can be found at http://creativecommons.org/licenses/by-sa/3.0/.