Prepared by Vladimír
Benko within the framework of a joint Project of
Main design decisions
- Slovak-Centric (languages spoken and/or taught in Slovakia
and its neighbouring countries)
- Latin names denoting language and size
- Crawled by SpiderLing
at (approximately) the same time
- Language-independent filtration by the same tools
- Language-dependent filtration by the same methodology
- PoS-tagged
by open-source or free tools,
native tagsets mapped to Araneum Universal Tagset
- Document-level deduplicated, duplicate and near-duplicate documents deleted
- Paragraph and/or sentence-level deduplicated, duplicate and near-duplicate segments marked
- Word sketches with compatible sketch grammars
- Accessible online via web interface
(under NoSketch
Engine) at unesco.uniba.sk or
aranea.juls.savba.sk
(no registration required in Guest mode)
- Also hosted (under KonText)
at kontext.korpus.cz (free registration required), and
(under Sketch Engine) at www.sketchengine.co.uk
(paid access, 30-day free trial available)
Aranea Corpora available (March 2019)
Credits
If you use the Aranea corpora for research purposes, or need to mention them for any reason,
please cite the following paper(s):
- Benko, Vladimír: Aranea: Yet Another Family of (Comparable) Web Corpora.
In Petr Sojka, Aleš Horák, Ivan Kopeček and Karel Pala (Eds.):
Text, Speech and Dialogue. 17th International Conference,
TSD 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings.
LNCS 8655.
Springer International Publishing Switzerland, 2014. pp. 257-264.
ISBN: 978-3-319-10815-5 (Print), 978-3-319-10816-2 (Online).
- Benko, Vladimír: Compatible Sketch Grammars for Comparable Corpora.
In Andrea Abel, Chiara Vettori, Natascia Ralli
(Eds.): Proceedings of the XVI EURALEX International Congress: The User In Focus. 15–19 July 2014.
Bolzano/Bozen: Eurac Research, 2014. pp. 417-430. ISBN 978-88-88906-97-3.
As well as the paper on the NoSketch Engine:
- Rychlý, Pavel: Manatee/Bonito – A Modular Corpus Manager.
In 1st Workshop on Recent Advances in Slavonic Natural Language Processing.
Brno: Masaryk University, 2007, pp. 65-70. ISBN 978-80-210-4471-5.
Contact
If you need the source corpus data,
please send a message to
vladimir.benko at
uniba.sk