There are many standard text collections of text categorization. Reuters-21578 dataset is one of them. This collection has been used widely in a number of studies especially in information retrieval, machine learning and other corpus based research. The Reuters-21578 collection is freely available in the Internet. The files are in Standard Generalized Markup Language (SGML) format. SGML, defined by ISO 8879, is a metalanguage for defining markup languages for documents. It is descendent of IBM's Generalized Markup Language (GML) created in the 1960s. As a markup language, it has a specific vocabulary (elements and attributes) and a declared syntax (defined grammars). In 1998, World Wide Web Consortium (W3C) has published and recommended Extended Markup Language (XML) for Internet community. XML is a profile or subset of SGML.
It was designed to describe data and to focus on what data is. Due to a number of technical reasons in SGML, XML becomes more acceptable for serving documents over the web. The "Reuters-21578, Distribution 1.0" corpus consist of stories appeared on the Reuters newswire in 1987. This corpus was first used in the CONSTRUE text categorization system (Hayes & Weinstein, 1990) based on a Reuters-22173. This new version was introduced in order to fix all the problems such as duplication of stories, typographical errors, etc. Java programing language does not has any API to parse SGML file but the Java programming language contains several methods for processing and writing XML. Older Java versions supported only the DOM API (Document Object Model) and the SAX (Simple API for XML) API DOM can be used to read and write XML files. SAX (Simple API for XML) is a Java API for sequential reading of XML files but this new version contain many features.
Contributions are always welcome!
See contributing.md
for ways to get started.
Please adhere to this project's code of conduct
.
I'm a Java developer, and I graduated in 2021, and subsequently, I worked for one year at Neptune Company. Following that, I have continued to work independently 🦾🔥 on my own projects....