Reuters 21578 dataset in json and sgm format, and the conversion script.
Uses BeautifulSoup for XML parsing:
pip install BeautifulSoup
The entire original data can be found in other-files
and sgm-data
. You can find the original archive on archive.ics.uci.edu