CNN provides transcripts for its shows at http://edition.cnn.com/TRANSCRIPTS/.
The transcripts are available for shows starting 1999/10/01
. See http://edition.cnn.com/TRANSCRIPTS/1999.10.01.html. However, we get a 'Page not found' error when we follow links until 1999/12/31
. So we started scraping the data from 2000/01/01
.
CNN went through a few HTML styles of the news transcripts between 2000/01/01
and 2014. So there are two scapers to parse the different HTML styles:
The parsed data are posted at http://dx.doi.org/10.7910/DVN/ISDPJU. For copyright reasons, access is restricted for research purposes only. The data are split into 6 files:
cnn-1.csv
. Data from 2000/01/01--2000/04/20. No. of transcripts = 7017cnn-2.csv
. Data from 2000/04/21--2001/04/03. No. of transcripts = 21381cnn-3.csv
. Data from 2001/04/04--2002/08/06. No. of transcripts = 35269cnn-4.csv
. Data from 2002/08/07--2002/09/16. No. of transcripts = 2343cnn-5.csv
. Data from 2002/09/17--2012/05/18. No. of transcripts = 101336cnn-6.csv
. Data from 2012/05/19--2014/06/17. No. of transcripts = 23536cnn-7.csv
. Data from 2014/06/18--2022/02/05. No. of transcripts = 102458
Total number of transcripts: 293,340
- 2000-04-21 New format error
- 2000-04-22 content within
and
tag - 2001-04-04 No URL prefix, subheader ==> h4, content next table
tag - Scripts from 2014