GitHub - pard187/COVID19_Tweets_Dataset: COVID-19 Tweets Dataset

Data Organization
Data Statistics
Hydrating Tweets
- Using our TWARC Notebook
Inquiries
Licensing
References

The repository contains an ongoing collection of tweets associated with the novel coronavirus COVID-19 since January 22nd, 2020.

As of 11/14/2020 there were a total of 1,027,974,005 tweets collected. The tweets are collected using Twitter’s trending topics and selected keywords. Moreover, the tweets from Chen et al. (2020) was used to supplement the dataset by hydrating non-duplicated tweets.

Citation

Christian Lopez, and Caleb Gallemore (2020) An Augmented Multilingual Twitter Dataset for Studying the COVID-19 Infodemic. DOI: 10.21203/rs.3.rs-95721/v1 https://www.researchsquare.com/article/rs-95721/v1

Christian Lopez, Malolan Vasu, and Caleb Gallemore (2020) Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset. arXiv:cs.SI/2003.10359,2020 https://arxiv.org/abs/2003.10359

Data Organization

The dataset is organized by hour (UTC) , month, and by tables. The description of all the features in all five tables is provided below. For example, the path “./Summary_Details/2020_01/2020_01_22_00_Summary_Details.csv” contains all the summary details of the tweets collection on January 22nd at 00:00 UTC time.

Features Description

Table	Feature Name	Description
Primary key	Tweet\_ID	Integer representation of the tweets unique identifier
1.Summary\_Details	Language	When present, indicates a BCP47 language identifier corresponding to the machine-detected language of the Tweet text
	Geolocation\_cordinate	Indicates whether or not the geographic location of the tweet was reported
	RT	Indicates if the tweet is a retweet (YES) or original tweet (NO)
	Likes	Number of likes for the tweet
	Retweets	Number of times the tweet was retweeted
	Country	When present, indicates a list of uppercase two-letter country codes from which the tweet comes
	Date\_Created	UTC date and time the tweet was created
2.Summary\_Hastag	Hashtag	Hashtag (\#) present in the tweet
3.Summary\_Mentions	Mentions	Mention (@) present in the tweet
4.Summary\_Sentiment	Sentiment\_Label	Most probable tweet sentiment (neutral, positive, negative)
	Logits\_Neutral	Non-normalized prediction for neutral sentiment
	Logits\_Positive	Non-normalized prediction for positive sentiment
	Logits\_Negative	Non-normalized prediction for negative sentiment
5.Summary\_NER	NER\_text	Text stating a named entity recognized by the NER algorithm
	Start\_Pos	Initial character position within the tweet of the NER\_text
	End\_Pos	End character position within the tweet of the NER\_text
	NER\_Label Prob	Label and probability of the named entity recognized by the NER algorithm

For more information visit: Twitter API and the Documentation for API Tweet-object

Data Statistics

General Statistics

As of 11/14/2020:

Total Number of tweets: 1,027,974,005

Average daily number of tweets: 140,124

Summary Statistics per Month

Month	Daily Avg. Original	Daily Avg. Retweets	Daily Avg. Tweets	Total of Orignal	Total of Retweets	Total of Tweets	Total with Geolocation	Max No. Retweets	Max No. Likes
1	5,947	30,576	35,501	1,958,346	7,852,504	9,810,850	1,773	674,151	334,802
2	10,978	29,918	40,604	7,624,648	21,944,443	29,568,948	8,103	469,739	637,589
3	13,095	44,714	56,283	12,610,824	46,659,589	59,270,412	19,952	1,064,693	1,255,858
4	30,091	89,513	119,859	20,591,357	60,301,889	80,893,244	38,213	649,823	662,005
5	35,163	99,928	135,709	26,258,213	73,618,083	99,876,289	47,684	1,007,616	929,811
6	51,033	142,569	193,096	34,786,076	95,171,388	129,957,461	58,138	790,652	882,693
7	53,720	155,042	209,738	39,611,015	111,876,344	151,487,359	56,808	615,768	1,287,117
8	51,330	143,551	195,142	37,596,182	103,098,588	140,694,770	55,837	2,183,434	860,162
9	50,068	132,040	182,947	35,861,979	92,957,247	128,819,226	32,381	1,925,489	839,689
10	54,716	137,722	200,741	39,945,510	102,236,659	141,886,653	318,121	946,810	785,385
11	61,923	108,904	171,746	20,175,866	35,532,927	55,708,793	12,820	577,095	619,643

There is a total of 649,830 tweets with geolocation information, which are shown on a map below:

Language Statistics

Tweets Language Summary

Languages	Total No. Tweets	Percentage of Tweets
English	701,293,369	68.24
Spanish; Castilian	129,562,757	12.61
Portuguese	38,128,984	3.71
Bahasa	29,024,708	2.82
French	26,766,298	2.60
Others	102,902,217	10.01

Sentiment Analaysis

The sentiment of all the English tweets was estimated using a state-or-the-art Twitter Sentiment algorithm BB_twtr. (See code here) .

Named Entity Recognition, Mentions, and Hashtags

The Named Entity Recognition algorithm of flairNLP was used to extract topics of conversation about PERSON, LOCATION, ORGANIZATION, and others. Below are the top 5 NER, Mentions (@) and Hastags (#)

Top 5 Mentions, Hashtags, and NER

Mentions	Hashtags	NER Person	NER Location	NER Organization	NER Miscellaneous
@realDonaldTrump	\#covid19	trump	china	cdc	coronavirus
14,257,497	63,204,611	8,655,311	13,136,868	2,008,315	7,398,967
@realdonaldtrump	\#coronavirus	donald trump	us	trump	covid-19
4,506,178	34,454,147	941,112	7,008,033	1,026,534	7,002,960
@joebiden	\#covid	fauci	uk	nhs	chinese
2,082,192	6,139,596	662,128	2,453,384	477,361	3,375,965
@JoeBiden	\#covid\19	god	wuhan	cnn	americans
1,961,173	2,264,110	560,171	2,427,141	343,216	2,928,792
@narendramodi	\#stayhome	obama	india	icu	covid19
1,064,830	1,291,185	336,070	1,861,257	266,064	718,674

Data Collection Process Inconsistencies

Only tweets in English were collected from 22 January to 31 January 2020, after this time the algorithm collected tweets in all languages.

There are also some known gaps of data shown below:

Known gaps

Date	Time
2020-08-06	07:00 UTC
2020-08-08	07:00 UTC
2020-08-09	07:00 UTC
2020-08-14	07:00 UTC

Hydrating Tweets

Using our TWARC Notebook

The notebook Automatically_Hydrate_TweetsIDs_COVID190_v2.ipynb will allow you to automatically hydrate the tweets-ID from our COVID19_Tweets_dataset GitHub repository.

You can run this notebook directly on the cloud using Google Colab (see how to tutorials) and Google Drive.

In order to hydrate the tweet-IDs using TWARC you need to create a Twitter Developer Account.

The Twitter API’s rate limits pose an issue to fetch data from tweed-IDs. So, we recommended using Hydrator to convert the list of tweed-IDs, into a CSV file containing all data and meta-data relating to the tweets. Hydrator also manages Twitter API Rate Limits for you.

For those who prefer a command-line interface over a GUI, we recommend using Twarc.

Using Hydrator

Follow the instructions on the Hydrator github repository.

Using Twarc

Follow the instructions on the Twarc github repository.

Inquiries

For questions about the dataset, please contact Dr. Christian Lopez at lopezbec@lafayette.edu, Dr. Caleb Gallemore at gallemoc@lafayette.edu, or Malolan Vasu at vasum@lafayette.edu.

Licensing

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

References

Emily Chen, Kristina Lerman, and Emilio Ferrara. 2020. #COVID-19: The First Public Coronavirus Twitter Dataset. arXiv:cs.SI/2003.07372, 2020

https://github.com/echen102/COVID-19-TweetIDs

Name		Name	Last commit message	Last commit date
Latest commit History 1,038 Commits
Old_Tweets_ID_by_keyword		Old_Tweets_ID_by_keyword
Summary_Details		Summary_Details
Summary_Hashtag		Summary_Hashtag
Summary_Mentions		Summary_Mentions
Summary_NER		Summary_NER
Summary_Sentiment		Summary_Sentiment
.gitignore		.gitignore
Automatically_Hydrate_TweetsIDs_COVID19_v2.ipynb		Automatically_Hydrate_TweetsIDs_COVID19_v2.ipynb
Features_table.csv		Features_table.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Organization

Data Statistics

General Statistics

Language Statistics

Sentiment Analaysis

Named Entity Recognition, Mentions, and Hashtags

Data Collection Process Inconsistencies

Hydrating Tweets

Using our TWARC Notebook

Using Hydrator

Using Twarc

Inquiries

Licensing

References

About

Uh oh!

Releases

Packages

Languages

pard187/COVID19_Tweets_Dataset

Folders and files

Latest commit

History

Repository files navigation

Data Organization

Data Statistics

General Statistics

Language Statistics

Sentiment Analaysis

Named Entity Recognition, Mentions, and Hashtags

Data Collection Process Inconsistencies

Hydrating Tweets

Using our TWARC Notebook

Using Hydrator

Using Twarc

Inquiries

Licensing

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages