Abstract

The latest trend of sharing information has evolved many concerns for the current researchers, which are working on computational social sciences. Online social network platforms have become a tool for sharing propagandistic information. This is being used as a lethal weapon in modern days to destabilize democracies and other political or religious events. The COVID-19 affected almost every corner of the world. Various propagandistic tweets were shared on Twitter during the peak time of COVID-19. In this paper, improved artificial neural network algorithm is proposed to classify tweets into propagandistic and nonpropagandistic class. The data are extracted using multiple ambiguous hashtags and are manually annotated into binary class. Hybrid feature engineering is being performed by combining “Term Frequency (TF)/Inverse Document Frequency (IDF),” “Bag of Words,” and Tweet Length. The proposed algorithm is compared with logistic regression, support vector machine, and multinomial Naive Bayes. Results showed that improved artificial neural network algorithm outperforms other machine learning algorithms by having 77.15% accuracy, 77% of recall, and 79% precision. In future, deep learning approaches like LSTM may be used for this classification task.

1. Introduction

The 21st century has been marked by disinformation in the form of false news, fuelled by technology such as social media networks that allow information to spread rapidly and target individual views, prejudices, and emotions. Culture is one aspect that may be of significance. Cultures differ in many ways, including individualism vs. collectivism, the distance of influence (between those at the top of the social hierarchy and those at the bottom), masculinity vs. femininity, and avoidance of ambiguity [1]. It is fair to assume that different kinds of fake news in different cultures would spread more easily. For example, in a highly collectivist society, likely, a fake news story of a well-known person transgressing a social standard will spread faster than in an individualistic community. If this is indeed, then the question arises as to whether the writers of fake news target particular communities with specific kinds of fake news that they think are optimal for that culture [2]. Making heuristics in decision-making is an essential factor that promotes fake news. Many distinct heuristics have been established, such as learning confidentiality, for example, which results in us believing that information presented as leaked or confidential is more likely to be true in some way. It is an under-researched area, but the relationship between culture and heuristics is possible. For instance, in cultures where there is a high power distance, the heuristic secrecy example may be more pronounced, as people are less used to learning details of those at the top of the power hierarchy [3]. In culmination, we would like to say that whether the news is fake or real, we should go in the depth of the issues before believing or acting on it.

Online social networks have evolved as a communication tool, and social networks play a vital role in modern time. The social networks have made communication so easy that a person can communicate with other people who reside in different corners of the world. The posts that are posted on these platforms reach to every corner of the world in no time. With the numerous benefits of these platforms, it has its limitations. Some hatemongers and filthy users use these platforms for spreading hate and filthy information. The offensive information that is shared/posted can be categorized into two classes that are misinformation and disinformation. Misinformation is the type of information shared by a person without knowing its truthfulness and can harm the society in many ways [4]. The misinformation can be regarding various topics like elections, trending events, and religion, while disinformation is the type of information posted/shared intentionally. In this type of messages, the user knows that he/she is sharing false information. These types of posts are very harmful for the whole society/country. Recently, during COVID-19 times, numerous amount of misinformation is shared on social media platforms. Misinformation related to vaccines is posted by fear mongers, which produce a threat for the life of people across the globe [5].

Another type of information that is being shared on social media is known as propagandistic posts. Propaganda is being used as a weapon for reduplicating sentiments by shaping a person’s behaviour [6]. The main aim of spreading propaganda is to create a mess in the mind of an ordinary person. These types of posts can be graphical, textual, and multimedia. The propagandistic information is related to the events related to politics, religion, sectarian discussion, and celebrities [7]. These posts are deliberately made so that a user will not understand that the message is propagandistic in nature. In some democratic states, propaganda is also shared through traditional media like news channels as political parties sponsor various news channels. These media channels are only used to spread hate and propagandistic news regarding the world’s events as people now are aware of these paid channels, which spread propagandistic information. People shift their way of gaining knowledge from traditional media to many advanced media known as social media. But these platforms share more propagandistic messages as compared to traditional media. According to our knowledge, no such system is being made to identify the posts in real-time, whether a post is propagandistic or nonpropagandistic. The modern age is the age of artificial intelligence, and the AI has shown better results in every field, be it medical, education, robotics, etc. [8, 9]. Now machine learning is being used for analysing the data. Machine learning has shown better accuracy in classifying the text, images, etc. Machine learning is also used for identifying the spammers and rumours on social media [10]. Machine learning can be achieved using supervised and unsupervised learning. The machine learning algorithm is trained in supervised learning to be acquainted with the output class [11]. In unsupervised learning, we feed data to machine learning algorithm without output class [12, 13]. There are various supervised machine learning algorithms, which can be used for classification tasks. Some of the algorithms are support vector machine, Naive Bayes, and decision tree. In this paper, we focus on classifying the tweets using artificial neural networks. Some of the novel contributions regarding this article are as follows:(i)A framework is proposed, which focus on machine learning algorithms for detecting the propaganda on online social networks.(ii)Novel dataset was prepared by annotating the tweets extracted using various keywords like #CORONAJIHAD #CORONAMUSLIM #CHINESEVIRUS #BIOWEAPON.(iii)Data annotation is done with the help of various techniques such as Name Calling, Bandwagon, and Testimonial.(iv)Feature engineering is improved by merging different features like “TF/IDF,” “Bag of Words,” and Tweet Length for training machine learning algorithms.(v)Artificial neural network is fine-tuned and is trained based on the features selected through proposed hybrid feature selection method.

This article consists of 6 sections that are briefly defined as Section 1 that describes the brief introduction about COVID-19 and the effect of misinformation and propaganda during these times. In Section 2, a detailed background is being provided related to misinformation and propaganda on social networks. In Section 3, detailed background of machine learning algorithms is provided. Proposed methodology is being discussed in Section 4. Section 5 discusses the results, while Section 6 concludes our work.

The increasing allure and beauty of social media use has an effect on our daily lives, whether directly or indirectly by convincing us to believe other people’s thoughts and recommendations while making little or large choices, such as purchasing goods and voting online for elections to establish a new government. It is unsurprising that social media has evolved into a tool for altering feelings through the dissemination of disinformation. On social media, false material and propaganda are widely utilized, and they must be recognized and fought. Gao et al. [14] examined the features of fraudulent accounts, with a particular emphasis on messages including URLs in textual format. With the rise in social media users, political events and government programmes are more debated, and social media misuse is becoming more frequent. Ratkiewicz et al. [15] proposed a technique to detect and track political misuse on social networks. The network structure exhibits slight deviation from the usual structure throughout the election. During election season, Halu et al. [16] looked at the structure of social networks. A model for opinion dynamics was proposed to demonstrate the wide area parties survived by accumulating limited proportion of the vote throughout election season. Ramakrishnan et al. [17] studied and mined data from online social networks relating to civil unrest. They extracted data pertaining to the specific incident, identified contributing elements, and analysed the event’s evolution. Liu et al. [18] discovered the inner sanctums of political power holders in governments beneath official job relationships and tracked the formation and evolution of the selected political groups across time. They accomplished their goal through three distinct techniques: constructing networks, identifying communities, and monitoring community evolution. Bhat et al. [19] developed a revolutionary technique for detecting stealthy accounts on online social networking sites. The processes include detecting communities at the node level, defining features, classifying them, and locating stealthy Sybils. Khanday et al. [20] detected coronavirus from clinical textual reports using machine learning techniques. Tweets, retweets, and re-tweeters were used to calculate political leanings. Wong et al. [21] calculated behaviour of Twitter users in the methodical process. Various occurrences occur simultaneously as a result of social media. Zarrinkalam and Bagheri [22] recognised social network’s events. To find undefined events on social networks, three models were used: latent Dirichlet allocation (LDA) for topic modelling, for document clustering, and wavelet analysis for feature clustering. The impact of social bots on politics was investigated (political propaganda using social media bots). Lightfoot [23] believed that social bots have solely a negative impact on politics. They discovered that social bots play a critical role in the propagation of false news and that profiles who consistently disseminate disinformation are much more likely to be bots. People who hold extreme views solely use social media to promote hatred and terror. Ashcroft et al. [24] suggested a semantic graph-based approach to identifying radicalization via social media. Anti-ISIS users often discuss politics, geographical regions, and counter-ISIS activities, while pro-ISIS users frequently discuss religions, historic events, and ethnicity. They also posted the shady information on social media. Using cognitive psychology, Kumar et al. [25] were able to detect disinformation on online social networks. The cognitive process is made up of the message’s consistency, coherence, source credibility, and general acceptance. Using the Twitter API, they gathered datasets using the hashtags #Syria, #Egypt, and others. Mazzoleni et al. [26] socially mediated form of popular communication, which was recognized as being significantly influenced by the nature of social media. “Appeal to the people,” “attacking the Elite,” and “ostracizing the others” were used to gauge the degree of populist communicative ideology. With the spread of COVID-19, various propagandistic textual posts were posted on Twitter using various hashtags. Mujahid et al. [27] analysed the sentiments of various topics that were shared during COVID-19 era through tweets and most of the tweets were related to the online education. With the help of artificial intelligence, various models were developed for predicting COVID-19. Ashraf et al. [28] analysed the published work that is related to COVID-19, and they grouped such works under nine categories for age, gender, the outcome of the disease, method used for prediction, and role of underlying medical conditions. Table 1 summarizes recent works regarding propaganda, in Table 1. P, R, and F1 are precision, recall, and F measure, respectively.

The following conclusion can be drawn from the above literature reviews.(i)Clustering algorithms are used for identifying propaganda, classification algorithms may improve the accuracy, and it needs further investigations.(ii)Majority of work is being done on news-let datasets, and work can be explored to online social media platforms.(iii)Semantic nature of the propagandistic posts due to semantic nature of the posts and it is a challenge to detect propaganda for that manual annotation is necessary.(iv)Feature engineering can be explored by choosing various features such as semantic and emphatic.

3. Background Knowledge

Machine learning emerged as a separate discipline of computer science due to rapid development of data mining techniques and procedures. It can be thought of as a subfield of artificial intelligence, with the primary principle being that a system (computer program, algorithm, etc.) may learn from its own activities. Arthur Samuel first described it in 1959 as “a subfield of study that focuses on the ability of computers to learn without being explicitly programmed.” “A computer program is said to learn from experience E with respect to a class of tasks T and a performance metric P if its performance on tasks in T, as measured by P, increases as a result of the experience E” [41]. The basic aim of machine learning algorithm is to perform a specific task based on its training. The model is developed using the input dataset as a basis for training, and it is then utilized to make predictions. Machine learning is being used in many areas of computer science such as the Internet of Things (IoT), image processing, and natural language processing (NLP). The machine learning nowadays is being used to detect various daily activities of humans [42]. With the eruption of COVID-19, machine learning plays a vital role in detecting the virus with the help of X-ray chest images [43]. In text mining, machine learning algorithms are used for classifying the sentiments of various text, and the text may be a document, a review of a product, a news, and a tweet [44].

Propaganda detection can be seen as a classification/clustering problem in machine learning; that is, unknown propaganda text should be clustered into different clusters depending on certain attributes detected by the algorithm. On the other hand, we can reduce this problem to classification after training a model on a large dataset of propaganda and nonpropaganda files. This challenge can be simplified down to classification only for known propaganda by having a small set of classes. One being propaganda class, and other being nonpropaganda class. In classification problem, it is easier to identify the right class, and the result is more accurate than with clustering methods. Some of the machine learning algorithms are described below.

3.1. Logistic Regression

Based on the link between the numerical variable and the label, this method guesses the numerical variable’s class [45]. The features that have been selected using feature engineering values are displayed in a table and given as an input. In general, the procedure determines the likelihood of being a member of a given class. We have two classes here. . Ref. (2) can be used to figure out the posterior probability.

3.1.1. Multinomial Naïve Bayes (MNB)

Using the Bayes rule [46], MNB calculates the class probabilities of a given text. It can be used to solve problems with binary and multiclass categorization. The essential concept is that each feature should be treated separately. The Naive Bayes algorithm makes use of the Bayes theorem to determine the probability of each attribute independently of any relationships. This is why this strategy is referred to as “naive”: characteristics in real-world tasks frequently have some degree of association. In our case, let A represent the collection of classes. There are two types of classes: A = 0 and A = 1. Furthermore, N denotes the set of features. Then, using the Bayes rule described in Ref. (3), MNB selects the class with the best probability for the test text :

is calculated by dividing the total number of textual records by the number of records labelled as class . is the likelihood of finding a textual data similar to in class . It can be calculated as follows:where denotes the number of instances of the word “n” in our dataset. and are the likelihood of the word “n” being presented in class a. The latter probability is derived using the training data using the Ref. (5):where are the instances of the word “x” in all training data corresponding to the class a. The Laplace estimator is utilized for avoiding zero-frequency problem, which assigns a value of one to the count of each word.

The simplicity and ease with which this strategy can be understood are two of its main features. Furthermore, it performs effectively on datasets with irrelevant attributes since the probability of these features having an effect on the outcome is low. This means that when making projections, they will not be taken into consideration. Additionally, this strategy often results in low resource consumption because it only involves calculating the probabilities of features and classes; no coefficients need to be identified, as is the case with other methods. As mentioned previously, its key shortcoming is that every features are treated independently, despite the fact that this is impossible in the vast majority of cases.

3.2. Support Vector Machine (SVM)

The support vector machine (SVM) algorithm is a method for categorizing text using supervised machine learning [47]. The key concept is to discover such a hyperplane that best separates the classes. The term “support vector” refers to the points nearest to the hyperplane that, if removed, would alter the hyperplane’s location. The margin is the distance between the support vector and the hyperplane. It necessitates a predetermined amount of features for the given text with predetermined label. The training set’s data points are denoted by , where “n” denotes number of features extracted. The values associated with the features selected during feature engineering are represented in a table and offered as a source of input. SVM’s primary objective is building of classifier using the following equation:where is the positive real constant, and b is the real constant.where and are the constants.

The classifier is constructed on the basis of the following assumptions:

This is equivalent towhere is the nonlinear function that converts input spaces to a higher-dimensional space. The hyperplane is used to accomplish categorization. The hyperplane distinguishes the two classes, requiring the introduction of a new variable . The hyperplane equation is

SVMs are often capable of achieving high accuracy, particularly on “cleaned” datasets. Additionally, it works well with high-dimensional datasets, even when the number of dimensions exceeds the number of samples. However, it may be more successful for huge datasets with a high level of noise or overlapping classes. Additionally, training time can be lengthy when dealing with huge datasets.

4. Methodology

The proposed methodology consists of four stages: data collection and annotation, preprocessing, feature engineering, and classification. Figure 1 gives the graphical representation of proposed methodology.

4.1. Data Collection and Annotation

Data are the main component in every research. In our work, data are extracted from online social network platforms using various methods such as APIs and crawlers. We extracted data from Twitter, a social networking site using its REST API. Data are extracted based on some events using various hashtags.

4.1.1. Data Collection

The most important thing in every research is data. Since propagandistic text is of semantic in nature, the first thing is to perform keyword-based extraction. We used Twitter Application Program Interface (API) for the extraction of tweets from Twitter [48]. We identified five ambiguous keywords that were mostly used in COVID-19 times. We extracted using #CORONAJIHAD, #CORONATERRORSIM, #CORONAMUSLIM, #CHINESEVIRUS, and #BIOWEAPON. About 50 K tweets were extracted using these keywords.

4.1.2. Annotation

Due to propagandistic tweets semantic nature, we preferred manual annotation over crowd-sourcing like Monkey Learn. Tweets were labelled into two classes based on various propaganda techniques: Name Calling, Bandwagon, Testimonial, and Glittering Generality. Two journalists labelled about 5 k tweets. Figure 2 shows the length of annotated tweets in propaganda and nonpropaganda class.

4.2. Data Preprocessing

Preprocessing plays a vital role in text classification problems. Since there are too much unnecessary spaces, links, etc., we need to clean the data. For this task, various preprocessing tasks are used. Various preprocessing techniques that are used for refining the text are as follows [49].

4.2.1. Tokenization

In this task, the tweets are divided into tokens such that we can easily refine the text by removing unwanted words and digits.

4.2.2. Punctuation Removal

In this phase, the punctuations such as comma, full stop, and semicolon are removed. Punctuation is mostly used by humans to communicate with each other.

4.2.3. Number Removal

Various numbers are posted in a tweets, and they does not have any sentiment. In this phase, the numbers are removed as numbers have the least preference in text analysis [50].

4.2.4. Stop Words

This task removes all the stop words as they have a very least role in a classification task. We used the English stop words dictionary to remove the stop words as we have worked on tweets written in English only.

4.2.5. Spell Checking

Humans have the nature of making mistakes whether grammatical or spelling while writing a post/tweet. In this phase, various misspelled words are corrected by using “pyspellchecker,” a python library.

4.2.6. Stemming

Each word is stemmed to its root word such that the actual meaning of the word can be generated. It shaves off the ends of words with the expectation of accomplishing this goal correctly the majority of the time and frequently involves the omission of derivational affixes.

4.3. Feature Engineering

For every machine learning algorithm, we have to choose features on which we can train algorithms. We used different feature engineering techniques for selecting the most relevant features. In this work, we combined various methods to get the most relevant features. The methods that are used in this work are as follows.

4.3.1. TF/IDF

Term Frequency/Inverse Document Frequency mirrors a word’s significance in a tweet or over an entire corpus by providing quantitative insights. It is determined utilizing the accompanying condition.where denotes the word as a component, denotes each tweet in the corpus, and denotes dataset’s total number of tweets (document space).

4.3.2. Bag of Words

The process of extracting information from text for use in machine learning algorithms is referred to as text mining. Bag of Words is a representation of text that describes the presence of words inside a document’s textual information. To extract more information from the text, we used bi-grams and tri-gram words in this work.

4.3.3. Length

Another feature due to which accuracy got improved is the length of the tweet. As Tweet length is at most 280 characters, it is too difficult to detect the propaganda in this short message. We take this as a feature for our classifier such that it will perform better.

We combined all the above selected features out of which we choose the most relevant 40 features by using Gini coefficient.

4.4. Classification

The classification process is carried out in order to group the given substance into two distinct types of groups. The two classes are propaganda (a tweet which is propagandistic) and nonpropaganda (normal tweet). Artificial neural network algorithm is used to mastermind the substance into these classes.

The multilayer perceptron classifier (MLPC) is a classification algorithm that is dependent on the framework. MLPC is utilizing backpropagation for learning [51]. The amount of hubs in the yield layer identifies with the number of classes. MLPC is comprised of several hubs, including the data layer, veiled layers (also known as centre layers), and yield layers, among others. A connection between each layer and the layer that it is connected to in the framework is established, where the data layer, the cover layer, and the yield layer can all be represented. Neurons in the data/input layer are responsible for perceiving information. The yield from these neurons is the same as the data pointers. Hubs in the data layer indicate the information. Every hub map commitments to yield by a straight blend of commitments with gesture’s loads and tendency and applying for actuation work. The cross-sectional structure for MLPC with layers is composed. Hidden/middle layers are between the information and output layers. Usually, the number of concealed layers ranges from one to a significant number. The capabilities of the focused computing layer are responsible for mapping a node’s contribution to its output. The sigmoid (strategic) work is utilized by the nodes in the middle levels.

The output layer is the final layer of a neural network, and it is responsible for returning the outcome to the original condition. With regard to the topology of a neural network, it also flags the previous layers based on how they have behaved in learning the input and enhancing their functions. The softmax function is used by nodes in the output layer.

The output layer’s node count N equals the number of classes (see Algorithm 1).

Require: filtered tweets
Ensure: propaganda tweet and non-propaganda tweet
(1)Filtered tweets
(2)Propaganda tweets
(3)Nonpropaganda tweets
(4)Tokenization T
(5)Stop word removal  SW
(6)Stemming S
(7)Total Number of Tweets n
(8)Term frequency/inverse document frequency  TF/IDF.
(9)Bag of Words  B
(10)START
(11)for do
(12) =  +  //manual annotation
(13)
(14)end for
(15)for 1 do
(16) = 
(17) = 
(18) = 
(19)end for
(20)for do
(21) = 
(22) =  + 
(23)end for
(24) (CLASSIFIER)
(25)END

5. Results and Discussion

The experimentation is performed on workstation having configuration of 16 GB RAM and GPU. Dataset is split into many ratios. First, data were split into 50 : 50; that is, 50% of data was used for training and 50% for testing. It was found that the accuracy was less as compared to when data were split into 60 : 40. After performing classification on these datasets, it was concluded that the dataset. When data are split into 80 : 20, which means that 80% of data are used for training and 20% for testing, the best accuracy is achieved. The testing of the models was measured based on precision, recall, F-measure, and accuracy using equations (14), (15), and (16):where is the true positive, and is the false positive.where is the true positive, and is the false positive.where is the true positive, and is the false negative.

While training and testing the models, the ANN over performed other algorithms by showing accuracy of 77.15%. The objective of the factual examination is to gather as much information as possible about the data that will be used in the classification task. The preceding work was completed using the Python programming language and the Spyder IDE. The data are presented in the form of a table with different qualities for the 40 attributes that were selected. The provided task is being carried out using several libraries. Pandas, scikit-learn, and NLTK, for example, are among the pre-owned libraries. The comparison of different methods for conducting binary classification is shown in Table 2. Figure 3 illustrates the confusion matrices of the conventional machine learning methods, whereas Figure 4 shows the confusion matrix of ANNBPI.

5.1. Comparative Study

For validating, the results generated using proposed approach with the existing work. The similar dataset, as used by other researchers, was experimented based on proposed methodology. After fine-tuning logistic regression, decision tree, support vector machine, multinomial Naive Bayes, and artificial neural network based on the proposed methodology, it was found that these algorithms performed better than other previous studies. The evaluation metrics showed that ANNBPI achieved highest performance on the said dataset by having precision of 78%, recall of 76%, F1-score of 75%, and accuracy of 76.15%. Table 3 shows the comparison of the proposed work with existing approaches in this domain.

6. Conclusion

Propaganda is used as a lethal weapon on online social networks. Using Twitter’s Application Program Interface, data were retrieved from one of the social media platform “Twitter.” The data were labelled manually using various propaganda identification techniques. Hybrid feature engineering is being performed, which includes Bag of Words, TF/IDF, and Tweet length. About 40 most relevant features are chosen using these techniques. Machine learning classifiers are trained and tested on the labelled dataset by splitting it into 70 : 30 ratio. The results showed that artificial neural network showed better results as compared to other algorithms. ANN gave 77.15% accuracy, 74% F-measure, 79% precision, and 77% recall. In the future, more data may be supplied to these algorithms and there are also future directions in deep learning.

Data Availability

The data used to support this study are available from the author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.