Twitter Bot Classifier
Machine Learning
Kaggle
Python
Scikit Learn
Twitter API

For my project in my Machine Learning course, I created a twitter bot classifier, which showed a 98.606% accuracy and stood in the Top 10% in its Kaggle competition.

Abstract:

Twitterbot [7] according to wikipedia is a bot program used to produce automated posts on the Twitter microblogging service, or to automatically follow Twitter users. Recently a study [2] conducted in 2017 at University of Southern California suggests that 9 to 15% of Twitter accounts are bots controlled by softwares instead of humans. While some of these bots are definitely beneficial like dissemination of news in critical times, many of these can be used for malicious activities such as promoting terrorist propaganda and influencing the opinion of citizens in general. One of the examples of this is the recently concluded US elections where around 1 million [6] automated tweets were recorded between 1st and 2nd debate which were in favour of the contesting contenders. Evidently enough, social network was a big part of how the US elections 2016 panned out eventually. Such malicious influences need to be controlled and the first step in achieving this is identifying if the accounts that are bots.

In this project, we study the problem of identifying bots on Twitter. There are many factors involved in determining if an account is bot or not, like, if they are telling you they are a bot, or if they tweet the same thing to everybody or if their source is an API. We consider all these factors to identify bots, and we use different machine learning techniques to train and test our data. While doing so, we go through some previous work done in the similar domain in section 3. In section 4 and 5, we describe the dataset we used in this project, and the Machine Learning techniques we tried to classify the accounts.

We also compare the results of different techniques based on factors such as accuracy, precision, recall, time taken for training and testing. Based on these observations, we try and find the most suited technique for this particular problem. Finally, we look at some of the potential areas of application of this classification

20 Awesome Colors