When I was studying to become a Software Developer, I had to tweet at least once every day to accomplish one of the goals of the school in which I was taking the program. In order to do that, I decided to create a bot to perform that operation in an automated way every day. That is how I found Tweepy, a Python library which allows you to use Twitter API.
The purpose of the bot is to gather fascinating facts from a website that publishes daily historical events and occurrences. It accomplishes this by employing a web scraper that extracts data from the website's HTML files. The extracted information is then stored in a file for further processing. Using the Tweepy library, the bot accesses the stored data and crafts intriguing tweets. These tweets are then published on a regular basis, sharing intriguing and curious facts with the bot's followers.
Process to create your first Twitter bot:
What is Tweepy?
Before we start lets briefly understand what this Python library is that lets you interact with Twitter API. Tweepy is an open source project that creates a convenient and easy way to interact with Twitter API using Python. For that porpouse, tweepy includes different classes and methods to represent Twitter's models and API endpoints. With these methods and clases you can encode and decode data, make HTTP requests, paginate search results and implement an OAuth authentication mechanism, among other things. With that being said, let's start.
Tweepy installation
According to the official repository, the easiest way to install the latest Tweepy version is by using pip
You can also use Git to clone the repository and install the latest Tweepy development branch.
And finally, you can also install Tweepy directly from the GitHub repository.
Authentication Credentials for Twitter API
First of all, you need to apply for a Twitter developer account. To do that you have to follow the next steps according to the Twitter developer account support.
Finally, once you complete your application, go to the developer portal dashboard to review your account status and setup. And if it's successfully registered, the next step is create your first app.
Create the application
Twitter lets you create authentication credentials to applications, not accounts. With that being said, you need to create an app to be able to make API requests. In this case, our app will be a bot that scrapes data from a website and publishes it as a tweet on your Twitter account.
To create your application, visit the developer portal dashboard select +Add App and provide the following information: app name, application description, website URL and all related information about how users will use your application.
Authentication credentials
In order to create your authentication credentials, go to the developer portal dashboard select your application and the click on "keys and tokens". Once you are on your project page, select "Generate API Key and Secret" and also select "Access Token and Secret", keep in mind that the last one should be created with read and write permissions, which guarantees that our bot can write tweets for you using the Twitter API. Don't forget to store the keys so you can use it later in our configuration file for Twitter Authentication.
You may want to test your credentials using this python script:
Make sure you replace the square bracket fields with your credentials. Once you're done we can continue with the next step.
Create your configuration file to Authenticate our bot
Here's an explanation of the authentication script:
The script starts with a shebang #!/usr/bin/python3
, which specifies the interpreter to be used to execute the script. In this case, it's set to Python 3.
The script imports the necessary modules, tweepy
and os
. Alongside of Tweepy, we'll use os
, which allows interaction with the operating system, particularly to retrieve environment variables. In this case we need that library because we will store our Twitter keys as environment variables.
The create_api()
function is defined. This function is responsible for creating and returning an instance of the Tweepy API, which will be used to interact with the Twitter API.
Inside the create_api()
function, the script retrieves several environment variables using os.getenv()
. These environment variables are expected to contain the necessary Twitter API credentials: API_KEY
, API_SECRET_KEY
, ACCESS_TOKEN
, and ACCESS_TOKEN_SECRET
.
The script uses the credentials obtained from the environment variables to initialize an instance of tweepy.OAuthHandler
. This class is responsible for handling the OAuth 1.0a authentication process required by the Twitter API.
The auth.set_access_token()
method is called to set the access token and access token secret obtained from the environment variables.
An instance of tweepy.API is created, passing the auth
object as an argument. This instance represents the authenticated connection to the Twitter API.
The script then attempts to verify the credentials by calling api.verify_credentials()
. If the verification is successful, it prints "Authentication OK"
. Otherwise, it catches any exception raised, prints "Error during authentication"
and re-raises the exception.
Finally, the created api object is returned from the create_api()
function.
In summary, this script defines a function create_api()
that creates and returns an authenticated instance of tweepy.API. It retrieves the necessary Twitter API credentials from environment variables, sets up the authentication, and verifies the credentials. This function can be used to establish a connection to the Twitter API for further interactions in your Twitter bot.
Create your bot
Bot modules: bot_v1.py
and split_string.py
Let's go through the script step by step:
The script imports the random
module, which will be used to choose a random line from the file, and the datetime
module from the standard library, which will be used to get the current date.
We also imports the create_api()
which we created in the previous step, this function is imported from the bots.config
module and is responsible for creating an authenticated instance of tweepy.API
that connects to the Twitter API.
Aditionally, we import specific functions from the babel.dates
and babel.numbers modules
. These functions will be used for formatting the date and decimal numbers in a localized manner.
The script also imports the split_string
function from the utils.split_string
module. This function will be used to split the tweet into multiple parts if its length exceeds Twitter character limit.
The tweet_job(api)
function is defined. This function takes the api object (the authenticated instance of tweepy.API
) as an argument. Inside the tweet_job()
function, the script opens a file named hoy_en_la_historia.txt
located at the specified path. It reads all the lines from the file and stores them in the lines list.
The script randomly selects a line from the lines list using random.choice()
and assigns it to the myline
variable.
We also get the current date using datetime.now()
and formats the date components separately. It formats the day as a two-digit decimal number and the month as the full month name in Spanish using the format_decimal()
and format_date()
functions from the babel library, respectively.
The script creates a formatted date string by combining the day, month, and a custom text. If the length of the tweet string mystr
is less than or equal to 240 characters (the Twitter character limit), it directly tweets the mystr
using api.update_status()
and prints the tweet.
If the length of the tweet string exceeds 240 characters, it splits the tweet using the split_string()
function, which splits the string into two parts while considering word boundaries. It adds a marker to indicate the order of the tweets. We'll go through it in the next step.
The script tweets the first part of the split string using api.update_status()
and assigns the tweet object to original_tweet
. It then tweets the second part as a reply to the original tweet using api.update_status()
with the in_reply_to_status_id
parameter set to the ID of the original tweet.
Also the main()
function is defined, which calls the create_api()
function to create an authenticated API instance and passes it to the tweet_job()
function.
Finally, we use the entrypoint, which is used to check if the current module is the main module by using the if __name__ == "__main__":
condition. If it is, it calls the main()
function to start the execution of the script.
In summary, this script defines a function tweet_job()
that reads lines from a file, selects a random line, formats the current date, and tweets the content. If the tweet exceeds the character limit, it splits the tweet into multiple parts. The script also defines a main()
function that creates an authenticated API instance and calls tweet_job()
. When executed as the main module, it sets everything in motion.
Let's checkout the split_string()
function
The split_string()
function takes a string as input and splits it into two parts while ensuring that the total length of the resulting strings does not exceed a certain limit. Here's a breakdown of the function's logic:
If the length of the input string is greater than 234 characters (slightly below the Twitter character limit of 240), the function proceeds to split the string. The first 234 characters of the input string are assigned to the variable first_string
.
The function searches for the last space within the first 234 characters using the rfind()
method. This helps ensure that the split occurs at a word boundary. If a space is found within the first 234 characters, first_string
is truncated to the last space, ensuring that it doesn't cut off words. The remaining characters from the input string, after the split point, are assigned to the variable second_string
. Any leading or trailing whitespace is stripped using the strip()
method. If the length of the input string is less than or equal to 234 characters, the entire input string is assigned to first_string
, and second_string
is set to an empty string.
Finally, the function returns a tuple containing first_string
and second_string
.
In summary, the split_string()
function splits a given string into two parts, with the first part having a maximum length of 234 characters (accounting for the Twitter character limit) while ensuring that the split occurs at a word boundary. It provides a convenient way to split long strings for tweeting purposes, maintaining readability and coherence.
Now let's create our data scraper using bash!
Data scraper
Here's a brief summary of the functionality of the provided bash script
The script takes a URL as an argument and assigns it to the url
variable. It sets the directory path where the data will be stored in the dir
variable. The script removes any existing hoy_en_la_historia.txt
file in the specified directory. It retrieves data from the specified URL using curl
, processes the HTML response using htmlq and html2text (make sure you install first both packages, follow the hyperlinks to see the installation instructions), and stores the result in a temporary file called data.txt
.
The script performs various transformations on the data.txt file using sed, tr, and grep commands to extract and format the desired data. Let's go through the commands used in the script one by one:
url=$1
: This command assigns the first argument passed to the script to the variable url
. It allows you to provide a URL as an argument when executing the script.
dir="$HOME/tweepy_bot/scrapers"
: This command sets the directory path where the data will be stored. It assigns the specified path to the variable dir
.
rm -rf "$dir"/hoy_en_la_historia.txt
: This command removes any existing hoy_en_la_historia.txt
file in the specified directory $dir
.
echo $(curl --silent "$url" | htmlq --text | html2text) | tr -s ' ' | sed '/./G' > "$dir"/data.txt
: This command retrieves the HTML content from the specified URL using curl
, processes it using htmlq
and html2text
, and saves the result in a temporary file called data.txt
. The echo
command and subsequent pipeline manipulate the text by removing excessive spaces and adding line breaks.
sed -i -E 's/([0-9]{3,4} -)/\n\1/g' "$dir"/data.txt
: ([0-9]{3,4} -)
is the pattern that matches either a 3-digit or 4-digit sequence followed by a space and a dash. The captured group is then inserted into the replacement string \n\1
to add a newline before the matched pattern.
sed -i '/^\s*$/d' "$dir"/data.txt
: This command is used to delete empty lines in the file. /^\s*$/
is a regular expression pattern that matches empty lines. The ^
represents the start of a line, \s*
matches zero or more whitespace characters, and $
represents the end of a line. /d
is the sed command to delete the matched lines.
sed -i '$ s/\./.\n/' "$dir"/data.txt
: $
matches the last line of the file. s/\./.\n/
finds the first occurrence of a dot \.
on the last line and replaces it with the dot followed by a newline .\n
.
grep -v -e 'See All' -e 'SHOW' -e 'Efemérides' "$dir"/data.txt | grep -vE '^.{,60}$' > "$dir"/hoy_en_la_historia.txt
: This command filters out lines in hoy_en_la_historia.txt
that contain certain keywords (See All, SHOW, Efemérides). It also removes lines that are shorter than or equal to 60 characters.
sed -i 's/ - /, /g' "$dir"/hoy_en_la_historia.txt
: This is a substitution command that searches for the pattern "space-dash-space" -
and replaces it with a comma and a space ,
.
rm -rf "$dir"/data*
: This command removes all temporary files starting with data
in the specified directory $dir
to clean our workspace.
It saves the final formatted data into the hoy_en_la_historia.txt
file, which is the file that we are using the get the data for our bot.
Now let's create the script to call our main bot module from a bash script that will be automated using crontab linux package.
Bot runner
Here's the script explanation:
The script imports the main
function from the bots.bot_v1
module Which is the function that we define when created the bot.
The variable maxtries
is set to 8, indicating the maximum number of attempts the script will make to execute the main()
function. The script enters a loop that iterates maxtries
times using the range()
function. Within each iteration, the main()
function is called. If the execution of the main()
function is successful (no exception is raised), the loop is terminated using the break
statement, and the script finishes.
If an exception occurs during the execution of the main()
function, the script pauses execution for 900 seconds (15 minutes) using the time.sleep()
function. After the pause, the loop continues to the next iteration, and the process repeats until either the main()
function succeeds or the maximum number of tries is reached. If the maximum number of tries is reached, the script prints "fail"
followed by the iteration number i
indicating the failed attempt.
In summary, this Python script repeatedly calls the main()
function from the bots.bot_v1
module with a maximum number of tries. If an exception occurs, it waits for a specific duration before retrying. The script provides a mechanism to handle failures with the cronjob call and ensures that the main()
function is executed multiple times within a specified time frame.
We are about to finish, lets check the bash script that we will use to automate our bot using crontab.
Cron Tweet script
url="https://www.hoyenlahistoria.com/efemerides.php"
: This line assigns the website url to the variable url
. It specifies the website from which the scraper will fetch data.
cd $HOME/tweepy_bot || exit
: This line changes the current directory to out tweepy_bot
project workspace. If the directory change is unsuccessful (for example, if the directory doesn't exist), the script exits.
./scrapers/scraper.sh $url
: This line executes the shell script scraper.sh
located in the scrapers directory. The script takes the value of the url
variable as an argument and it will be the script that performs all text file operations that allows us to have a clean file to make out bot tweet for us in readability way.
python3 bot_runner.py
: This line executes the Python script bot_runner.py
. It runs the script responsible for running the bot, which performs the operations based on the data fetched by the scraper.
In summary, this script sets the URL, changes the directory to the appropriate location, executes the scraper script with the specified URL, and then executes the bot runner script to perform operations based on the scraped data.
Finally, let's check our cronjob file.
Crontab job
In order to automate out bot we'll use a cronjob. The Cron daemon is a built-in Linux utility that runs processes on your system at a scheduled time. Cron reads the crontab (cron tables) for predefined commands and scripts. By using a specific syntax, you can configure a cron job to schedule scripts or other commands to run automatically.
Cron reads the configuration file and the daemon uses a specific syntax to interpret the lines in the crontab configuration. Let's see the syntax to understand the basic components to make it work.
Here you can find a detail Cron Jobs in Linux explanation but for the porpuses of our lab, i just to that you keep in mind the file components, now let's check our cron file.
To edit this file just type in your terminal crontab -e
which will open an editor with the default configuration. Just copy and paste the content at the end of the file. Make sure to replace your keys within the fields.
Let's briefly check the crontab script:
The m h dom mon dow
fields represent the minute, hour, day of the month, month, and day of the week, respectively. In this case, 0 6 * * *
indicates that the specified command should run at 6:00 AM every day. $HOME/tweepy_bot/cron_script.sh
is the path to the shell script that contains the necessary commands to run the bot, in our case we call our Cron Tweet script that we created previously.
In summary, the crontab
file is configured to execute the cron_script.sh shell script at 6:00 AM every day, which in turn runs the necessary commands to automate the bot.
This is how should look our project tree
Now we're done with our first twitter bot!
Conclusion
Creating a Twitter bot using Tweepy and the Twitter API has been a rewarding experience. Through the use of Tweepy's Python library and the authentication credentials provided by the Twitter developer account, I was able to automate the process of tweeting daily. By fetching data from a website, composing tweets, and formatting them appropriately, the bot script fulfilled the requirements of my school program. The implementation of Crontab jobs ensured the bot's automation, allowing for consistent and timely tweets. Overall, this project has not only deepened my understanding of APIs, data scraping, and automation but has also strengthened my skills as a software developer.
I you want to see the repo with all the configuration files you can follow this link.