My Twitter bot in Python and Bash - Sebastián Valencia Sierra

When I was studying to become a Software Developer, I had to tweet at least once every day to accomplish one of the goals of the school in which I was taking the program. In order to do that, I decided to create a bot to perform that operation in an automated way every day. That is how I found Tweepy, a Python library which allows you to use Twitter API.

The purpose of the bot is to gather fascinating facts from a website that publishes daily historical events and occurrences. It accomplishes this by employing a web scraper that extracts data from the website's HTML files. The extracted information is then stored in a file for further processing. Using the Tweepy library, the bot accesses the stored data and crafts intriguing tweets. These tweets are then published on a regular basis, sharing intriguing and curious facts with the bot's followers.

Process to create your first Twitter bot:

Make sure you install Tweepy locally.
Sign up as a Twitter developer to use its API.
Use Tweepy to communicate with Twitter API.
Write your bot.
Automate your bot using Crontab jobs.

What is Tweepy?

Before we start lets briefly understand what this Python library is that lets you interact with Twitter API. Tweepy is an open source project that creates a convenient and easy way to interact with Twitter API using Python. For that porpouse, tweepy includes different classes and methods to represent Twitter's models and API endpoints. With these methods and clases you can encode and decode data, make HTTP requests, paginate search results and implement an OAuth authentication mechanism, among other things. With that being said, let's start.

Tweepy installation

According to the official repository, the easiest way to install the latest Tweepy version is by using pip

pip install tweepy

You can also use Git to clone the repository and install the latest Tweepy development branch.

git clone https://github.com/tweepy/tweepy.git
cd tweepy
pip install .

And finally, you can also install Tweepy directly from the GitHub repository.

pip install git+https://github.com/tweepy/tweepy.git

Authentication Credentials for Twitter API

First of all, you need to apply for a Twitter developer account. To do that you have to follow the next steps according to the Twitter developer account support.

Log-in to Twitter and verify your email address. (Note that the email and phone number verification from your Twitter account may be needed to apply for a developer account, review on the Twitter help center: email address confirmation or add phone number.)
Click sign up at developer.twitter.com to enter your developer account name, location and use case details.
Review and accept the developer agreement and submit.
Check your email to verify your developer account. Look for an email from developer-accounts@twitter.com that has the subject line: "Verify your Twitter Developer Account" Note: the developer-accounts@twitter.com email is not available for inbound requests.
You should now have access to the Developer Portal to create a new App and Project with Essential access, or will need to continue your application with Elevated access
If you apply for Elevated access (or Academic Research access) please continue to check your verified email for information about your application.

Finally, once you complete your application, go to the developer portal dashboard to review your account status and setup. And if it's successfully registered, the next step is create your first app.

Create the application

Twitter lets you create authentication credentials to applications, not accounts. With that being said, you need to create an app to be able to make API requests. In this case, our app will be a bot that scrapes data from a website and publishes it as a tweet on your Twitter account.

To create your application, visit the developer portal dashboard select +Add App and provide the following information: app name, application description, website URL and all related information about how users will use your application.

Authentication credentials

In order to create your authentication credentials, go to the developer portal dashboard select your application and the click on "keys and tokens". Once you are on your project page, select "Generate API Key and Secret" and also select "Access Token and Secret", keep in mind that the last one should be created with read and write permissions, which guarantees that our bot can write tweets for you using the Twitter API. Don't forget to store the keys so you can use it later in our configuration file for Twitter Authentication.

You may want to test your credentials using this python script:

#!/usr/bin/python3
import tweepy

# Authenticate to Twitter
auth = tweepy.OAuthHandler("[API_KEY]",
    "[API_SECRET_KEY]")
auth.set_access_token("[ACCESS_TOKEN]",
    "[ACCESS_TOKEN_SECRET]")

api = tweepy.API(auth)

try:
    api.verify_credentials()
    print("Authentication OK")
except Exception as e:
    print("Error during authentication")
    raise e

Make sure you replace the square bracket fields with your credentials. Once you're done we can continue with the next step.

Create your configuration file to Authenticate our bot

#!/usr/bin/python3

import tweepy
import os


def create_api():
    consumer_key = os.getenv("API_KEY")
    consumer_secret = os.getenv("API_SECRET_KEY")
    access_token = os.getenv("ACCESS_TOKEN")
    access_token_secret = os.getenv("ACCESS_TOKEN_SECRET")

    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)
    try:
        api.verify_credentials()
        print("Authentication OK")
    except Exception as e:
        print("Error during authentication")
        raise e
    return api

Here's an explanation of the authentication script:

The script starts with a shebang #!/usr/bin/python3, which specifies the interpreter to be used to execute the script. In this case, it's set to Python 3.

The script imports the necessary modules, tweepy and os. Alongside of Tweepy, we'll use os, which allows interaction with the operating system, particularly to retrieve environment variables. In this case we need that library because we will store our Twitter keys as environment variables.

The create_api() function is defined. This function is responsible for creating and returning an instance of the Tweepy API, which will be used to interact with the Twitter API.

Inside the create_api() function, the script retrieves several environment variables using os.getenv(). These environment variables are expected to contain the necessary Twitter API credentials: API_KEY, API_SECRET_KEY, ACCESS_TOKEN, and ACCESS_TOKEN_SECRET.

The script uses the credentials obtained from the environment variables to initialize an instance of tweepy.OAuthHandler. This class is responsible for handling the OAuth 1.0a authentication process required by the Twitter API.

The auth.set_access_token() method is called to set the access token and access token secret obtained from the environment variables.

An instance of tweepy.API is created, passing the auth object as an argument. This instance represents the authenticated connection to the Twitter API.

The script then attempts to verify the credentials by calling api.verify_credentials(). If the verification is successful, it prints "Authentication OK". Otherwise, it catches any exception raised, prints "Error during authentication" and re-raises the exception.

Finally, the created api object is returned from the create_api() function.

In summary, this script defines a function create_api() that creates and returns an authenticated instance of tweepy.API. It retrieves the necessary Twitter API credentials from environment variables, sets up the authentication, and verifies the credentials. This function can be used to establish a connection to the Twitter API for further interactions in your Twitter bot.

Create your bot

Bot modules: bot_v1.py and split_string.py

#!/usr/bin/python3
# bot_v1.py
import random

from bots.config import create_api

from datetime import datetime
# import locale
from babel.dates import format_date
from babel.numbers import format_decimal

from utils.split_string import split_string
# Authenticate to Twitter


def tweet_job(api):
    data = '~/tweepy_bot/scrapers/hoy_en_la_historia.txt'
    with open(data, 'r') as filename:
        lines = filename.readlines()

    myline = random.choice(lines)

    # Get the current date
    current_date = datetime.now()

    # Format the date components separately
    day = format_decimal(current_date.day, format='##')
    month = format_date(current_date, format='MMMM', locale='es')

    # Format the date as "month day"
    # Create the formatted date with "de" separator
    formatted_date = f"{day} de {month}"

    # Tweet each line, then wait one minute and tweet another.
    # Note: this design means the bot runs continuously
    myline = myline
    mystr = myline.strip()
    mystr = f"🤖 #HoyEnLaHistoria, {formatted_date}, " + mystr + " [© 2012-2023 Hoyenlahistoria.com]"

    if len(mystr) <= 240:
        original_tweet = api.update_status(status=mystr)
        print(mystr)
    else:
        firstStr, secondStr = split_string(mystr)
        firstStr = firstStr + " [1/2]"
        secondStr = secondStr + " [2/2]"
        original_tweet = api.update_status(status=firstStr)
        api.update_status(status=secondStr,
                          in_reply_to_status_id=original_tweet.id,
                          auto_populate_reply_metadata=True)
        print(f"First tweet: {firstStr}\nsecond tweet: {secondStr}")


def main():
    api = create_api()
    tweet_job(api)


if __name__ == "__main__":
    main()

Let's go through the script step by step:

The script imports the random module, which will be used to choose a random line from the file, and the datetime module from the standard library, which will be used to get the current date.

We also imports the create_api() which we created in the previous step, this function is imported from the bots.config module and is responsible for creating an authenticated instance of tweepy.API that connects to the Twitter API.

Aditionally, we import specific functions from the babel.dates and babel.numbers modules. These functions will be used for formatting the date and decimal numbers in a localized manner.

The script also imports the split_stringfunction from the utils.split_string module. This function will be used to split the tweet into multiple parts if its length exceeds Twitter character limit.

The tweet_job(api) function is defined. This function takes the api object (the authenticated instance of tweepy.API) as an argument. Inside the tweet_job() function, the script opens a file named hoy_en_la_historia.txt located at the specified path. It reads all the lines from the file and stores them in the lines list.

The script randomly selects a line from the lines list using random.choice() and assigns it to the myline variable.

We also get the current date using datetime.now() and formats the date components separately. It formats the day as a two-digit decimal number and the month as the full month name in Spanish using the format_decimal() and format_date() functions from the babel library, respectively.

The script creates a formatted date string by combining the day, month, and a custom text. If the length of the tweet string mystr is less than or equal to 240 characters (the Twitter character limit), it directly tweets the mystr using api.update_status() and prints the tweet.

If the length of the tweet string exceeds 240 characters, it splits the tweet using the split_string() function, which splits the string into two parts while considering word boundaries. It adds a marker to indicate the order of the tweets. We'll go through it in the next step.

The script tweets the first part of the split string using api.update_status() and assigns the tweet object to original_tweet. It then tweets the second part as a reply to the original tweet using api.update_status() with the in_reply_to_status_id parameter set to the ID of the original tweet.

Also the main() function is defined, which calls the create_api() function to create an authenticated API instance and passes it to the tweet_job() function.

Finally, we use the entrypoint, which is used to check if the current module is the main module by using the if __name__ == "__main__": condition. If it is, it calls the main() function to start the execution of the script.

In summary, this script defines a function tweet_job() that reads lines from a file, selects a random line, formats the current date, and tweets the content. If the tweet exceeds the character limit, it splits the tweet into multiple parts. The script also defines a main() function that creates an authenticated API instance and calls tweet_job(). When executed as the main module, it sets everything in motion.

Let's checkout the split_string() function

#!/usr/bin/env python3

def split_string(string):
    if len(string) > 234:
        # First 240 characters
        first_string = string[:234]
        # Find the last space within the first 240 characters
        last_space_index = first_string.rfind(' ')
        if last_space_index != -1:
            # Truncate to the last space
            first_string = first_string[:last_space_index]
        # Remaining characters
        second_string = string[len(first_string):].strip()
    else:
        first_string = string
        second_string = ""

    return first_string, second_string

The split_string() function takes a string as input and splits it into two parts while ensuring that the total length of the resulting strings does not exceed a certain limit. Here's a breakdown of the function's logic:

If the length of the input string is greater than 234 characters (slightly below the Twitter character limit of 240), the function proceeds to split the string. The first 234 characters of the input string are assigned to the variable first_string.

The function searches for the last space within the first 234 characters using the rfind() method. This helps ensure that the split occurs at a word boundary. If a space is found within the first 234 characters, first_string is truncated to the last space, ensuring that it doesn't cut off words. The remaining characters from the input string, after the split point, are assigned to the variable second_string. Any leading or trailing whitespace is stripped using the strip() method. If the length of the input string is less than or equal to 234 characters, the entire input string is assigned to first_string, and second_string is set to an empty string.

Finally, the function returns a tuple containing first_string and second_string.

In summary, the split_string() function splits a given string into two parts, with the first part having a maximum length of 234 characters (accounting for the Twitter character limit) while ensuring that the split occurs at a word boundary. It provides a convenient way to split long strings for tweeting purposes, maintaining readability and coherence.

Now let's create our data scraper using bash!

Data scraper

#!/usr/bin/env bash

# Get the URL argument passed to the script
url=$1

# Define the directory path
dir="$HOME/tweepy_bot/scrapers"

# Remove the existing hoy_en_la_historia.txt file
rm -rf "$dir"/hoy_en_la_historia.txt

# Fetch the content from the given URL, convert it to plain text, remove extra spaces, and save it to data.txt
echo $(curl --silent "$url" | htmlq --text | html2text) | tr -s ' ' | sed '/./G' > "$dir"/data.txt

# Insert a new line before a 3 or 4-digit number followed by a hyphen in data.txt
sed -i -E 's/([0-9]{3,4} -)/\n\1/g' "$dir"/data.txt

# Remove empty lines from data.txt
sed -i '/^\s*$/d' "$dir"/data.txt

# Insert a new line after the last period in data.txt
sed -i '$ s/\./.\n/' "$dir"/data.txt

# Filter out lines containing 'See All', 'SHOW', 'Efemérides' from output.txt and remove lines shorter than or equal to 140 characters, save the result to hoy_en_la_historia.txt
grep -v -e 'See All' -e 'SHOW' -e 'Efemérides' "$dir"/data.txt | grep -vE '^.{,140}$' > "$dir"/hoy_en_la_historia.txt

# Replace ' - ' with ', ' in hoy_en_la_historia.txt
sed -i 's/ - /, /g' "$dir"/hoy_en_la_historia.txt

# Remove the temporary data.txt
rm -rf "$dir"/data.txt

Here's a brief summary of the functionality of the provided bash script

The script takes a URL as an argument and assigns it to the url variable. It sets the directory path where the data will be stored in the dir variable. The script removes any existing hoy_en_la_historia.txt file in the specified directory. It retrieves data from the specified URL using curl, processes the HTML response using htmlq and html2text (make sure you install first both packages, follow the hyperlinks to see the installation instructions), and stores the result in a temporary file called data.txt.

The script performs various transformations on the data.txt file using sed, tr, and grep commands to extract and format the desired data. Let's go through the commands used in the script one by one:

url=$1: This command assigns the first argument passed to the script to the variable url. It allows you to provide a URL as an argument when executing the script.

dir="$HOME/tweepy_bot/scrapers": This command sets the directory path where the data will be stored. It assigns the specified path to the variable dir.

rm -rf "$dir"/hoy_en_la_historia.txt: This command removes any existing hoy_en_la_historia.txt file in the specified directory $dir.

echo $(curl --silent "$url" | htmlq --text | html2text) | tr -s ' ' | sed '/./G' > "$dir"/data.txt: This command retrieves the HTML content from the specified URL using curl, processes it using htmlq and html2text, and saves the result in a temporary file called data.txt. The echo command and subsequent pipeline manipulate the text by removing excessive spaces and adding line breaks.

sed -i -E 's/([0-9]{3,4} -)/\n\1/g' "$dir"/data.txt: ([0-9]{3,4} -) is the pattern that matches either a 3-digit or 4-digit sequence followed by a space and a dash. The captured group is then inserted into the replacement string \n\1 to add a newline before the matched pattern.

sed -i '/^\s*$/d' "$dir"/data.txt: This command is used to delete empty lines in the file. /^\s*$/ is a regular expression pattern that matches empty lines. The ^ represents the start of a line, \s* matches zero or more whitespace characters, and $ represents the end of a line. /d is the sed command to delete the matched lines.

sed -i '$ s/\./.\n/' "$dir"/data.txt: $ matches the last line of the file. s/\./.\n/ finds the first occurrence of a dot \. on the last line and replaces it with the dot followed by a newline .\n.

grep -v -e 'See All' -e 'SHOW' -e 'Efemérides' "$dir"/data.txt | grep -vE '^.{,60}$' > "$dir"/hoy_en_la_historia.txt: This command filters out lines in hoy_en_la_historia.txt that contain certain keywords (See All, SHOW, Efemérides). It also removes lines that are shorter than or equal to 60 characters.

sed -i 's/ - /, /g' "$dir"/hoy_en_la_historia.txt: This is a substitution command that searches for the pattern "space-dash-space" - and replaces it with a comma and a space , .

rm -rf "$dir"/data*: This command removes all temporary files starting with data in the specified directory $dir to clean our workspace.

It saves the final formatted data into the hoy_en_la_historia.txt file, which is the file that we are using the get the data for our bot.

Now let's create the script to call our main bot module from a bash script that will be automated using crontab linux package.

Bot runner

#!/usr/bin/env python3
from bots.bot_v1 import main
import time

maxtries = 8    # 8 * 15 minutes = about 2 hours total of waiting,

for i in range(maxtries):
    try:
        main()
        break
    except:
        time.sleep(900)
        print("fail", i)

Here's the script explanation:

The script imports the main function from the bots.bot_v1 module Which is the function that we define when created the bot.

The variable maxtries is set to 8, indicating the maximum number of attempts the script will make to execute the main() function. The script enters a loop that iterates maxtries times using the range() function. Within each iteration, the main() function is called. If the execution of the main() function is successful (no exception is raised), the loop is terminated using the break statement, and the script finishes.

If an exception occurs during the execution of the main() function, the script pauses execution for 900 seconds (15 minutes) using the time.sleep() function. After the pause, the loop continues to the next iteration, and the process repeats until either the main() function succeeds or the maximum number of tries is reached. If the maximum number of tries is reached, the script prints "fail" followed by the iteration number i indicating the failed attempt.

In summary, this Python script repeatedly calls the main() function from the bots.bot_v1 module with a maximum number of tries. If an exception occurs, it waits for a specific duration before retrying. The script provides a mechanism to handle failures with the cronjob call and ensures that the main() function is executed multiple times within a specified time frame.

We are about to finish, lets check the bash script that we will use to automate our bot using crontab.

Cron Tweet script

# cron_script.sh
#!/usr/bin/env bash
url="https://www.hoyenlahistoria.com/efemerides.php"
cd $HOME/tweepy_bot || exit
./scrapers/scraper.sh $url
python3 bot_runner.py

url="https://www.hoyenlahistoria.com/efemerides.php": This line assigns the website url to the variable url. It specifies the website from which the scraper will fetch data.

cd $HOME/tweepy_bot || exit: This line changes the current directory to out tweepy_bot project workspace. If the directory change is unsuccessful (for example, if the directory doesn't exist), the script exits.

./scrapers/scraper.sh $url: This line executes the shell script scraper.sh located in the scrapers directory. The script takes the value of the url variable as an argument and it will be the script that performs all text file operations that allows us to have a clean file to make out bot tweet for us in readability way.

python3 bot_runner.py: This line executes the Python script bot_runner.py. It runs the script responsible for running the bot, which performs the operations based on the data fetched by the scraper.

In summary, this script sets the URL, changes the directory to the appropriate location, executes the scraper script with the specified URL, and then executes the bot runner script to perform operations based on the scraped data.

Finally, let's check our cronjob file.

Crontab job

In order to automate out bot we'll use a cronjob. The Cron daemon is a built-in Linux utility that runs processes on your system at a scheduled time. Cron reads the crontab (cron tables) for predefined commands and scripts. By using a specific syntax, you can configure a cron job to schedule scripts or other commands to run automatically.

Cron reads the configuration file and the daemon uses a specific syntax to interpret the lines in the crontab configuration. Let's see the syntax to understand the basic components to make it work.

Here you can find a detail Cron Jobs in Linux explanation but for the porpuses of our lab, i just to that you keep in mind the file components, now let's check our cron file.

API_KEY=[your api key]
API_SECRET_KEY=[your api secret key]
ACCESS_TOKEN=[your access token]
ACCESS_TOKEN_SECRET=[your access token secret]
# m h dom mon dow command
0 6 * * * bash $HOME/tweepy_bot/cron_script.sh

To edit this file just type in your terminal crontab -e which will open an editor with the default configuration. Just copy and paste the content at the end of the file. Make sure to replace your keys within the fields.

Let's briefly check the crontab script:

The m h dom mon dow fields represent the minute, hour, day of the month, month, and day of the week, respectively. In this case, 0 6 * * * indicates that the specified command should run at 6:00 AM every day. $HOME/tweepy_bot/cron_script.sh is the path to the shell script that contains the necessary commands to run the bot, in our case we call our Cron Tweet script that we created previously.

In summary, the crontab file is configured to execute the cron_script.sh shell script at 6:00 AM every day, which in turn runs the necessary commands to automate the bot.

This is how should look our project tree

~/tweepy_bot
❯ tree
.
├── bot_runner.py
├── bots
│   ├── bot_v1.py
│   └── config.py
├── cron_script.sh
├── scrapers
│   └── scraper.sh
└── utils
    └── split_string.py

4 directorie, 6 files

Now we're done with our first twitter bot!

Conclusion

Creating a Twitter bot using Tweepy and the Twitter API has been a rewarding experience. Through the use of Tweepy's Python library and the authentication credentials provided by the Twitter developer account, I was able to automate the process of tweeting daily. By fetching data from a website, composing tweets, and formatting them appropriately, the bot script fulfilled the requirements of my school program. The implementation of Crontab jobs ensured the bot's automation, allowing for consistent and timely tweets. Overall, this project has not only deepened my understanding of APIs, data scraping, and automation but has also strengthened my skills as a software developer.

I you want to see the repo with all the configuration files you can follow this link.

← Previous Post Next Post →