Automating scraping with Python & Github Actions

In this tutorial, we'll walk through the process of automating the scraping of dam data from the Kerala State Electricity Board (KSEB) website. We'll use Python for scraping and GitHub Actions for automation. This project allows us to collect daily updates on dam levels and store them for further analysis or visualization or as an API endpoint.

ps: you can get the entire code here.

Introduction

The goal of this project is to scrape dam data from a specific website and store it in a JSON file. We will use Python for the scraping and data processing, and GitHub Actions to automate the process. This ensures that the data is updated regularly without manual intervention.

Python Script for Scraping Data

Libraries Used

We use the following libraries in our Python script:

requests: To make HTTP requests to the website.
BeautifulSoup from bs4: To parse the HTML content of the webpage.
json: To handle JSON data.
datetime: To get the current date.

Scraping Function

The scrape_dam_data function is responsible for scraping the data from the website. Let's break down the function step by step.

import requests
from bs4 import BeautifulSoup
import json
from datetime import datetime

def scrape_dam_data():
    date = datetime.today().strftime('%d.%m.%Y')
    print(date)

    try:
        page_url = "https://dams.kseb.in/?page_id=45"
        response = requests.get(page_url)
        response.raise_for_status()  # Raise an exception for bad status codes

        soup = BeautifulSoup(response.content, 'html.parser')
        article = soup.find('article')
        if not article:
            raise ValueError("Article not found")

        date_link = article.find('h3').find('a')
        if not date_link:
            raise ValueError("Date link not found")

        link = date_link['href']
        print(f"Date: {date}, Link: {link}")

        if not link:
            raise ValueError(f"No data found for {date}")

        data_response = requests.get(link)
        data_response.raise_for_status()  # Raise an exception for bad status codes

        data_soup = BeautifulSoup(data_response.content, 'html.parser')
        table = data_soup.find('table')
        if not table:
            raise ValueError("Data table not found")

        rows = table.find_all('tr')
        if len(rows) < 2:  # Check if table has at least two rows (header and data)
            raise ValueError("No rows found")

        header = [th.text.strip().encode('ascii', 'ignore').decode('ascii') for th in rows[0].find_all('td')]
        data = []
        for row in rows[2:]:  # skip the header row
            cols = row.find_all('td')
            cols = [col.text.strip().replace(u'\u00a0', ' ') for col in cols]
            data.append(dict(zip(header, cols)))

        json_data = json.dumps(data, indent=4)
        print(json_data)

        return {
            "date": date,
            "data": data
        }
    except Exception as e:
        print(f"Error occurred: {e}")
        return {
            "date": date,
            "error": str(e)
        }

Explanation

Fetching the Webpage: We use requests.get to fetch the webpage content. The raise_for_status method ensures that an exception is raised for any HTTP error codes.
Parsing the HTML: We use BeautifulSoup to parse the HTML content and locate the relevant article and date link.
Fetching Data: We follow the link to the data page, fetch the content, and parse it to find the data table.
Processing the Table: We extract the header and data rows from the table, clean the data, and store it in a list of dictionaries.
Returning the Data: The function returns a dictionary containing the date and the scraped data.

Saving Data to JSON

The save_to_json function saves the scraped data to a JSON file.

def save_to_json(data, filename):
    try:
        with open(filename, 'w') as f:
            json.dump(data, f, indent=2)
    except Exception as e:
        print(f"Error saving to JSON: {e}")

Explanation

Opening the File: We open the file in write mode.
Writing the Data: We use json.dump to write the data to the file with an indentation of 2 spaces for readability.
Error Handling: Any exceptions during the file operation are caught and printed.

Main Function

The main function orchestrates the scraping and saving process.

if __name__ == "__main__":
    dam_data = scrape_dam_data()
    if dam_data:
        filename = f"dam_data.json"
        save_to_json(dam_data, filename)
        print(f"Data saved to {filename}")
    else:
        print("Failed to scrape data")

Explanation

Scraping the Data: We call scrape_dam_data to get the data.
Saving the Data: If data is successfully scraped, we save it to dam_data.json.
Error Handling: If scraping fails, an error message is printed.

GitHub Actions Workflow

Workflow Configuration

The GitHub Actions workflow is defined in a YAML file. This workflow schedules the scraping script to run at specific times and commits the updated data to the repository.

name: Scrape Dam Data

on:
  schedule:
    - cron: '30 23 * * *'  # 5:00 AM IST
    - cron: '0 2 * * *'    # 7:30 AM IST
    - cron: '30 2 * * *'   # 8:00 AM IST
    - cron: '30 4 * * *'   # 10:00 AM IST
    - cron: '02 5 * * *'   # 10:00 AM IST
    - cron: '03 5 * * *'   # 10:00 AM IST
    - cron: '04 5 * * *'   # 10:00 AM IST
    - cron: '05 5 * * *'   # 10:00 AM IST
    - cron: '06 5 * * *'   # 10:00 AM IST
    - cron: '30 5 * * *'   # 11:00 AM IST
    - cron: '30 7 * * *'   # 1:00 PM IST
    - cron: '0 12 * * *'   # 5:00 PM IST
    - cron: '30 23 * * *'  # 11:00 PM IST  

  workflow_dispatch:  # Allow manual triggering

Explanation

Schedule: The schedule key defines the times at which the workflow should run using cron syntax. The times are set to ensure the data is updated multiple times a day.
Manual Trigger: The workflow_dispatch key allows the workflow to be triggered manually from the GitHub Actions tab.

Job Steps

The scrape job defines the steps to execute the scraping script and commit the changes.

jobs:
  scrape:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.x'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install requests beautifulsoup4
    - name: Run scraper
      run: python scrape.py
    - name: Commit and push if changed
      run: |
        git config --local user.email "action@github.com"
        git config --local user.name "GitHub Action"
        git add dam_data.json
        git diff --quiet && git diff --staged --quiet || (git commit -m "Update dam data" && git push)

Explanation

Checkout Repository: The actions/checkout@v4 action checks out the repository.
Set Up Python: The actions/setup-python@v2 action sets up a Python environment.
Install Dependencies: We install the required Python packages (requests and beautifulsoup4).
Run Scraper: We run the scrape.py script to scrape the data.
Commit and Push: If the data has changed, we commit the updated dam_data.json file and push the changes to the repository.

How It All Works Together

The GitHub Actions workflow is triggered according to the schedule or manually.
It sets up the environment and runs the scrape.py script.
The script fetches the latest dam data from the KSEB website.
The data is saved to dam_data.json.
If there are changes to the JSON file, it's committed and pushed to the repository.

This automation ensures that we have up-to-date dam data stored in our repository, which can be used for further analysis or visualization or as an API endpoint.

Conclusion

In this blog post, we have walked through a project that automates the scraping and storage of dam data using Python and GitHub Actions. We covered the Python script for scraping and processing the data, and the GitHub Actions workflow for automating the process. This setup ensures that the data is updated regularly without manual intervention, making it a robust solution for data collection and storage.

Automating Dam Data Scraping and Storage with Python and GitHub Actions

Introduction

Python Script for Scraping Data

Libraries Used

Scraping Function

Explanation

Saving Data to JSON

Explanation

Main Function

Explanation

GitHub Actions Workflow

Workflow Configuration

Explanation

Job Steps

Explanation

How It All Works Together

Conclusion