In this tutorial, we'll walk through the process of automating the scraping of dam data from the Kerala State Electricity Board (KSEB) website. We'll use Python for scraping and GitHub Actions for automation. This project allows us to collect daily updates on dam levels and store them for further analysis or visualization or as an API endpoint.
ps: you can get the entire code here.
Introduction
The goal of this project is to scrape dam data from a specific website and store it in a JSON file. We will use Python for the scraping and data processing, and GitHub Actions to automate the process. This ensures that the data is updated regularly without manual intervention.
Python Script for Scraping Data
Libraries Used
We use the following libraries in our Python script:
requests
: To make HTTP requests to the website.BeautifulSoup
frombs4
: To parse the HTML content of the webpage.json
: To handle JSON data.datetime
: To get the current date.
Scraping Function
The scrape_dam_data
function is responsible for scraping the data from the website. Let's break down the function step by step.
import requests
from bs4 import BeautifulSoup
import json
from datetime import datetime
def scrape_dam_data():
date = datetime.today().strftime('%d.%m.%Y')
print(date)
try:
page_url = "https://dams.kseb.in/?page_id=45"
response = requests.get(page_url)
response.raise_for_status() # Raise an exception for bad status codes
soup = BeautifulSoup(response.content, 'html.parser')
article = soup.find('article')
if not article:
raise ValueError("Article not found")
date_link = article.find('h3').find('a')
if not date_link:
raise ValueError("Date link not found")
link = date_link['href']
print(f"Date: {date}, Link: {link}")
if not link:
raise ValueError(f"No data found for {date}")
data_response = requests.get(link)
data_response.raise_for_status() # Raise an exception for bad status codes
data_soup = BeautifulSoup(data_response.content, 'html.parser')
table = data_soup.find('table')
if not table:
raise ValueError("Data table not found")
rows = table.find_all('tr')
if len(rows) < 2: # Check if table has at least two rows (header and data)
raise ValueError("No rows found")
header = [th.text.strip().encode('ascii', 'ignore').decode('ascii') for th in rows[0].find_all('td')]
data = []
for row in rows[2:]: # skip the header row
cols = row.find_all('td')
cols = [col.text.strip().replace(u'\u00a0', ' ') for col in cols]
data.append(dict(zip(header, cols)))
json_data = json.dumps(data, indent=4)
print(json_data)
return {
"date": date,
"data": data
}
except Exception as e:
print(f"Error occurred: {e}")
return {
"date": date,
"error": str(e)
}
Explanation
Fetching the Webpage: We use
requests.get
to fetch the webpage content. Theraise_for_status
method ensures that an exception is raised for any HTTP error codes.Parsing the HTML: We use
BeautifulSoup
to parse the HTML content and locate the relevant article and date link.Fetching Data: We follow the link to the data page, fetch the content, and parse it to find the data table.
Processing the Table: We extract the header and data rows from the table, clean the data, and store it in a list of dictionaries.
Returning the Data: The function returns a dictionary containing the date and the scraped data.
Saving Data to JSON
The save_to_json
function saves the scraped data to a JSON file.
def save_to_json(data, filename):
try:
with open(filename, 'w') as f:
json.dump(data, f, indent=2)
except Exception as e:
print(f"Error saving to JSON: {e}")
Explanation
Opening the File: We open the file in write mode.
Writing the Data: We use
json.dump
to write the data to the file with an indentation of 2 spaces for readability.Error Handling: Any exceptions during the file operation are caught and printed.
Main Function
The main function orchestrates the scraping and saving process.
if __name__ == "__main__":
dam_data = scrape_dam_data()
if dam_data:
filename = f"dam_data.json"
save_to_json(dam_data, filename)
print(f"Data saved to {filename}")
else:
print("Failed to scrape data")
Explanation
Scraping the Data: We call
scrape_dam_data
to get the data.Saving the Data: If data is successfully scraped, we save it to
dam_data.json
.Error Handling: If scraping fails, an error message is printed.
GitHub Actions Workflow
Workflow Configuration
The GitHub Actions workflow is defined in a YAML file. This workflow schedules the scraping script to run at specific times and commits the updated data to the repository.
name: Scrape Dam Data
on:
schedule:
- cron: '30 23 * * *' # 5:00 AM IST
- cron: '0 2 * * *' # 7:30 AM IST
- cron: '30 2 * * *' # 8:00 AM IST
- cron: '30 4 * * *' # 10:00 AM IST
- cron: '02 5 * * *' # 10:00 AM IST
- cron: '03 5 * * *' # 10:00 AM IST
- cron: '04 5 * * *' # 10:00 AM IST
- cron: '05 5 * * *' # 10:00 AM IST
- cron: '06 5 * * *' # 10:00 AM IST
- cron: '30 5 * * *' # 11:00 AM IST
- cron: '30 7 * * *' # 1:00 PM IST
- cron: '0 12 * * *' # 5:00 PM IST
- cron: '30 23 * * *' # 11:00 PM IST
workflow_dispatch: # Allow manual triggering
Explanation
Schedule: The
schedule
key defines the times at which the workflow should run using cron syntax. The times are set to ensure the data is updated multiple times a day.Manual Trigger: The
workflow_dispatch
key allows the workflow to be triggered manually from the GitHub Actions tab.
Job Steps
The scrape
job defines the steps to execute the scraping script and commit the changes.
jobs:
scrape:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install requests beautifulsoup4
- name: Run scraper
run: python scrape.py
- name: Commit and push if changed
run: |
git config --local user.email "action@github.com"
git config --local user.name "GitHub Action"
git add dam_data.json
git diff --quiet && git diff --staged --quiet || (git commit -m "Update dam data" && git push)
Explanation
Checkout Repository: The
actions/checkout@v4
action checks out the repository.Set Up Python: The
actions/setup-python@v2
action sets up a Python environment.Install Dependencies: We install the required Python packages (
requests
andbeautifulsoup4
).Run Scraper: We run the
scrape.py
script to scrape the data.Commit and Push: If the data has changed, we commit the updated
dam_data.json
file and push the changes to the repository.
How It All Works Together
The GitHub Actions workflow is triggered according to the schedule or manually.
It sets up the environment and runs the
scrape.py
script.The script fetches the latest dam data from the KSEB website.
The data is saved to
dam_data.json
.If there are changes to the JSON file, it's committed and pushed to the repository.
This automation ensures that we have up-to-date dam data stored in our repository, which can be used for further analysis or visualization or as an API endpoint.
Conclusion
In this blog post, we have walked through a project that automates the scraping and storage of dam data using Python and GitHub Actions. We covered the Python script for scraping and processing the data, and the GitHub Actions workflow for automating the process. This setup ensures that the data is updated regularly without manual intervention, making it a robust solution for data collection and storage.