Today, I had an interesting exercise with ChatGPT, writing Python code to preserve the contents of one of my old blog sites at www.example.com/blogs before it gets retired. I wanted to convert all the blog posts into fully formatted PDFs without writing a single line of code myself. The goal was to ensure the PDFs contained text, formatting, and images, and not just screenshots, making them useful for future searching and copying. For the next two hours, I worked with ChatGPT. It was a journey that unfolded through a clear and interactive dialogue.

The Journey Began

We started by fetching blog post titles and URLs from the website, using Python to scrape and neatly organize the data into a CSV file. This initial step was our first program. Initially, ChatGPT struggled to read the HTML file from the internet, so I pasted the source code directly into the chat. This allowed ChatGPT to analyze it and generate the correct code for writing all the URLs of the posts into a CSV file with their titles.

Adding Complexity

The next step was to write a second program that would read the CSV file, fetch the HTML content for each URL, and then convert these into PDFs. ChatGPT suggested using wkhtmltopdf for this task. After downloading and installing the executable, I added its folder to my system path. This step aimed to preserve all digital qualities of the blog posts, such as links and images, making the PDFs true to their online form.

Overcoming Hurdles

Early versions of the code faced challenges such as formatting issues and error handling during PDF generation. By sharing error messages, ChatGPT was able to refine the code in subsequent iterations. Each interaction allowed us to improve the approach, demonstrating the iterative nature of coding and problem-solving.

Final Touches

Eventually, we enhanced the PDFs by setting their metadata to reflect the original blog post’s publication date, adding an authentic touch to the digital documents.

Tools of the Trade

Throughout this journey, several Python packages were our tools:
  • Beautiful Soup for HTML parsing.
  • Pandas for data management.
  • pdfkit/wkhtmltopdf for web to PDF conversion.
  • Requests for web requests.
  • Subprocess for system operations in Python.

Wrap-Up

In 17 interactions, we went from concept to a working script capable of turning any blog post into a well-crafted PDF. This experience wasn’t just about technical success; it was about leveraging AI as a partner in the creative process, making complex tasks approachable and manageable.

Here’s the flowchart that outlines our process:

  1. Start Process
  2. Fetch all blog listing pages and generate CSV with titles and URLs of all posts
  3. Read URLs from CSV file
  4. Fetch the web page for each URL using requests
  5. Extract blog title and publication date from HTML
  6. Convert each HTML to PDF using WKHTMLtoPDF
  7. Save PDF with blog title as filename
  8. Set PDF creation date to match blog post date
  9. Handle errors during URL fetch or PDF conversion
  10. Log successes and failures
  11. End Process
This experience shows how, with clear guidance and a bit of patience, you can use Generative AI to handle even complex coding tasks, transforming an idea into a practical solution in a few hours.

FetchAllPosts_TitlesAndURL.py

# pip install requests beautifulsoup4 pandas

import requests
from bs4 import BeautifulSoup
import csv

# URL format to iterate over pages
base_url = "https://www.example.com/blogs/?paged={}"

# Function to fetch and parse a single page
def fetch_page(page_number):
    try:
        url = base_url.format(page_number)
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup
    except requests.exceptions.RequestException as e:
        print(f"Error fetching page {page_number}: {e}")
        return None

# Function to extract headings and URLs from a given BeautifulSoup object
def extract_posts(soup):
    if soup is None:
        return []
    articles = soup.find_all('article')
    posts = []
    for article in articles:
        headline = article.find('h1', class_='entry-title')
        if headline and headline.a:
            posts.append((headline.get_text(strip=True), headline.a['href']))
    return posts

# List to store all posts across all pages
all_posts = []

# Iterate over each page
for page_number in range(1, 17):  # 16 pages total
    soup = fetch_page(page_number)
    posts = extract_posts(soup)
    all_posts.extend(posts)
    print(f"Page {page_number} processed, found {len(posts)} posts.")

# Saving to CSV file
csv_filename = 'example_Blogs_Posts.csv'  # Default folder in Windows
with open(csv_filename, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Title', 'URL'])  # Writing header
    writer.writerows(all_posts)

print(f"All posts have been extracted and saved to {csv_filename}.")
 

SaveAllURLs_As_PDF.py


# pip install pandas pdfkit
# pip install pytz
# Install wkhtmltopdf from wkhtmltopdf.org a

import pandas as pd
import pdfkit
import requests
import os
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import datetime
import subprocess
import logging

def sanitize_filename(title):
    import string
    valid_chars = f"-_.() {string.ascii_letters}{string.digits}"
    return ''.join(c for c in title if c in valid_chars)

def requests_retry_session(retries=3, backoff_factor=0.3, status_forcelist=(500, 502, 504), session=None):
    session = session or requests.Session()
    retry = Retry(total=retries, read=retries, connect=retries, backoff_factor=backoff_factor, status_forcelist=status_forcelist, allowed_methods=frozenset(['GET', 'POST']))
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session

def read_data_from_csv(csv_file):
    try:
        data = pd.read_csv(csv_file)
        logging.info(f"Read {len(data)} entries from {csv_file}")
        return data
    except Exception as e:
        logging.error(f"Failed to read CSV file: {e}")
        return pd.DataFrame()

def fetch_and_parse_html(url):
    session = requests_retry_session()
    try:
        response = session.get(url, timeout=60)
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.find('h1', class_='entry-title').get_text(strip=True)
        date_string = soup.find('time', class_='entry-date published')['datetime']
        date = datetime.datetime.strptime(date_string, '%Y-%m-%dT%H:%M:%S%z')
        return title, date
    except Exception as e:
        logging.error(f"Failed to fetch or parse HTML from {url}: {e}")
        return "Untitled", datetime.datetime.now()

def convert_to_pdf(url, output_directory, default_title):
    title, date = fetch_and_parse_html(url)
    filename = sanitize_filename(title if title != "Untitled" else default_title) + '.pdf'
    output_path = os.path.join(output_directory, filename)
    options = {
        'page-size': 'Letter',
        'minimum-font-size': 12,
        'encoding': "UTF-8",
        'custom-header': [('User-Agent', 'Mozilla/5.0')],
        'no-outline': None
    }
    try:
        logging.info(f"Converting {url} to PDF as {output_path}")
        pdfkit.from_url(url, output_path, options=options)
        check_pdf_content(output_path)
    except Exception as e:
        logging.error(f"Failed to convert {url} to PDF: {e}")
    set_file_creation_date(output_path, date)

def check_pdf_content(filepath):
    if os.path.getsize(filepath) == 0:
        logging.warning(f"Generated PDF is empty: {filepath}")

def set_file_creation_date(filepath, creation_date):
    formatted_date = creation_date.strftime('%Y-%m-%d %H:%M:%S')
    command = f'(Get-Item "{filepath}").creationtime=$(Get-Date "{formatted_date}")'
    subprocess.run(['powershell', '-command', command], check=True)

def main():
    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
    csv_filename = 'Example_Blogs_Posts.csv'
    output_directory = 'pdf_output'
    data = read_data_from_csv(csv_filename)
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)
    for index, row in data.iterrows():
        default_title = f'Untitled_{index}'
        convert_to_pdf(row['URL'], output_directory, row.get('Title', default_title))

if __name__ == "__main__":
    main()
    logging.info("All pages have been processed and saved as PDFs.")

Discover more from Venkatarangan blog

Subscribe to get the latest posts to your email.

Categorized in: