Extracting Blogs from URLs

We will show how to extract blog content from a URL using BeautifulSoup and some LLM calls to identify the main article content and reformat it in markdown format.

This content is heavily inspired from Jason Liu in this document segmentation example (opens in a new tab).

Step 1: Line Numbering formatting

First, we'll use the requests library to fetch the content from a given URL and BeautifulSoup to extract all the text from the page.

After extracting the text, we'll add a number at the start of each line. This step is crucial as it allows us to ask the LLM to predict the start and end lines of the main article content.

import requests
from bs4 import BeautifulSoup
 
 
 
def doc_with_lines(document):
    document_lines = document.split("\n")
    document_with_line_numbers = ""
    line2text = {}
    for i, line in enumerate(document_lines):
        document_with_line_numbers += f"[{i}] {line}\n"
        line2text[i] = line
    return document_with_line_numbers, line2text
 
 
text = requests.get("https://miniblog.ai/cookbooks/extract-blogs-from-url").text
soup = BeautifulSoup(text, "html.parser")
text = soup.get_text()
text_with_lines, line2text = doc_with_lines(text)

Step 2: Start and end line identification for the article

Normally, text_with_lines is small enough to be processed by an LLM in 1 request.

from pydantic import BaseModel, Field
client = instructor.from_openai(openai.OpenAI())
class ArticleDelimiters(BaseModel):
    title: str = Field(description="title of the article")
    start_index: int = Field(description="line number where the article begins")
    end_index: int = Field(description="line number where the article ends")
 
article_delimiters = client.chat.completions.create(
        model="command-r-plus",
        response_model=ArticleDelimiters,
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": text_with_lines,
            },
        ],
    )

Step 3: Get markdown formatted text for the article

With the start and end lines identified, we can confidently extract the article content. After obtaining the relevant text, we employ another LLM or make an additional LLM call to reformat the content.

system_prompt = """
You are an expert in markdown formatting. You are given a document with line numbers and the content of the document.
You need to extract the article content between the start and end lines.
The returned article should be in markdown format, with headers for the sections, etc etc 
"""

Additional steps: Images extractions, tags, descriptions etc

Next, you can focus on extracting additional meta information from the article. This process involves several steps that you can do next:

Image Extraction:
- Collect all images from the article using href and a tags from the html text (between the delimiters)
- Filter images based on specific resolution criteria. Usually images of less than 100kb are not from the blog post.
- Get a description / title from those images using gpt-4o / gpt-4o-mini
Tag Generation:
- Option 1: Recreate tags from scratch based on the article content
- Option 2: Extract existing tags if available (may be simpler than recreating)
Description Creation with llm
Date Handling:
- For the article date, simply use the existing date in the web page if available, or use a recent date