Extracting Blogs from URLs
We will show how to extract blog content from a URL using BeautifulSoup and some LLM calls to identify the main article content and reformat it in markdown format.
This content is heavily inspired from Jason Liu in this document segmentation example (opens in a new tab).
Step 1: Line Numbering formatting
First, we'll use the requests
library to fetch the content from a given URL and BeautifulSoup
to extract all the text from the page.
After extracting the text, we'll add a number at the start of each line. This step is crucial as it allows us to ask the LLM to predict the start and end lines of the main article content.
import requests
from bs4 import BeautifulSoup
def doc_with_lines(document):
document_lines = document.split("\n")
document_with_line_numbers = ""
line2text = {}
for i, line in enumerate(document_lines):
document_with_line_numbers += f"[{i}] {line}\n"
line2text[i] = line
return document_with_line_numbers, line2text
text = requests.get("https://miniblog.ai/cookbooks/extract-blogs-from-url").text
soup = BeautifulSoup(text, "html.parser")
text = soup.get_text()
text_with_lines, line2text = doc_with_lines(text)
Step 2: Start and end line identification for the article
Normally, text_with_lines
is small enough to be processed by an LLM in 1 request.
from pydantic import BaseModel, Field
client = instructor.from_openai(openai.OpenAI())
class ArticleDelimiters(BaseModel):
title: str = Field(description="title of the article")
start_index: int = Field(description="line number where the article begins")
end_index: int = Field(description="line number where the article ends")
article_delimiters = client.chat.completions.create(
model="command-r-plus",
response_model=ArticleDelimiters,
messages=[
{
"role": "system",
"content": system_prompt,
},
{
"role": "user",
"content": text_with_lines,
},
],
)
Step 3: Get markdown formatted text for the article
With the start and end lines identified, we can confidently extract the article content. After obtaining the relevant text, we employ another LLM or make an additional LLM call to reformat the content.
system_prompt = """
You are an expert in markdown formatting. You are given a document with line numbers and the content of the document.
You need to extract the article content between the start and end lines.
The returned article should be in markdown format, with headers for the sections, etc etc
"""
Additional steps: Images extractions, tags, descriptions etc
Next, you can focus on extracting additional meta information from the article. This process involves several steps that you can do next:
-
Image Extraction:
- Collect all images from the article using
href
anda
tags from the html text (between the delimiters) - Filter images based on specific resolution criteria. Usually images of less than 100kb are not from the blog post.
- Get a description / title from those images using gpt-4o / gpt-4o-mini
- Collect all images from the article using
-
Tag Generation:
- Option 1: Recreate tags from scratch based on the article content
- Option 2: Extract existing tags if available (may be simpler than recreating)
-
Description Creation with llm
-
Date Handling:
- For the article date, simply use the existing date in the web page if available, or use a recent date