The challenge of web scraping often revolves around the repetitive task of writing custom code for each target website. Whether dealing with Medium, Substack, or independent blogs, developers typically maintain separate handlers for different platforms. PluckMD, an open-source tool, reimagines this process by generating scraping logic dynamically at runtime rather than relying on predefined configurations.
A data-driven approach to extraction
Instead of embedding domain-specific rules into the codebase, PluckMD introduces a concept called AdapterSpec—a simple data structure that defines how to locate and extract content from any webpage. This structure includes three key specifications: listing, article extraction, and pagination. Each specification is designed to be flexible, allowing the same configuration to be generated through heuristics, large language models (LLMs), or even manual input.
The AdapterSpec interface looks like this:
interface AdapterSpec {
listing: ListingExtractionSpec; // Locates article links on a page
article: ArticleExtractionSpec; // Extracts the main content body
pagination: PaginationSpec; // Handles multi-page content
evidence: string; // Validation evidence
}Because the AdapterSpec is purely data, it standardizes the way different sources and generation methods contribute to the scraping process. This uniformity ensures that whether a developer, an algorithm, or an LLM generates the spec, the downstream systems process it identically.
A tiered resolution system for efficiency
PluckMD employs a multi-stage resolution process to generate or retrieve AdapterSpecs in the most cost-effective way possible. The system prioritizes speed and reliability by checking caches and local heuristics before resorting to more expensive methods like LLMs.
The resolution pipeline follows this order:
- Cache check: A previously successful AdapterSpec is retrieved and validated against the current page structure.
- Local heuristics: If no cached spec exists, the system applies built-in heuristics to infer the correct extraction rules.
- LLM fallback: Only when heuristics fail does the system engage an LLM to generate a spec dynamically.
Every successfully validated spec is automatically cached, meaning that subsequent scraping operations on the same site are nearly instantaneous. This approach minimizes computational overhead while maintaining accuracy.
How heuristics identify article lists without prior knowledge
The heuristic engine in PluckMD operates without any site-specific assumptions. It analyzes the structure of a webpage by examining all links, normalizing their paths, and identifying repeating patterns. For example, URLs like /blog/post-1 and /blog/post-2 are grouped under /blog/*, while unrelated paths like /about are excluded.
The system scores potential article lists based on multiple factors:
- Number of matching links
- Proportion of page links that match the pattern
- Path depth (e.g.,
/blog/2024/postis deeper than/blog/post) - Location within semantic HTML elements like
<main>
A pattern must repeat at least three times to be considered valid. The system automatically treats numeric segments and hashes as variable, preventing fragmentation of similar URL structures. This ensures that variations in article IDs or timestamps do not disrupt the grouping process.
Validation: the safety gate for dynamic scraping
Before any AdapterSpec is used or cached, it must pass a rigorous validation process. This step ensures that dynamically generated rules do not introduce errors or inconsistencies into the scraping pipeline.
The validation criteria include:
- The link selector must match at least three links on the page.
- At least 50% of those links must conform to the expected URL pattern.
- If the extraction method relies on CSS selectors, the retrieved body content must contain at least 80 characters.
Specs that fail validation are discarded, protecting the system from unreliable or malicious inputs. This gate applies uniformly, regardless of whether the spec was generated by heuristics, an LLM, or a developer.
Unified interfaces for diverse content sources
Webpages can originate from static HTML, dynamically rendered content via headless browsers, or even authenticated user sessions. PluckMD abstracts these differences behind a single interface, allowing the same extraction logic to work across multiple backends.
The tool supports three primary sources:
- Static fetch: Direct HTTP requests to retrieve raw HTML
- Headless browser (Playwright): Renders JavaScript-heavy pages
- Live browser (logged-in Chrome): Captures authenticated content
Additionally, a DomEvaluator interface enables live operations such as scrolling and clicking pagination buttons. This modular design means adding support for a new source type only requires implementing one interface, reducing maintenance overhead.
Falling back to agent-based generation
When local heuristics fail to produce a valid AdapterSpec and no LLM is configured, PluckMD does not abandon the task. Instead, it generates a structured request file containing the page’s DOM structure and a list of candidate selectors. This file can then be processed by an external coding agent, which produces the necessary extraction logic.
The workflow for agent-assisted scraping involves:
- Heuristics identify candidate selectors but fail validation
- A request file is generated with page structure and candidate rules
- An agent generates an AdapterSpec based on the request
- The spec is validated and cached for future use
This approach ensures that even the most complex or obscure websites can be scraped without requiring manual intervention.
A flexible foundation for future enhancements
PluckMD’s core innovation lies in its ability to treat extraction rules as interchangeable data rather than rigid code. By decoupling the generation of scraping logic from its execution, the tool enables seamless integration with various input methods and backend systems.
The confidence threshold between trusting local heuristics and invoking an LLM remains an area of exploration. While the current system balances speed and accuracy effectively, future iterations may refine the decision-making process to further optimize performance and reduce reliance on expensive model calls. Developers interested in alternative approaches to generic web extraction are encouraged to contribute or share insights.
As web content continues to diversify, tools like PluckMD highlight the importance of adaptable scraping frameworks. By shifting from code-based to data-based extraction, developers can reduce maintenance burdens and focus on building scalable, reliable systems for content aggregation.
AI summary
Tek bir JSON yapısıyla tüm web sitelerinden içerik çıkarmak mümkün. pluckmd’nin nasıl çalıştığını ve kod yerine veri odaklı yaklaşımların avantajlarını keşfedin.