Extraction Rules

Extraction Rules provide precise control over data extraction using CSS selectors. This feature is perfect when you know exactly which elements contain the data you need and want deterministic, fast extraction without AI overhead.

How Extraction Rules Work

Extraction Rules allow you to:

Define CSS selectors for specific data elements
Map selectors to meaningful field names
Extract text content directly from matching elements
Receive structured data in your response

Unlike AI extraction, extraction rules are fast, predictable, and don't consume additional credits.

Basic Usage

Include the extractRules parameter as an object mapping field names to CSS selectors:

curl -X POST https://your-domain.com/api/v1/scrape \
  -H "X-API-KEY: your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-store.com/product",
    "extractRules": {
      "title": "h1.product-title",
      "price": ".price-current",
      "description": ".product-description p"
    }
  }'

CSS Selector Examples

Basic Selectors

{
  "extractRules": {
    "title": "h1",                    // First h1 element
    "price": ".price",                // Element with class 'price'
    "description": "#description",    // Element with id 'description'
    "category": "nav .breadcrumb li:last-child"  // Last breadcrumb item
  }
}

Advanced Selectors

{
  "extractRules": {
    "product_name": "h1.product-title, h1.item-title",  // Multiple selectors
    "rating": "[data-rating]",                          // Attribute selector
    "reviews": "span:contains('reviews')",              // Text content selector
    "availability": ".stock-status span:first-child",   // Nested selection
    "brand": "meta[property='product:brand']",          // Meta tag content
    "image_url": ".product-image img[src]"              // Image source
  }
}

Real-World Examples

E-commerce Product Page

{
  "url": "https://store.example.com/product/123",
  "renderJs": true,
  "extractRules": {
    "name": "h1.product-name, .product-title h1",
    "current_price": ".price-now, .current-price, .sale-price",
    "original_price": ".price-was, .original-price, .list-price",
    "discount": ".discount-percent, .savings-percent",
    "availability": ".stock-status, .availability-text",
    "rating": ".rating-value, [data-rating]",
    "review_count": ".review-count, .reviews-total",
    "brand": ".brand-name, .manufacturer",
    "model": ".model-number, .product-code",
    "main_image": ".product-image img",
    "description": ".product-description, .item-details",
    "features": ".product-features li",
    "specifications": ".specs-table td:nth-child(2)",
    "category": ".breadcrumb li:last-child"
  }
}

News Article Extraction

{
  "url": "https://news.example.com/article/123",
  "extractRules": {
    "headline": "h1.article-title, .headline",
    "subtitle": ".article-subtitle, .deck",
    "author": ".author-name, .byline a",
    "publish_date": "time[datetime], .publish-date",
    "content": ".article-body, .content p",
    "category": ".category-link, .section-name",
    "tags": ".tag-list a, .article-tags li",
    "read_time": ".read-time, .estimated-time",
    "image_caption": ".featured-image figcaption"
  }
}

Job Listing Extraction

{
  "url": "https://jobs.example.com/listing/123",
  "extractRules": {
    "job_title": "h1.job-title, .position-title",
    "company": ".company-name, .employer",
    "location": ".job-location, .location-text",
    "salary": ".salary-range, .compensation",
    "job_type": ".employment-type, .job-category",
    "posted_date": ".posted-date, .listing-date",
    "description": ".job-description, .role-summary",
    "requirements": ".requirements li, .qualifications li",
    "benefits": ".benefits li, .perks li",
    "contact_email": ".contact-info [href^='mailto:']",
    "application_url": ".apply-button[href], .application-link"
  }
}

Response Format

Extracted data is returned in the extracted_data field:

{
  "cost": 1,
  "creditsLeft": 999,
  "initial-status-code": 200,
  "resolved-url": "https://store.example.com/product/123",
  "type": "html",
  "body": "<!DOCTYPE html>...",
  "extracted_data": {
    "title": "Premium Wireless Headphones",
    "price": "$199.99",
    "description": "High-quality audio with noise cancellation",
    "category": "Electronics"
  }
}

Combining with Other Features

With JavaScript Rendering

{
  "url": "https://spa-app.com/product",
  "renderJs": true,
  "waitFor": 3000,
  "extractRules": {
    "dynamic_price": ".price-loaded",
    "stock_status": ".stock-dynamic"
  }
}

With JavaScript Instructions

{
  "url": "https://example.com/product",
  "renderJs": true,
  "javascriptInstruction": [
    {
      "action": "clickElement",
      "selector": {"type": "css", "value": ".show-more-details"}
    },
    {
      "action": "wait",
      "delay": 2000
    }
  ],
  "extractRules": {
    "detailed_specs": ".expanded-details li",
    "technical_data": ".tech-specs td"
  }
}

With Screenshots

{
  "url": "https://example.com/page",
  "renderJs": true,
  "screenshot": true,
  "extractRules": {
    "visible_text": ".main-content",
    "sidebar_info": ".sidebar-widget"
  }
}

Advanced Techniques

Handling Multiple Elements

When a selector matches multiple elements, the API returns the text from the first match:

{
  "extractRules": {
    "first_paragraph": "p",           // Gets first <p> element
    "all_headings": "h2, h3, h4"      // Gets first matching heading
  }
}

Attribute Extraction

Extract attribute values instead of text content:

{
  "extractRules": {
    "image_url": "img.product-image",     // Gets src attribute
    "link_url": "a.product-link",         // Gets href attribute
    "data_id": "[data-product-id]"        // Gets data-product-id attribute
  }
}

Fallback Selectors

Use multiple selectors for better reliability:

{
  "extractRules": {
    "price": ".price-current, .price-now, .current-price, .sale-price"
  }
}

Best Practices

1. Use Specific Selectors

// ❌ Too generic - might match unintended elements
{"title": "h1"}

// ✅ Specific and reliable
{"title": "h1.product-title, .main-title h1"}

2. Handle Dynamic Content

{
  "url": "https://spa-app.com",
  "renderJs": true,
  "waitFor": 3000,
  "extractRules": {
    "loaded_content": ".content-loaded"
  }
}

3. Test Selectors First

Before implementing, test your CSS selectors in browser dev tools:

// Test in browser console
document.querySelector('.price-current')?.textContent
document.querySelectorAll('.product-features li')

4. Use Fallback Strategies

{
  "extractRules": {
    "price": ".price-sale, .price-current, .price, [data-price]"
  }
}

Error Handling

When extraction rules can't find elements:

{
  "extracted_data": {
    "title": "Found Product Title",
    "price": "",              // Empty string when not found
    "description": "Found Description",
    "missing_field": ""       // Empty string for missing elements
  }
}

Credit Costs

Extraction rules don't add extra credits to your request:

Basic scraping with extraction: 1 credit
With JavaScript rendering: 4 credits total
With screenshot: varies based on rendering

Limitations

CSS Selector Limitations

Standard CSS3 selectors only
No CSS4 features or pseudo-selectors
No JavaScript execution in selectors

Content Limitations

Extracts text content only (no HTML)
First matching element only
No array/list extraction from multiple elements

Performance Considerations

Very fast compared to AI extraction
No additional API calls or processing
Minimal impact on response time

Comparison: Extraction Rules vs AI Extraction

Feature	Extraction Rules	AI Extraction
Speed	Very Fast	Slower
Cost	No extra credits	+5 credits
Precision	Exact CSS targeting	Natural language
Flexibility	Fixed selectors	Adaptive understanding
Setup	Requires CSS knowledge	Natural descriptions
Reliability	Breaks if HTML changes	Adapts to changes

Programming Language Examples

Python

import requests

response = requests.post(
    'https://your-domain.com/api/v1/scrape',
    headers={'X-API-KEY': 'your-api-key'},
    json={
        'url': 'https://example.com/product',
        'extractRules': {
            'title': 'h1.product-title',
            'price': '.price-current',
            'availability': '.stock-status'
        }
    }
)

data = response.json()
extracted = data['extracted_data']
print(f"Title: {extracted['title']}")
print(f"Price: {extracted['price']}")

JavaScript/Node.js

const response = await fetch('https://your-domain.com/api/v1/scrape', {
  method: 'POST',
  headers: {
    'X-API-KEY': 'your-api-key',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: 'https://example.com/product',
    extractRules: {
      title: 'h1.product-title',
      price: '.price-current',
      availability: '.stock-status'
    }
  })
});

const data = await response.json();
console.log('Extracted:', data.extracted_data);

PHP

$response = file_get_contents('https://your-domain.com/api/v1/scrape', false, stream_context_create([
    'http' => [
        'method' => 'POST',
        'header' => [
            'X-API-KEY: your-api-key',
            'Content-Type: application/json'
        ],
        'content' => json_encode([
            'url' => 'https://example.com/product',
            'extractRules' => [
                'title' => 'h1.product-title',
                'price' => '.price-current'
            ]
        ])
    ]
]));

$data = json_decode($response, true);
echo $data['extracted_data']['title'];

Next Steps

Learn about AI Extraction for flexible, natural language extraction
Explore JavaScript Instructions for dynamic content
Check out Screenshot Capture for visual monitoring
See API Reference for complete parameter details

How Extraction Rules Work​

Basic Usage​

CSS Selector Examples​

Basic Selectors​

Advanced Selectors​

Real-World Examples​

E-commerce Product Page​

News Article Extraction​

Job Listing Extraction​

Response Format​

Combining with Other Features​

With JavaScript Rendering​

With JavaScript Instructions​

With Screenshots​

Advanced Techniques​

Handling Multiple Elements​

Attribute Extraction​

Fallback Selectors​

Best Practices​

1. Use Specific Selectors​

2. Handle Dynamic Content​

3. Test Selectors First​

4. Use Fallback Strategies​

Error Handling​

Credit Costs​

Limitations​

CSS Selector Limitations​

Content Limitations​

Performance Considerations​

Comparison: Extraction Rules vs AI Extraction​

Programming Language Examples​

Python​

JavaScript/Node.js​

PHP​

Next Steps​