AI-Powered Data Extraction

WebscrapingHQ's AI extraction feature uses advanced machine learning models to intelligently identify and extract structured data from web pages. Instead of writing complex CSS selectors or XPath expressions, simply describe what data you want to extract in natural language.

How AI Extraction Works

The AI extraction system:

Analyzes the entire page content after rendering
Understands your data requirements through natural language descriptions
Identifies relevant content using advanced pattern recognition
Extracts structured data in JSON format
Returns clean, formatted results ready for your application

Basic AI Extraction

To use AI extraction, include the aiScraping parameter in your request:

curl -X POST https://your-domain.com/api/v1/scrape \
  -H "X-API-KEY: your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-store.com/product",
    "renderJs": true,
    "aiScraping": [
      {
        "name": "title",
        "value": "The main product title"
      },
      {
        "name": "price", 
        "value": "Current product price"
      },
      {
        "name": "availability",
        "value": "Whether the product is in stock"
      }
    ]
  }'

Field Configuration

Each AI extraction field supports these properties:

`name` (required)

The field name for the extracted data in the response.

`value` (optional)

A natural language description of what to extract. The more specific, the better the results.

`context` (optional)

Additional context to help the AI understand the data location or format.

Advanced Examples

E-commerce Product Extraction

{
  "url": "https://store.example.com/product/123",
  "renderJs": true,
  "aiScraping": [
    {
      "name": "product_name",
      "value": "The main product title or name"
    },
    {
      "name": "price",
      "value": "Current selling price in any currency"
    },
    {
      "name": "original_price",
      "value": "Original price before discount, if available"
    },
    {
      "name": "availability",
      "value": "Stock status - in stock, out of stock, limited quantity"
    },
    {
      "name": "rating",
      "value": "Customer rating or review score"
    },
    {
      "name": "review_count",
      "value": "Number of customer reviews"
    },
    {
      "name": "description",
      "value": "Product description or key features"
    },
    {
      "name": "category",
      "value": "Product category or department"
    },
    {
      "name": "brand",
      "value": "Brand or manufacturer name"
    },
    {
      "name": "images",
      "value": "Product image URLs as a list"
    }
  ]
}

News Article Extraction

{
  "url": "https://news.example.com/article/123",
  "renderJs": true,
  "aiScraping": [
    {
      "name": "headline",
      "value": "Main article headline or title"
    },
    {
      "name": "author",
      "value": "Article author name"
    },
    {
      "name": "publish_date",
      "value": "When the article was published"
    },
    {
      "name": "summary",
      "value": "Article summary or excerpt"
    },
    {
      "name": "content",
      "value": "Full article text content"
    },
    {
      "name": "tags",
      "value": "Article tags or categories as a list"
    },
    {
      "name": "read_time",
      "value": "Estimated reading time"
    }
  ]
}

Job Listing Extraction

{
  "url": "https://jobs.example.com/posting/123",
  "renderJs": true,
  "aiScraping": [
    {
      "name": "job_title",
      "value": "Job position title"
    },
    {
      "name": "company",
      "value": "Company name"
    },
    {
      "name": "location",
      "value": "Job location - city, state, country"
    },
    {
      "name": "salary_range",
      "value": "Salary or compensation range"
    },
    {
      "name": "employment_type",
      "value": "Full-time, part-time, contract, etc."
    },
    {
      "name": "requirements",
      "value": "Required skills and qualifications as a list"
    },
    {
      "name": "benefits",
      "value": "Job benefits and perks as a list"
    },
    {
      "name": "description",
      "value": "Full job description"
    },
    {
      "name": "posted_date",
      "value": "When the job was posted"
    }
  ]
}

Response Format

AI extraction results are returned in the aiResponse field:

{
  "cost": 6,
  "creditsLeft": 994,
  "aiResponse": {
    "title": "Premium Wireless Headphones",
    "price": "$199.99",
    "availability": "In Stock",
    "rating": "4.5",
    "review_count": "1,247",
    "category": "Electronics > Audio"
  },
  "initial-status-code": 200,
  "resolved-url": "https://store.example.com/product/123",
  "type": "html",
  "body": "<!DOCTYPE html>..."
}

Best Practices

1. Be Specific in Descriptions

// ❌ Too vague
{"name": "data", "value": "important information"}

// ✅ Specific and clear
{"name": "product_price", "value": "Current selling price including currency symbol"}

2. Use Context for Complex Scenarios

{
  "name": "discount_percentage",
  "value": "Percentage discount from original price",
  "context": "Look for text like '20% off' or 'Save 25%'"
}

3. Handle Lists and Arrays

{
  "name": "features",
  "value": "Product features and specifications as a list",
  "context": "Return as JSON array format ['feature1', 'feature2']"
}

4. Combine with JavaScript Rendering

AI extraction works best with fully rendered pages:

{
  "url": "https://spa-app.com/data",
  "renderJs": true,
  "waitFor": 3000,
  "aiScraping": [...]
}

Error Handling

If AI extraction fails to find specific data:

{
  "aiResponse": {
    "title": "Product Name Found",
    "price": null,
    "availability": "Could not determine"
  }
}

The AI will return null or descriptive text when data cannot be found.

Credit Costs

AI extraction adds 5 credits to your base request cost:

Basic scraping: 1 credit
JavaScript rendering: +3 credits
AI extraction: +5 credits
Total: 9 credits

Combining with Other Features

AI extraction works seamlessly with other features:

{
  "url": "https://example.com/page",
  "renderJs": true,
  "screenshot": true,
  "aiScraping": [...],
  "javascriptInstruction": [
    {
      "action": "clickElement",
      "selector": {"type": "css", "value": ".load-more"}
    }
  ]
}

Language Support

The AI extraction system supports content in multiple languages and can extract data from:

English websites
International e-commerce sites
Multi-language news sources
Global job boards
Social media platforms

Limitations

Maximum 20 fields per request
Best results with well-structured HTML
Requires JavaScript rendering for dynamic content
May have difficulty with heavily obfuscated content

Programming Language Examples

Python

import requests

response = requests.post(
    'https://your-domain.com/api/v1/scrape',
    headers={'X-API-KEY': 'your-api-key'},
    json={
        'url': 'https://example.com/product',
        'renderJs': True,
        'aiScraping': [
            {'name': 'title', 'value': 'Product title'},
            {'name': 'price', 'value': 'Current price'}
        ]
    }
)

data = response.json()
extracted = data['aiResponse']
print(f"Title: {extracted['title']}")
print(f"Price: {extracted['price']}")

JavaScript/Node.js

const response = await fetch('https://your-domain.com/api/v1/scrape', {
  method: 'POST',
  headers: {
    'X-API-KEY': 'your-api-key',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: 'https://example.com/product',
    renderJs: true,
    aiScraping: [
      {name: 'title', value: 'Product title'},
      {name: 'price', value: 'Current price'}
    ]
  })
});

const data = await response.json();
console.log('Extracted data:', data.aiResponse);

PHP

$response = file_get_contents('https://your-domain.com/api/v1/scrape', false, stream_context_create([
    'http' => [
        'method' => 'POST',
        'header' => [
            'X-API-KEY: your-api-key',
            'Content-Type: application/json'
        ],
        'content' => json_encode([
            'url' => 'https://example.com/product',
            'renderJs' => true,
            'aiScraping' => [
                ['name' => 'title', 'value' => 'Product title'],
                ['name' => 'price', 'value' => 'Current price']
            ]
        ])
    ]
]));

$data = json_decode($response, true);
echo "Title: " . $data['aiResponse']['title'];

Next Steps

Learn about Extraction Rules for CSS-based extraction
Explore JavaScript Instructions for complex interactions
Check out SERP Scraping for search engine results
See API Reference for complete parameter details

How AI Extraction Works​

Basic AI Extraction​

Field Configuration​

name (required)​

value (optional)​

context (optional)​

Advanced Examples​

E-commerce Product Extraction​

News Article Extraction​

Job Listing Extraction​

Response Format​

Best Practices​

1. Be Specific in Descriptions​

2. Use Context for Complex Scenarios​

3. Handle Lists and Arrays​

4. Combine with JavaScript Rendering​

Error Handling​

Credit Costs​

Combining with Other Features​

Language Support​

Limitations​

Programming Language Examples​

Python​

JavaScript/Node.js​

PHP​

Next Steps​