Data Access & Reliability

Some data sources may return incomplete responses, rate-limit automated requests, or require JavaScript rendering before content is available. This guide covers the tools available in the connectors codebase for improving data access reliability.

Strategy Overview

Technique	Difficulty	Cost	Best For
Header tuning & rate limiting	Low	Free	Most sources
Headless browser (chromedp)	Medium	Free	JS-rendered pages
Proxy rotation	Medium	$$	Rate-limited sources
Managed scraping API (e.g. Firecrawl)	Low	$$$	Complex JS rendering
Hybrid (proxy + headless)	High	$$	Difficult sources

1. Header Tuning & Rate Limiting (Already In Use)

Most connectors already apply this baseline:

Rotating User-Agent strings via extensions.RandomUserAgent(c) (Colly)
Standard browser headers: Accept, Accept-Language, Sec-Ch-Ua, Upgrade-Insecure-Requests
Referer headers via extensions.Referer(c)
Rate limiting with jitter through Colly’s LimitRule with RandomDelay
Session warm-up: visiting the homepage to collect cookies before deeper pages
Adaptive backoff: AdaptiveRateLimiter handles 429 responses with exponential backoff

This is sufficient for most sources and should always be the starting point.

2. Headless Browser (chromedp)

Some sources render content client-side via JavaScript. The shared utility at connectors/common/headless.go uses headless Chrome to fetch fully-rendered HTML:

import "connectors/common"

html, err := common.FetchRenderedHTML(ctx, url, 20*time.Second, nil)

Options

Pass HeadlessFetchOptions to configure proxy and additional flags:

html, err := common.FetchRenderedHTML(ctx, url, 20*time.Second, &common.HeadlessFetchOptions{
    ProxyURL:    "http://user:pass@proxy.example.com:8080",
    ExtraFlags:  []string{"--disable-blink-features=AutomationControlled"},
})

Limitations

Slow (~2-20s per page) — not suitable for high-volume discovery
Requires Chrome/Chromium installed in the container image

3. Proxy Rotation

When a source rate-limits by IP, rotating through multiple egress addresses improves throughput. The utility at connectors/common/proxy.go provides a ProxyRotator that integrates with both Colly and net/http.

Configuration

Set the PROXY_URLS environment variable (comma-separated) or pass URLs directly:

export PROXY_URLS="http://user:pass@us.proxy.example.com:8080,http://user:pass@eu.proxy.example.com:8080"

Usage with Colly

rotator := common.NewProxyRotatorFromEnv()

c := colly.NewCollector()
if rotator != nil {
    c.SetProxyFunc(rotator.CollyProxyFunc())
}

Usage with net/http

rotator := common.NewProxyRotatorFromEnv()

transport := rotator.HTTPTransport()
client := &http.Client{Transport: transport}

Proxy Providers

Several commercial proxy providers offer residential and datacenter pools with geographic targeting. Evaluate based on:

Geographic coverage — choose a provider with IPs in the regions your connectors target
Pricing model — per-GB vs. per-request vs. monthly flat rate
Sticky sessions — needed when cookies must persist across requests on the same IP
Managed vs. raw — some providers bundle JS rendering and header management into a single API

Sticky Sessions

For multi-page fetches where cookies must persist, use sticky sessions (same IP for a session window):

rotator := common.NewProxyRotator(proxyURLs, common.ProxyRotatorOptions{
    StickySessionDuration: 5 * time.Minute,
})

4. Managed Scraping APIs

Services like Firecrawl handle JavaScript rendering and return clean HTML or Markdown. They can be useful when headless Chrome alone isn’t sufficient, or when you want to avoid managing browser infrastructure.

Setup

export FIRECRAWL_API_KEY="fc-your-api-key"
export FIRECRAWL_API_URL="https://api.firecrawl.dev"  # optional, defaults to this

Usage

The utility at connectors/common/firecrawl.go provides a client:

fc := common.NewFirecrawlClientFromEnv()
if fc == nil {
    // Firecrawl not configured, fall back to direct fetching
}

result, err := fc.ScrapeURL(ctx, "https://example.com/listings", nil)
if err != nil {
    return err
}
// result.HTML contains the rendered page HTML
// result.Markdown contains clean markdown extraction

Options

result, err := fc.ScrapeURL(ctx, url, &common.FirecrawlScrapeOptions{
    Formats: []string{"html", "markdown"},
    WaitFor: 3000,  // wait 3s for JS to render
})

When to Consider a Managed API

The source requires complex JS rendering that headless Chrome struggles with
You want clean structured output without writing CSS selectors
Rapid prototyping of new connectors (get data flowing first, optimize later)

5. Choosing an Approach

Start with the lightest technique and escalate only as needed:

Start simple: Header tuning and rate limiting work for most sources
Add proxies: If requests are being rate-limited, add ProxyRotator
Try headless: If the source requires JS rendering, use FetchRenderedHTML
Consider a managed API: If headless rendering is insufficient, try Firecrawl or a similar service
Monitor and adapt: Source behaviour changes over time; review connector health regularly

6. Environment Variables Reference

Variable	Description	Example
`PROXY_URLS`	Comma-separated proxy URLs for rotation	`http://user:pass@proxy:8080,...`
`FIRECRAWL_API_KEY`	Firecrawl API authentication key	`fc-abc123...`
`FIRECRAWL_API_URL`	Firecrawl API base URL (optional)	`https://api.firecrawl.dev`

These can be set in the connector’s Kubernetes CronJob manifest or in mill/config/.env for local development.