Skip to content

Data Access & Reliability

Some data sources may return incomplete responses, rate-limit automated requests, or require JavaScript rendering before content is available. This guide covers the tools available in the connectors codebase for improving data access reliability.

TechniqueDifficultyCostBest For
Header tuning & rate limitingLowFreeMost sources
Headless browser (chromedp)MediumFreeJS-rendered pages
Proxy rotationMedium$$Rate-limited sources
Managed scraping API (e.g. Firecrawl)Low$$$Complex JS rendering
Hybrid (proxy + headless)High$$Difficult sources

1. Header Tuning & Rate Limiting (Already In Use)

Section titled “1. Header Tuning & Rate Limiting (Already In Use)”

Most connectors already apply this baseline:

  • Rotating User-Agent strings via extensions.RandomUserAgent(c) (Colly)
  • Standard browser headers: Accept, Accept-Language, Sec-Ch-Ua, Upgrade-Insecure-Requests
  • Referer headers via extensions.Referer(c)
  • Rate limiting with jitter through Colly’s LimitRule with RandomDelay
  • Session warm-up: visiting the homepage to collect cookies before deeper pages
  • Adaptive backoff: AdaptiveRateLimiter handles 429 responses with exponential backoff

This is sufficient for most sources and should always be the starting point.

Some sources render content client-side via JavaScript. The shared utility at connectors/common/headless.go uses headless Chrome to fetch fully-rendered HTML:

import "connectors/common"
html, err := common.FetchRenderedHTML(ctx, url, 20*time.Second, nil)

Pass HeadlessFetchOptions to configure proxy and additional flags:

html, err := common.FetchRenderedHTML(ctx, url, 20*time.Second, &common.HeadlessFetchOptions{
ProxyURL: "http://user:pass@proxy.example.com:8080",
ExtraFlags: []string{"--disable-blink-features=AutomationControlled"},
})
  • Slow (~2-20s per page) — not suitable for high-volume discovery
  • Requires Chrome/Chromium installed in the container image

When a source rate-limits by IP, rotating through multiple egress addresses improves throughput. The utility at connectors/common/proxy.go provides a ProxyRotator that integrates with both Colly and net/http.

Set the PROXY_URLS environment variable (comma-separated) or pass URLs directly:

Terminal window
export PROXY_URLS="http://user:pass@us.proxy.example.com:8080,http://user:pass@eu.proxy.example.com:8080"
rotator := common.NewProxyRotatorFromEnv()
c := colly.NewCollector()
if rotator != nil {
c.SetProxyFunc(rotator.CollyProxyFunc())
}
rotator := common.NewProxyRotatorFromEnv()
transport := rotator.HTTPTransport()
client := &http.Client{Transport: transport}

Several commercial proxy providers offer residential and datacenter pools with geographic targeting. Evaluate based on:

  • Geographic coverage — choose a provider with IPs in the regions your connectors target
  • Pricing model — per-GB vs. per-request vs. monthly flat rate
  • Sticky sessions — needed when cookies must persist across requests on the same IP
  • Managed vs. raw — some providers bundle JS rendering and header management into a single API

For multi-page fetches where cookies must persist, use sticky sessions (same IP for a session window):

rotator := common.NewProxyRotator(proxyURLs, common.ProxyRotatorOptions{
StickySessionDuration: 5 * time.Minute,
})

Services like Firecrawl handle JavaScript rendering and return clean HTML or Markdown. They can be useful when headless Chrome alone isn’t sufficient, or when you want to avoid managing browser infrastructure.

Terminal window
export FIRECRAWL_API_KEY="fc-your-api-key"
export FIRECRAWL_API_URL="https://api.firecrawl.dev" # optional, defaults to this

The utility at connectors/common/firecrawl.go provides a client:

fc := common.NewFirecrawlClientFromEnv()
if fc == nil {
// Firecrawl not configured, fall back to direct fetching
}
result, err := fc.ScrapeURL(ctx, "https://example.com/listings", nil)
if err != nil {
return err
}
// result.HTML contains the rendered page HTML
// result.Markdown contains clean markdown extraction
result, err := fc.ScrapeURL(ctx, url, &common.FirecrawlScrapeOptions{
Formats: []string{"html", "markdown"},
WaitFor: 3000, // wait 3s for JS to render
})
  • The source requires complex JS rendering that headless Chrome struggles with
  • You want clean structured output without writing CSS selectors
  • Rapid prototyping of new connectors (get data flowing first, optimize later)

Start with the lightest technique and escalate only as needed:

  1. Start simple: Header tuning and rate limiting work for most sources
  2. Add proxies: If requests are being rate-limited, add ProxyRotator
  3. Try headless: If the source requires JS rendering, use FetchRenderedHTML
  4. Consider a managed API: If headless rendering is insufficient, try Firecrawl or a similar service
  5. Monitor and adapt: Source behaviour changes over time; review connector health regularly
VariableDescriptionExample
PROXY_URLSComma-separated proxy URLs for rotationhttp://user:pass@proxy:8080,...
FIRECRAWL_API_KEYFirecrawl API authentication keyfc-abc123...
FIRECRAWL_API_URLFirecrawl API base URL (optional)https://api.firecrawl.dev

These can be set in the connector’s Kubernetes CronJob manifest or in mill/config/.env for local development.