Data Access & Reliability
Some data sources may return incomplete responses, rate-limit automated requests, or require JavaScript rendering before content is available. This guide covers the tools available in the connectors codebase for improving data access reliability.
Strategy Overview
Section titled “Strategy Overview”| Technique | Difficulty | Cost | Best For |
|---|---|---|---|
| Header tuning & rate limiting | Low | Free | Most sources |
| Headless browser (chromedp) | Medium | Free | JS-rendered pages |
| Proxy rotation | Medium | $$ | Rate-limited sources |
| Managed scraping API (e.g. Firecrawl) | Low | $$$ | Complex JS rendering |
| Hybrid (proxy + headless) | High | $$ | Difficult sources |
1. Header Tuning & Rate Limiting (Already In Use)
Section titled “1. Header Tuning & Rate Limiting (Already In Use)”Most connectors already apply this baseline:
- Rotating User-Agent strings via
extensions.RandomUserAgent(c)(Colly) - Standard browser headers:
Accept,Accept-Language,Sec-Ch-Ua,Upgrade-Insecure-Requests - Referer headers via
extensions.Referer(c) - Rate limiting with jitter through Colly’s
LimitRulewithRandomDelay - Session warm-up: visiting the homepage to collect cookies before deeper pages
- Adaptive backoff:
AdaptiveRateLimiterhandles 429 responses with exponential backoff
This is sufficient for most sources and should always be the starting point.
2. Headless Browser (chromedp)
Section titled “2. Headless Browser (chromedp)”Some sources render content client-side via JavaScript. The shared utility at connectors/common/headless.go uses headless Chrome to fetch fully-rendered HTML:
import "connectors/common"
html, err := common.FetchRenderedHTML(ctx, url, 20*time.Second, nil)Options
Section titled “Options”Pass HeadlessFetchOptions to configure proxy and additional flags:
html, err := common.FetchRenderedHTML(ctx, url, 20*time.Second, &common.HeadlessFetchOptions{ ProxyURL: "http://user:pass@proxy.example.com:8080", ExtraFlags: []string{"--disable-blink-features=AutomationControlled"},})Limitations
Section titled “Limitations”- Slow (~2-20s per page) — not suitable for high-volume discovery
- Requires Chrome/Chromium installed in the container image
3. Proxy Rotation
Section titled “3. Proxy Rotation”When a source rate-limits by IP, rotating through multiple egress addresses improves throughput. The utility at connectors/common/proxy.go provides a ProxyRotator that integrates with both Colly and net/http.
Configuration
Section titled “Configuration”Set the PROXY_URLS environment variable (comma-separated) or pass URLs directly:
export PROXY_URLS="http://user:pass@us.proxy.example.com:8080,http://user:pass@eu.proxy.example.com:8080"Usage with Colly
Section titled “Usage with Colly”rotator := common.NewProxyRotatorFromEnv()
c := colly.NewCollector()if rotator != nil { c.SetProxyFunc(rotator.CollyProxyFunc())}Usage with net/http
Section titled “Usage with net/http”rotator := common.NewProxyRotatorFromEnv()
transport := rotator.HTTPTransport()client := &http.Client{Transport: transport}Proxy Providers
Section titled “Proxy Providers”Several commercial proxy providers offer residential and datacenter pools with geographic targeting. Evaluate based on:
- Geographic coverage — choose a provider with IPs in the regions your connectors target
- Pricing model — per-GB vs. per-request vs. monthly flat rate
- Sticky sessions — needed when cookies must persist across requests on the same IP
- Managed vs. raw — some providers bundle JS rendering and header management into a single API
Sticky Sessions
Section titled “Sticky Sessions”For multi-page fetches where cookies must persist, use sticky sessions (same IP for a session window):
rotator := common.NewProxyRotator(proxyURLs, common.ProxyRotatorOptions{ StickySessionDuration: 5 * time.Minute,})4. Managed Scraping APIs
Section titled “4. Managed Scraping APIs”Services like Firecrawl handle JavaScript rendering and return clean HTML or Markdown. They can be useful when headless Chrome alone isn’t sufficient, or when you want to avoid managing browser infrastructure.
export FIRECRAWL_API_KEY="fc-your-api-key"export FIRECRAWL_API_URL="https://api.firecrawl.dev" # optional, defaults to thisThe utility at connectors/common/firecrawl.go provides a client:
fc := common.NewFirecrawlClientFromEnv()if fc == nil { // Firecrawl not configured, fall back to direct fetching}
result, err := fc.ScrapeURL(ctx, "https://example.com/listings", nil)if err != nil { return err}// result.HTML contains the rendered page HTML// result.Markdown contains clean markdown extractionOptions
Section titled “Options”result, err := fc.ScrapeURL(ctx, url, &common.FirecrawlScrapeOptions{ Formats: []string{"html", "markdown"}, WaitFor: 3000, // wait 3s for JS to render})When to Consider a Managed API
Section titled “When to Consider a Managed API”- The source requires complex JS rendering that headless Chrome struggles with
- You want clean structured output without writing CSS selectors
- Rapid prototyping of new connectors (get data flowing first, optimize later)
5. Choosing an Approach
Section titled “5. Choosing an Approach”Start with the lightest technique and escalate only as needed:
- Start simple: Header tuning and rate limiting work for most sources
- Add proxies: If requests are being rate-limited, add
ProxyRotator - Try headless: If the source requires JS rendering, use
FetchRenderedHTML - Consider a managed API: If headless rendering is insufficient, try Firecrawl or a similar service
- Monitor and adapt: Source behaviour changes over time; review connector health regularly
6. Environment Variables Reference
Section titled “6. Environment Variables Reference”| Variable | Description | Example |
|---|---|---|
PROXY_URLS | Comma-separated proxy URLs for rotation | http://user:pass@proxy:8080,... |
FIRECRAWL_API_KEY | Firecrawl API authentication key | fc-abc123... |
FIRECRAWL_API_URL | Firecrawl API base URL (optional) | https://api.firecrawl.dev |
These can be set in the connector’s Kubernetes CronJob manifest or in mill/config/.env for local development.