Creating a New Connector
This guide will walk you through creating a new connector for the unified property connector system. Connectors are responsible for scraping property data from real estate websites and submitting it to the Mill API.
Prerequisites
Section titled “Prerequisites”Before you begin, ensure you have:
- Go 1.24.9+ installed
- Understanding of the
common.Connectorinterface - Familiarity with web scraping (we use Colly)
- Access to the target website’s structure
Step 1: Create the Connector Directory
Section titled “Step 1: Create the Connector Directory”Create a new directory for your connector in connectors/sources/:
cd connectors/sourcesmkdir my-new-connectorcd my-new-connectorNaming Convention:
- Use lowercase with hyphens (e.g.,
realestate-com-au) - Match the domain name when possible
- Keep names descriptive and unique
Step 2: Implement the Connector Interface
Section titled “Step 2: Implement the Connector Interface”Create connector.go in your new directory. All connectors must implement the common.Connector interface:
package my_new_connector
import ( "fmt" "log" "strings" "sync" "time"
"github.com/gocolly/colly/v2" "connectors/common")
// Connector implements the common.Connector interfacetype Connector struct { collector *colly.Collector source string rateLimit time.Duration stats common.ConnectorStats mu sync.RWMutex}
// NewConnector creates a new connector instancefunc NewConnector() common.Connector { c := colly.NewCollector( colly.UserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"), )
// Set up rate limiting c.Limit(&colly.LimitRule{ DomainGlob: "*example.com*", Parallelism: 1, Delay: 2 * time.Second, })
// Add headers to appear browser-like c.OnRequest(func(r *colly.Request) { r.Headers.Set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8") r.Headers.Set("Accept-Language", "en-US,en;q=0.5") })
return &Connector{ collector: c, source: "example.com", rateLimit: 2 * time.Second, stats: common.ConnectorStats{}, }}
// GetName returns the connector name (used in CLI)func (h *Connector) GetName() string { return "my-new-connector"}
// GetSource returns the source website domainfunc (h *Connector) GetSource() string { return h.source}
// SetRateLimit configures rate limitingfunc (h *Connector) SetRateLimit(delay time.Duration) { h.mu.Lock() defer h.mu.Unlock() h.rateLimit = delay h.collector.Limit(&colly.LimitRule{ DomainGlob: "*example.com*", Parallelism: 1, Delay: delay, })}
// GetStats returns current collection statisticsfunc (h *Connector) GetStats() common.ConnectorStats { h.mu.RLock() defer h.mu.RUnlock() return h.stats}
// HealthCheck verifies the connector can access the target websitefunc (h *Connector) HealthCheck() error { testURL := "https://www.example.com" err := h.collector.Visit(testURL) if err != nil { return fmt.Errorf("failed to access %s: %w", testURL, err) } return nil}
// ScrapeProperties implements the main scraping logicfunc (h *Connector) ScrapeProperties(opts common.ConnectorOptions) ([]common.Property, error) { var properties []common.Property var mu sync.Mutex
// Reset stats h.mu.Lock() h.stats = common.ConnectorStats{} h.mu.Unlock()
// Set up collector callbacks h.collector.OnHTML(".property-listing", func(e *colly.HTMLElement) { property := h.parseProperty(e) if property != nil { mu.Lock() properties = append(properties, *property) h.mu.Lock() h.stats.PropertiesCollected++ h.mu.Unlock() mu.Unlock() } })
// Handle errors h.collector.OnError(func(r *colly.Response, err error) { h.mu.Lock() h.stats.ErrorsEncountered++ h.mu.Unlock() if opts.Verbose { log.Printf("Error scraping %s: %v", r.Request.URL, err) } })
// Build search URL based on location searchURL := h.buildSearchURL(opts.Location, opts.Pages)
// Visit search pages for page := 1; page <= opts.Pages && len(properties) < opts.MaxProperties; page++ { url := fmt.Sprintf("%s&page=%d", searchURL, page) if opts.Verbose { log.Printf("Scraping page %d: %s", page, url) } h.collector.Visit(url) time.Sleep(h.rateLimit) // Respect rate limits }
return properties[:min(len(properties), opts.MaxProperties)], nil}
// parseProperty extracts property data from HTML elementfunc (h *Connector) parseProperty(e *colly.HTMLElement) *common.Property { // Extract property details from HTML address := strings.TrimSpace(e.ChildText(".address")) priceText := strings.TrimSpace(e.ChildText(".price"))
// Parse price (remove currency symbols, commas) price := h.parsePrice(priceText)
// Extract other fields bedrooms := h.parseInt(e.ChildText(".bedrooms")) bathrooms := h.parseFloat(e.ChildText(".bathrooms"))
// Get property URL propertyURL := e.Request.AbsoluteURL(e.ChildAttr("a", "href"))
// Collect image URLs var imageURLs []string e.ForEach(".property-image img", func(_ int, img *colly.HTMLElement) { imgURL := e.Request.AbsoluteURL(img.Attr("src")) imageURLs = append(imageURLs, imgURL) })
return &common.Property{ SourceID: h.extractSourceID(propertyURL), Source: h.source, Address: address, Price: price, PriceCurrency: h.detectCurrency(priceText), Bedrooms: bedrooms, Bathrooms: bathrooms, PropertyType: strings.TrimSpace(e.ChildText(".property-type")), Description: strings.TrimSpace(e.ChildText(".description")), ImageURLs: imageURLs, SourceURL: propertyURL, CollectionTime: time.Now(), Region: opts.Location, }}
// Helper functionsfunc (h *Connector) buildSearchURL(location string, pages int) string { // Build the search URL based on your target website's structure return fmt.Sprintf("https://www.example.com/search?location=%s", location)}
func (h *Connector) parsePrice(priceText string) int { // Remove currency symbols, commas, spaces // Convert to integer (cents or base unit) // Return 0 if parsing fails return 0 // Implement your parsing logic}
func (h *Connector) parseInt(text string) int { // Extract integer from text return 0 // Implement your parsing logic}
func (h *Connector) parseFloat(text string) float64 { // Extract float from text return 0.0 // Implement your parsing logic}
func (h *Connector) extractSourceID(url string) string { // Extract unique ID from URL return "" // Implement your extraction logic}
func (h *Connector) detectCurrency(priceText string) string { // Detect currency from price text return "USD" // Implement your detection logic}Step 3: Create Tests
Section titled “Step 3: Create Tests”Create connector_test.go to test your connector:
package my_new_connector
import ( "testing" "connectors/common")
func TestNewConnector(t *testing.T) { h := NewConnector() if h == nil { t.Fatal("NewConnector returned nil") }
if h.GetName() != "my-new-connector" { t.Errorf("Expected name 'my-new-connector', got '%s'", h.GetName()) }
if h.GetSource() != "example.com" { t.Errorf("Expected source 'example.com', got '%s'", h.GetSource()) }}
func TestHealthCheck(t *testing.T) { h := NewConnector() err := h.HealthCheck() if err != nil { t.Logf("Health check failed (may be expected): %v", err) }}
func TestScrapeProperties(t *testing.T) { h := NewConnector() opts := common.ConnectorOptions{ Location: "test-location", Pages: 1, MaxProperties: 5, Delay: 2 * time.Second, Verbose: true, DryRun: true, }
properties, err := h.ScrapeProperties(opts) if err != nil { t.Fatalf("ScrapeProperties failed: %v", err) }
if len(properties) == 0 { t.Log("No properties found (may be expected)") }}Create compliance_test.go to ensure your connector implements the interface correctly:
package my_new_connector
import ( "testing" "connectors/common")
// TestConnectorInterface ensures the connector implements all required methodsfunc TestConnectorInterface(t *testing.T) { var _ common.Connector = (*Connector)(nil)}Step 4: Register the Connector
Section titled “Step 4: Register the Connector”Add your connector to connectors/main.go:
import ( // ... existing imports my_new_connector "connectors/sources/my-new-connector")
func main() { registry := common.NewConnectorRegistry()
// ... existing registrations registry.Register(my_new_connector.NewConnector())
// ... rest of main function}Also add a default location in the main() function’s location switch statement:
case "my-new-connector": *location = "default-location"Step 5: Test Your Connector
Section titled “Step 5: Test Your Connector”Build and test your connector:
# From connectors directorycd connectors
# Build the binarygo build -o bin/connector .
# Test your connector./bin/connector -connector my-new-connector -location test -pages 1 -max 5 -dry-run -verbose
# Run health check./bin/connector -connector my-new-connector -health-check
# Run testsgo test ./sources/my-new-connector/...Step 6: Best Practices
Section titled “Step 6: Best Practices”Rate Limiting
Section titled “Rate Limiting”Always implement respectful rate limiting:
// Default: 2 seconds between requestsc.Limit(&colly.LimitRule{ DomainGlob: "*example.com*", Parallelism: 1, Delay: 2 * time.Second,})Error Handling
Section titled “Error Handling”Handle errors gracefully:
h.collector.OnError(func(r *colly.Response, err error) { h.mu.Lock() h.stats.ErrorsEncountered++ h.mu.Unlock() if opts.Verbose { log.Printf("Error scraping %s: %v", r.Request.URL, err) }})Data Validation
Section titled “Data Validation”Validate extracted data:
func (h *Connector) parseProperty(e *colly.HTMLElement) *common.Property { address := strings.TrimSpace(e.ChildText(".address")) if address == "" { return nil // Skip invalid properties }
price := h.parsePrice(e.ChildText(".price")) if price <= 0 { return nil // Skip properties without valid price }
// ... rest of parsing}User-Agent and Headers
Section titled “User-Agent and Headers”Use realistic browser headers:
c := colly.NewCollector( colly.UserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"),)
c.OnRequest(func(r *colly.Request) { r.Headers.Set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8") r.Headers.Set("Accept-Language", "en-US,en;q=0.5") r.Headers.Set("Referer", "https://www.example.com")})Property Data Completeness
Section titled “Property Data Completeness”Ensure all required fields are populated:
SourceID: Unique identifier from the sourceSource: Website domainAddress: Full property addressPrice: Price as integer (in smallest currency unit)PriceCurrency: Currency code (USD, EUR, etc.)SourceURL: Direct link to property listingCollectionTime: Timestamp when data was collected
Step 7: Integration with Mill API
Section titled “Step 7: Integration with Mill API”Your connector automatically integrates with Mill API when run with API credentials:
./bin/connector \ -connector my-new-connector \ -location test \ -mill-api "http://localhost:4000/api/v1" \ -mill-api-key "your-token"The connector will:
- Scrape properties
- Submit them to Mill API in batches
- Handle authentication automatically via token service
Common Patterns
Section titled “Common Patterns”Pagination
Section titled “Pagination”Handle paginated results:
for page := 1; page <= opts.Pages && len(properties) < opts.MaxProperties; page++ { url := fmt.Sprintf("%s?page=%d", baseURL, page) h.collector.Visit(url) time.Sleep(h.rateLimit)}Dynamic Content
Section titled “Dynamic Content”For JavaScript-rendered content, consider:
- Using browser automation (Playwright, Selenium)
- Finding API endpoints the site uses
- Using headless browsers if necessary
Anti-Bot Measures
Section titled “Anti-Bot Measures”If you encounter anti-bot measures:
- Increase delays between requests
- Rotate user agents
- Use proxy rotation (if allowed)
- Respect robots.txt
- Consider official APIs if available
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Issue: No properties found
- Check CSS selectors match the website structure
- Verify the search URL is correct
- Enable verbose mode:
-verbose - Check if the site requires authentication
Issue: Rate limiting / 429 errors
- Increase delay:
-delay 5s - Reduce parallelism
- Check if site has API limits
Issue: Parsing errors
- Validate HTML structure hasn’t changed
- Add defensive parsing with fallbacks
- Log raw HTML for debugging
Issue: Health check fails
- Verify website is accessible
- Check network connectivity
- Ensure URL is correct
Next Steps
Section titled “Next Steps”After creating your connector:
- Test thoroughly with various locations and parameters
- Run health checks regularly
- Monitor error rates in production
- Update selectors if website structure changes
- Document any special requirements or limitations
Example: Complete Connector
Section titled “Example: Complete Connector”See existing connectors for reference:
connectors/sources/harcourts/- Simple HTML scrapingconnectors/sources/homes-co-nz/- Complex paginationconnectors/sources/zillow/- API-based approach