Harvesters
The unified harvester system provides a single binary solution for harvesting real estate property data from multiple sources. This system replaces individual harvester binaries with a unified command-line interface that supports multiple property websites.
Overview
Section titled “Overview”The harvester system:
- Single Binary: One executable for all harvesters
- Unified CLI: Consistent command-line interface across all sources
- Extensible Architecture: Easy to add new harvesters
- Common Data Format: Standardized property data structure
- Rate Limiting: Configurable rate limiting per source
- Mill API Integration: Built-in support for Mill API submission
- Health Checks: Monitor harvester status
Quick Start
Section titled “Quick Start”List Available Harvesters
Section titled “List Available Harvesters”cd harvesters./bin/harvester -listRun a Harvester
Section titled “Run a Harvester”./bin/harvester \ -harvester homes-co-nz \ -location auckland \ -pages 2 \ -max 10 \ -dry-runHealth Check
Section titled “Health Check”./bin/harvester -harvester homes-co-nz -health-checkDocumentation
Section titled “Documentation”Getting Started
Section titled “Getting Started”- Creating a New Harvester - Step-by-step guide to building a new harvester
- Implemented Harvesters - Complete list of all implemented harvesters
- Sources & Enrichers - Deep dive into scraper vs. enrichment architecture
- Harvester Health - Current health status of all harvesters
- Harvester TODO List - Comprehensive list of harvesters to implement
Current Status
Section titled “Current Status”- Total Implemented: 30 harvesters
- Healthy: 18 harvesters
- Unhealthy: 12 harvesters
See the Harvester Health page for detailed status information.
Features
Section titled “Features”Unified Interface
Section titled “Unified Interface”All harvesters share the same command-line interface:
./bin/harvester \ -harvester <name> \ -location <location> \ -pages <number> \ -max <number> \ -delay <duration> \ -verbose \ -dry-run \ -mill-api "<url>" \ -mill-api-key "<key>"Standardized Data Format
Section titled “Standardized Data Format”All harvesters return properties in a standardized format:
{ "source_id": "12345", "source": "homes.co.nz", "address": "123 Example Street, Auckland", "price": 750000, "price_currency": "NZD", "bedrooms": 3, "bathrooms": 2.0, "car_spaces": 2, "property_type": "House", "description": "Beautiful family home...", "image_urls": ["https://example.com/image1.jpg"], "agent_name": "John Smith", "agent_phone": "+64 21 123 4567", "agency": "Example Realty", "listing_date": "2024-01-15T10:00:00Z", "source_url": "https://homes.co.nz/property/12345", "collection_time": "2024-01-20T15:30:00Z", "region": "Auckland", "features": ["garage", "garden", "deck"]}Rate Limiting
Section titled “Rate Limiting”Each harvester implements respectful rate limiting:
- Default: 2 seconds between requests
- Configurable via
-delayflag - Domain-specific limits
- Respects website policies
Mill API Integration
Section titled “Mill API Integration”Automatic integration with Mill API:
./bin/harvester \ -harvester homes-co-nz \ -location auckland \ -mill-api "http://localhost:4000/api/v1" \ -mill-api-key "your-token"The harvester will:
- Scrape properties from the target website
- Submit them to Mill API in batches
- Handle authentication automatically
- Report success/failure statistics
Common Use Cases
Section titled “Common Use Cases”Development Testing
Section titled “Development Testing”# Dry run without submitting to API./bin/harvester \ -harvester homes-co-nz \ -location auckland \ -pages 1 \ -max 5 \ -dry-run \ -verboseProduction Harvesting
Section titled “Production Harvesting”# Full harvest with API submission./bin/harvester \ -harvester homes-co-nz \ -location auckland \ -pages 10 \ -max 100 \ -mill-api "https://api.garagejs.com/api/v1" \ -mill-api-key "your-token"Health Monitoring
Section titled “Health Monitoring”# Check all harvesterscd harvesters./scripts/check_harvester_health.sh
# Check specific harvester./bin/harvester -harvester homes-co-nz -health-checkArchitecture
Section titled “Architecture”Project Structure
Section titled “Project Structure”harvesters/├── main.go # Unified CLI application├── go.mod # Go module definition├── Makefile # Build and development commands├── common/ # Shared types and interfaces│ ├── types.go # Common data structures│ └── mill_client.go # Mill API client└── sources/ # Individual harvester implementations ├── harcourts/ # Harcourts harvester ├── homes-co-nz/ # homes.co.nz harvester └── ...Harvester Interface
Section titled “Harvester Interface”All harvesters implement the common.Harvester interface:
type Harvester interface { GetName() string GetSource() string ScrapeProperties(opts HarvesterOptions) ([]Property, error) GetStats() HarvesterStats SetRateLimit(delay time.Duration) HealthCheck() error}Contributing
Section titled “Contributing”Adding a New Harvester
Section titled “Adding a New Harvester”- Review Creating a New Harvester
- Create harvester directory in
sources/ - Implement the
Harvesterinterface - Register in
main.go - Add tests
- Update documentation
Fixing Unhealthy Harvesters
Section titled “Fixing Unhealthy Harvesters”- Review Harvester Health for issues
- Identify the root cause
- Implement fixes
- Test thoroughly
- Update health status
Resources
Section titled “Resources”- Colly Documentation - Web scraping framework
- Mill API Documentation - API reference
- Go Documentation - Go language reference
Support
Section titled “Support”For issues or questions:
- Check the troubleshooting guide
- Review harvester health status
- See implemented harvesters for examples