Skip to content

Harvesters

The unified harvester system provides a single binary solution for harvesting real estate property data from multiple sources. This system replaces individual harvester binaries with a unified command-line interface that supports multiple property websites.

The harvester system:

  • Single Binary: One executable for all harvesters
  • Unified CLI: Consistent command-line interface across all sources
  • Extensible Architecture: Easy to add new harvesters
  • Common Data Format: Standardized property data structure
  • Rate Limiting: Configurable rate limiting per source
  • Mill API Integration: Built-in support for Mill API submission
  • Health Checks: Monitor harvester status
Terminal window
cd harvesters
./bin/harvester -list
Terminal window
./bin/harvester \
-harvester homes-co-nz \
-location auckland \
-pages 2 \
-max 10 \
-dry-run
Terminal window
./bin/harvester -harvester homes-co-nz -health-check
  • Total Implemented: 30 harvesters
  • Healthy: 18 harvesters
  • Unhealthy: 12 harvesters

See the Harvester Health page for detailed status information.

All harvesters share the same command-line interface:

Terminal window
./bin/harvester \
-harvester <name> \
-location <location> \
-pages <number> \
-max <number> \
-delay <duration> \
-verbose \
-dry-run \
-mill-api "<url>" \
-mill-api-key "<key>"

All harvesters return properties in a standardized format:

{
"source_id": "12345",
"source": "homes.co.nz",
"address": "123 Example Street, Auckland",
"price": 750000,
"price_currency": "NZD",
"bedrooms": 3,
"bathrooms": 2.0,
"car_spaces": 2,
"property_type": "House",
"description": "Beautiful family home...",
"image_urls": ["https://example.com/image1.jpg"],
"agent_name": "John Smith",
"agent_phone": "+64 21 123 4567",
"agency": "Example Realty",
"listing_date": "2024-01-15T10:00:00Z",
"source_url": "https://homes.co.nz/property/12345",
"collection_time": "2024-01-20T15:30:00Z",
"region": "Auckland",
"features": ["garage", "garden", "deck"]
}

Each harvester implements respectful rate limiting:

  • Default: 2 seconds between requests
  • Configurable via -delay flag
  • Domain-specific limits
  • Respects website policies

Automatic integration with Mill API:

Terminal window
./bin/harvester \
-harvester homes-co-nz \
-location auckland \
-mill-api "http://localhost:4000/api/v1" \
-mill-api-key "your-token"

The harvester will:

  1. Scrape properties from the target website
  2. Submit them to Mill API in batches
  3. Handle authentication automatically
  4. Report success/failure statistics
Terminal window
# Dry run without submitting to API
./bin/harvester \
-harvester homes-co-nz \
-location auckland \
-pages 1 \
-max 5 \
-dry-run \
-verbose
Terminal window
# Full harvest with API submission
./bin/harvester \
-harvester homes-co-nz \
-location auckland \
-pages 10 \
-max 100 \
-mill-api "https://api.garagejs.com/api/v1" \
-mill-api-key "your-token"
Terminal window
# Check all harvesters
cd harvesters
./scripts/check_harvester_health.sh
# Check specific harvester
./bin/harvester -harvester homes-co-nz -health-check
harvesters/
├── main.go # Unified CLI application
├── go.mod # Go module definition
├── Makefile # Build and development commands
├── common/ # Shared types and interfaces
│ ├── types.go # Common data structures
│ └── mill_client.go # Mill API client
└── sources/ # Individual harvester implementations
├── harcourts/ # Harcourts harvester
├── homes-co-nz/ # homes.co.nz harvester
└── ...

All harvesters implement the common.Harvester interface:

type Harvester interface {
GetName() string
GetSource() string
ScrapeProperties(opts HarvesterOptions) ([]Property, error)
GetStats() HarvesterStats
SetRateLimit(delay time.Duration)
HealthCheck() error
}
  1. Review Creating a New Harvester
  2. Create harvester directory in sources/
  3. Implement the Harvester interface
  4. Register in main.go
  5. Add tests
  6. Update documentation
  1. Review Harvester Health for issues
  2. Identify the root cause
  3. Implement fixes
  4. Test thoroughly
  5. Update health status

For issues or questions: