Property Merge Strategy
When a connector submits a property that already exists in the database (identified by its address_hash), Mill does not blindly overwrite the existing record. Instead it applies a field-level merge strategy that preserves enriched or higher-quality data while still accepting fresh updates from the source.
This page explains the rules, the reasoning behind each one, and the edge cases to be aware of.
Why a merge strategy is necessary
Section titled “Why a merge strategy is necessary”Multiple connectors can submit data for the same physical address — for example, a property on the same street might appear in both homes.co.nz and a direct MLS feed. Each source assigns its own internal ID and may have partial information.
A naive upsert (overwrite everything) creates two problems:
- Primary key collisions — If source B’s ID for a property is the same string as an already-existing different property, overwriting
idwould violate the primary key constraint. - Data regression — Geocoding enriches a record with precise coordinates. A later connector message with no location data would wipe those coordinates, degrading quality.
The merge strategy solves both problems by choosing the best available value for every field on each update.
Deduplication key
Section titled “Deduplication key”Properties are deduplicated by address_hash — an MD5 hash derived from the normalised street, city, state, postal code, and country. When two records share the same hash, they are considered the same physical property, regardless of source ID.
Field-level merge rules
Section titled “Field-level merge rules”Fields that are always preserved
Section titled “Fields that are always preserved”| Field | Reason |
|---|---|
id | Primary key. Never changed after initial insert. Prevents PK collisions when a different source assigns a new ID to the same address. |
created_at | Records when this property first appeared in the system. Always reflects the original insertion time. |
Fields that always take the incoming value
Section titled “Fields that always take the incoming value”| Field | Reason |
|---|---|
updated_at | Always reflects the timestamp of the most recent message, so consumers can detect recently changed records. |
status | Listing status changes (e.g. active → sold) must propagate immediately. Empty strings from the connector still fall back to the existing value. |
String fields — incoming wins if non-empty
Section titled “String fields — incoming wins if non-empty”For all text fields, the rule is:
new_value = incoming != '' ? incoming : existingFields covered: title, description, mls_number, property_name, property_subtype, currency, property_type, data_source, address_street, address_unit, address_city, address_state, address_postal_code, address_country, address_full, source_url.
This means:
- A connector that sends a richer description will update the record.
- A connector that sends an empty string (because the field was missing in its feed) will not wipe an existing value.
Numeric fields — incoming wins if non-zero
Section titled “Numeric fields — incoming wins if non-zero”For all numeric detail and price fields, the rule is:
new_value = incoming != 0 ? incoming : existingFields covered: price, price_current, bedrooms_total, bathrooms_total, square_meters_total, square_meters_living, lot_size_sqm, year_built, parking_total_spaces, levels_total.
This protects against connectors that omit optional fields (Go’s zero value for int/float64 is 0), which would otherwise overwrite real data.
data_quality_score — highest value wins
Section titled “data_quality_score — highest value wins”data_quality_score = GREATEST(incoming, existing)The score can only increase over time. Once a high-quality source has enriched a record, a lower-quality source cannot degrade it.
Location — incoming wins if non-zero
Section titled “Location — incoming wins if non-zero”latitude/longitude/s2_cell_id = incoming != 0 ? incoming : existinggeom = incoming IS NOT NULL ? incoming : existingGeocoding is expensive. If a record has already been enriched with precise coordinates (e.g. from the geocoding service), a later message that arrived without location data will not wipe those coordinates.
Conversely, if a record has no coordinates and a new message provides them, the coordinates are updated.
listing_date — incoming wins if after year 1900
Section titled “listing_date — incoming wins if after year 1900”listing_date = incoming > '1900-01-01' ? incoming : existingGo’s zero value for time.Time is 0001-01-01 00:00:00 UTC. Without this guard, a connector that doesn’t populate listing_date would overwrite a real listing date with the Go zero time.
Summary table
Section titled “Summary table”| Field group | Rule |
|---|---|
id, created_at | Always preserve existing |
updated_at | Always take incoming |
status | Incoming if non-empty, else existing |
| All other string fields | Incoming if non-empty, else existing |
| All numeric detail & price fields | Incoming if non-zero, else existing |
data_quality_score | GREATEST(incoming, existing) |
latitude, longitude, s2_cell_id | Incoming if non-zero, else existing |
geom | Incoming if non-null, else existing |
listing_date | Incoming if after 1900-01-01, else existing |
Effect on property images
Section titled “Effect on property images”Property images are stored in a separate property_images table that references properties.id via a foreign key. Because id is always preserved, images are never orphaned by a subsequent upsert from a different source. This was a secondary bug fixed by the same change — previously, attempting to change a row’s id would trigger a FK violation from the images table.
Implementation reference
Section titled “Implementation reference”The merge is implemented as a single PostgreSQL INSERT ... ON CONFLICT (address_hash) DO UPDATE SET statement in:
mill/internal/database/postgres.go → CreatePropertyComprehensive()The integration tests covering every merge rule live in:
mill/internal/integration_test/property_merge_on_upsert_test.go