Skip to content

Property Merge Strategy

When a connector submits a property that already exists in the database (identified by its address_hash), Mill does not blindly overwrite the existing record. Instead it applies a field-level merge strategy that preserves enriched or higher-quality data while still accepting fresh updates from the source.

This page explains the rules, the reasoning behind each one, and the edge cases to be aware of.

Multiple connectors can submit data for the same physical address — for example, a property on the same street might appear in both homes.co.nz and a direct MLS feed. Each source assigns its own internal ID and may have partial information.

A naive upsert (overwrite everything) creates two problems:

  1. Primary key collisions — If source B’s ID for a property is the same string as an already-existing different property, overwriting id would violate the primary key constraint.
  2. Data regression — Geocoding enriches a record with precise coordinates. A later connector message with no location data would wipe those coordinates, degrading quality.

The merge strategy solves both problems by choosing the best available value for every field on each update.

Properties are deduplicated by address_hash — an MD5 hash derived from the normalised street, city, state, postal code, and country. When two records share the same hash, they are considered the same physical property, regardless of source ID.

FieldReason
idPrimary key. Never changed after initial insert. Prevents PK collisions when a different source assigns a new ID to the same address.
created_atRecords when this property first appeared in the system. Always reflects the original insertion time.

Fields that always take the incoming value

Section titled “Fields that always take the incoming value”
FieldReason
updated_atAlways reflects the timestamp of the most recent message, so consumers can detect recently changed records.
statusListing status changes (e.g. activesold) must propagate immediately. Empty strings from the connector still fall back to the existing value.

String fields — incoming wins if non-empty

Section titled “String fields — incoming wins if non-empty”

For all text fields, the rule is:

new_value = incoming != '' ? incoming : existing

Fields covered: title, description, mls_number, property_name, property_subtype, currency, property_type, data_source, address_street, address_unit, address_city, address_state, address_postal_code, address_country, address_full, source_url.

This means:

  • A connector that sends a richer description will update the record.
  • A connector that sends an empty string (because the field was missing in its feed) will not wipe an existing value.

Numeric fields — incoming wins if non-zero

Section titled “Numeric fields — incoming wins if non-zero”

For all numeric detail and price fields, the rule is:

new_value = incoming != 0 ? incoming : existing

Fields covered: price, price_current, bedrooms_total, bathrooms_total, square_meters_total, square_meters_living, lot_size_sqm, year_built, parking_total_spaces, levels_total.

This protects against connectors that omit optional fields (Go’s zero value for int/float64 is 0), which would otherwise overwrite real data.

data_quality_score = GREATEST(incoming, existing)

The score can only increase over time. Once a high-quality source has enriched a record, a lower-quality source cannot degrade it.

latitude/longitude/s2_cell_id = incoming != 0 ? incoming : existing
geom = incoming IS NOT NULL ? incoming : existing

Geocoding is expensive. If a record has already been enriched with precise coordinates (e.g. from the geocoding service), a later message that arrived without location data will not wipe those coordinates.

Conversely, if a record has no coordinates and a new message provides them, the coordinates are updated.

listing_date — incoming wins if after year 1900

Section titled “listing_date — incoming wins if after year 1900”
listing_date = incoming > '1900-01-01' ? incoming : existing

Go’s zero value for time.Time is 0001-01-01 00:00:00 UTC. Without this guard, a connector that doesn’t populate listing_date would overwrite a real listing date with the Go zero time.

Field groupRule
id, created_atAlways preserve existing
updated_atAlways take incoming
statusIncoming if non-empty, else existing
All other string fieldsIncoming if non-empty, else existing
All numeric detail & price fieldsIncoming if non-zero, else existing
data_quality_scoreGREATEST(incoming, existing)
latitude, longitude, s2_cell_idIncoming if non-zero, else existing
geomIncoming if non-null, else existing
listing_dateIncoming if after 1900-01-01, else existing

Property images are stored in a separate property_images table that references properties.id via a foreign key. Because id is always preserved, images are never orphaned by a subsequent upsert from a different source. This was a secondary bug fixed by the same change — previously, attempting to change a row’s id would trigger a FK violation from the images table.

The merge is implemented as a single PostgreSQL INSERT ... ON CONFLICT (address_hash) DO UPDATE SET statement in:

mill/internal/database/postgres.go → CreatePropertyComprehensive()

The integration tests covering every merge rule live in:

mill/internal/integration_test/property_merge_on_upsert_test.go