Identification Attributes
These attributes uniquely identify properties and track their data sources.
Core Identifiers
Section titled “Core Identifiers”- Type:
VARCHAR(50) - Description: Primary unique identifier for the property
- Example:
"prop-12345" - Required: Yes
- Indexed: Yes (primary key)
property_uuid
Section titled “property_uuid”- Type:
VARCHAR(100) - Description: Universally unique identifier (UUID) for the property
- Example:
"123e4567-e89b-12d3-a456-426614174000" - Required: No
- Indexed: Yes (unique index)
source_id
Section titled “source_id”- Type:
VARCHAR(100) - Description: Identifier from the original source system (harvester)
- Example:
"realestate-com-au-12345" - Required: No
- Use Case: Links property back to original listing source
mls_number
Section titled “mls_number”- Type:
VARCHAR(50) - Description: Multiple Listing Service (MLS) number, if applicable
- Example:
"ML123456" - Required: No
- Note: Only applicable in regions with MLS systems
property_id_alt
Section titled “property_id_alt”- Type:
VARCHAR(50) - Description: Alternative identifier for integrations or external systems
- Example:
"EXT-789" - Required: No
- Use Case: Integration with third-party systems
Deduplication
Section titled “Deduplication”address_hash
Section titled “address_hash”- Type:
VARCHAR(64) - Description: MD5 hash of normalized address for deduplication
- Example:
"5d41402abc4b2a76b9719d911017c592" - Required: Yes
- Indexed: Yes (unique key in Doris)
- Note: Must be first column in table for Doris UNIQUE KEY constraint
Data Quality & Source Tracking
Section titled “Data Quality & Source Tracking”data_source
Section titled “data_source”- Type:
VARCHAR(100) - Description: Name of the harvester or data source that provided this property
- Example:
"realestate-com-au","zillow","manual-entry" - Required: No
- Use Case: Track data provenance and quality by source
data_quality_score
Section titled “data_quality_score”- Type:
DOUBLE - Description: Quality score from 0.0 to 1.0 indicating completeness and accuracy
- Example:
0.85 - Range: 0.0 (poor) to 1.0 (excellent)
- Calculation: Based on field completeness, validation, and consistency
data_confidence_level
Section titled “data_confidence_level”- Type:
VARCHAR(50) - Description: Confidence level in the data accuracy
- Values:
"high","medium","low","verified" - Example:
"high" - Use Case: Filter properties by data reliability
last_verified
Section titled “last_verified”- Type:
DATETIME - Description: Timestamp when property data was last verified or updated from source
- Example:
"2024-01-15T10:30:00Z" - Required: No
- Use Case: Track data freshness
Usage Examples
Section titled “Usage Examples”Query by ID
Section titled “Query by ID”SELECT * FROM properties WHERE id = 'prop-12345';Find Duplicates by Address Hash
Section titled “Find Duplicates by Address Hash”SELECT address_hash, COUNT(*) as countFROM propertiesGROUP BY address_hashHAVING count > 1;Filter by Data Quality
Section titled “Filter by Data Quality”SELECT * FROM propertiesWHERE data_quality_score >= 0.8ORDER BY data_quality_score DESC;Track Properties by Source
Section titled “Track Properties by Source”SELECT data_source, COUNT(*) as count, AVG(data_quality_score) as avg_qualityFROM propertiesGROUP BY data_sourceORDER BY count DESC;Best Practices
Section titled “Best Practices”- Always set
id: Generate a unique ID for every property - Use
address_hashfor deduplication: Hash normalized addresses to identify duplicates - Track
data_source: Always record which harvester provided the data - Maintain
data_quality_score: Calculate and update quality scores during ingestion - Set
last_verified: Update timestamp when refreshing data from source
Related Attributes
Section titled “Related Attributes”- Basic Information - Property type and status
- Address Attributes - Address components used for hashing
- Timestamps - Audit timestamps