Data integration challenges have plagued analytics teams for years. Different tables with similar data often have columns in different orders, varying column names, or missing fields entirely. Snowflake's "Union by Name" feature addresses these pain points by intelligently matching columns based on their names rather than their positions, making data consolidation more intuitive and robust.
What is Union by Name?
Union by Name is a powerful enhancement to Snowflake's traditional UNION operation that matches columns by their names instead of their ordinal positions. This feature eliminates the need for complex column reordering and explicit NULL handling when combining datasets with different structures.
Traditional UNION vs. Union by Name
Traditional UNION requires:
- Exact column count match
- Identical column order
- Manual NULL handling for missing columns
Union by Name provides:
- Automatic column matching by name
- Flexible column ordering
- Intelligent handling of missing columns
Syntax and Basic Usage
Basic Syntax
-- Basic Union by Name syntax
SELECT * FROM table1
UNION [ALL] BY NAME
SELECT * FROM table2;
-- With explicit column selection
SELECT col1, col2, col3 FROM table1
UNION BY NAME
SELECT col1, col2, col3 FROM table2;
Simple Example
-- Table A structure: id, name, email, created_date
-- Table B structure: email, id, last_login, name
-- Traditional approach (would fail or require restructuring)
SELECT id, name, email, NULL as last_login, created_date, NULL as last_login
FROM table_a
UNION ALL
SELECT id, name, email, last_login, NULL as created_date
FROM table_b;
-- Union by Name approach (automatic matching)
SELECT *
FROM table_a
UNION ALL BY NAME
SELECT *
FROM table_b;
Advanced Features and Capabilities
1. Handling Missing Columns
Union by Name automatically handles missing columns by filling them with NULL values.
-- Customer data from different sources
-- Source 1: customer_id, first_name, last_name, email
-- Source 2: customer_id, full_name, phone, email, registration_date
SELECT
customer_id,
first_name,
last_name,
email,
NULL as phone,
NULL as registration_date,
'source_1' as data_source
FROM customer_source_1
UNION ALL BY NAME
SELECT
customer_id,
NULL as first_name,
NULL as last_name,
full_name,
phone,
email,
registration_date,
'source_2' as data_source
FROM customer_source_2;
2. Column Type Compatibility
Union by Name handles compatible data types automatically and provides clear error messages for incompatible types.
-- Compatible types example
WITH sales_2023 AS (
SELECT
order_id::VARCHAR as order_id,
amount::DECIMAL(10,2) as amount,
order_date::DATE as order_date
FROM sales_2023_raw
),
sales_2024 AS (
SELECT
order_id::STRING as order_id,
amount::NUMBER(12,2) as amount,
order_date::TIMESTAMP as order_date
FROM sales_2024_raw
)
SELECT * FROM sales_2023
UNION BY NAME
SELECT * FROM sales_2024;
3. Working with Complex Data Types
-- Handling VARIANT and ARRAY columns
SELECT
product_id,
product_name,
attributes::VARIANT as product_attributes,
tags::ARRAY as product_tags
FROM product_catalog_v1
UNION ALL BY NAME
SELECT
product_id,
product_name,
specifications::VARIANT as product_attributes,
categories::ARRAY as product_tags
FROM product_catalog_v2;
Real-World Use Cases
1. Multi-Source Data Integration
-- Consolidating customer data from multiple CRM systems
CREATE OR REPLACE VIEW unified_customers AS
SELECT
customer_id,
first_name,
last_name,
email,
phone,
address,
city,
state,
country,
registration_date,
'salesforce' as source_system
FROM salesforce_customers
UNION ALL BY NAME
SELECT
customer_id,
first_name,
last_name,
email_address as email,
phone_number as phone,
street_address as address,
city,
state_province as state,
country,
created_date as registration_date,
'hubspot' as source_system
FROM hubspot_customers
UNION ALL BY NAME
SELECT
id as customer_id,
fname as first_name,
lname as last_name,
email,
phone,
address_line_1 as address,
city,
state,
country,
signup_date as registration_date,
'internal_db' as source_system
FROM internal_customers;
2. Historical Data Migration
-- Combining historical data with different schema versions
-- Schema V1: id, name, email, created_at
-- Schema V2: id, first_name, last_name, email, phone, created_at
-- Schema V3: user_id, first_name, last_name, email, phone, address, created_at
CREATE OR REPLACE TABLE complete_user_history AS
SELECT
id,
name as first_name,
NULL as last_name,
email,
NULL as phone,
NULL as address,
created_at,
'v1' as schema_version
FROM users_v1_archive
UNION ALL BY NAME
SELECT
id,
first_name,
last_name,
email,
phone,
NULL as address,
created_at,
'v2' as schema_version
FROM users_v2_archive
UNION ALL BY NAME
SELECT
user_id as id,
first_name,
last_name,
email,
phone,
address,
created_at,
'v3' as schema_version
FROM users_v3_current;
3. A/B Testing Data Consolidation
-- Consolidating A/B test results from different experiments
WITH experiment_a AS (
SELECT
user_id,
experiment_name,
variant,
conversion_rate,
revenue,
session_duration,
'experiment_a' as test_type
FROM ab_test_results_a
),
experiment_b AS (
SELECT
user_id,
experiment_name,
variant,
click_through_rate as conversion_rate,
total_revenue as revenue,
avg_session_time as session_duration,
bounce_rate,
'experiment_b' as test_type
FROM ab_test_results_b
)
SELECT * FROM experiment_a
UNION ALL BY NAME
SELECT * FROM experiment_b;
Performance Optimization Strategies
1. Column Pruning
-- Optimize performance by selecting only needed columns
SELECT
customer_id,
email,
registration_date,
source_system
FROM (
SELECT customer_id, email, registration_date, 'crm_a' as source_system
FROM crm_system_a
UNION ALL BY NAME
SELECT customer_id, email, signup_date as registration_date, 'crm_b' as source_system
FROM crm_system_b
);
2. Efficient Data Type Handling
-- Ensure consistent data types for better performance
CREATE OR REPLACE VIEW normalized_sales AS
SELECT
order_id::VARCHAR(50) as order_id,
customer_id::NUMBER(38,0) as customer_id,
amount::DECIMAL(15,2) as amount,
order_date::DATE as order_date,
'online' as channel
FROM online_sales
UNION ALL BY NAME
SELECT
order_id::VARCHAR(50) as order_id,
customer_id::NUMBER(38,0) as customer_id,
amount::DECIMAL(15,2) as amount,
order_date::DATE as order_date,
'retail' as channel
FROM retail_sales;
3. Partitioning Strategy
-- Leverage clustering for better performance
CREATE OR REPLACE TABLE unified_transactions (
transaction_id VARCHAR(100),
customer_id NUMBER(38,0),
amount DECIMAL(15,2),
transaction_date DATE,
payment_method VARCHAR(50),
source_system VARCHAR(50)
) CLUSTER BY (transaction_date, source_system);
-- Insert data using Union by Name
INSERT INTO unified_transactions
SELECT * FROM payment_processor_a
UNION ALL BY NAME
SELECT * FROM payment_processor_b
UNION ALL BY NAME
SELECT * FROM payment_processor_c;
Error Handling and Troubleshooting
1. Common Error Scenarios
-- Handling data type mismatches
SELECT
product_id,
TRY_CAST(price AS DECIMAL(10,2)) as price,
product_name
FROM product_source_1
UNION ALL BY NAME
SELECT
product_id,
TRY_CAST(price_string AS DECIMAL(10,2)) as price,
product_name
FROM product_source_2;
2. Validation and Data Quality Checks
-- Data quality validation after Union by Name
CREATE OR REPLACE PROCEDURE validate_union_results()
RETURNS STRING
LANGUAGE SQL
AS
$$
DECLARE
duplicate_count INTEGER;
null_count INTEGER;
result_message STRING;
BEGIN
-- Check for duplicates
SELECT COUNT(*) INTO duplicate_count
FROM (
SELECT customer_id, COUNT(*) as cnt
FROM unified_customers
GROUP BY customer_id
HAVING COUNT(*) > 1
);
-- Check for excessive NULLs
SELECT COUNT(*) INTO null_count
FROM unified_customers
WHERE email IS NULL OR customer_id IS NULL;
-- Generate report
result_message := 'Validation Results: ' ||
'Duplicates: ' || duplicate_count || ', ' ||
'Null Records: ' || null_count;
RETURN result_message;
END;
$$;
CALL validate_union_results();
Best Practices and Guidelines
1. Schema Design Considerations
-- Establish consistent naming conventions
CREATE OR REPLACE VIEW standardized_events AS
SELECT
event_id,
user_id,
event_timestamp,
event_type,
event_properties,
'mobile_app' as source
FROM mobile_events
UNION ALL BY NAME
SELECT
event_id,
user_id,
timestamp as event_timestamp,
type as event_type,
properties as event_properties,
'web_app' as source
FROM web_events;
2. Documentation and Monitoring
-- Create metadata tracking for Union by Name operations
CREATE OR REPLACE TABLE union_operations_log (
operation_id STRING,
source_tables ARRAY,
target_table STRING,
column_mapping VARIANT,
execution_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
row_count NUMBER,
status STRING
);
-- Log Union by Name operations
INSERT INTO union_operations_log (
operation_id,
source_tables,
target_table,
column_mapping,
row_count,
status
)
VALUES (
'UNION_CUSTOMERS_001',
['salesforce_customers', 'hubspot_customers'],
'unified_customers',
OBJECT_CONSTRUCT('email_mapping', 'email->email_address'),
(SELECT COUNT(*) FROM unified_customers),
'SUCCESS'
);
3. Testing and Validation Framework
-- Create comprehensive testing framework
CREATE OR REPLACE PROCEDURE test_union_by_name()
RETURNS STRING
LANGUAGE SQL
AS
$$
DECLARE
test_results STRING DEFAULT '';
BEGIN
-- Test 1: Column count validation
CREATE OR REPLACE TEMPORARY TABLE test_union_result AS
SELECT * FROM table_a UNION ALL BY NAME SELECT * FROM table_b;
-- Test 2: Data type consistency
SELECT
CASE
WHEN COUNT(DISTINCT TYPEOF(column_name)) = 1 THEN 'PASS'
ELSE 'FAIL'
END as type_consistency_test
FROM test_union_result;
-- Test 3: NULL handling validation
SELECT
COUNT(*) as null_count,
COUNT(*) / (SELECT COUNT(*) FROM test_union_result) as null_percentage
FROM test_union_result
WHERE column_name IS NULL;
RETURN 'Union by Name tests completed successfully';
END;
$$;
Advanced Integration Patterns
1. Dynamic Union Generation
-- Generate Union by Name statements dynamically
CREATE OR REPLACE PROCEDURE generate_dynamic_union(table_pattern STRING)
RETURNS STRING
LANGUAGE SQL
AS
$$
DECLARE
union_sql STRING DEFAULT '';
table_cursor CURSOR FOR
SELECT table_name
FROM information_schema.tables
WHERE table_name LIKE table_pattern;
BEGIN
FOR table_rec IN table_cursor DO
IF (union_sql != '') THEN
union_sql := union_sql || ' UNION ALL BY NAME ';
END IF;
union_sql := union_sql || 'SELECT * FROM ' || table_rec.table_name;
END FOR;
RETURN union_sql;
END;
$$;
2. CDC (Change Data Capture) Integration
-- Implement CDC with Union by Name
CREATE OR REPLACE STREAM customer_changes ON TABLE customers;
-- Merge changes using Union by Name
MERGE INTO customer_master cm
USING (
SELECT * FROM customer_changes WHERE metadata$action = 'INSERT'
UNION ALL BY NAME
SELECT * FROM customer_changes WHERE metadata$action = 'UPDATE'
) src
ON cm.customer_id = src.customer_id
WHEN MATCHED THEN UPDATE SET
cm.first_name = src.first_name,
cm.last_name = src.last_name,
cm.email = src.email
WHEN NOT MATCHED THEN INSERT VALUES (
src.customer_id,
src.first_name,
src.last_name,
src.email
);
Migration Strategies
1. Gradual Migration from Traditional UNION
-- Phase 1: Identify existing UNION operations
CREATE OR REPLACE VIEW union_inventory AS
SELECT
query_text,
query_hash,
execution_count,
avg_execution_time
FROM snowflake.account_usage.query_history
WHERE query_text ILIKE '%UNION%'
AND query_text NOT ILIKE '%UNION BY NAME%'
AND start_time >= CURRENT_DATE - INTERVAL '30 days';
-- Phase 2: Test Union by Name compatibility
CREATE OR REPLACE PROCEDURE test_union_compatibility()
RETURNS STRING
LANGUAGE SQL
AS
$$
BEGIN
-- Test existing UNION queries with BY NAME
-- Log results and performance differences
RETURN 'Compatibility testing completed';
END;
$$;
2. Performance Comparison Framework
-- Compare performance between traditional UNION and Union by Name
CREATE OR REPLACE TABLE performance_comparison (
test_name STRING,
traditional_union_time NUMBER,
union_by_name_time NUMBER,
performance_improvement_pct NUMBER,
test_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);
-- Performance testing procedure
CREATE OR REPLACE PROCEDURE compare_union_performance()
RETURNS STRING
LANGUAGE SQL
AS
$$
DECLARE
start_time TIMESTAMP;
end_time TIMESTAMP;
traditional_time NUMBER;
by_name_time NUMBER;
BEGIN
-- Test traditional UNION
start_time := CURRENT_TIMESTAMP();
CREATE OR REPLACE TEMPORARY TABLE test_traditional AS
SELECT id, name, email FROM table_a
UNION ALL
SELECT id, name, email FROM table_b;
end_time := CURRENT_TIMESTAMP();
traditional_time := DATEDIFF(millisecond, start_time, end_time);
-- Test Union by Name
start_time := CURRENT_TIMESTAMP();
CREATE OR REPLACE TEMPORARY TABLE test_by_name AS
SELECT * FROM table_a
UNION ALL BY NAME
SELECT * FROM table_b;
end_time := CURRENT_TIMESTAMP();
by_name_time := DATEDIFF(millisecond, start_time, end_time);
-- Log results
INSERT INTO performance_comparison VALUES (
'basic_union_test',
traditional_time,
by_name_time,
((traditional_time - by_name_time) / traditional_time) * 100
);
RETURN 'Performance comparison completed';
END;
$$;
Future Enhancements and Roadmap
Snowflake continues to enhance Union by Name with upcoming features:
1. Enhanced Column Mapping
- Fuzzy matching for similar column names
- Custom mapping rules and transformations
- Automated schema evolution handling
2. Performance Optimizations
- Improved query planning for Union by Name operations
- Better predicate pushdown optimization
- Enhanced parallel processing capabilities
3. Integration Improvements
- Better integration with Snowflake's data sharing features
- Enhanced support for external tables
- Improved handling of semi-structured data
Conclusion
Union by Name represents a significant advancement in Snowflake's data integration capabilities. By eliminating the rigid constraints of traditional UNION operations, it enables more flexible, maintainable, and robust data consolidation workflows.
The feature particularly excels in scenarios involving:
- Multi-source data integration where schema consistency is challenging
- Historical data consolidation across different system versions
- Agile development environments where schema changes are frequent
- Data warehouse modernization projects requiring flexible data movement
As organizations continue to deal with increasingly complex data landscapes, Union by Name provides the flexibility and intelligence needed to streamline data integration processes while maintaining data quality and performance.
The combination of automatic column matching, intelligent NULL handling, and performance optimization makes Union by Name an essential tool for modern data engineering workflows. By adopting this feature, organizations can reduce development time, improve data quality, and build more resilient data pipelines that adapt to changing business requirements.
Whether you're consolidating customer data from multiple CRMs, migrating historical data with evolving schemas, or building flexible ETL pipelines, Union by Name provides the foundation for more intelligent and adaptable data integration strategies.
