Working with CHECKSUM Function in SQL Server

SQL Server offers many powerful built-in functions and one of them is CHECKSUM function. CHECKSUM function is a great feature and can be used for data validation by writing simple T-SQL scripts.
Last day one of the developers who was working on data validation, approached me saying the CHECKSUM function is not giving expected results and is producing different checksum value even though the data that is being compared is same. After looking into it, I figured out that the problem was with the data types being used to store the data which was being compared.
Let me explain it with an example as shown below.
Create a source table Employee to hold the employee related information.
CREATE TABLE Employee (
EmployeeId INT NOT NULL,

EmployeeName NVARCHAR(255) NOT NULL,
ExpertiseId INT NOT NULL
)
GO

Load some data into Employee table.
INSERT INTO Employee (EmployeeId, EmployeeName, ExpertiseId)
SELECT 1 AS EmployeeId, ‘Keaton Hyde’ AS EmployeeName, 2 AS ExpertiseId

UNION ALL
SELECT 2 AS EmployeeId, ‘Jesse Marks’ AS EmployeeName, 1 AS ExpertiseId
UNION ALL
SELECT 3 AS EmployeeId, ‘Barrett Castaneda’ AS EmployeeName, 2 AS ExpertiseId
UNION ALL
SELECT 4 AS EmployeeId, ‘Oren Hunter’ AS EmployeeName, 4 AS ExpertiseId
UNION ALL
SELECT 5 AS EmployeeId, ‘Eric Cooper’ AS EmployeeName, 2 AS ExpertiseId
UNION ALL
SELECT 6 AS EmployeeId, ‘Yoshio Battle’ AS EmployeeName, 2 AS ExpertiseId
UNION ALL
SELECT 7 AS EmployeeId, ‘Phelan Stein’ AS EmployeeName, 1 AS ExpertiseId
UNION ALL
SELECT 8 AS EmployeeId, ‘Jack Gallagher’ AS EmployeeName, 1 AS ExpertiseId
UNION ALL
SELECT 9 AS EmployeeId, ‘Jin Lowery’ AS EmployeeName, 1 AS ExpertiseId
UNION ALL
SELECT 10 AS EmployeeId, ‘Vincent Fulton’ AS EmployeeName, 1 AS ExpertiseId
GO

Create another source table Expertise to hold the expertise related information and load data.

CREATE TABLE Expertise (
ExpertiseId INT NOT NULL,
ExpertiseDesc NVARCHAR(255) NOT NULL
)
GO

INSERT INTO Expertise (ExpertiseId, ExpertiseDesc)
SELECT 1 AS ExpertiseId, ‘SQL Server – Beginner’
UNION ALL
SELECT 2 AS ExpertiseId, ‘SQL Server – Intermediate’
UNION ALL
SELECT 3 AS ExpertiseId, ‘SQL Server – Expert’
UNION ALL
SELECT 4 AS ExpertiseId, ‘SQL Server – Guru’
GO

Data is usually stored into dimension tables from the staging tables using an ETL tool (SSIS in case of Microsoft BI).

Now let us create the target table DimEmployee and load some data. This process is usually done using an ETL process, but for demo purposes we will load it using T-SQL to keep it simple.

CREATE TABLE DimEmployee(
EmployeeId INT NOT NULL,
EmployeeName NVARCHAR(255) NOT NULL,
ExpertiseDesc NVARCHAR(255) NOT NULL
)
GO
INSERT INTO DimEmployee(EmployeeId, EmployeeName, ExpertiseDesc)
SELECT 1 AS EmployeeId, ‘Keaton Hyde’ AS EmployeeName, ‘SQL Server – Intermediate’ AS ExpertiseDesc
UNION ALL
SELECT 2 AS EmployeeId, ‘Jesse Marks’ AS EmployeeName, ‘SQL Server – Beginner’ AS ExpertiseDesc
UNION ALL
SELECT 3 AS EmployeeId, ‘Barrett Castaneda’ AS EmployeeName, ‘SQL Server – Intermediate’ AS ExpertiseDesc
UNION ALL
SELECT 4 AS EmployeeId, ‘Oren Hunter’ AS EmployeeName, ‘SQL Server – Guru’ AS ExpertiseDesc
UNION ALL
SELECT 5 AS EmployeeId, ‘Eric Cooper’ AS EmployeeName, ‘SQL Server – Intermediate’ AS ExpertiseDesc
UNION ALL
SELECT 6 AS EmployeeId, ‘Yoshio Battle’ AS EmployeeName, ‘SQL Server – Intermediate’ AS ExpertiseDesc
UNION ALL
SELECT 7 AS EmployeeId, ‘Phelan Stein’ AS EmployeeName, ‘SQL Server – Beginner’ AS ExpertiseDesc
UNION ALL
SELECT 8 AS EmployeeId, ‘Jack Gallagher’ AS EmployeeName, ‘SQL Server – Beginner’ AS ExpertiseDesc
UNION ALL
SELECT 9 AS EmployeeId, ‘Jin Lowery’ AS EmployeeName, ‘SQL Server – Beginner’ AS ExpertiseDesc
UNION ALL
SELECT 10 AS EmployeeId, ‘Vincent Fulton’ AS EmployeeName, ‘SQL Server – Expert’ AS ExpertiseDesc
GO
Now that the data is loaded into the target table, we need to compare the data between source/staging tables and target/dimension tables to make sure the data in the target is in sync with the source.
Use the following query to test the data between source and target tables.
WITH SourceCTE (EmployeeId, EmployeeName, ExpertiseDesc) AS(
SELECT
Emp.EmployeeId
, Emp.EmployeeName
, CASE Emp.ExpertiseId
WHEN 1 THEN ‘SQL Server – Beginner’
WHEN 2 THEN ‘SQL Server – Intermediate’
WHEN 3 THEN ‘SQL Server – Expert’
WHEN 4 THEN ‘SQL Server – Guru’
END AS ExpertiseDesc
FROM
[dbo].[Employee] AS [Emp]
LEFT JOIN [dbo].[Expertise] AS [Exp] ON [Emp].[ExpertiseId] = [Exp].[ExpertiseId]
)
, TargetCTE (EmployeeId, EmployeeName, ExpertiseDesc) AS (
SELECT EmployeeId, EmployeeName, ExpertiseDesc
FROM DimEmployee
)

SELECT
SCTE.EmployeeId
, SCTE.EmployeeName
, SCTE.ExpertiseDesc AS SourceExpertiseDesc
, CHECKSUM(SCTE.EmployeeId, SCTE.ExpertiseDesc) AS SourceExpertiseChecksum
, TCTE.ExpertiseDesc AS TargetExpertiseDesc
, CHECKSUM(SCTE.EmployeeId, TCTE.ExpertiseDesc) AS TargetExpertiseChecksum
FROM
SourceCTE SCTE
INNER JOIN TargetCTE TCTE ON SCTE.EmployeeId = TCTE.EmployeeId

As we can see that, we are deriving the ExpertiseDesc from Expertise table using a CASE statement in the SourceCTE. When we run the above query, we get the following output.

Click to view full size image

Notice that in the above query output, even though the ExpertiseDesc matches between source (SourceExpertiseDesc) and target (TargetExpertiseDesc) for EmployeeId from 1 through 9, the corresponding checksum between source (SourceExpertiseChecksum) and target (TargetExpertiseChecksum) does not match. This is quite surprising right ? Well, this is the usual behaviour of CHECKSUM function. Let us take a closer look and understand this behaviour.

Let us modify the CASE statement slightly in the above query as shown below. Let us run this query and check out the results.

WITH SourceCTE (EmployeeId, EmployeeName, ExpertiseDesc) AS(

SELECT
Emp.EmployeeId
, Emp.EmployeeName
, CASE Emp.ExpertiseId
WHEN 1 THEN N‘SQL Server – Beginner’
WHEN 2 THEN N‘SQL Server – Intermediate’
WHEN 3 THEN N‘SQL Server – Expert’
WHEN 4 THEN N‘SQL Server – Guru’
END AS ExpertiseDesc
FROM
[dbo].[Employee] AS [Emp]
LEFT JOIN [dbo].[Expertise] AS [Exp] ON [Emp].[ExpertiseId] = [Exp].[ExpertiseId]
)
, TargetCTE (EmployeeId, EmployeeName, ExpertiseDesc) AS (
SELECT EmployeeId, EmployeeName, ExpertiseDesc
FROM DimEmployee
)

SELECT
SCTE.EmployeeId
, SCTE.EmployeeName
, SCTE.ExpertiseDesc AS SourceExpertiseDesc
, CHECKSUM(SCTE.EmployeeId, SCTE.ExpertiseDesc) AS SourceExpertiseChecksum
, TCTE.ExpertiseDesc AS TargetExpertiseDesc
, CHECKSUM(SCTE.EmployeeId, TCTE.ExpertiseDesc) AS TargetExpertiseChecksum
FROM
SourceCTE SCTE
INNER JOIN TargetCTE TCTE ON SCTE.EmployeeId = TCTE.EmployeeId
Notice that, I have added N to the strings in the CASE statement to make it as NVARCHAR/NCHAR instead of VARCHAR/CHAR.

Click to view full size image

Now, as we can see that the same data yields same checksum value provided the data type of the two strings/data being compared is same.

Add the following WHERE clause to the above query to get only those records where there is a mismatch between source and target tables.

CHECKSUM(SCTE.EmployeeId,SCTE.ExpertiseDesc) <> CHECKSUM(SCTE.EmployeeId,TCTE.ExpertiseDesc)

Here is the output that you would get when you run the above query (modified query with N added to the string) with the WHERE clause shown above.

Click to view full size image

Following are few key points to note while using CHECKSUM function:

  • CHECKSUM function calculates a hash value for each unique combination of columns/arguments passed to the function. This hash value is called checksum value.
  • The return type of CHECKSUM function is integer (both positive and negative values).
  • The checksum generated by the CHECKSUM function depends on the order of columns specified in the function.
  • CHECKSUM function takes both data and the data type into account while generating the checksum value. If the data being compared is same but the data type is different then CHECKSUM function returns different values for each of these as demonstrated above.
  • NULL values of same data type are treated as equal by CHECKSUM function.
  • For less number of records, CHECKSUM is unique based on my experience. However, CHECKSUM is not guaranteed to be unique. While dealing with large amount of data consider using HASHBYTES instead.

Let me know if you like this article by leaving a comment below.

Reference: SQL Server Books Online (BOL)

Advertisements

About Dattatrey Sindol (Datta)

Datta is a Microsoft BI Enthusiast, passionate developer, and a blogger. View Full Profile

Posted on February 27, 2011, in SQL Server, T-SQL. Bookmark the permalink. 2 Comments.

  1. I had a similar issue and after going through this article, i modified my checksum query and it worked like charm…This saved me a lot of time… ThanksJohn

  2. Nice article Datta…

What are your thoughts?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s