Data Profiling Using T-SQL: Part 1 – Column Null Ratio

Data Profiling/Data Analysis is one of the important phases of any database/data warehouse projects and helps us understand the data better so that we can make better design decisions.
I will be writing a series of blog posts in the coming days on data profiling in SQL Server. This series of blog posts is inspired by the SSIS Data Profiling Task which was introduced as part of SQL Server 2008 Integration Services.
In this blog post, I will be showing you a simple T-SQL script to capture the Column Null Ratio/Column Null Percentage.
Below is the T-SQL script which takes Database Name, Schema Name and Column Name as inputs and produces the Column Null Ratio/Null Percentage as output.
— Provide the Schema Name and the Table/View Name
DECLARE @DatabaseName NVARCHAR(255) = N’AdventureWorksDW’

DECLARE @SchemaName NVARCHAR(255) = N’dbo’
DECLARE @TableName NVARCHAR(255) = N’DimProduct’

BEGIN

SET NOCOUNT ON

— Declare the parameters internal to query
DECLARE @SQLString NVARCHAR(MAX) = N”
DECLARE @ParamDefinition NVARCHAR(MAX) = N”

DECLARE @ColumnList TABLE (
ColumnId INT IDENTITY(1,1)
, ColumnName NVARCHAR(255)
, ColumnNullCount INT
, ColumnNullPercentage NUMERIC(10,4)
)

DECLARE @TableRecordCount INT
DECLARE @ColumnNullCount INT
DECLARE @ColumnCount INT = 0

DECLARE @LoopCounter INT = 1
DECLARE @ColumnName NVARCHAR(255)

SET @SQLString =
N’SELECT @TableRecordCount = COUNT(*) FROM ‘
+ @DatabaseName + N’.’ + @SchemaName + N’.’ + @TableName
SET @ParamDefinition = N’@TableRecordCount INT OUTPUT’
EXECUTE sp_executesql @SQLString, @ParamDefinition
, @TableRecordCount OUTPUT

SET @SQLString =
N’SELECT COLUMN_NAME FROM ‘
+ @DatabaseName + N’.’ + N’INFORMATION_SCHEMA.COLUMNS
WHERE
TABLE_SCHEMA = @SchemaName
AND TABLE_NAME = @TableName’

SET @ParamDefinition = N’@SchemaName NVARCHAR(255), @TableName NVARCHAR(255)’

INSERT INTO @ColumnList (ColumnName)
EXECUTE sp_executesql @SQLString, @ParamDefinition
, @SchemaName, @TableName

SELECT @ColumnCount = COUNT(*) FROM @ColumnList

WHILE (@LoopCounter <= @ColumnCount)
BEGIN

SELECT @ColumnName = ColumnName
FROM @ColumnList
WHERE ColumnId = @LoopCounter

SET @SQLString =
N’SELECT @ColumnNullCount = COUNT(*) FROM ‘
+ @DatabaseName + N’.’ + @SchemaName + N’.’ + @TableName
+ N’ WHERE ‘ + @ColumnName + N’ IS NULL’

SET @ParamDefinition = N’@ColumnNullCount INT OUTPUT’

EXECUTE sp_executesql @SQLString, @ParamDefinition
, @ColumnNullCount OUTPUT

UPDATE @ColumnList
SET
ColumnNullCount = @ColumnNullCount
, ColumnNullPercentage = @ColumnNullCount * 100.0/@TableRecordCount
WHERE ColumnId = @LoopCounter

SET @LoopCounter += 1

END

SELECT
ColumnName AS [Column Name]
, @TableRecordCount AS [Table Record Count]
, ColumnNullCount AS [Column Null Count]
, CAST(ColumnNullPercentage AS NVARCHAR(20)) + N’ %’ AS [Column Null Percentage]
FROM @ColumnList
ORDER BY [Column Name]

END

Below is a sample output using the above T-SQL script as well as from the SSIS Data Profiling Task.
This sample output is taken by passing the Database Name as AdventureWorksDW, Schema Name as dbo and Table Name as DimProduct. AdventureWorksDW sample database can be downloaded from CodePlex.
T-SQL Script Output
SSIS Data Profiling Task Output
Here are some of the highlights of the above T-SQL script:
  • Takes Database Name, Schema Name and Column Name as inputs and produces the Column Null Ratio/Null Percentage as output.
  • With few minor changes, this script can be used in any of the SQL Server versions.
  • This script is a one step process to get the Null Ratio/Null Percentage whereas data profiling using SSIS Data Profiling Task is a two step process – First generate the xml file with the profile and then use Data Profile Viewer to view the generated profile.
  • Only SQL Server Management Studio (SSMS) with necessary connectivity to the required server/database is sufficient to capture the profile whereas in case of SSIS Data Profiling Task, it requires an ADO.Net Connection Manager.
  • Knowledge of T-SQL is sufficient whereas the SSIS Data Profiling Task requires some basic knowledge of SSIS.
Well, it’s ultimately a matter of choice on whether T-SQL script like above should be used or SSIS Data Profiling Task. But, I personally prefer doing things using T-SQL πŸ™‚
Note: This demonstration is based on SQL Server 2008 R2.
Advertisements

About Dattatrey Sindol (Datta)

Datta is a Microsoft BI Enthusiast, passionate developer, and a blogger. View Full Profile

Posted on March 6, 2011, in Code Snippets, SQL Data Profiling, SQL Server, T-SQL, T-SQL Snippets. Bookmark the permalink. 1 Comment.

What are your thoughts?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s