Andre's Blog

Personal blog of Andre Perusse

Deleting Large Quantities of Data Using an Exponential Batch Reduction Algorithm

Recently, I was involved with a project that collected a vast quantity of data on a daily basis and stored it in a SQL Server database. That data was maintained in the database for an extended period of time after which, on a regular schedule, it was purged as being old data and no longer relevant.

The problem here is that SQL DELETE operations are rather expensive tasks and the amount of data being deleted was so large that the purging process took far too long to complete, resulting in database command timeouts and unacceptably long table locks. To fix the problem, I borrowed a technique described to me by a colleague whereby the maximum number of rows to delete in any single transaction could be specified, and if timeouts still occurred that number could be automatically reduced, or throttled, until the transaction succeeded.

The following code blocks demonstrate a C# method calling a stored procedure to perform this task. The stored procedure accepts a parameter which is the maximum number of rows to delete and returns a value indicating the number of rows that were actually deleted. This way, the C# function knows if more rows still need to be deleted and the proc is called iteratively (with a sleep interval to allow the database to perform other tasks) until no more rows need to be deleted.

Also note that the C# method checks for a SQL Timeout error on each command execution. If it detects a timeout, it automatically reduces the maximum number of rows to delete by half and tries again. It keeps doing this until it reaches a pre-defined lower threshold at which point it throws up its hands in failure.

Here is the SQL procedure that gets called:

CREATE PROCEDURE dbo.MyTable_Delete
    @BatchSize int,
    @RowsAffected int output
AS
DELETE TOP (@BatchSize)
FROM dbo.MyTable
SET @RowsAffected = @@ROWCOUNT

And here is the C# method that calls the proc. You have to be careful that each call to the proc executes in its own transaction. If all the iterative delete operations are wrapped in the same transaction you'll have gained nothing and perhaps even made the problem worse.

public void Delete_OldData(int batchSize, int minimumBatchSize, int transactionDelay)
{
    // If timeouts occur, we'll reduce the batch size.
    int workingBatchSize = batchSize;
    using (DbConnection dbConnection = CreateConnection())
    {
        dbConnection.Open();
        using (DbCommand dbCommand = dbConnection.CreateCommand())
        {
            try
            {
                int rowsAffected = 1;
                // We use a batch size to limit the number of deleted rows to prevent command
                // time-outs and avoid excessive table locking.
                while (rowsAffected > 0)
                {
                        dbCommand.CommandText = "[dbo].[MyTable_Delete]";
                        dbCommand.CommandType = CommandType.StoredProcedure;
                        dbCommand.Parameters.Clear();
                        dbCommand.Parameters.Add(new SqlParameter("@BatchSize", workingBatchSize));
                        SqlParameter rowsAffectedParam = new SqlParameter("@RowsAffected", SqlDbType.Int);
                        rowsAffectedParam.Direction = ParameterDirection.Output;
                        dbCommand.Parameters.Add(rowsAffectedParam);
                        try
                        {
                            dbCommand.ExecuteNonQuery();
                            rowsAffected = (int)rowsAffectedParam.Value;
                            // Restore batch size
                            workingBatchSize = batchSize;
                        }
                        catch (SqlException ex)
                        {
                            if (ex.Number == -2)    // -2 means Client Timeout
                            {
                                if (workingBatchSize <= minimumBatchSize)
                                {
                                        throw new Exception(string.Format("Proc MyTable_Delete experienced a timeout exception. Batch size is already at minimum {0}. Unable to throttle further.", workingBatchSize));
                                }
                                // Reduce batch size and try again.
                                workingBatchSize = Math.Max((int)(workingBatchSize / 2), minimumBatchSize);
                                rowsAffected = 1;
                            }
                        }
                        // Allow database to perform other work.
                        Thread.Sleep(transactionDelay);
                    }
                }
                catch (Exception ex)
                {
                        // Log error and throw.
                }
            }
        }
    }
}

Dates and Time Zones in Javascript, C#, and SQL Server

An issue that always seems to plague developers who write software that is used in multiple geographically disperse locations is that of dates and time zones. If a user in the Eastern time zone updates a record that contains a date and time field (say, 1/15/2014 4:35 PM), and then a user in the Pacific time zone views that record, what date and time are they supposed to see? The time as it was entered in Eastern time, or should they see their local time instead (1/15/2014 1:35 PM)?

Generally speaking (there are, of course, exceptions) date and time information should be stored in a database as a UTC (Coordinated Universal Time) time, and automatically converted and displayed in the user's current time zone in the client application. But that is often not as easy at it seems to implement. Below, I'll outline a process that has worked well for me in several apps.

Step 1 - Getting data from the client to the server


In the past, web apps collected data in an HTML <form> element that was submitted to the server causing a full page refresh. Today, I prefer to use single-page application frameworks where almost all client-server communication is done using the Javascript XmlHttpRequest (xhr) object. For me, I find jQuery nicely abstracts away the details of xhr using the $.ajax() api method. But how do we wrap up user-entered date-time data into an xhr request so that time zone data is preserved and can be interpreted by our server?

The Javascript date object does have time zone support, but it's not immediately obvious how we can make that work properly with the server. When sending data in an xhr object, it's usually JSON formatted, meaning just about everything is a string - there are no data types. Javascript date objects have a handy toString() method that outputs a date like this:

Sun Feb 23 2014 15:33:09 GMT-0400 (Atlantic Standard Time)

This would seem to contain all the information we require, but alas, the default ASP.NET WebAPI configuration can not translate this to a C# DateTime instance. But there is a solution - the ISO-8601 date format, which looks like this:

2014-02-23T15:33:09-04:00

In fact, most modern browsers have the ability to output this format using the toISOString() method, but there are a couple of drawbacks. First, the toISOString() method is not supported on all browsers and second, this built-in method deletes the time zone information and simply outputs it as UTC (such as 2014-02-23T19:33:09Z). Sometimes it is desirable to preserve the originating time zone data.

Thus, I prefer to use the excellent moment.js library for formatting date and time values. It is incredibly flexible and powerful and you'll find yourself using it everywhere in your Javascript code when dealing with dates. The default format() method of moment.js outputs dates in ISO format.

Step 2 - Processing on the Server


Actually, with WebAPI and its default use of Newtonsoft's Json.NET library, there is nothing else you have to do on the server. Using the default model binder, any date that is submitted in ISO-8601 format will be automatically converted to local server time and, most importantly, the Kind property of the DateTime instance will be set properly to DateTimeKind.Local.

Step 3 - Saving in SQL Server


Now here's one step I see a lot of people miss - saving the datetime properly in SQL Server. We want the date to be stored as a UTC value so that it can be converted to any time zone for display purposes. In order to do this, you must convert the C# datetime instance to UTC when adding it to the parameters of your stored procedure, like so:

parameters.Add("myDateField", myDateTime.ToUniversalTime());

Step 4 - Retrieving from SQL Server


Another step often missed by developers is the requirement to properly stamp the Kind property of the C# DateTime instance when retrieving it from the database. You must mark the value as UTC like this:

myDateField = DateTime.SpecifyKind((DateTime)dbReader("myDateField"), DateTimeKind.Utc);

Step 5 - Displaying on the Web Page


Actually, there is no step 5. The Json.NET serializer will format the date value in ISO-8601 format which is automatically converted to local time in the client Javascript code. If you output this value to the web page, it will be displayed in client local time.

By following this process you can ensure that all date and time data in your database is consistently represented as UTC, and can be reliably converted to the time zone being used by the client application. One gotcha to be careful of, however, is when filtering date values. Just like storing a date value in the database, the client specified date filter must be converted to UTC time before the comparison against SQL data is performed.

Happy coding!

SQL Server 2005 Syntax Incompatible with SQL Server 2000

On a recent project we use SQL Server 2005 for development but the product officially supports installation on both SQL Server 2005 and SQL Server 2000. During development we'll create tables in the database (using SQL 2005) using the GUI tools in either Visual Studio or Management Studio. When it comes time to create the installation scripts, we'll "Generate CREATE scripts" from these GUI tools. With SQL 2005 (well, at least the version we're using, which is SP2), the CREATE TABLE script will now use a SQL 2005 specific syntax that will not work on SQL 2000.
 
For example, here is an auto-generated script that runs on SQL 2005, but not on SQL 2000:
 
CREATE TABLE [dbo].[Order](
 [orderID] [int] NOT NULL,
 [customerID] [int] NOT NULL,
 [orderDate] [datetime] NOT NULL,
 [shipDate] [datetime] NOT NULL,
 [shipperID] [int] NOT NULL,
 [shipperTrackingNumber] [varchar](50) NULL,
 CONSTRAINT [PK_Order] PRIMARY KEY CLUSTERED
(
 [orderID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
 
If you try to run this on SQL 2000, you'll get the following error:
 
Server: Msg 170, Level 15, State 1, Line 11
Line 11: Incorrect syntax near '('.
 
It would appear as though SQL 2000 does not like the syntax of the primary key constraint included in the CREATE TABLE statement. Alternatively, the following script works on both SQL 2000 and SQL 2005:
 
CREATE TABLE [dbo].[Order] (
 [orderID] [int] NOT NULL ,
 [customerID] [int] NOT NULL ,
 [orderDate] [datetime] NOT NULL ,
 [shipDate] [datetime] NOT NULL ,
 [shipperID] [int] NOT NULL ,
 [shipperTrackingNumber] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
) ON [PRIMARY]
GO
 
ALTER TABLE [dbo].[Order] ADD
 CONSTRAINT [PK_Order] PRIMARY KEY  CLUSTERED
 (
  [orderID]
 )  ON [PRIMARY]
GO
 
Now, if that was the only problem I could probably live with that. But wait! There's more! When you create an object in SQL Server, it's generally good practice to first make sure the object doesn't already exist. In my day-to-day use, my scripts will often check for general objects, or foreign-key constraints. SQL 2005 uses the following code for these operations:
 
IF  EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id = OBJECT_ID(N'[dbo].[FK_Order_Customer]') AND parent_object_id = OBJECT_ID(N'[dbo].[Customer]'))
 
Run this on SQL 2000 and you'll get:
 
Server: Msg 208, Level 16, State 1, Line 1
Invalid object name 'sys.foreign_keys'.
 
Also, the following code is used by SQL 2005:
 
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Order]') AND type in (N'U'))
DROP TABLE [dbo].[Order]
GO
 
which will give you the following on SQL 2000:
 
Server: Msg 208, Level 16, State 1, Line 1
Invalid object name 'sys.objects'.
 
Microsoft changed the way that meta-data is stored in SQL 2005 to improve security, amongst other things, but this means that these scripts won't work on SQL 2000. Thankfully, however, they did provide "views" in SQL Server 2005 which mimic the old behavior on SQL 2000. To fix these errors on SQL 2000, you can use the following syntax which will work on both SQL 2005 and SQL 2000:
 
IF  EXISTS (SELECT * FROM dbo.sysforeignkeys WHERE fkeyid = OBJECT_ID(N'[dbo].[FK_Order_Customer]') AND rkeyid = OBJECT_ID(N'[dbo].[Order]'))
ALTER TABLE [dbo].[Order] DROP CONSTRAINT [FK_Order_Customer]
GO
 
IF  EXISTS (SELECT * FROM dbo.sysobjects WHERE id = OBJECT_ID(N'[dbo].[Order]') AND type in (N'U'))
DROP TABLE [dbo].[Order]
GO
 
Thankfully, there is a way around this if you're using SQL Management Studio. Instead of right-clicking on a table to generate a CREATE script, right-click on the Database and select Tasks -> Generate Scripts. This will open the Script Wizard dialog. On the Choose Script Options page, set the "Script for Server Version" property to "SQL Server 2000" and your CREATE scripts will now be fully compatible. A little more clicking is required, but at least your scripts will work on both server versions.