Comments on ...juggling with .net & sql: SQL Server # TSQL to Convert STRING in PROPER format

After more testing, I'm concluding that you ar...

2012-10-17T08:32:58.154-07:00

After more testing, I'm concluding that you are correct. My approach is to pivot and concatenate the string back together, which is killing performance.

Hello Marc, It is really good to see your approac...

2012-10-17T00:37:55.834-07:00

Hello Marc,

It is really good to see your approach and totally agree to the point that SQL Server is better at processing SET-BASED data. And believe me, I was also first planning to almost similar approach what you have suggested but with a small difference - i.e. I was planning to split the string/text to be converted by WORDS unlike your approach of splitting them by characters.

I am mainly concerned on two points in your approach,
First is that the need for this kind of function will mostly be SCALAR based and hence will have to called for each row.

Secondly, after converting your code to a scalar function and after comparing the results with my approach, the following were the figures -

For 100 Rows, Comparative execution plan showed
My Approach : 50%
Your Approach : 50%

For 1000 Rows, Comparative execution plan showed
My Approach : 50%
Your Approach : 50%

For 3000 Rows, Comparative execution plan showed
My Approach : 50%
Your Approach : 50%

Please share your views for the same.

;WITH CTE_String_By_Char AS -- break each s...

2012-10-16T06:23:06.741-07:00

;WITH CTE_String_By_Char AS -- break each string into numbered characters
( SELECT [s].[id] as [id],
[s].[string] as [string],
[n].[n] as [character_num],
SUBSTRING([s].[string], [n].[n], 1) as [the_character]
FROM [dbo].[strings] [s],
[dbo].[Tally] [n]
WHERE LEN([s].[string]) >= [n].[n]
),
CTE_Chars_and_Prev AS -- determine the previous character
( SELECT [main].[id],
[main].[string],
[main].[character_num],
[main].[the_character],
[prev].[character_num] as [prev_char_num],
[prev].[the_character] as [prev_char]
FROM [CTE_String_By_Char] [main]
LEFT JOIN [CTE_String_By_Char] [prev] ON [main].[id] = [prev].[id] AND [main].[character_num] = [prev].[character_num] + 1
),
CTE_Chars_And_Caps AS -- if the previous character is NULL (the start of the string) or the previous character is a space, make the character uppercase
( SELECT [data].[id] as [id],
[data].[character_num] as [character_num],
[data].[the_character] as [the_character],
CASE
WHEN [data].[prev_char_num] IS NULL OR
[data].[prev_char] = ' ' THEN UPPER([data].[the_character])
ELSE [data].[the_character]
END as [corrected_char]
FROM [CTE_Chars_and_Prev] [data]
),
CTE_Concatenate AS -- put the strings back together again.
( SELECT DISTINCT [main].[id],
REPLACE(REPLACE(REPLACE(REPLACE(
(SELECT [caps].[corrected_char] [c] FROM [CTE_Chars_And_Caps] [caps] WHERE [caps].[id] = [main].[id] ORDER BY [caps].[character_num] FOR XML PATH ('')) ,
'', ''), '', ''), '', ''), ' ', ' ')
as [proper_string]
FROM [CTE_Chars_And_Caps] [main]
)
SELECT [original].[id],
[original].[string],
[results].[proper_string]
FROM [dbo].[strings] [original]
INNER JOIN [CTE_Concatenate] [results] ON [original].[id] = [results].[id]
ORDER BY [original].[id]

First, looping through a string will work, but it ...

2012-10-16T06:22:41.905-07:00

First, looping through a string will work, but it takes an iterative approach, which is less performant than a set-based approach (SQL Server excels at set-based operations).

Second, the use of a function is correct from a functional decomposition perspective (isolate functionality into reusable pieces), but from a performance perspective is sub-optimal. You will trigger table or clustered index scans as the function is called for each row you feed it.

Instead, leverage a tally table and a self-join, like this:

First, you need a tally table. I don't much care how it's created or where its stored (if at all), but these are so useful that every instance should have one available. This will create a tally table with 65,536 entries, starting with 1.

CREATE TABLE [dbo].[Tally]
(
[n] [bigint]
)

;WITH
lv0 AS (SELECT 0 AS g UNION ALL SELECT 0),
lv1 AS (SELECT 0 AS g FROM lv0 AS a CROSS JOIN lv0 AS b), -- 4
lv2 AS (SELECT 0 AS g FROM lv1 AS a CROSS JOIN lv1 AS b), -- 16
lv3 AS (SELECT 0 AS g FROM lv2 AS a CROSS JOIN lv2 AS b), -- 256
lv4 AS (SELECT 0 AS g FROM lv3 AS a CROSS JOIN lv3 AS b), -- 65,536
--lv5 AS (SELECT 0 AS g FROM lv4 AS a CROSS JOIN lv4 AS b), -- 4,294,967,296
Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM lv4)
INSERT INTO [dbo].[Tally]([n])
SELECT [n] FROM [Nums];
GO

Next, throw in a few test strings:

CREATE TABLE [dbo].[strings]
(
[id] [int] IDENTITY,
[string] [varchar](max)
)
GO

INSERT INTO [dbo].[strings] ([string])
VALUES
('This is my first string'),
('This is my second string');
GO

Now you just need a query that will leverage the Tally table to break each string into it's individual characters and check if the previous character is NULL (start of a string) or if the previous character is a space. If so, make the current character uppercase; then put the string back together.

The CTE's should be commented well enough to let you know what's going on at each step. This could be further optimized, but a set-based operation will almost always perform better than an RBAR operation. (see next comment for the code)