About This Blog

.net & SQL Samples, programming tips and tricks, performance tips, guidelines, and best practices

Sunday 14 October 2012

SQL Server # TSQL to Convert STRING in PROPER format

Problem Statement

SQL Server has got in-built functions to convert the given string into LOWER() or UPPER() format but it does not provides any direct way to convert it to PROPER format. A string/text is said to be in a PROPER format if all the words in that string/text starts with a CAPITAL letter.

E.g. If a string is - “Hello, how are you?”,

String in Lower format = “hello, how are you?”

String in Upper format = “HELLO, HOW ARE YOU?”

and String in Proper format = “Hello, How Are You?”

 

Implementation

Ideally, SQL Server is not the right place to implement this kind of logic, as string operations are costlier in SQL from performance perspective. it should be either implemented in the Front-End language or the Reporting Tool as this more related to the formatting. However, if this is to be implemented in SQL, the more preferred way is to use SQL-CLR function. It does not mean that we can not achieve this with T-SQL.

Today, I will share a simple T-SQL function, which could be used to convert any given string in PROPER format. Below is the script -

CREATE FUNCTION [dbo].[PROPER]
(
  @StrToConvertToProper AS VARCHAR(MAX)
) 
RETURNS VARCHAR(MAX) 
AS
BEGIN
  --Trim the Text
  SET @StrToConvertToProper = LTRIM(RTRIM(@StrToConvertToProper))
 
  --Find the No. of Words in the Text
  DECLARE @WordCount AS INT
  SET @WordCount = LEN(@StrToConvertToProper) - LEN(REPLACE(@StrToConvertToProper,' ','')) + 1
 
  --Variable to track the space position
  DECLARE @LastSpacePosition AS INT = 0
 
  --Loop through all the words
  WHILE(@WordCount > 0)
    BEGIN
      --Set the Space Position
      SET @LastSpacePosition = CHARINDEX(' ',@StrToConvertToProper,@LastSpacePosition + 1)
      
      --Replace the Character
      SET @StrToConvertToProper = STUFF(@StrToConvertToProper,
                                        @LastSpacePosition + 1,
                                        1,
                                        UPPER(SUBSTRING(@StrToConvertToProper, @LastSpacePosition + 1, 1)))
      --Decrement the Loop counter                                      
      SET @WordCount = @WordCount - 1
    END
    
  RETURN @StrToConvertToProper
END  

When the above script is used as –



SELECT dbo.PROPER('hello, how are you?')


we get the following result - Hello, How Are You?


Conclusion


The given script could be used to convert any string in PROPER format using T-SQL. However, I would personally prefer converting the string at the Front-End or in the Reporting tool to display the string in this format.

4 comments:

  1. First, looping through a string will work, but it takes an iterative approach, which is less performant than a set-based approach (SQL Server excels at set-based operations).

    Second, the use of a function is correct from a functional decomposition perspective (isolate functionality into reusable pieces), but from a performance perspective is sub-optimal. You will trigger table or clustered index scans as the function is called for each row you feed it.

    Instead, leverage a tally table and a self-join, like this:

    First, you need a tally table. I don't much care how it's created or where its stored (if at all), but these are so useful that every instance should have one available. This will create a tally table with 65,536 entries, starting with 1.

    CREATE TABLE [dbo].[Tally]
    (
    [n] [bigint]
    )

    ;WITH
    lv0 AS (SELECT 0 AS g UNION ALL SELECT 0),
    lv1 AS (SELECT 0 AS g FROM lv0 AS a CROSS JOIN lv0 AS b), -- 4
    lv2 AS (SELECT 0 AS g FROM lv1 AS a CROSS JOIN lv1 AS b), -- 16
    lv3 AS (SELECT 0 AS g FROM lv2 AS a CROSS JOIN lv2 AS b), -- 256
    lv4 AS (SELECT 0 AS g FROM lv3 AS a CROSS JOIN lv3 AS b), -- 65,536
    --lv5 AS (SELECT 0 AS g FROM lv4 AS a CROSS JOIN lv4 AS b), -- 4,294,967,296
    Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM lv4)
    INSERT INTO [dbo].[Tally]([n])
    SELECT [n] FROM [Nums];
    GO

    Next, throw in a few test strings:

    CREATE TABLE [dbo].[strings]
    (
    [id] [int] IDENTITY,
    [string] [varchar](max)
    )
    GO

    INSERT INTO [dbo].[strings] ([string])
    VALUES
    ('This is my first string'),
    ('This is my second string');
    GO

    Now you just need a query that will leverage the Tally table to break each string into it's individual characters and check if the previous character is NULL (start of a string) or if the previous character is a space. If so, make the current character uppercase; then put the string back together.

    The CTE's should be commented well enough to let you know what's going on at each step. This could be further optimized, but a set-based operation will almost always perform better than an RBAR operation. (see next comment for the code)

    ReplyDelete
  2. ;WITH CTE_String_By_Char AS -- break each string into numbered characters
    ( SELECT [s].[id] as [id],
    [s].[string] as [string],
    [n].[n] as [character_num],
    SUBSTRING([s].[string], [n].[n], 1) as [the_character]
    FROM [dbo].[strings] [s],
    [dbo].[Tally] [n]
    WHERE LEN([s].[string]) >= [n].[n]
    ),
    CTE_Chars_and_Prev AS -- determine the previous character
    ( SELECT [main].[id],
    [main].[string],
    [main].[character_num],
    [main].[the_character],
    [prev].[character_num] as [prev_char_num],
    [prev].[the_character] as [prev_char]
    FROM [CTE_String_By_Char] [main]
    LEFT JOIN [CTE_String_By_Char] [prev] ON [main].[id] = [prev].[id] AND [main].[character_num] = [prev].[character_num] + 1
    ),
    CTE_Chars_And_Caps AS -- if the previous character is NULL (the start of the string) or the previous character is a space, make the character uppercase
    ( SELECT [data].[id] as [id],
    [data].[character_num] as [character_num],
    [data].[the_character] as [the_character],
    CASE
    WHEN [data].[prev_char_num] IS NULL OR
    [data].[prev_char] = ' ' THEN UPPER([data].[the_character])
    ELSE [data].[the_character]
    END as [corrected_char]
    FROM [CTE_Chars_and_Prev] [data]
    ),
    CTE_Concatenate AS -- put the strings back together again.
    ( SELECT DISTINCT [main].[id],
    REPLACE(REPLACE(REPLACE(REPLACE(
    (SELECT [caps].[corrected_char] [c] FROM [CTE_Chars_And_Caps] [caps] WHERE [caps].[id] = [main].[id] ORDER BY [caps].[character_num] FOR XML PATH ('')) ,
    '', ''), '', ''), '', ''), ' ', ' ')
    as [proper_string]
    FROM [CTE_Chars_And_Caps] [main]
    )
    SELECT [original].[id],
    [original].[string],
    [results].[proper_string]
    FROM [dbo].[strings] [original]
    INNER JOIN [CTE_Concatenate] [results] ON [original].[id] = [results].[id]
    ORDER BY [original].[id]

    ReplyDelete
  3. Hello Marc,

    It is really good to see your approach and totally agree to the point that SQL Server is better at processing SET-BASED data. And believe me, I was also first planning to almost similar approach what you have suggested but with a small difference - i.e. I was planning to split the string/text to be converted by WORDS unlike your approach of splitting them by characters.

    I am mainly concerned on two points in your approach,
    First is that the need for this kind of function will mostly be SCALAR based and hence will have to called for each row.

    Secondly, after converting your code to a scalar function and after comparing the results with my approach, the following were the figures -

    For 100 Rows, Comparative execution plan showed
    My Approach : 50%
    Your Approach : 50%

    For 1000 Rows, Comparative execution plan showed
    My Approach : 50%
    Your Approach : 50%

    For 3000 Rows, Comparative execution plan showed
    My Approach : 50%
    Your Approach : 50%

    Please share your views for the same.

    ReplyDelete
  4. After more testing, I'm concluding that you are correct. My approach is to pivot and concatenate the string back together, which is killing performance.

    ReplyDelete