Improving SQL Queries

As a newb to sql, im still trying to get the hang of building schemas and accessing data, but one of the easiest things we can do to decrease the amount of time to run advanced queries on a large dataset is by building proper indexing structures into your schema.

One of the best explanations of indexing that I’ve come across looks at a database like a library. Consider each piece of data being represented by a book. When data is not properly indexed, it is like asking a librarian to physically walk through every shelf in the library looking for the particular book you need. In a library of a thousand books this could take a very long time. But consider a library of ten million books!

#1 Creating A Primary Key

In the example above, the library has no way of organizing their books. This example represents a table that contains no primary key.

One of the simplest ways for the librarian to organize the books would be to assign each book a number as it comes in. Giving a table a primary key is equal to organizing a library by the dewey decimal system. It gives a reference number to ask for when looking for a certain book, or a certain piece of data. In a library you want to number each book and arrange the book shelves according to their numbers. By doing this, the librarian can organize the books in numerical order, and organize the shelves based on the book numbers on that shelf. If a patron comes in looking for a book with a particular number, the librarian can go to the shelf containing that particularly numbered book, and then look on that shelf for that book. It will now take two minutes to find the book rather than two or more hours.

#2 Creating A Non-Clustered Index

Creating a primary key, however, only solves one problem in our data search. What if the library patron knows the name of the book they want, but not the book’s serial number?

The librarian now has to go through all the books individually to find the book title that the patron wants. To solve this problem, the librarian can create a catalog or index that contains the book title and its corresponding serial number. This index can be alphabetically organized and books can be grouped by the first letter of their title.

Next time the patron asks for a book titled say “Walking With Dinosaurs”, the librarian merely has to go to section “W” of the catalog, look for “Walking With Dinosaurs”, and look for its corresponding serial number. Once she has that it will take her two minutes to walk to the correct shelf in the library and find the book.

In a library, you would want to create a catalogue or index for every way that a patron may want to find a book. What if the patron knows the author, but doesn’t know the book title? What if they know the subject, but not the title or the author?

Each non-clustered index page will contain a range of product names along with their corresponding primary key values. When a product is searched using the product name in the search criteria, the database engine will first seek the non-clustered index tree for product name to find the primary key value of the book. Once found, the database engine then searches the clustered index tree with the primary key to find the row for the actual item that is being searched.

But where should we create non-clustered indexes?

According to codeproject.com, the following criteria should be followed when creating non-clustered indexes. Non-clustered indexes should be put on columns that are:

Frequently used in the search criteria
Used to join other tables
Used as foreign key fields
Of having high selectivity (column which returns a low percentage (0-5%) of rows from a total number of rows on a particular value)
Used in the ORDER BY clause
Of type XML

#3 Creating Covering Indexes

Covering indexes are created in order to say what other values should be stored as columns in the index page besides the Primary Key Values. Covered indexes should be created to include only a few columns that are frequently used in the Select queries.

For example, if you are creating an index page based on author, you may also want to store values for Book Title, and Publisher Name. If you include too many covered indexes, too much memory may be required to store the additional columns, and you risk over consumption of memory and slow performance.

To create this particular index you can use the following code:

CREATE INDEX Library_AuthorName
					ON dbo.library(AuthorName)
					INCLUDE(BookTitle, Published_on)

How this works:

You want to know the Book Title and the Date of Publish for every Book written by a particular Author. your select statement would look like this:

SELECT, Book Title FROM Library WHERE AuthorName= ‘Edgar Allen Poe’

For each primary key, the SQL Server engine "seeks" into the clustered index tree to find the actual row locations in the corresponding page.

For each primary key, when found, the SQL Server engine selects the Book Title column values from the corresponding rows.

Please note that in the above steps, for each of the primary key entries for AuthorName= ‘Edgar Allen Poe’, the SQL Server engine has to search the clustered index tree (say there are a thousand books authored by Edgar Allan Poe, it would have to search the clustered index tree a thousand times) to retrieve BookTitle in the query. In order to simplify this and skip the step of having to go back into the clustered index and search a thousand times for each bookTitle by edgar allen poe you could do this faster by just seeking into the non-clustered index tree for AuthorName and reading the column values directly from that page.

IN SUMMARY:

The most basic way to fix query speed problems for databases containing large amounts of data is to create or add the appropriate primary keys, non-clustered indexes, and covered indexes.

Database optimization does not stop there, however. There are a number of other techniques we must look at, but I will save those for a later date.

References:

http://www.codeproject.com/Articles/34372/Top-steps-to-optimize-data-access-in-SQL-Server