Database Programming Notes: June 2011

Tuesday, June 28, 2011

Normalization

A database in the 1st normal form does not have multivalued columns like the one below.

Name | PhoneNumber

John | 122345, 456879

In a database which is already in the 1st normal form, if all columns are functional dependent on the whole key, then the database is said to be in the 2nd normal form.

In a database, if all the columns are directly dependent on the whole key then the database is said to be in the 3rd normal form.

Transactions

Transactions in SQL provide the ACI of ACID properties in relational databases.

A - Atomocity

Atomicity means all or nothing. Imagine, we have an application for a bank, and we need to implement the logic for a transfer of money from one account to another account. This procedure will reduce the balance in one account and increase the balance in another account. We have to make sure that either both these steps happen or none happen. It would be very undesirable if we reduced the balance of one account without increasing the balance of the other. If both these sql statements are executed in a transaction then we are guaranteed that either both will be executed or none will be executed, because transactions guarantee Atomicity.

C - Consistency

Imagine a forum application where we have a table for a post, and a table for user such that the post table has a foreign key for the user. What would happen if we delete a use who has several entries in the post table? We will end up with rows in the post table which no longer refer to a valid user. This violates a constraint in the post table, leaving the database in an inconsistent state. Well in an RDBMS this will not happen because it guarantees consistency.

I - Isolation

Isolation means the results of a transactions are invisible to another transaction until that transaction completes.

Monday, June 27, 2011

Complex SQL queries

In simple SQL queries we usually query one or multiple tables, and possibly with different types of joins.

The notion of a complex query is rather arbitrary, some people would consider any query with a join to be complex. However for this discussion we consider a complex query, one in which we need to group results, perform some aggregations, etc.

Imagine we have a database containing tables for books and authors. We want to find the number of authors for each book. We can write this query with a GroupBy clause and the COUNT aggregate function.

SELECT Book.title,         COUNT(*) AS Authors     FROM  Book JOIN Book_author        ON Book.isbn = Book_author.isbn     GROUP BY Book.title;

Now let's look at another query which is a bit more complex than the previous query.

select users.user_id,         users.email,         count(*) as how_many,        max(classified_ads.posted) as how_recent from users, classified_ads where users.user_id = classified_ads.user_id group by users.user_id, users.email order by how_recent desc, how_many desc;

The query above joins the users table with the classified_ads table. It selects user_id and email from users, while joining the user_id field in users and classified_ads tables. So basically the result is going to contain the user_id and email of all users who are in the classified_ads table (basically those users who have published classified ads). The group by users.user_id, users.email clause specifies that we want to bundle together all rows which have the same values in the users.user_id and users.email column. What this means is that all these rows will be represented as one row. Representing multiple rows as one row would in make cases make sense only if we run some aggregate function on these rows. In the above case, we use two aggregate functions count, and max. The count will compute the number of rows clumped by due to the group by clause. The max aggregate has been specified on the classified_ads.posted field. What this will do is compute the row which has the maximum value in the classified_ads.posted field. The order by clause specifies the order in which the rows will be listed. The above sql specifies that we want to order by how_recent desc, which will display the most recent classified ad posting at the top. For those rows which have the same how_recent value, the rows will be ordered by the how_many computed value. One thing to remember here is that most databases will record timestamps with a subsecond precision. So two dates showing the same data (upto day level precision) may not be the same in reality.

Having:

The where clause works on rows, whereas the having clause works on groups.

Simple SQL queries

The following sql will give us the number of rows in the user table:

select count(*) from users;

The following sql will tell us the schema of the user table:

describe users

A simple query from one table will typically have the following structure:

the select list (which columns in our report)
the name of the table
the where clauses (which rows we want to see)
the order by clauses (how we want the rows arranged)

select email from users where email like '%ail.com';

select email from users where email like '%ail.com' order by email;
select count(*) from users where registration_date >= '1998-09-01' and registration_date < '1998-10-01' and email like '%mit.edu';
select user_id, email  from users  where 0 < (select count(*)             from classified_ads            where classified_ads.user_id = users.user_id);
select user_id, email  from users  where exists (select 1               from classified_ads               where classified_ads.user_id = users.user_id);
AND A JOIN:
select users.user_id, users.email, classified_ads.posted from users, classified_ads where users.user_id = classified_ads.user_id order by users.email, posted;
The above query will not show users who have not posted any classifieds. 
If we want to see users who have not posted any classifieds, then we need an outer join.
select users.user_id, users.email, classified_ads.posted from users, classified_ads where users.user_id = classified_ads.user_id(+) order by users.email, posted;

Extending a simple query into a JOIN

Suppose that you have a query from one table returning almost everything that you need, except for one column that's in another table. Here's a way to develop the JOIN without risking breaking your application:

add the new table to your FROM clause
add a WHERE constraint to prevent Oracle from building a Cartesian product
hunt for ambiguous column names in the SELECT list and other portions of the query; prefix these with table names if necessary
test that you've not broken anything in your zeal to add additional info
add a new column to the SELECT list

Database modelling

When we think of database design, we need to think of the following:

Table design:

What rows will the table contain?
What will the primary key be?
Do we need to apply any constraints to any of the rows?
How will this table relate to other existing tables in the system?

Normalization:

What normal form is our design in?
Can we improve it?

This page in Phillip Greenspun's online book, explains his own process of modelling his application's database. Excellent page with a lot of learnings.

Why are databases better than text files

A text file is the simplest database in the world. If you want to store your friend's contact numbers, all you need is a simple file - contacts.txt, where you can store a number per line, and use the space character as a delimiter.

Joe Smith 4014

Jack Brown 7864

July Dustin 8768

This is fine, until you decide to create a multiuser application which will allow your entire family to read and write phone numbers. Because this is a multi-user application, it is possible that 2 or more people may write numbers to the file at the same time. Because we do not have concurrency control in files, we might get into a scenario where if two writes happen simultaneously, one write may override the other write, or they both might even get garbled.

It is also possible that your power gets tripped while your process is writing data, and only half the data gets written. Now you have a database which could be corrupt in a variety of ways.

If you choose a Relational Database Management System, it will give you ACID, Indexes (also here is a nice introduction to indexes), and the ability to use SQL to manage (retrieve and insert) the data. These properties will be immensely useful as your software and it's requirements grow.

Databases keep their information in tables, where each table can be thought of as a spreadsheet, which has rows and columns. The columns describe one aspect of the data being described, and each row describes one object of data. We create these tables with SQL, and very specifically the DDL (data definition language) part of SQL.

Once we have defined the database structure, we insert and retrieve data from it. This is also done using SQL, specifically the DML (data manipulation language) part of it.

The following SQL statement is the simplest join:

select * from mailing_list, phone_number;

This returns the Cartesian product of the rows in both the tables. In most practical cases, we will restrict the query with a where clause.

select * from mailing_list, phone_number where mailing_list.email == phone_number.email;

The above query will only return those rows together if the email fields match.

Things to understand:

What are the ACID properties of databases and what are their advantages?
What is concurrency and how is it handled?
Why is it a huge benefit to have SQL as a declarative language?
What is the advantage of using Indexes?
What is a table?
What is a primary key, and why do we need one?
What relationships can exist between tables, and what do they mean?

Pages