September 2, 2020

Data Model

A database design usually starts by the high-level description of the schema. The idea is to analyse the organisation requirements and frame it on a schema that a variety of users would understand.

The outcome of this exercise, is a data model. A data model is a representation (often in a diagram) of the data relationship in an organisation. This representation will describe the data and provide a guideline of how this data could be manipulated and retrieved.

The data model is composed of three components

Structural: A collection of relationships that construct the data. This includes the diagrams, the entities and their dependencies, what their names are, what types of data those entities host, etc.
Manipulative: The available operations on the data. Directives to instruct what data can be retrieved, updated, or deleted.
Integrity constraints: Checks that guarantee data integrity.

The optimal data model can be found if the following criteria is satisfied:

Consistent: The model represents a clear mapping of the organisation information.
Minimal: No entity or property is dispensable. No non-necessary dependencies are present. It is also easy for all the organisation users to understand the data model.
Distinguishable: Different data can be told apart without confusion.
Non-redundant: There is only one way to represent an entity and its attributes.
Not application bound: The data model should not change based on what database system is being used.
Extensible: New features can be added without breaking the existent schema.
Easy to draw: If the diagram of a said data-model is difficult to draw, one of the criteria above hasn't been completely satisfied.

Entity-Relationship (ER)

An entity is an object in the organisation that can be represented in the database (a person, location, thing, concept, event, etc...). An entity can have any number of attributes. Those attributes are properties that describe the data stored in that entity. An entity that represents a location, could have a set of coordinates as one of its attributes. Entities are then connected via relationships.

Database Design Phases

The process of designing a database can be split into three different phases:

Conceptual Phase: The creation of a data model concept that is put to proof against users' requirements. This phase does not involve any particular implementation details. This phase will present the entities and their attributes.
Logical Phase: The model concept is enhanced with the relationship logic. This means that the relationship between data, its constrains, and data retrieval/update/deletion queries are added. The model should then be put to proof against users' requirements once more.
Physical Phase: Decision of how the database is implemented. It takes into account which database management system (DBMS) will be used, what tables will be created, what security protocol will be implemented, what constraints and indexes will be created, and so on.

Note that the database design is an organic process. The database will change as the organisation grows and requires updates.

Functional Dependency

A functional dependency between attributes describes how those attributes are related.

Formal definition: If A and B are attributes of a relation R, B is functionally dependent on A if each value of A is associated with exactly one value of B. In this case, A is called the determinant Notation: A ® B (or A -> B)

Example: Consider an entity Player with the following attributes: playerId, fName, position. We can determine the position of a certain player, by knowing what the value of playerId is. In other words, playerId determines the position (playerId -> position). But the opposite is not true, knowing the position doesn't imply knowing the playerId.

Functional dependencies can be broken down into 3 different types:

Partial Functional Dependency: A -> B is a partial functional dependency if there is an attribute that can be removed from A and yet the dependency still holds. Consider the entity Player with the attributes playerId, name, and teamId. We have that:
```
playerId, name -> teamId
```
But the name attribute is not really necessary to determine teamId. Therefore, the name attribute could be removed but the functional dependency would still exist between playerId and teamId.
Full Functional Dependency: A -> B is a full functional dependency if removal of any attribute from A results in the dependency no longer existing. Consider the entity Player with attributes playerId and position. We then have that:
```
playerId -> position
```
If we remove the playerId attribute, the dependency no longer exists.
Transitive Dependency: Given that A, B, and C are attributes of a relation and A -> B and B -> C, then C is a transivite dependency of A via B. Consider the entity Player:
```
playerId -> name, position, teamId, teamName
teamId -> teamName
```
playerId determines the teamName attribute via the teamId attribute. In this case, teamName is transitively dependent on playerId via teamId.

Normalisation

Normalisation is one of the most important aspects of database design. It is common to find databases that have more entities than necessary. The same redundancy often happens to entities' attributes as well. The normalisation technique aims to minimise those redundancies by finding the minimal set of entities and attributes required to meet users' requirements.

A normalised database is easier to manipulate and to understand. Normalisation provides easy access to the database, prevents maintenance overheads, and enhances user experience in general. Additionally, new users will have a more compact learning curve when being introduced to the database for the first time.

Normalisation rules can be broken down into the following normal forms:

First Normal Form
Second Normal Form
Third Normal Form
Boyce and Codd Normal Form (BCNF)
Fourth Normal Form

First Normal Form (1NF)

For a table to be in the 1NF, it must conform to the following 4 rules:

Single Valued Attributes: Each column in a table should not contain multiple values. As a hypothetical example, on a table defining a Location entity, an entry in the column "country" cannot be "New Zealand & Brazil".
Attribute Domain does not change: In each column, the values stored must be of the same type. On a table defining a Location entity, an entry in the column "country" cannot be a number, while the other entries are a text.
Unique name for Attributes/Columns: No repetition of columns names within the same entity. The entity "Country" cannot have two columns called "population".
Order doesn't matter: The order that the data is stored in the table does not affect any database operation.

Second Normal Form (2NF)

A table is in the 2NF if:

It is in the First Normal Form (1NF)
Has Partial Dependency: An attribute in a table depends on a part of the primary key, but not on the whole key. As an example, if we have an entity "Student Score" which has the attributes: "student_id", "subject_id", "score", and "teacher_name". The attribute "teacher_name" only depends on the "subject_id", and the primary key of this table is the combination of "student_id + subject_id". Note that to remove this dependency we only need to move the "teacher_name" property to the "Subject" entity.

Alternative Definition: A table is in the first normal form and every non-primary-key attribute is fully functionally dependent on the primary key.

Note: A table with transitive dependencies can be in the 2NF.

Third Normal Form (3NF)

A table is in the 3NF if:

It is in the Second Normal Form (2NF)
Does not have Transitive Dependency (TD): The TD happens when a non-prime attribute has a dependency on another non-prime attribute. If a "Score" entity has the properties: "score_id", "student_id", "subject_id", "marks", "exam_name", "total_marks". The prime keys are "student_id" and "subject_id". But the "total_marks" depends on the "exam_name", that does not belong to the prime keys set. In this case, the solution is to create a new "Exam" entity with the attributes: "exam_name" and "total_marks"

Alternative Definition: A table is in the first and second normal form and no non-primary-key attribute is transitively dependent on the primary-key.

Boyce and Codd Normal Form (BCNF)

A table is in the BCNF if:

It is in the Third Normal Form (3NF)
For each functional dependency (X -> Y), X should be a super key. If we have an entity "Students Subjects" with the attributes: "student_id", "subject", and "professor", we have "student_id" and "subject" as the primary keys. Given that a subject can have multiple professors, there is a dependency between professor -> subject. So we have an attribute depending on a prime/super key. To solve this problem, the solution is to create a "Professor" entity with the columns "professor_id", "professor_name", and "subject".

Fourth Normal Form (4NF)

A table is in the BCNF if:

It is in the Boyce and Codd Normal Form (BCNF)
There is no Multi-Valued Dependency Given that A -> B (B depends on A), if multiple values of B exist, then the table has a multi-valued dependency. Imagine we have an entity with three columns: "student_id", "course", and "hobby". Both "course" and "hobby" depend on the student id. A way to fix it is to create two entities: "Student Hobby", and "Student Course". This dependency usually infers that the table has a combination of columns that don't belong together. (Why are hobby and course in the same table?)