(CSE201) Advanced Databases
Yuxuan Wu Lv13

Storage

Building a Database: High-Level

  • Design conceptual schema using a data model.

    • Design relational schema, e.g., student(sid, name)
  • Data Definition Language (DDL)

    • CREATE TABLE student (sid char(8) primary key, name varchar(32))
  • Data Manipulation Language (DML)

    • INSERT INTO student VALUES (‘00112233’, ‘Paul’)

Access Time

Access time is the time it takes from when a read or write a block request is issued to when data transfer begins (or ends).

Access time = Seek time + Rotational latency + (Transfer time)

  • Seek time – time it takes to position the arm over the right track.

    • Average seek time is about of the worst case seek time (e.g., from the innermost to the outermost)
  • Rotational latency – time it takes for the sector to be accessed to appear under the head.

    • Average latency is 1/2 of the worst case latency (e.g., nearly 360 degree rotation)
  • Transfer time – time to actually read/write the data in the sector once the head is positioned, that is, the time for the disk to rotate over the sectors.

    • In most cases, transfer time is much less than the seek time and rotational latency.

Disk Block

  • A contiguous sequence of sectors from a single track

  • Data is transferred between disk and main memory in blocks

  • Sizes range from 512 bytes to several kilobytes

Organization of Records in Files

Heap – a record can be placed anywhere in the file where there is space

Sequential – store records in sequential order, based on the value of the search key of each record

Hashing – a hash function computed on some attribute of each record; the result specifies in which block of the file the record should be placed

Records of each relation may be stored in a separate file. In a multi-table clustering file organization records of several different relations can be stored in the same file

Motivation: store related records on the same block to minimize I/O

![image-20210219194326826](/Users/yuxuan/Library/Application Support/typora-user-images/image-20210219194326826.png)

Index

Indexing mechanisms used to speed up access to the desired data.

Indexing is a data structure technique which allows you to quickly retrieve records from a database file.

An Index is a small table having only two columns.

  • The first column comprises a copy of the primary or candidate key of a table.
  • Its second column contains a set of pointers for holding the address of the disk block where that specific key value stored.

Types of Indexing

img

The Structure of Index

Data file: collection of blocks holding records on disk

Index file: a data structure allowing the DBMS to find particular records in a data file more efficiently.

  • An index file consists of records (called index entries) of the form:

    • Search-key - pointer
  • Search Key: one or set of attributes used to look up records in a file.

Relationship: a search key K in the index file is associated with a pointer to a data-file record that has search key K.

Index Techniques

  • Depending on the organisation of index file, an index can be:
    • an ordered Index where index entries are sorted on the search key value. (Primary indexing)
    • a hashing Index where hashing technique is employed to organize index entries (Secondary indexing)

Ordered Indices (Primary index)

  • Dense index: index record appears for every search-key value in the file.

  • Sparse Index: contains index records for only some search-key values.

Dense Index vs. Sparse Index

  • Index size

    • Sparse index is smaller
  • Requirement on data file

    • The data file must be sequential file
  • Lookup

    • Sparse index is smaller and may fit in memory
    • Dense index can directly tell if a record exists.
  • Update

    • Sparse index requires less space and maintenance for insertion and deletion.
  • Good tradeoff: sparse index with an index entry for every block in file, corresponding to least search-keyvalue in the block.

Dense index files

![image-20210219210300684](/Users/yuxuan/Library/Application Support/typora-user-images/image-20210219210300684.png)

Sparse index files

![image-20210219210345875](/Users/yuxuan/Library/Application Support/typora-user-images/image-20210219210345875.png)

An ordered index can also be:

  • Primary index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file.

    • Also called clustering index. The search key of a primary index is usually but not necessarily the primary key.
    • Can be sparse
  • Secondary index: an index whose search key specifies an order different from the sequential order of the file.

    • Also called non-clustering index.

    • Can not be sparse

  • Index-sequential file: ordered sequential file with a primary index.

![image-20210219211429706](/Users/yuxuan/Library/Application Support/typora-user-images/image-20210219211429706.png)

  • Index record points to a bucket that contains pointers to all the actual records with that particular search-key value.
  • Secondary indices have to be dense

Primary and Secondary Indices

  • Indices offer substantial benefits when searching for records.

  • But: updating indices imposes overhead on database modification - when a file is modified, every index on the file must be updated,

  • Sequential scan using primary index is efficient

  • But a sequential scan using a secondary index is expensive

    • Each record access may fetch a new block from disk
    • Block fetch requires about 5 to 10 milliseconds; versus about 100 nanoseconds for memory access

Clustering Index

In a clustered index, records themselves are stored in the Index and not pointers. Sometimes the Index is created on non-primary key columns which might not be unique for each record. In such a situation, you can group two or more columns to get the unique values and create an index which is called clustered Index. This also helps you to identify the record faster.

Example:

Let’s assume that a company recruited many employees in various departments. In this case, clustering indexing in DBMS should be created for all employees who belong to the same dept.

It is considered in a single cluster, and index points point to the cluster as a whole. Here, Department _no is a non-unique key.

What is Multilevel Index?

Multilevel Indexing in Database is created when a primary index does not fit in memory. In this type of indexing method, you can reduce the number of disk accesses to short any record and kept on a disk as a sequential file and create a sparse base on that file.

img

B-Tree Index

B-tree index is the widely used data structures for tree based indexing in DBMS. It is a multilevel format of tree based indexing in DBMS technique which has balanced binary search trees. All leaf nodes of the B tree signify actual data pointers.

Moreover, all leaf nodes are interlinked with a link list, which allows a B tree to support both random and sequential access.

img

  • Lead nodes must have between 2 and 4 values.
  • Every path from the root to leaf are mostly on an equal length.
  • Non-leaf nodes apart from the root node have between 3 and 5 children nodes.
  • Every node which is not a root or a leaf has between n/2] and n children.

Index Definition in SQL

  • Create an index
1
2
3
4
create index <index-name> on <relation-name>(<attribute-list>)

create index b-index on branch(branch_name)

  • To drop an index
1
n drop index <index-name>
  • Most database systems allow specification of type of index.

Advantages of Indexing

Important pros/ advantage of Indexing are:

  • It helps you to reduce the total number of I/O operations needed to retrieve that data, so you don’t need to access a row in the database from an index structure.
  • Offers Faster search and retrieval of data to users.
  • Indexing also helps you to reduce tablespace as you don’t need to link to a row in a table, as there is no need to store the ROWID in the Index. Thus you will able to reduce the tablespace.
  • You can’t sort data in the lead nodes as the value of the primary key classifies it.

Disadvantages of Indexing

Important drawbacks/cons of Indexing are:

  • To perform the indexing database management system, you need a primary key on the table with a unique value.
  • You can’t perform any other indexes in Database on the Indexed data.
  • You are not allowed to partition an index-organized table.
  • SQL Indexing Decrease performance in INSERT, DELETE, and UPDATE query.

Summary:

  • Indexing is a small table which is consist of two columns.
  • Two main types of indexing methods are 1)Primary Indexing 2) Secondary Indexing.
  • Primary Index is an ordered file which is fixed length size with two fields.
  • The primary Indexing is also further divided into two types 1)Dense Index 2)Sparse Index.
  • In a dense index, a record is created for every search key valued in the database.
  • A sparse indexing method helps you to resolve the issues of dense Indexing.
  • The secondary Index in DBMS is an indexing method whose search key specifies an order different from the sequential order of the file.
  • Clustering index is defined as an order data file.
  • Multilevel Indexing is created when a primary index does not fit in memory.
  • The biggest benefit of Indexing is that it helps you to reduce the total number of I/O operations needed to retrieve that data.
  • The biggest drawback to performing the indexing database management system, you need a primary key on the table with a unique value.

B+ tree index

B+-Tree is “short” and “Fat”:

  • Disk-based: one node per block; large fan-out

  • Balanced (more or less): good performance guarantee

https://www.guru99.com/introduction-b-plus-tree.html

Rules for B+ Tree

Here are essential rules for B+ Tree.

  • Leaves are used to store data records.
  • It stored in the internal nodes of the Tree.
  • If a target key value is less than the internal node, then the point just to its left side is followed.
  • If a target key value is greater than or equal to the internal node, then the point just to its right side is followed.
  • The root has a minimum of two children.
B + Tree B Tree
Search keys can be repeated. Search keys cannot be redundant.
Data is only saved on the leaf nodes. Both leaf nodes and internal nodes can store data
Data stored on the leaf node makes the search more accurate and faster. Searching is slow due to data stored on Leaf and internal nodes.
Deletion is not difficult as an element is only removed from a leaf node. Deletion of elements is a complicated and time-consuming process.
Linked leaf nodes make the search efficient and quick. You cannot link leaf nodes.

In a B+-Tree:

  • n is the number of pointers in a node; pointers: P1, P2, …Pn

  • Search keys: K1 < K2 < K3 < . . . < Kn–1

  • All paths (from root to leaf) have same length (balanced trees specific character)

  • Root must have at least two children

  • In each non-leaf node (inner node), more than half (≥⎡n/2⎤ ) pointers must be used

  • Each leaf node must contain at least ⎡(n-1)/2)⎤ keys

![image-20210219214406436](/Users/yuxuan/Library/Application Support/typora-user-images/image-20210219214406436.png)

Example

![image-20210219214509314](/Users/yuxuan/Library/Application Support/typora-user-images/image-20210219214509314.png)

Characters of the B+ trees

  • Since the inner-node connections are done by pointers, “logically” close blocks need not be “physically” close.

  • The non-leaf levels of the B+-tree form a hierarchy of sparse indices.

  • If there are K search-key values in the file

    • The B+-tree height is no more than ⎡log⎡n/2⎤(K)⎤ .

    • Level below root has at least 2* ⎡n/2⎤ values

    • Next level has at least 2* ⎡n/2⎤ * ⎡n/2⎤ values

  • Searching can be conducted efficiently

  • Insertion and deletion to the main file can be handled efficiently, as the index can be restructured in logarithmic time

Search operation

  • To find the required record, you need to execute the binary search on the available records in the Tree.
  • In case of an exact match with the search key, the corresponding record is returned to the user.
  • In case the exact key is not located by the search in the parent, current, or leaf node, then a “not found message” is displayed to the user.
  • The search process can be re-run for better and more accurate results.
1
2
3
4
5
1. Call the binary search method on the records in the B+ Tree.
2. If the search parameters match the exact key
The accurate result is returned and displayed to the user
Else, if the node being searched is the current and the exact key is not found by the algorithm
Display the statement "Recordset cannot be found.

Procedure in updates on B+ tree

  1. Find the leaf node in which the search-key value would appear

  2. If the search-key value is already present in the leaf node

    2.1. Add record to the file

    2.2. If necessary add a pointer to the bucket.

  3. If the search-key value is not present, then

    3.1. add the record to the main file (and create a bucket if necessary)

    3.2. If there is room in the leaf node, insert (key-value, pointer) pair in the leaf node

    3.3. Otherwise, split the node (along with the new (key-value, pointer) entry)

  4. Splitting a leaf node:

    4.1. take the n (search-key value, pointer) pairs (including the one being inserted) an in-memory area M in sorted order. Place the first ⎡n/2⎤ in the original node, and the rest in a new node.

    4.2. let the new node be p, and let k be the least key value in p. Insert (k,p) in the parent of the node being split.

    4.3. If the parent is full, split it and propagate the split further up.

  5. Splitting of nodes proceeds upwards till a node that is not full is found.

    • In the worst case the root node may be split increasing the height of the tree by 1

img

![image-20210220082907159](/Users/yuxuan/Library/Application Support/typora-user-images/image-20210220082907159.png)

![image-20210220082923546](/Users/yuxuan/Library/Application Support/typora-user-images/image-20210220082923546.png)

Delete Operation

The complexity of the delete procedure in the B+ Tree surpasses that of the insert and search functionality.

The following algorithm is applicable while deleting an element from the B+ Tree:

  • Firstly, we need to locate a leaf entry in the Tree that is holding the key and pointer. , delete the leaf entry from the Tree if the Leaf fulfills the exact conditions of record deletion.
  • In case the leaf node only meets the satisfactory factor of being half full, then the operation is completed; otherwise, the Leaf node has minimum entries and cannot be deleted.
  • The other linked nodes on the right and left can vacate any entries then move them to the Leaf. If these criteria is not fulfilled, then they should combine the leaf node and its linked node in the tree hierarchy.
  • Upon merging of leaf node with its neighbors on the right or left, entries of values in the leaf node or linked neighbor pointing to the top-level node are deleted.

img

The example above illustrates the procedure to remove an element from the B+ Tree of a specific order.

  • Firstly, the exact locations of the element to be deleted are identified in the Tree.
  • Here the element to be deleted can only be accurately identified at the leaf level and not at the index placement. Hence, the element can be deleted without affecting the rules of deletion, which is the value of the bare-minimum key.

img

  • In the above example, we have to delete 31 from the Tree.
  • We need to locate the instances of 31 in Index and Leaf.
  • We can see that 31 is available in both Index and Leaf node level. Hence, we delete it from both instances.
  • But, we have to fill the index pointing to 42. We will now look at the right child under 25 and take the minimum value and place it as an index. So, 42 being the only value present, it will become the index.

Delete Operation Algorithm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
1) Start at the root and go up to leaf node containing the key K
2) Find the node n on the path from the root to the leaf node containing K
A. If n is root, remove K
a. if root has more than one key, done
b. if root has only K
i) if any of its child nodes can lend a node
Borrow key from the child and adjust child links
ii) Otherwise merge the children nodes. It will be a new root
c. If n is an internal node, remove K
i) If n has at least ceil(m/2) keys, done!
ii) If n has less than ceil(m/2) keys,
If a sibling can lend a key,
Borrow key from the sibling and adjust keys in n and the parent node
Adjust child links
Else
Merge n with its sibling
Adjust child links
d. If n is a leaf node, remove K
i) If n has at least ceil(M/2) elements, done!
In case the smallest key is deleted, push up the next key
ii) If n has less than ceil(m/2) elements
If the sibling can lend a key
Borrow key from a sibling and adjust keys in n and its parent node
Else
Merge n and its sibling
Adjust keys in the parent node

Output:

The Key “K” is deleted, and keys are borrowed from siblings for adjusting values in n and its parent nodes if needed.

Summary:

  • B+ Tree is a self-balancing data structure for executing accurate and faster searching, inserting and deleting procedures on data
  • We can easily retrieve complete data or partial data because going through the linked tree structure makes it efficient.
  • The B+ tree structure grows and shrinks with an increase/decrease in the number of stored records.
  • Storage of data on the leaf nodes and subsequent branching of internal nodes evidently shortens the tree height, which reduces the disk input and output operations, ultimately consuming much less space on the storage devices.

Hash Index

What is Hashing in DBMS?

  • In DBMS, hashing is a technique to directly search the location of desired data on the disk without using index structure.

  • Hashing method is used to index and retrieve items in a database as it is faster to search that specific item using the shorter hashed key instead of using its original value.

  • Data is stored in the form of data blocks whose address is generated by applying a hash function in the memory location where these records are stored known as a data block or data bucket.

Why do we need Hashing?

Here, are the situations in the DBMS where you need to apply the Hashing method:

  • For a huge database structure, it’s tough to search all the index values through all its level and then you need to reach the destination data block to get the desired data.
  • Hashing method is used to index and retrieve items in a database as it is faster to search that specific item using the shorter hashed key instead of using its original value.
  • Hashing is an ideal method to calculate the direct location of a data record on the disk without using index structure.
  • It is also a helpful technique for implementing dictionaries

Structure of Static Hashing

  • A bucket is a unit of storage containing one or more records (a bucket is typically a disk block).

  • Hash function h is a function from the set of all search-key values K to the set of all bucket addresses B.

  • Hash function is used to locate records for access, insertion as well as deletion.

  • Records with different search-key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate a record.

Important Terminologies using in Hashing

Here, are important terminologies which are used in Hashing:

  • Data bucket – Data buckets are memory locations where the records are stored. It is also known as Unit Of Storage.
  • Key: A DBMS key is an attribute or set of an attribute which helps you to identify a row(tuple) in a relation(table). This allows you to find the relationship between two tables.
  • Hash function: A hash function, is a mapping function which maps all the set of search keys to the address where actual records are placed.
  • Linear Probing – Linear probing is a fixed interval between probes. In this method, the next available data block is used to enter the new record, instead of overwriting on the older record.
  • Quadratic probing- It helps you to determine the new bucket address. It helps you to add Interval between probes by adding the consecutive output of quadratic polynomial to starting value given by the original computation.
  • Hash index – It is an address of the data block. A hash function could be a simple mathematical function to even a complex mathematical function.
  • Double Hashing –Double hashing is a computer programming method used in hash tables to resolve the issues of has a collision.
  • Bucket Overflow: The condition of bucket-overflow is called collision. This is a fatal stage for any static has to function.

There are mainly two types of SQL hashing methods:

  1. Static Hashing
  2. Dynamic Hashing

Static Hashing

In the static hashing, the resultant data bucket address will always remain the same.

Therefore, if you generate an address for say Student_ID = 10 using hashing function mod(3), the resultant bucket address will always be 1. So, you will not see any change in the bucket address.

Therefore, in this static hashing method, the number of data buckets in memory always remains constant.

Static Hash Functions

  • Inserting a record: When a new record requires to be inserted into the table, you can generate an address for the new record using its hash key. When the address is generated, the record is automatically stored in that location.
  • Searching: When you need to retrieve the record, the same hash function should be helpful to retrieve the address of the bucket where data should be stored.
  • Delete a record: Using the hash function, you can first fetch the record which is you wants to delete. Then you can remove the records for that address in memory.

Static hashing is further divided into

  1. Open hashing
  2. Close hashing.

Open Hashing

In Open hashing method, Instead of overwriting older one the next available data block is used to enter the new record, This method is also known as linear probing.

For example, A2 is a new record which you wants to insert. The hash function generates address as 222. But it is already occupied by some other value. That’s why the system looks for the next data bucket 501 and assigns A2 to it.

imgHow Open Hash Works

Close Hashing

In the close hashing method, when buckets are full, a new bucket is allocated for the same hash and result are linked after the previous one.

Dynamic Hashing

Dynamic hashing offers a mechanism in which data buckets are added and removed dynamically and on demand. In this hashing, the hash function helps you to create a large number of values.

Comparison of Ordered Indexing and Hashing

Parameters Order Indexing Hashing
Storing of address Addresses in the memory are sorted according to a key value called the primary key Addresses are always generated using a hash function on the key value.
Performance It can decrease when the data increases in the hash file. As it stores the data in a sorted form when there is any (insert/delete/update) operation performed which decreases its performance. Performance of hashing will be best when there is a constant addition and deletion of data. However, when the database is huge, then hash file organization and its maintenance will be costlier.
Use for Preferred for range retrieval of data- which means whenever there is retrieval data for a particular range, this method is an ideal option. This is an ideal method when you want to retrieve a particular record based on the search key. However, it will only perform well when the hash function is on the search key.
Memory management There will be many unused data blocks because of the delete/update operation. These data blocks can’t be released for re-use. That’s why regular maintenance of the memory is required. In static and dynamic hashing methods, memory is always managed. Bucket overflow is also handled perfectly to extend static hashing.

What is Collision (Bucket Overflow)?

Hash collision is a state when the resultant hashes from two or more data in the data set, wrongly map the same place in the hash table.

Bucket overflow can occur because of

  • Insufficient buckets

  • Skew in distribution of records. This can occur due to two reasons:

    • multiple records have same search-key value
    • chosen hash function produces non-uniform distribution of key value

Overflow chaining – the overflow buckets of a given bucket are chained together in a linked list.

Hash Indices

  • Hashing can be used not only for file organization, but also for index-structure creation.

  • A hash index organizes the search keys, with their associated record pointers, into a hash file structure.

  • Strictly speaking, hash indices are always secondary indices.

  • We use the term hash index to refer to both secondary index structures and hash organized files.

How to deal with Hashing Collision?

There are two technique which you can use to avoid a hash collision:

  1. Rehashing: This method, invokes a secondary hash function, which is applied continuously until an empty slot is found, where a record should be placed.
  2. Chaining: Chaining method builds a Linked list of items whose key hashes to the same value. This method requires an extra link field to each table position.

Deficiencies of Static Hashing

  • In static hashing, function h maps search-key values to a fixed set of B of bucket addresses. Databases grow or shrink with time.

    • If initial number of buckets is too small, and file grows, performance will degrade due to too many overflows.
    • If space is allocated for anticipated growth, a significant amount of space will be wasted initially (and buckets will be underfull).
    • If database shrinks, again space will be wasted.
  • One solution: periodic re-organisation of the file with a new hash function

    • Expensive, disrupts normal operations
  • Better solution: allow the number of buckets to be modified dynamically - Dynamic Hashing!

Dynamic Hashing

  • Good for database that grows and shrinks in size

  • Allows the hash function to be modified dynamically

  • Extendable hashing – one form of dynamic hashing

    • Hash function generates values over a large range — typically b-bit integers, with b = 32.

    • At any time use only a prefix of the hash function to index into a table of bucket addresses.

    • Let the length of the prefix be i bits, 0 ≤ i ≤ 32.

    • Bucket address table size = 2i. Initially i = 0

    • Value of i grows and shrinks as the size of the database grows and shrinks.

    • Multiple entries in the bucket address table may point to the same bucket

    • Thus, actual number of buckets is < 2i

    • The number of buckets also changes dynamically due to coalescing and splitting of buckets.

![image-20210220111058629](/Users/yuxuan/Library/Application Support/typora-user-images/image-20210220111058629.png)

Summary:

  • In DBMS, hashing is a technique to directly search the location of desired data on the disk without using index structure.
  • Hashing method is used to index and retrieve items in a database as it is faster to search that specific item using the shorter hashed key instead of using its original value.
  • Data bucket, Key , Hash function, Linear Probing, Quadratic probing , Hash index, Double Hashing, Bucket Overflow are important terminologies used in hashing
  • Two types of hashing methods are 1) static hashing 2) dynamic hashing
  • In the static hashing, the resultant data bucket address will always remain the same.
  • Dynamic hashing offers a mechanism in which data buckets are added and removed dynamically and on demand.
  • In order Indexing addresses in the memory are sorted according to a critical value while in hashing addresses are always generated using a hash function on the key value.
  • Hash collision is a state when the resultant hashes from two or more data in the data set, wrongly map the same place in the hash table.
  • Rehashing and chaining are two methods which help you to avoid hashing collision.
  • Post title:(CSE201) Advanced Databases
  • Post author:Yuxuan Wu
  • Create time:2021-02-19 06:00:03
  • Post link:yuxuanwu17.github.io2021/02/19/2021-02-19-(CSE201)-Advanced-Databases/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.