Extensible engine of storage
See also: ESE
The driving of extensible storage of Windows ( Extensible Storage Engine or ESE), known also under the name of JET Blue , is an access method Séquentiel indexed i.e. a technology allowing data storage. It is the basic engine of data used by the products such as Exchange Server, Active Directory, SQL Server. It rests on the principles known under the acronym ACID of the Management systems of relational databases.
ESE makes it possible the applications to have a state consist of data via a system of updates and recoveries which occur in the form of transactions.
Les transactions in ESE is highly concurrent what makes ESE usable for applications for waiter. ESE puts in memory hiding place data in an intelligent way to allow a high performance access it.
A mechanism of covering of the data after a planting of the system is provided so that the data are safeguarded.
Moreover, ESE is a light software what makes it usable for auxiliary applications.
Database
A base of data is at the same time a physical and logical regrouping data. A database ESE takes the shape of a single file for Windows. In-house the database is a collection of pages of 2,4 or 8 Kbyte. These pages contain meta-data to describe the data contained within the databases, the data themselves, of the indices for persist interesting orders of the data, and other information. This information is intermélée within the file of the database but efforts are made to keep dated used together clustered together within the database. A database ESE can contain until 2^31 pages, or 16 Terabytes of data, for pages of size of 8 Kbyte.Databases ESE are organized in groups called authorities. The majority of the applications use only one authority, but all the applications can also use multiple authorities. The importance of the authority is that it associates an individual recovery log series with one or more database. Currently, up to 6 databases user can be attached to an authority ESE constantly. Each process using ESE can have up to 1024 authorities ESE.
A database is portable owing to the fact that it can be detached from an authority ESE in progress to be attached then to the same authority or a different authority. While it is detached, a database can be copied by using standard Windows utilities. The base of data cannot be copied while it used actively since ESE opens the files of databases in an exclusive way. A database can physically reside on any device having a support for the addressable operations of input/output directly per Windows.
Tables
A table is a homogeneous collection of recordings, where each recording has the same number of columns. Each table is identified by a name of table, whose range is local with the database in which the table is contained. The quantity of space of disc allocated with a table within a database is determined by a given parameter when the table is created with the CreateTable operation. The tables grow automatically in answer to the creation of data (recording).The tables have one or more index. There must be at least an index of cluster for the data of recording. When there is no index of cluster defined by the application, an artificial index is used which orders and groups the recordings by chronological order of insertion of recording. The indices are defined to make persist interesting orders among the data, and allow sequential accesses to the recordings in the order of the index, and the direct accesses to the recordings by values of the indices of the columns. Indices in cluster in ESE must also be primary, which means that the key of the index must be single.
The indices in cluster and those which are not it are represented by using B+ trees. If an operation of insertion or update causes a going beyond of page, the page will be cut into two: a new page is then allocated and is logically chained between the two pages previously adjacent. Owing to the fact that this new page is not physically adjacent with its logical neighbors, the accesses to it are not also effective. ESE has a characteristic of in line compaction which Re-compact data. If a table must be frequently updated, space can be to him held for future insertions by specifying a density of suitable page during the creation of a page or an index. This makes it possible to avoid the operations of cutting of page or makes it possible to give them to one later moment.
Recordings and columns
A recording is a unit associated with values of columns. The recordings are inserted and updated via operations " update" and can be removed via operations " Delete". The columns are fixed and read via the operations " SetColumns" and " RetrieveColumns" , respectively. The maximum size of a recording is of 8110 bytes for pages of 8k-bytes with the exception of the columns with long values. Types of the columns " LongText" and " LongBinary" do not contribute significantly to the limitation in the face, and of the recordings can contain data of a size much broader than a size of basic page of data if these data are stored in columns of the " type; Length value". When a reference " length value" is stored in a recording, it needs there only 9 bytes of necessary data. These data " length values" can themselves make until 2G-bytes in the face.The recordings are typically uniform in the fact that each recording has a whole of values for the same whole of columns. In ESE, it is also possible to define a great number of columns for a table, and of only of to have filled some of nonnull values for certain recordings. In the same way, a table can also be a collection of heterogeneous recordings.
ESE offers the support for a broad panel of values of columns, going in the face of the bit alone up to values of 2G-bytes. To make the choice of the correct rtpes of columns is important because the type of the column determines many of its properties, including its scheduling for the indices. Here the list of the types having a support in ESE:
Standard of columns
fixed, variable and marked Columns
Chaque table ESE can define to 127 columns of fixed length, 128 columns with variable length and 64.993 columns marked.
Les fixed columns is primarily columns which take the same quantity of memory for each recording, whatever their value. The fixed columns use 1 bit to represent a zero value (NO ONE) and a quantity of read-only memory for each recording.
Les variable columns is primarily columns which use a variable quantity of memory for each recording in laquel they are defined; this space of memory depends on the size of the value of this particular column. The variable columns use 2 bytes to determine a zero value (NO ONE), and a quantity of memory capacity varaiable for each recording in which this column is defined.
The marked columns are columns which take any place in memory if they contiennet no recording. They can be with single value or be with value multiples. Same the columns marked can contain multiple values in only one recording! When marked columns are defined in a recording, each authority of a marked column takes roughly 4 bytes of memory capacity in addition to the size of the value of the authority of the marked column. When the number of authorities of only one marked column is important, the heading of each authority of marked column makes 2 bytes roughly. The marked columns are ideal for rare columns because they do not occupy any place in memory if they are not defined. If a marked column with multiple value is indexed, the index will contain only one only entry for the recording for each value of the marked column. Long Been worth
Column standards off Long Text and Long Binary broad are binary objects. They are stored in separate B+tree from the clustered long index keyed by been worth id and byte offset. ESE supports suspends, byte arranges overwrite, and set size for thesis columns. Also, ESE has individual authority multiple blind feature where records may refers the same broad binary object, ace though each record had its own Copy off the information, i.e without inter-record locking conflicts. Maximum The size off has Long Text gold Long Binary column been worth is 2GBytes.
Version, Car-increment and Escrow Columns
Version columns are automatically incremented by ESE each time has record containing this column is modified via year Update operation. This column boat Be set by the application, goal edge only Be read. Applications off version columns include being used to given yew year in-memory Copy off has given record needs to Be refreshed. Yew the been worth in has table record is greater than the been worth in has cached Copy then the cached Copy is known to Be out off date. Version columns must Be off standard Length.
Car increment columns are automatically set single by ESE such that the been worth contained in the column is for every record in the table. Thesis columns, like version columns, boat Be set by the application. Car increment columns are read only, and are automatically set when has new record is inserted into has table via year Update operation. The been worth in the column remains constant for the life off the record, and only one car increment column is allowed per table. Car increment columns may Be off standard Length standard gold Currency.
Escrow columns edge Be modified via year EscrowUpdate operation. Escrowed updates are numeric delta operations. Escrow columns must Be off standard Length. Examples off numeric delta operations include adding 2 to was worth gold subtracting 1 from was worth. ESE tracks the changes in was worth rather than the end been worth off year update. Multiple sessions may each cuts outstanding exchanges made via EscrowUpdate to the same off been worth because ESE given edge the actual end been worth regardless which transactions made and which transactions rollback. Multiple This allows users to concurrently update has column by making numeric delta exchanges. --->
Indices
an index one ordonnemencement is ordonnemencement intended to persist within the recordings in a table. The indices are used for sequential accesses to the lines of data in the defined order, and for the direct naviguation within the recordings based on the indexed values of the columns. The order defined by an index is described in terms of a table of columns, in the order of precedence. this table of columns is also called the key of index. each column is called a segment of index. Each segment of index can be classified in order of ascending sort or decreasing, in terms of its scheduling. One can define index as much than one wants for a table. ESE provides a rich whole of characteristics of indexing. Clustered IndexOne index may Be specified ace the clustered, but primary, index. In ESE, the clustered single index must Be and is referred to ace the primary index. Other index are described ace non-clustered, but secondary, index. Primary index different are from secondary index in that the index entry is the record itself, and not has logical to point to the record. Secondary index cuts primary keys At to their leaves to logically link to the record in the primary index. In other words, the table is physically clustered in primary index order. Retrieval off non-indexed record dated in primary index order is generally much faster than in secondary index order. This is because has individual disk multiple access edge bring into memory records that will Be access closed together in time. The same disk access satisfies multiple record access operations. However, the insertion off has record into the middle off year index, ace determined by the primary index order, may Be very much slower than appending it to the end off year index. Update frequency must Be carefully considered against retrieval patterns when performing table design. Yew primary No index is defined for has table, then year implicit primary index, called has database key (DBK) index is created. The DBK is simply has single ascending number incremented each time has record is inserted. Ace has result, the physical order off records in has DBK index is chronological insertion order, and new records are always added At the end off the table. Yew year application wishes to cluster dated one autoincrement has not-single index, this is possible by adding year column to the end off the not-single index definition.
Indexing Over Multi-valued Columns
Index edge Be defined over multi-valued columns. Multiple entries may exist in thesis index for been worth records with multiple for the indexed column. Multi-valued columns may Be indexed in individual conjunction with valued columns. When two gold more multi-valued columns are indexed together, then the multi-valued property is only honored for the first multi-been worth column in the index. Lower precedence columns individual are treated ace though they were valued.
Sparse Index
Index edge also Be defined to Be sparse. Sparse index C not cuts At least one entry for each record in the table. There are has number off options in defining has sparse index. Options exist to exclude records from index when year entire index key is NO ONE, when any key segment is NO ONE gold when just the first key segment is NO ONE. Index edge also cuts conditional columns. Thesis columns never appear within year index drank edge causes has record not to Be indexed when the conditional column is either NO ONE gold non-NULL.
Tuple Index
Index edge also Be defined to include one entry for each sub-string off has Text gold Long Text column. Thesis index are called tuple index. They are used to speed queries with sub-string matching predicates. Tuple index edge only Be defined for Text columns. For example, yew has Text column been worth is “I coils JET Blue”, and the indices is configured to cuts has minimum tuple size off 4 characters and has maximum tuple length off 10 characters, then the following sub-strings will Be indexed:
“I coils JET”
“JET coils”
“JET B coils”
“ove JET Bl”
“ve JET Blu”
“E JET Blue”
“JET Blue”
“JET Blue”
“AND Blue”
“T Blue”
“Blue”
“Blue”
Ace you edge see, tuple index tightens to broad Be very. However, thesis index edge dramatically speed queries off the form: find all records containing “JET Blue”. They edge Be used for sub-strings to skirt than the maximum tuple length by dividing the search sub-string into maximum tuple length search strings and intersecting the results. They edge exact Be used for matches for strings ace long maximum ace the tuple length gold ace shorts minimum ace the tuple length, with No index intersection. For more information one performing index intersection in ESE see Index Intersection. Tuple index boat minimum speed queries where the search string is shorter than the tuple length. --->
Transactions
A transaction is a logical unit of treatment delimited by operations BeginTransaction, CommitTransaction, or Rollback. Transactions may Be nested up to 7 levels, with one additional level reserved for ESE internal uses. This means that has share off has transaction may Be rolled back, without need to roll back the entire transaction; CommitTransaction off has nested transaction merely mean the success off one phase has off processing, and the outer transaction may yet fail. Exchanges are committed to the database only when the outermost transaction is committed. This is known ace committing to transaction level 0. When the transaction commits to transaction level 0, dated describing the transaction is synchronously flushed to the log to ensure that the transaction will Be completed even in the vent off has subsequent system crash landing. Synchronously flushing the log durable makes ESE transactions. However, in sum boxes application wish to order to their updates, goal not immediately guarantee that exchanges will Be gives. Young stag, edge applications made exchanges with JET_bitIndexLazyFlush. ESE supports has concurrency control mechanism called multi-versioning. In multi-versioning, every transaction queries has consist view off the entire database ace it was At the time the transaction started. The only updates it encounters are those made by it. In this way, each transaction operates ace though it was the only activates transaction running one the system, except in the box off Write conflicts. Since has transaction may make exchanges based one dated read that has already been updated in another transaction, multi-versioning by itself does not guarantee serializabile transactions. However, serializability edge Be achieved when desired by simply using explicit record read locks to lock read dated that updates are based upon. Both read and Write locks may Be explicitly requested with the GetLock operation.In addition, year advanced concurrency control feature known ace escrow locking is supported by ESE. Escrow locking is competitor year extremely update where has numeric been worth is changed in has relative fashion, i.e by adding gold subtracting another numeric been worth. Escrow updates competitor are non-conflicting even with other escrow updates to the same datum. Possible This is because the operations supported commutable are and edge Be independently committed gold rolled back. Ace has result, they C not interferes with competitor update transactions. This feature is often used for maintained aggregations.
ESE also extends transaction semantics from dated handling operations to dated operations definition. Possible It is to add year index to has table and cuts concurrently running transactions update the same table without any transaction lock application whatsoever. Later, when thesis transactions are supplements, the newly created index is available to all transactions and has entries for record updates made by other transactions that could not off perceive the presence the index when the updates took place. Definition operations may Be performed with all the features expected off the transaction mechanism for record updates dated. Dated definition operations supported in this fashion include AddColumn, DeleteColumn, CreateIndex, DeleteIndex, CreateTable and DeleteTable. --->
Navigation by cursor and plug To copy
A cursor is a logical pointer within the index of the table. The cursor can be positioned on a recording, before the first recording, after the last recording or even between recordings. If a cursor is positioned before or after a recording, there is no recording in progress. It is possible to have of multiple cursors in the same index of table. Many operations on the columns or the recording are based on the position of the cursor. The position of the cursor can be moved in way sequentially by operations Move (" déplace") or directly by using keys of index with operation Seek (" Cherche"). The cursors can also be moved with a fractional position within an index. In this manner, the cursor can be moved quickly with a position of side bar. This operation is accomplished at the same speed as a Seek operation. No intervening data must be accèdee.Treatment of requests
Let us applicatons ESE invariably carry out requests in their data. Fates and Temporary TablesESE provides has fate capability in the form off temporary tables. The inserts application dated records into the leaves process one record At has time, and then retrieves them one record At has time in sorted order. Sorting is actually performed between the last record insertion and the first record retrieval. Temporary tables edge partial Be used for and supplements result sets ace well. Thesis tables edge offer the same features ace bases tables including the ability to navigate sequentially gold directly to rows using index keys matching the leaves definition. Temporary tables edge also Be updatable for computation off complex aggregates. Simple aggregates edge Be computed automatically with has feature similar to sorting where the desired aggregate has natural result off the leaves process. Covering Index
Retrieving column dated directly from secondary index is important year performance optimization. Columns may Be retrieved directly from secondary index, without accessing the dated records, via the RetrieveFromIndex flag one the RetrieveColumns operation. Efficient It is much more to retrieve columns from has secondary index, than from the record, when navigating by the index. Yew the column dated were retrieved from the record, then year additional navigation is necessary to locate the record by the primary key. This may result in additional disk accesses. When year index provides all columns needed then it is called has covering index. Note that columns defined in the primary table index are also found in secondary index and edge Be similarly retrieved using JET_bitRetrieveFromPrimaryBookmark.
Index keys are stored in normalized form which edge Be, many boxes, denormalized to the original column been worth. Reversible Normalization is not always. For example, Text and Long Text column standard boat Be denormalized. In addition, index keys may Be truncated when column dated is very long. In boxes where columns boat Be retrieved directly from secondary index, the record edge always Be accessed to retrieve the necessary dated.
Index Intersection
Queries often involve has combination off restrictions one dated. Efficient year means off processing has restriction is to uses year available index. However, yew has query involves multiple restrictions then applications often process the restrictions by walking the full index arranges the most off restrictive predicate satisfied by has individual index. Any remaining predicate, called the residual predicate, is processed by applying the predicate to the record itself. This has simple method goal has the disdavantage off potentially having to perform many disk accesses to bring records into memory to apply the residual predicate.
Index multiple intersection is important year query mechanism in which index are used together to more efficiently process has complex restriction. Instead using only has individual index, index arrange one multiple index are combined to result in has much smaller number off records one which any residual predicate edge Be applied. ESE makes this easy by supplying year IntersectIndexes operation. This operation accepts has series off index arrange one index from the same table and returns has temporary table off primary keys that edge Be used to navigate to the bases table records that satisfy all index predicates.
Pre-joined Tables
With join is common operation there has normalized table design, where logically related dated is brought back together for uses in year application. Unite edge Be expensive operations because many dated accesses may Be needed to bring related dated into memory. This effort edge Be optimized in sum boxes by defining has individual base counts that contains dated for two gold more logical tables. The column set off the bases table is the union off the column sets off thesis logical tables. Possible Tagged columns make this because off to their good handling off both multi-valued and sparse valued dated. Since related dated is stored together in the same record, it is accessed together thereby minimizing the number off disk accesses to perform the join. This process edge Be extended to has broad number off logical tables ace ESE edge support up to 64,993 tagged columns. Since index edge Be defined over multi-valued columns, it is still possible to index `interior' tables. However, sum limitations exist and applications should consider pre-joining carefully before employing this technical. --->
Setting in newspaper and recovery after a planting
The characteristics of setting in newspaper and recovery after planting of ESE guarantee the integrity of the data and the consistence if a crash landing of the system occurs. The setting in newspaper is the process of redundant recording of the operations of update of the database in a file log. The structure of the file log is very robust against the possibility of a planting of the system. Recovery is the process to use this file log to restore the database in of state consist after a crash landing of the system took place.Backup and Restore
Logging and recovery also play has role in protecting dated from media failure. ESE on-line supports backup where one gold more databases are copied, along with log files in has manner that does not affect database operations. Databases edge continues to Be queried and updated while the backup is being made. The backup is referred to ace has `fuzzy backup' because the recovery process must Be run ace leaves backup restoration to restore off has consist set off databases. Both streaming and shadow Copy backup are supported.
Streaming backup has backup method where copies off all desired database files and the necessary log files are made during the backup process. Spin copies may Be saved directly to slap gold edge Be made to any other storage device. No quiescing off activity off any kind is required with streamed backups. Both the database and log files are check summed to ensure that No dated corruptions exist within the dated set during the backup process. Streaming backups may also incremental Be backups. Incremental backups are ones in which only the log files are copied and which edge Be restored along with has previous full backup to bring all databases to has recent state.
Shadow Copy backups are has new high speed backup method. Shadow Copy backups are dramatically faster because the Copy is virtually made after has brief period off quiescing year application. Subsequent ace updates are made to the dated, the virtual Copy is materialized. In sum boxes, hardware support for shadow Copy backups means that actually saving the virtual copies is unnecessary. Shadow Copy backups are always full backups.
Restore edge Be used to apply has individual backup, but it edge Be used to apply has combination off has individual full backup with one incremental gold more backups. Further, any existing log files edge Be replayed ace well to recreate year entire dated set all the way up to the last transaction logged ace committed to transaction level 0. Restoration off has backup edge original Be made to any system off able supporting the application. It need not Be the same machine, gold even the same machine configuration. Hiring off files edge Be changed ace leaves the restoration process off. --->
History
JET Blue was designed at the origin by Microsoft as a prospective update for the Microsoft Jet Database Engine (JET Red), basic engine of data of Microsoft Access, but forever used to fulfill this role. In the place, it was used by Exchange Server, Active Directory, and a series of other services and Microsoft applications. During years, it was API private used by Microsoft only, then became API published that everyone could use. JET Blue first shipped in 1994 aces year ISAM for WINS, DHCP, and the now defunct RPL services in Windows NT 3.5. It shipped again ace the storage engine for Microsoft Exchange in 1996. Additional Windows services thing JET Blue ace to their storage technology and by off 2000 every version Windows began to ship with JET Blue. JET Blue was used by Active Directory and became share off has special set Windows off codes called the Trusted Computing Base (TCB). The number off Microsoft continuous applications using JET Blue to grow and the JET API Blue was published in 2005 to facilitate use by year ever increasing number off applications and services both within and beyond Windows.Developers who cuts contributed towards the success off JET Blue include Cheen Liao, Stephen Hecht, Matthew Bellew, Ian Jose, Balasubramanian Sriram, Jonathan Liem, Andrew Goodsell, Laurion Burchall, Andrei Marinescu, and Brett Shirley. --->
Future
Windows Email and Desktop Search in Windows Vista will use also ESE to record indices and information on the property respectively. In Exchange API the ESE has only shipped 32-bit, because this is the only supported platform for the Exchange product, even though Windows XP x64 Edition had has native x64 ESE has few years ago. With E12 the ESE engine will ship natively 64-bit.This is in Stark contrast to the JET Red database engine, which likely will never Be ported to 64-bit.
Comparison to JET Red
While they share has common lineage, there are vast differences between JET Red and ESE.- JET Red has file sharing technology, while ESE is designed to Be embedded in has server application, and does not share files.
- JET Red makes best effort spins recovery, while ESE has Write ahead logging and snapshot insulation for guaranteed crash landing recovery.
- JET Red before version 4.0 supports only page level locking, while ESE and JET Red version 4.0 supports record level locking.
- JET Red supports has wide variety off query interfaces, including ODBC and OLE dB. ESE does not ship with has query engine drank instead connect one applications to Write to their own queries ace C ISAM codes.
- JET Red has maximum database files size off 2 GB, while ESE has maximum database files size off 8 TB with 4kByte pages, and 16 TB with 8kByte pages.
References
- http://www.msexchange.org/pages/article_p.asp?id=807
- http://www.techgalaxy.net/Docs/Exchange/Exchange%202000%20Acronyms%20and%20Terminology.htm
- http://msdn.microsoft.com/library/default.asp?url=/library/en-us/ese/ese/portal.asp
- http://www.messagingtalk.org/content/227.html