Wednesday, January 30, 2013

Data layer requirements for applications with cloud hosting and NoSQL databases - Cloudant

I'm reading: Data layer requirements for applications with cloud hosting and NoSQL databases - CloudantTweet this !
MVC is one of the most famous and widely adopted application development design patterns. Typically data layer of applications starts off with features that requests actions related to data manipulations. 
  • If an application is public facing where users can sign-up on the app and consume services, there are bright changes that with the success of the application, user data would grow linearly. 
  • On the top of it if the application is related to social media / networking / large scale e-commerce, then the app is a future candidate to mature in the big data processing requirements. This puts an obligation on the data layer to be more mature, if you want to keep your operational costs (incurred by dedicated DBAs, infrastructure procurement and maintenance, etc) under check. 
  • As data grows big, and tends towards being unstructured from structured, database choices tends to be towards specialized NoSQL databases from generalized relational RDBMS.
  • To manage data load, data management methods like replication, sharding, master-slave configuration, partitioning etc are employed. As most of these techniques require flexibly scalable infrastructure, data is hosted mostly on cloud and this means that data access layer should be mature enough to work with cloud and almost manage the infrastructure administration using the cloud provider api. 
  • Different cloud providers have their own pros and cons. An application downtime is directly proportional to equivalent loss is business in today's world. So you might want your data layer to be flexible enough, such that it can access data from some of the leading cloud hosting environments.
These specifications are too ambitious for an application to develop just for it's own use.  But at the same time to keep frequent revamping of data layer at bay or to avoid create duplicate data layers for different optimal data access and management methods, one needs a data layer that can manage the above mentioned challenges.

One interesting company provider a data layer solution that can meet the above specifications with even more features - Cloudant. It provides a data layer to 
  • Manage data in CouchDB NoSQL database, 
  • Hosted on choice of cloud providers like AWS, Azure, RackSpace etc
  • RESTful CouchDB compliant APIs
  • JSON based data exchange format
  • Lucene based text search and much more.
NoSQL databases, Cloud providers of the scale of Azure, AWS etc, MPP systems like Hadoop etc are quite famous. But the data layer that manages data access along with data management on cloud can be a very interesting option, than starting your app from scratch with a fat sized cloud account. Checkout the technical whitepaper on Cloudant for a detailed understanding of this service and this article.

Sunday, January 27, 2013

Database Sharding, SQL Azure Federation, SQL Server 2012 and Sequence

I'm reading: Database Sharding, SQL Azure Federation, SQL Server 2012 and SequenceTweet this !
Almost everyone who is a SQL Server professional directly or indirectly knows the newly introduced Sequence object in SQL Server 2012. This post is not about explaining how to use it. Just google what most famous blogger have to say about Sequence object, and you would find that the only content in the post is how to use it. I was hoping that someone would see it from the angle that I see it, but I didn't find any. So I thought of sharing my viewpoint on Sequence object.

One can manage generating incremental keys/ids using the IDENTITY property. Is Sequence just another means to flexibly manage automated incremental ID generation ? I would say it's one of the means, but I don't see it as the ultimate purpose of it.

Applications enabling answer to complex queries often needs data of a huge size. How easily would you get to read data, depends on strategically how-and-where you write data. Network bandwidth, computing capacity and memory availability are some of the driving factors in read-write operations. When apps face performance pressure, the first step that is typically followed is scaling up. After scale up reaches a limit, scale out is the next measure. Partitioning, Replication, Federation are different approaches to scale out databases. If data is distributed and stored in a way that the query is likely to fetch all the data from a single server / partition, this strategy might work till each database size remains in gigabyte sized limits. If each partition / database is multi terabyte sized, even if entire data to be returned to database is figured out, it would face a bottleneck due to network bandwidth available to a single server due to the amount of data to be returned.

Database sharding is partitioning data horizontally across many servers on commodity hardware. Some areas where sharding is used include:

  • NoSQL databases running on Hadoop and MapReduce based infrastructure with database sharding have been able to solve database scalability related challenges on some of the worlds biggest and most aggressively growing datasets. 
  • SQL Azure Federation on Windows Azure platform is one of the awaited features in SQL Azure, that would allow multi tenant databases with sharding. 
  • Cloud databases often have to be multi-tenant / federated, due to the limits on database sizes as a result of cloud based storage architecture.
The interesting part is, database sharding on relational databases is not available out-of-box, at least in SQL Server. Consider a scenario where the need is to insert a couple of records in order but in different databases. One would need a key/id generation mechanism that is independent of tables, unlike IDENTITY property.

Sequence is independent of tables, which makes it much more scalable. From the feature set it seems that Sequence can help to generate a ordered set of keys for storing data across shards in a sorted format. This can be confirmed only with a PoC, but for now I am hopeful that when SQL Azure Federation  would go RTM, Sequence would definitely make a place in the T-SQL features for the same.

If you feel that database sharding for relational OLTP databases is not a popular practice, then you might want to check out this company that just runs its shop on database sharding - An interesting whitepaper on database sharding can be read from here.

Thursday, January 24, 2013

70-463 Exam Guide - Part 2

I'm reading: 70-463 Exam Guide - Part 2Tweet this !

How does dimensional modeling of a data warehouse enable / empower analytics ?

1) Data analysis often requires pivoting huge amounts of data. Attributes of a dimension (i.e. columns in a dimension table) ideally contains non-continuous data. Such data is often suitable to slice / pivot data and can be used as a scale axis to visualize categorized data on a graph. Even if the data in attributes is continuous (i.e. data contains lots of distinct values), OLAP technologies like SSAS provides was to categorize continuous data into discrete categories.

2) For example, if we try to plot data on a graph use values in Age attribute for the population in a city, on X or Y axis, then there would be inevitably 100 bars on a graph which is almost impossible to analyze. A better approach would be to divide age attribute in few distinct age groups and use those groups on the graph. Each group would represent a bar, and upon drilling down the bar, actual data can be brought on the graph.

Why do dimension tables often include columns that cannot be used for pivoting / filtering / analytics ?

1) Often dimensions like Geography, Employee, Account, Products etc have attributes describing the dimension member. For example Employee dimension can have attributes like Gender, Date of Birth, Marital Status, Number of children etc.. Such attributes are neither good candidates for pivoting nor for any kind of roll-ups or aggregation.

2) However broad or narrow be the analytical spectrum, inevitably almost any such system would have reporting on the top of it. If the dimensions have only pivotable or aggregatable attributes, then such data would be sufficient only to plot on a graph but not on a detailed report. Such data can facilitate drill-down, but not drill-through of the problem. For drill-through, one would require different attributes based on which a report can be sketched out.

3) For an Employee report, one might start to analyze a report based on salary attribute as it is pivotable. For example, list all employees whose gross income is less than 30k USD. But post that one would want to check which of those are women having more than two kids and are aged over forty, to balance work load on them. For such a report, if dimensions do not contain such attributes, then a cross-database query joining data warehouse and transactional OLTP database would be required. This results in poor performance and other reporting issues. Hence such attributes are included in a dimension.

Such attributes in a dimension are called member properties.

What are hierarchies and how they help in analytics ?

1) Attributes within a dimension as well as across dimensions can be related and well as unrelated. For example, date of birth and gender are unrelated attributes, but country and city are related attributes. Related attributes are often pivotable and form a hierarchy called natural hierarchies. Unrelated attributes have lesser probability of forming a hierarchy, and if they do, then it's called unnatural hierarchies.

2) Hierarchies provide a drill-down path for data analysis are the data can be pivoted. Aggregatable attributes can only be rolled-up but cannot be drilled down to another level based on another related / unrelated attribute. For example salary can be rolled up from daily to the highest possible granularity. But to drill down salary from Year-Month-Week, a hierarchy made up of Year, Month and Date attributes is required which would slice rolled-up values of Salary attribute. Hierarchies formed of two or more related / unrelated attributes can be drilled-down which are very useful for analytics.

How to accommodate changing attribute values of a dimension in a dimensional model ?

1) Data archival policy is often overlooked or considered a trivial aspect in database modeling. But in data warehouse it has a very deep impact on the modeling itself. Facts should ideally contain transactional data, so there is no question of data update. Dimension attributes often have changing values, which might be worth keeping a track. For example, an organization might not be interested in keep a track of every time when an employee had children but only total number of children as of the latest status. But say, if the employee is a sales person, an organization would want to keep track of the field area where the sales executive was posted at any point in time. Without tracking this history, reports would always show that sales executive has operated in one area which might be his/her present area of operation, as the attribute value is always updated / overwritten. Such changing values of attributes in a dimension are addressed by a design aspect known as Slowly Changing Dimension (SCD).

2) SCD are of three types, based on how attribute history has to be preserved.

SCD Type 1: Attributes for which values are updated / overwritten, and history is not maintained is know as SCD Type 1.

SCD Type 2: To maintain history, at a minimum three new columns are added to the dimension - ValidFrom, ValidTill, and IsCurrent. ValidFrom and ValidTill defines the time scope during which the value was valid, and IsCurrent helps queries to easily identify the latest value of the attribute(s). When a value changes for an attribute, the ValidTill value is updated with the current date, and IsCurrent is marked as "N". A new record with the latest value of the attribute is added to the dimension, having ValueFrom as the current date, ValidTill as NULL and IsCurrent as "Y".

SCD Type 3: If one decided to maintain history of attribute values, and if the attribute values are frequently changing, then the dimension can have rapidly changing values (often known as rapidly changing dimension). Due to the same, dimension would have huge amount of dimension members effectively translating into performance issues in dimension processing as well as querying. Type 3 is a method where one can limit the amount of history to be preserved. Only the latest value and the one before that are preserved. Due to this design, the tracking and scoping fields required in Type 2 would not be required in Type 3.

How are dimensions (especially slowly changing dimensions) associated with facts in a dimensional model ?

1) Any dimension would have a business key, which is one of the candidates for primary key, for example ProductID in Product dimension. If a dimension does not have any SCD attributes, then Business key can be used as the primary key for the dimension. But in other case, whenever a new record is inserted for a SCD attribute, the business key would get duplicated. So to manage this, SCD dimensions have another field that acts as a placeholder of primary key - popularly known as Surrogate Key. This field would typically have auto-increment identity values in most of the cases.

2) Primary keys in a dimension act as a foreign keys in fact tables, which enables pivoting data based on dimension attributes for analytics. Business keys in a dimension helps in maintaining lineage with OLTP data sources, from where data is collected and warehoused typically using ETL methods. One can't correctly slice data in a fact with a dimension whose key is not present directly / indrectly in the fact table. A combination of all the foreign keys in a fact table can be used as a composite primary key in many cases.

3) If a fact table is not associated with a dimension table, and still one wants to associate a dimension with a fact without adding primary key from dimension into the fact table, one can create bridge fact tables. These bridge tables contain only keys from dimension and fact, which creates an indirect association between dimension and fact tables.

What are measures and where are they stored ?

1) Attributes of a fact table are known as measures. Fact table should contain calculable data, which is measure of different attributes of an entity. For example, for a Sales fact table would have order value, number of units in order, tax on order value etc, profit margin percentage etc.

2) Additivity of different measures can be different - some can be fully additive like order value, some can be partially additive like tax, and some can be non-additive like profit margin percentage. Based on the nature of addivity, different aggregation functions can be applied on the attributes for roll-ups.

Wednesday, January 09, 2013

70-463 Certification Exam Guide / Notes - Part I

I'm reading: 70-463 Certification Exam Guide / Notes - Part ITweet this !
This blog has been silent almost across the previous year with very little activity. I got married, got 7 projects at a time on my back, worked as a Senior Architect, reviewed a book and lots more. I struggled striking a balance between personal life and professional life, and now I am trying to come back to business.

As a part of my role, I am pursuing and mentoring a group to pass the exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012. Through the series of posts I would share my notes for this exam preparation. This is Part I of the notes.

Why do we need a Data Warehouse ?

1) Data stored in normalized schema in OLTP systems can have hundreds to thousands of tables for an enterprise. OLTP systems have a probability, that a portion of these tables can be often less descriptive due to lack of self relevant naming conventions. This makes designing queries harder for reporting purposes.

2) As normalised schema has a lot of tables, a single piece of related information is split and stored into various tables with referential integrity constraints. This means that reading this data requires creating joins with many tables. For reporting and analysis purposes over a very huge dataset spanning thousands to millions of records of historical data, such queries would perform very poorly.

3) OLTP systems tend to archive data time to time on a scheduled basis. It might be in the form of hard / soft delete. Lack of historical data can limit the level of analysis that can be performed on the data.

4) OLTP systems tend to store most updated version of data only. For ex employee address, product name, martial status etc. Generally history for such attributes would not be preserved, which results in loss of historical data. Lack of history for such attributes can limit the level of analysis that can be performed on the data.

5) OLTP data in an enterprise can be federated across different departments / business units / associate group companies. Each of it would typically have their own set of applications. There would be inevitably as set of common master data duplicated across the units. Due to duplicate and non-synchronized master data, consolidated reporting and analysis that combines data across the enterprise becomes almost impossible.

6) As each working unit of an enterprise can have their own OLTP system, different attributes can have data representation in different forms for the same data. For ex, an employee is permanent can be represented by Yes / No, Y / N, 1 / 0 etc. Even this makes data extremely difficult to interpret and hence raising the challenges to a centralized reporting over the enterprise data.

7) OLTP systems can contain data in free form, which can lead to degraded data quality. For example, free form address entry can lead to a very poor quality of geographic information due to typographic mistakes, effectively leading to unusable data and inviting data cleansing exercise.

8) OLTP systems tend to have a very good classification of business entities, but the some entities required for analysis are not normalized to the most detailed extent. For example, attributes related to date would would stored in datetime format. Entities like date and time has constituents like day / week / month / quater / year / fiscal year / sec / minute / hour etc. Mostly reporting and analytics works on a time scale. To report data based on any particular constituent of date / time, it would require extracting this constituent from the date value on a very huge dataset. This could lead to great performance issues.

What is a Star schema and when it should be used ?

1) Data model / schema used to create a data warehouse is also known as dimensional model. In this model, reference / lookup / master tables are called dimensions. Measureable details of data are called measures / facts and the tables that host the same are called Fact Tables.

2) Typcially, a simplified denormalized schema that covers the most granular section of a business entity can be represented by a star schema. Such a schema would have a single fact table containing measurable / aggregatable values. This fact table would have foreign keys from all dimension tables that completes identity of that record and provides different angles of analysis.

If you look at AdventureworksDW 2012 schema, FactInternetSales / FactResellerSales is a Fact Table. This table has reference from different dimensions like Date, Time, Product, Customers etc. Together these Fact and Dimensions form a schema known as Star Schema.

What is a Snowflake schema and when it should be used

1) A typical dimension is a highly denormalized business entity. If the same is normalized to 2nd or 3rd normal form, then the same star schema is termed as snowflake schema. For example, Product-ProductCategory-ProductSubCategory, Country-State-City etc are examples of snowflaked Product and Geography dimension respectively.

2) From a warehousing perspective, any data analysis done on data sourced from warehouse directly would suffer performance due to more joins as a result of more tables in a snowflake schema. From the perspective of creating data marts, snowflake schema would provide options to source data more selectively to datamarts.

3) Any typical enterprise class dimensional model would inevitably contain a snowflaked model of schema. Even if one selects not to normalize business entity into normalized dimensions, still dimensions would contain some common attributes like date keys, time keys, status keys etc. So Date dimension would be related Patent dimension by the date key for patenteddate field.

What is a Conformed Dimension ?

1) Theoretically considering if each discrete business area has a star schema, then different business areas typically tend to be connected to be each other. So if a star schema is created for each business unit in a enterprise, all these star schemas would have to be connected too.

2) Two schemas can be connected when they have something in common. If a dimension is shared across more than one schemas, then such dimensions are known as conformed dimensions. For ex, Date dimension would be shared by almost all the Fact tables in the schema.

What is a Data mart ?

1) Using a data warehouse as the source, technologies like SQL Server Analysis Services creates multidimensional data structures known as Cube. A cube contains dimensions and measures that can be used for large scale real-time analysis. These systems are known as Online Analytical Processing (OLAP) applications.

2) Typically a data warehouse can contain data from the entire enterprise along with history. A data mart would source data from a data warehouse and build specialized data structures on the top of the same for reporting and analytics.

What is Dimensionality / Granularity ?

1) If two business entities contain data at the same scale, then they are said to be at the same level of granularity. For example, if sales data is stores weekly in facts and orders data is stored daily, then order data can be said to have a lower granularity compared to sales.

2) To associate two entities in a dimensional model, they need to be calculated as the same level of granularity.

Feel free to share you queries on this posts, and I would try to answer the same in best possible way.
Related Posts with Thumbnails