Monday, June 16, 2014

SQL Server vs MongoDB vs MySQL

I'm reading: SQL Server vs MongoDB vs MySQLTweet this !

Microsoft SQL Server is one of the mainstream databases used in most operational systems built using Microsoft technology stack. One of the biggest shortcoming is the inability to support horizontal scaling / sharding. So the next logical choices that are most nearest to SQL Server would be MySQL.

In case you are looking for horizontal scaling / sharding, that would mean that you are gearing up to deal with Big Data. MongoDB is the arguably the first logical step in NoSQL world, in case if someone is considering to experiment with NoSQL to handle BigData.

At the stage, one is faced with the requirement to compare all these databases. Below is a quick comparison of these databases, with limitations highlighted in red and product strengths in blue.


Saturday, June 14, 2014

Elasticsearch vs Solr vs Endeca vs Sharepoint FAST vs Google Search Appliance ( GSA ) vs Autonomy vs Semaphore

I'm reading: Elasticsearch vs Solr vs Endeca vs Sharepoint FAST vs Google Search Appliance ( GSA ) vs Autonomy vs SemaphoreTweet this !
Enterprise Search is a huge market. Fortunately there are just a handful of products out there to cater this business and unfortunately there is no one-product-fits-all kind of product out there.

There are specific category of features expected from an enterprise search product, which makes it suitable for one or other requirements. Some of them are listed as below:

1) Crawling
  • Web Crawling: An enterprise has most of the content on portals in the form of html and media documents. A crawler is the basic means to create an index out of this content.
  • DB Crawling: Data stored in databases often needs to be crawled or imported into the search inventory.
2) Taxonomy

Taxonomy is the logical organization of content in the enterprise content management system. Some term it as metadata or structure or term stores of the index maintained in the system.  It's the method of framing structure around the content, so that information can be retrieved more effectively and precisely.

For example, a very simple way of implementing taxonomy can be the ability to tag content using a set of keywords defined centrally at the organization level.

3) Specialized OOB Search
  • Faceted search (like the ones when you use Amazon and a set of categories appear of the left side)
  • Dictionary based search (where you look for a word and its synonyms)
  • Auto-suggest (for example when you type terms in google and it suggest few phrases)
4) Plugability 
  • Ability to index SMTP server
  • Ability to index LDAP server
  • Out-of-box ability to index any such external systems
Systems like Google Search Appliance, Oracle Endeca, HP Autonomy, Microsoft Sharepoint search, and Solr are the top leaders in this category. Products like Smartlogic Semaphore add a value added layer on the top of it.

But the big question is where does products like Elasticsearch fit here ? 

While we looked at the positives of these products due to their ability to provide the above mentioned features, there are some downsides / limitations too, where Elasticsearch or even Solr steps in.

1) Any of these products are not economic. For example, HP Autonomy is heard to have the base price of more than half a million dollars. Every enterprise may not have the budget to afford it.

2) Some products do not support database indexing easily. For example GSA does not allow to use complex delta detection based queries for indexing data from databases easily.

3) Most of these products are not scalable horizontally. Apart from appliance solutions, products like endeca are resource intensive and not suitable for managing big data kind of volumes due to their scalability architecture.

4) Custom development for extending the product using APIs is not as easy as compared to open source products.

Custom search for applications is inevitable. Though the enterprise search platform may be dominated by these products, but for empowering custom applications that manage big data using specialized search functionality (for example ecommerce sites like and others), products like elasticsearch and solr would continue to find its space.

The limitations with products like Elasticsearch is that it lacks the enterprise scale features for example OOB Crawlers, Information Visualization and Reporting layers required for e-discovery and reporting, and very limited taxonomy which is very crucial for an enterprise search platform. But as the product is still very young and evolving, these features can be expected hopefully over the couple of years.

Thursday, June 12, 2014

Elasticsearch with .NET : NEST Library Code Example

I'm reading: Elasticsearch with .NET : NEST Library Code ExampleTweet this !
Elasticsearch can be used with a number of programming languages, one of it being Microsoft .NET. Elasticsearch.NET (low level client) and NEST (high level client). 

NEST comes with a strongly typed wrapper around Elasticsearch.NET API, and allows for a fully object oriented programming approach to interface with Elasticsearch. It also has nice documentation to learn the APIs. 

The first program that I would want to generally write, is to index a structured document into elasticsearch using C# code and NEST APIs. One only needs any version of Visual Studio and NEST Nugget package installed. Below is the very first console application I wrote to test the .NET integration with Elasticsearch. Let me know whether you liked the code, whether it worked for you, and in case if you need any help with programming.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

using Nest;
using Nest.Domain.Connection;

namespace ESConsole
    class Program
        static void Main(string[] args)
            var uri = new Uri("http://localhost:9200");
            var settings = new ConnectionSettings(uri).SetDefaultIndex("contacts");
            var client = new ElasticClient(settings);

            if (client.Health(HealthLevel.Cluster).ConnectionStatus.Success)
                Console.WriteLine("Connection Successful");
                if (client.IndexExists("contacts").Exists)
                    Console.WriteLine("Index Exists");
                    Program.UpsertArticle(client, new Article("The Last Airbender", "Siddharth"), "blog", "article", 1);
                    Program.UpsertContact(client, new Contacts("Siddharth Mehta", "India"), "contacts", "contacts", 2);
                    Console.WriteLine("Data Indexed Successfully");
                    Console.WriteLine("Index Does Not Exist");
                Console.Write("Connection Failed");



        public class Article
            public string title { get; set; }
            public string artist { get; set; }
            public Article(string Title, string Artist)
                title = Title; artist = Artist;

        public class Contacts
            public string name { get; set; }
            public string country { get; set; }
            public Contacts(string Name, string Country)
                name = Name; country = Country;

        public static void UpsertArticle(ElasticClient client, Article article, string index, string type, int id)
            var RecordInserted = client.Index(article, index, type, id).Id;
            if (RecordInserted.ToString() != "")
                Console.WriteLine("Transaction Successful !");
                Console.WriteLine("Transaction Failed");

        public static void UpsertContact(ElasticClient client, Contacts contact, string index, string type, int id)
            var RecordInserted = client.Index(contact, index, type, id).Id;

            if (RecordInserted.ToString() != "")
                Console.WriteLine("Transaction Successful !");
                Console.WriteLine("Transaction Failed");

Monday, June 09, 2014

Elasticsearch with SQL Server

I'm reading: Elasticsearch with SQL ServerTweet this !
Elasticsearch is a very powerful value addition to any relational dbms like SQL Server, Oracle, DB2 etc, provided it's used wisely. Before we look at how to use elasticsearch with SQL Server, we should look at "Why to use elasticsearch with SQL Server". This question holds the key to the answer.

SQL Server hold data either in relational form or in multi-dimensional form (through SSAS). Full Text Search (FTS) in SQL Server is capable of providing some out-of-box search feature, but when search queries requires exhaustive searching over huge datasets, and add some complexity in the search definition itself, one can evidently see performance impact there. Elasticsearch is primarily a search engine, but loaded with features like Facets and Aggregation framework, it helps solve many data analysis related problems. For example, everyone of us would have visited sites like,, etc. Whenever we search for a product, it builds all the dynamic categories, ranges and values on the fly. For such features, a product like elasticsearch can be extremely helpful. One such real project example can be read from here.

How to use Elasticsearch with SQL Server ?

Elasticsearch JDBC River is the best means (to the best of my knowledge as of date) to load data from SQL Server into an elasticsearch index. One of the best explanations on setting up elasticsearch JDBC river with SQL Server, can be read from here.

One point to keep in view is that, if you setup a river and you restart elasticsearch server, the river would execute the query set for the river again. This could result in reloading of the entire data in the index. In case if the IDs are being fetched from the source, all existing records would get updates. But if IDs are autogenerated in elasticsearch, this would result in new records, which would ultimately lead to duplicate data. So use the river cautiously. You can also delete the river once data is loaded into the index, in case its a one time activity for one time data migration.

Thursday, June 05, 2014

Elasticsearch Architecture Design - Considerations for using Elasticsearch in custom solutions

I'm reading: Elasticsearch Architecture Design - Considerations for using Elasticsearch in custom solutionsTweet this !
Elasticsearch has it's own use cases for which one may want to consider the same as one of the technology stack in the solution. While considering elasticsearch in the scope of the solution, there are certain aspects of the product which should be kept in view from a solution design perspective.

Below is a list of some of the most important architecture design considerations from my perspective.

1) Elasticsearch is a document oriented database system. Document oriented stores are useful where the need is to have a scalable system and queries are driven by content of the data and not by the key which is the case in key-value stores.

2) Elasticsearch stores data in the form of JSON documents. JSON has limited set of datatypes. So if the need is to have a very strongly and uniquely typed system, one should consider the data migration or storage strategy wisely.

3) Elasticsearch is a database system, that exposes REST based APIs for data as well as server administration. This means that elasticsearch can be seen as a DB server as well as a Web server. So when the infrastructure landscape is being designed, one should carefully consider whether to place elasticsearch in web zone or in db zone.

4) Elasticsearch does not ship with any authentication module as of date. There are some community plugins available though. Keeping this point in view, one would want to keep elasticsearch behind a web application and not expose it directly over internet.

5) Elasticsearch does not ship with any GUI tools or editors for development purposes. For the same, elasticsearch has a concept of plugins, which can be developed using Elasticsearch APIs. A huge number of web based plugins for elasticsearch are available on GitHub, which can be easily installed and used for development and data analysis purposes.

6) Elasticsearch front-end programming wrappers are quite popular among developers as they provide the familiar syntactical support and eliminating the need for developers to learn elasticsearch API. For example, NEST is a .NET wrapper on the top of elasticsearch API. The risk with such wrapper frameworks is that they should continuously update their API to be compliant with elasticsearch api. Elasticsearch has a very aggressive release schedule and typically there are minimum 3-4 releases of the product every year, some of which also has breaking changes.

7) Elasticsearch like many other NoSQL products has the characteristics of default behavior and automated data management, until it is explicitly overridden. For example, if a document is inserted with Id - 1, and then if again the same document is inserted in the same index and tyep, it won't raise an exception like relational databases do. It would silently update the document and increment the version number of the document. Another example, elasticsearch would manage any document on any shard which may reside on any node. So until and unless, a specific node is asked, any document can end up on any machine, node and shard.

8) Elasticsearch is a Multi Version Concurrency Control (MVCC) system. This means that any document is never updated actually. Whenever an update request is received, elasticsearch inserts a new record with an increment version number. Though one can configure purging of the older versions, this characteristic of the system can result in high storage requirements, if there are huge and frequent updates on the system.

9) Elasticsearch has a concept of "rivers" for integrating elasticsearch with external systems like CouchDB, Twitter etc. But out-of-box, it does not support any rivers to import data from elasticsearch in SQL Server, Oracle, MySQL, DB2 and other relational databases. There are community plugins like JDBC River which can help in one time imports of data. But for continued extraction and loading of data into elasticsearch, a distributed ETL system like Apache Kafka or Twitter Storm would be advisable.

10) Elasticsearch has some very rigid settings related to data management, which should be taken care even before creating the system. For example, once an index is created, by default it is configured with 5 primary shards. Once the shards are created, throughout the life of the index, the number of shards can never be changed at all. So in case you were planning for 50 million records that you would manage with 5 shards, and say the requirements changed and you now need to load 150 million, still you would have to manage 150 million records with 5 shards only.

These are some of the initial set of considerations to keep in view while fitting elasticsearch in your solution. In the time to come, I would share another part of this architecture design article, with more points to consider.

Sunday, June 01, 2014

Elasticsearch Tutorial - Elasticsearch Storage Architecture : Analysis and Inverted Indexes

I'm reading: Elasticsearch Tutorial - Elasticsearch Storage Architecture : Analysis and Inverted IndexesTweet this !
Elasticsearch use the Apache Lucene engine for almost all of its operations. One of the primary differences between relational databases and NoSQL systems is the way it stores data. When it comes to the storage architecture of elasticsearch, there are two terms which are key to the storage mechanism - Analysis process and Inverted Indexes.

What is Analysis process in elasticsearch ? 

In Part 1, I already explained what's a tokenizer and filter in elasticsearch. Whenever an index is created, a default mapping and analyzer would be attached to it. Depending on the config of the analyzer, a tokenizer and filter would be configured for the same.

When a document request for indexing is received by elasticsearch, which in turn is handled by lucene, it converts the document in a stream of tokens. After tokens are generated, the same gets filtered by the configured filter. This entire process is called the analysis process, and is applied on every document that gets indexed.

Below is an example of the analysis process. Consider an html tag with embedded sentence as the document as the input. When the same passes through a set of filters and tokenizers, it gets converted into a set of tokens, which finally gets indexed.

What is the storage data structure of elasticsearch ? Inverted Indexes

In SQL Server, we have a binary tree as the data structure for an index, for example. Post the analysis process, when the data is converted into tokens, these tokens are stored into an internal structure called inverted index. This structure maps each unique term in an index to a document. This data structure allows for faster data search and text analytics. All the attributes like term count, term position and other such attributes are associated with the term. Below is a sample visualization of how an inverted index may look like.

Post the tokens are mapped, document is stored on the disk. One can choose to store the original input of the document along with the analyzed document. The original input gets stored in a system field names "_source". Once can even choose to not analyze the input, and store the document without any analysis. The structure of the inverted index totally depends upon the analyzer chosen for indexing.

Summary: One thing to learn from this is that the key to an efficient storage and retrieval process is the analysis process defined on the index, as per the application needs.

Saturday, May 31, 2014

Elasticsearch Tutorial - Questions - Download Elasticsearch GUI Tools - Part 2

I'm reading: Elasticsearch Tutorial - Questions - Download Elasticsearch GUI Tools - Part 2Tweet this !
Elasticsearch has a very simple installation mechanism. It requires JVM installed on the host OS, and execute elasticsearch.bat file to kick start the same. To consider integrating products like elasticsearch, there are often requirements where front-end tools are required. Some of such requirements are mentioned below:

1) How to import data in bulk from existing data repositories like Excel files, SQL Server, Oracle, MySQL, DB2, MongoDB, and others.

2) How to visually explore data stored in elasticsearch using GUI tools ?

3) How to equip support and monitoring teams with required tools for their day to day operations ?

4) How to equip analyst with tools for executing ad-hoc search queries on data stored in elasticsearch ?

Elasticsearch has a mechanism to support interoperability using a feature called plugins. Plugins can be installed using a simple plugin command. Elasticsearch supports a number of plugins and a huge number of plugins are supported by community. An exhaustive list of such plugins are listed here.

If you are new to Elasticsearch, and just setting up your development environment, below is the list of some of the plugins that you might particularly find useful to speed up your development process.

1) Elasticsearch GUI - A web based elasticsearch administration console written in AngularJS.

2) Elastichead - A web based front-end for elasticsearch, that lets you browse data in a tabular format, provides interface to see metadata, and lets your fire ad-hoc queries.

3) Elasticsearch HQ - A web based elasticsearch monitoring and management console for instances and clusters.

4) Bigdesk - A web based elasticsearch plugin that allows to monitor a huge list of performance counters using charts and graphs.

5) Elasticsearch segmentspy - A web based elasticsearch plugin that specializes in monitoring segment relates features like merges, additions, deletes etc.

6) Elasticsearch whatson A web based elasticsearch plugin that specializes in providing comparative analysis of data stored across indices, shards, nodes and cluster.

7) Elasticsearch FS River Elasticsearch plugin to bulk import content. Though this plugin has got a few bugs open, but still for a one time bulk import it is very useful. With some fixes and workarounds, this plugin can be used to warehouse huge amount of context into elasticsearch.

8) Marvel - A commerical monitoring and analytics tool from elasticsearch.

9) Elasticsearch JDBC River - Elasticsearch plugin to bulk import data from variety of systems into elasticsearch. If wisely used, this is the most useful plugin to start pumping data into elasticsearch.

Monday, May 19, 2014

ElasticSearch Tutorial - Questions - Basics - Part I

I'm reading: ElasticSearch Tutorial - Questions - Basics - Part ITweet this !
1) ElasticSearch uses Apache Lucene as the underlying technology.

2) Relational databases maps the values of fields in a table to indexes. During search operation indexes are used to locate records. Lucene uses inverted indexes that stores values (terms) in a field, which are used to find related records (documents).


3) What is an index in ElasticSearch ? 

An index is similar to a table in relational databases. The difference is that relational databases would store actual values, which is optional in ElasticSearch. An index can store actual and/or analyzed values in an index.

4) What is a document in ElasticSearch ? 

A document is similar to a row in relational databases. The difference is that each document in an index can have a different structure (fields), but should have same data type for common fields.

Each field can occur multiple times in a document with different data types. Fields can contain other documents too.

5) Does ElasticSearch have a schema ?

Yes, ElasticSeach can have mappings which can be used to enforce schema on documents.

6) What is a document type in ElasticSearch ?

A document type can be seen as the document schema / mapping definition, which has the mapping of all the fields in the document along with its data types.

7) What is indexing in ElasticSearch ?

The process of storing data in an index is called indexing in ElasticSearch. Data in ElasticSearch can be dividend into write-once and read-many segments. Whenever an update is attempted, a new version of the document is written to the index.

8) What is a node in ElasticSearch ?

Each instance of ElasticSearch is called a node. Multiple nodes can work in harmony to form an ElasticSearch Cluster.

9) What is a shard in ElasticSearch ?

Due to resource limitations like RAM, vCPU etc, for scale-out, applications need to employ multiple instances of ElasticSearch on separate machines. Data in an index can be divided into multiple partitions, each handled by a separate node (instance) of ElasticSearch. Each such partition is called a shard. By default an ElasticSearch index has 5 shards.

10) What is a replica in ElasticSearch ?

Each shard in ElasticSearch has 2 copy of the shard. These copies are called replicas. They serve the purpose of high-availability and fault-tolerance.

11) What is an Analyzer in ElasticSearch ?

While indexing data in ElasticSearch, data is transformed internally by the Analyzer defined for the index, and then indexed. An analyzer is built of tokenizer and filters. Following types of Analyzers are available in ElasticSearch 1.10.
12) What is a Tokenizer in ElasticSearch ?

A Tokenizer breakdown fields values of a document into a stream, and inverted indexes are created and updates using these values, and these stream of values are stored in the document.

13) What is a Filter in ElasticSearch ?

After data is processed by Tokenizer, the same is processed by Filter, before indexing. Following types of Filters are available in ElasticSearch 1.10.
14) What is the query language of ElasticSearch ?

ElasticSearch uses the Apache Lucene query language, which is called Query DSL.

In the next part of ElasticSearch Tutorial, we would see how to install ElasticSearch, and use ElasticSearch tools and technologies to administer the same.

Sunday, May 18, 2014

World Famous Architectures : Facebook, WhatsApp, Amazon, Twitter, YouTube, Google, ESPN, Salesforce, FarmVille and other world famous architectures

I'm reading: World Famous Architectures : Facebook, WhatsApp, Amazon, Twitter, YouTube, Google, ESPN, Salesforce, FarmVille and other world famous architecturesTweet this !
Experience is the biggest teacher, and no books or coaches can be a better teacher than learning from experience. We often hear from various sources in the professional world around us, regarding different architecture designs and practices, and still most of us would have inevitably attended a performance optimization training at least once in the past 2-3 years.

In my opinion, if you want to really learn scalability and performance, just take a look at the below mentioned top architectures of the world. I bet, if you can follow and implement even any two of them to the extent they have been by these organizations, you are set to build a new world famous architecture.

1) WhatsApp Architecture

Chef for Microsoft Azure, Amazon, OpenStack, Rackspace, Google Compute Engine, or Linode

I'm reading: Chef for Microsoft Azure, Amazon, OpenStack, Rackspace, Google Compute Engine, or LinodeTweet this !
What is DevOps ? For beginner, who are not aware of what is DevOps, can read this page to gain an idea on the same.

DevOps is a branch of architecture design, that is often considered trivial by many architects or development leads. For traditional applications, infrastructure provisioning, capacity management, monitoring, and operations support generally gets taken care by dedicated IT teams bound by pre-agreed SLAs.

But when architects are dealing with cloud scale applications, devops is no longer a trivial area or outside of the solution definition. Automating infrastructure management using script based templates, on all the major cloud vendors, is one of the standard industry practices and supported by almost all the cloud vendors as well. It came to my surprise when I found one of the Microsoft Azure cloud trainers not aware of what is devops automation, and I had to educate the trainer on the same.

Some of the major players in this area are as mentioned below, and Chef is leading the way in this area.
Vagrant still stands a choice for VMWare lovers, but Chef is much more sophisticated compared to Vagrant. A good place for beginners is to start getting an overview of Chef, and pursuing some free webinars and free trainings provided by Chef. Chef in itself is a comprehensive framework with concepts like Knife, Cookbook, Chef-repo, Ohai etc. A picture is worth thousand words.Below mentioned is the architecture diagram of Chef.

Chef Architecture

Picking up a cloud automation vendor is not the end of  devops. Often huge businesses have hybrid infrastructure environments formed of private datacenter, physical and virtual environments, multi-tenant cloud environments. Companies like RightScale too offer specialized solutions to deal with such use-cases. Below is an interesting architecture diagram of righscale solution model.

RightScale MultiCloud Platform

Friday, May 16, 2014

How to drive a project on NoSQL, Big data, Elasticsearch, MongoDB, Hadoop, and other such technologies

I'm reading: How to drive a project on NoSQL, Big data, Elasticsearch, MongoDB, Hadoop, and other such technologiesTweet this !
I am authoring this blog after quite a long break from blogging. Once one gets married, promoted in the organization at the same time, and made responsible for more than 20+ projects as the Lead Architect for a portfolio, it's not easy to catch up with blogging. 

These days, I work on projects spanning technologies like Sharepoint 2013, .NET, jQuery, SQL Server, SSIS, SSAS, SSRS, Powerpivot, Powerview, Mobile web apps using Bootstap and jQueryMobile, Native apps using iOS xCode, and NoSQL based technologies like Elasticsearch and MongoDB. Working as a solution architect with a broad range of projects and technologies is like working as a chef in a kitchen. I get to mix and merge various technology combinations, to create various solution recipes that cater to project requirements. The only exception is bad recipes are not tolerated easily as significant cost is involved based on a architect's decision.

I have spent my career working with technologies that were predominantly from Microsoft space. But the world is changing, and so are the focus on technologies. I have been taking a lot of personal interest in studying more on the NoSQL based technologies that can tap intelligence from unstructured data as well as big data.

One of the biggest traits that many developer or architect generally have is the typical punch line "I can't learn by reading, I need hands-on experience of the technology I need to manage". If you are working with a multi-national organization, it's not that easy to land into a project where neither you would have an experience in the driving technology, and in most cases neither the organization would have any experience too. When organizations don't find or recognize use-cases for any particular technology, if you try to push or propose the technology, it would be seen as you are trying to sell the technology and it's a solution in search of a problem. 

So the big question is, how to bag an entry ticket into the NoSQL world and drive a project using NoSQL technologies ?

Some of the initiatives that can help professionals seeking to build competency in NoSQL as well as intending to drive NoSQL based projects, can consider the following points:

1) Setup a personal lab: Virtualization has made is easy to create a VM. Most of the NoSQL technologies require very modest resources (like 2 GB RAM and single core), to run the software. This can be a starting playground to start practicing the technology.

2) Join the global community: Platforms like Github and Stackoverflow have lot of community projects and real-life questions. By being an active observer as well as participant of these platforms, one can mature on the technology very fast as well as make oneself globally visible as an active professional in the technology of choice.

3) Create a community within your organization: Organizations feel comfortable in adopting technologies, which can be easily managed by the pool of people available in the organization. If you one of the few ones having grip on the technology, you may classify yourself in the niche bracket, but that does not increase organizations confidence to deal in the technology. To deal with this issue, you should conduct various awareness sessions to bring people are various levels up to speed with technology, and create a community of practice in the organization.

4) Pursue a professional training: Post you have been able to successfully pursue points 2 and 3, you can confidently ask for a budget from the organization to pursue professional training on the subject. Everyone's pocket might not allow to pursue training from one's own pocket !!

5) Develop and publish POCs: Confidence to adopt a technology and confidence in a professionals ability to manage a technology, is reflected by the professionals ability to justify the use-case for technology. Identifying use-cases and justifying through POCs are the best means for the same.

By following these 5 steps, I believe that one can establish oneself as well as one's organization in a position to make an entry in the NoSQL world. Let me know what you think.
Related Posts with Thumbnails