2015年2月19日星期四

Big Unstructured Data v/s Structured Relational Data


This blog is going to give an introduction about the differences between structured data and unstructured data and what those mean for the organizations in the current big data and data warehouses contexts.

Structured Data vs. Unstructured Data
To tell the differences between the two, there are many ways can do it. Let us see the video below first.

This video provides several ways to determine the differences. First, the video suggests one way, which is whether the data can fit into a pre-define model. If the answer is yes, then the data is structured, otherwise, it is unstructured. Pre-defined model can be tables, database, and etc. 

On the other hand, unstructured data would be text-heavy information without an organized manner. 

The other way to determine whether it is Structured Data or not, is to see whether the pieces of data are meaningful to us. As we know, data is just raw materials waiting for us to dig out useful information. Before obtaining useful information, raw data does not make any sense to us. At this time, the raw data is just the so called unstructured data. On the other hand, information is meaningful data. It can also be described as structured, organized, and processed data. The last approach to judge is to use experience. Normally, the types of data that listed below are unstructured data.
1 Word Doc & PDF’s & Text files

Ø  Unstructured data

Ø  Examples: Books, Articles

2. Audio files

Ø  Unstructured data

Ø  Example: Call center conversations.

3. email body

Ø  Unstructured data

Ø  Example: you don’t need an example here!

4. Videos

Ø  Unstructured data

Ø  Example: Video footage of criminal interrogation

On the other hand, those types of data below often are structured data.
1 A Data Mart / Data Warehouse

Ø  Structured Data


Data & Organizations & Warehouses

After understanding both of the structured and unstructured data, the next fact we need to remember is data is extremely important to the business nowadays, no matter what the type of data are.

  • 80 percent of business-relevant information originates in unstructured form. 

                         –  Justin Langseth

With the dramatically decreasing cost of storage, Data Warehouse, as a product born in the context of big data, can store, extract, transport, and load the huge amount of various data for the business organizations. Based on ETL, OLAP, and other BI applications, data warehouse is best assistant to deal with big unstructured data and users can mine the useful resources from the big data pool easily.

Limitations
However, data warehouse also has some drawbacks, especially in terms of analyzing different types of data.

Ø  Data is hosted on various systems which make silos of information.  It is time consuming to get the data and compile it. Some of the data comes in forms of Excel spreadsheets or PowerPoint presentations. There is no easy way to get access to the data and it requires intensive manual processing to gather the data and create reports. There is no ability to perform custom analysis or drill down capabilities.
Ø  Central place to view the data required for reporting and analysis.
Ø  No automated ways to get reports, no dashboards are available.
Ø  Reports are all Excel based, spreadsheets silo and skew the information.
Ø  No ability to do a quick analysis or “what if” modeling.

Outlook of Data Warehousing
There is no doubt that big data is the most popular word that everyone talks about right now. This is just the way that people gradually see and understand the world: the world is composed of unstructured stuff. Unstructured data is just what the reality is. In the past, people always emphasized the accuracy of the dataset and sampling method. This is because the size of the data is very limited so that we must take great care of it. However, things are changing over time. With the development of cloud computing and decreasing cost of computing and storage, people can see the big picture of the world by accepting and even embracing the unstructured data and even missing values.
Data warehousing is the best carrier to store the big data and explore the greatest use from it. It will be one of the most powerful tools not only in the statistical or business analytics world, but also in the whole business, medical, science fields, in the a few years in the future.

References:






2015年2月2日星期一

Business Intelligence & Analysis Products Scan & Evaluation

MIS 587
Blog 1
Due: 2/03/2015
Rui He

Business Intelligence & Analysis Products
Scan & Evaluation
The figure below is called “Magic Quadrant”, a scatter plot developed by Gartner to show how the major players in the current Business Intelligence and Analysis market perform.

Figure 1. Magic Quadrant for Business Intelligence and Analytics Platforms

In this figure, Gartner analysts set two rules, which are ability to execute and completeness of vision, as their judge criteria.  From the figure, we can easily capture a rough idea about how those major players perform according to the two criteria.
Today, I will choose five BI analysis products (not top 5 products) from the figure to compare their overall performance from 5 criteria.

The five products I will examine today will be
Tableau,
QilkView,
MicroStrategy,
Information Builders,
and Jaspersoft.
The five criteria used for judgment will be
Functionality,
Ease to use,
Performance,
Productivity,
and Cost.

Below is the table that displays my evaluation of the five chosen products according to my criteria.
 
Weight
Tableau
QilkView
MicroStrategy
Information Builders
Jaspersoft
Functionality
25%
8
9
9
9
8
Ease to Use
15%
8
10
8
8
8
Performance
30%
9
8
7
8
8
Productivity
20%
8
8
7
9
7
Cost
10%
7
8
7
9
9
Points
100%
8.20
8.30
7.40
8.30
7.90
Rank
 
3
1
5
1
4

What does each criterion mean here?
1.      Functionality
Functionality means capability and extensibility. Capability refers to the range of BI functions that the BI product can support. The more functions that the product has, the higher score it will get for this criterion. Extensibility mainly stands for integration ability in this scenario. For example, the easier that the product can be integrated into web portal, the higher score that the product will achieve. Or the more applications like R, Java, and etc. that the product can be embedded into, the higher score that the product will achieve.
2.      Ease to Use
It includes three aspects. Difficulty level of installation of the product, intuitive level of the GUI/user interface design of the product, and the convenience level of the future and upgrade/maintenance. The easier to install, the more intuitive to explore and use the product, and the easier to maintain and upgrade the BI product, the higher score that the product will get for this criterion.
3.      Performance
It includes aspects. First, the level of business functions that it can support. This is a different requirement than the functionality. Functionality puts more focus on the BI analysis technology perspective like ETL and OLAP, which are standard features. On the other hand, performance concentrates on some plus features or special points, such as ad-hoc slice and dice and some other embedded insights between its product families. Second, stability is also one of the important features in performance. The more stable that the product is, the higher score it will obtain.
4.      Productivity
This criterion focuses on the soft technologies that enable users can effectively improve their productivity. Business dashboards and data visualizations are very good examples that reveal the meaning of productivity.
5.      Cost
Cost is also a very critical factor to evaluate when consider which BI tool to choose, regardless the size of organizations or companies.

Detailed Evaluations and Explanations

Tableau
 Tableau is often the first supplier that comes to mind when businesses consider data visualization tools. While the product is easy to use, and produces very attractive visuals, it is not particularly sophisticated and may prove inadequate as needs mature. It is still a classic choice between ease-of-use and sophistication. However, Tableau must lose its market share when more competitors entering into the market in the future.

QilkView
The QlikView BI platform has the ability to be all things to all people, and will satisfy business users, developers and enterprise needs. It sets the right balance between ease-of-use and sophistication. Great extensibility and the ability to create new chart types and BI apps make QlikView win the game.


MicroStrategy
MicroStrategy is in many ways a meeting of the old and new in business intelligence, and takes the positives from both. It is truly an enterprise solution meeting the less glamorous demands for regular reporting, complex dashboards and extensive admin, while offering up the sexier self-service BI users now expect from a BI solution. It is expensive, and for organizations with less demanding requirements other options will be more economical.


Information Builders      
Information Builders is a long established supplier of BI, analytics and integration technologies.
The integration of BI and data mining is quite unique and puts IB ahead of the crowd. The maturity, sophistication and value for money, is very hard to beat.


Jaspersoft
Jaspersoft does not particularly distinguish itself in any way, but neither does it have any striking inadequacies. This BI suite will address the BI needs of many organizations without a great deal of fuss, and can be extended to meet bespoke requirements.


Conclusion
Overall, I will recommend either Qilkview or Information Builders for your BI Analysis Tools.