2015年2月19日星期四

Big Unstructured Data v/s Structured Relational Data


This blog is going to give an introduction about the differences between structured data and unstructured data and what those mean for the organizations in the current big data and data warehouses contexts.

Structured Data vs. Unstructured Data
To tell the differences between the two, there are many ways can do it. Let us see the video below first.

This video provides several ways to determine the differences. First, the video suggests one way, which is whether the data can fit into a pre-define model. If the answer is yes, then the data is structured, otherwise, it is unstructured. Pre-defined model can be tables, database, and etc. 

On the other hand, unstructured data would be text-heavy information without an organized manner. 

The other way to determine whether it is Structured Data or not, is to see whether the pieces of data are meaningful to us. As we know, data is just raw materials waiting for us to dig out useful information. Before obtaining useful information, raw data does not make any sense to us. At this time, the raw data is just the so called unstructured data. On the other hand, information is meaningful data. It can also be described as structured, organized, and processed data. The last approach to judge is to use experience. Normally, the types of data that listed below are unstructured data.
1 Word Doc & PDF’s & Text files

Ø  Unstructured data

Ø  Examples: Books, Articles

2. Audio files

Ø  Unstructured data

Ø  Example: Call center conversations.

3. email body

Ø  Unstructured data

Ø  Example: you don’t need an example here!

4. Videos

Ø  Unstructured data

Ø  Example: Video footage of criminal interrogation

On the other hand, those types of data below often are structured data.
1 A Data Mart / Data Warehouse

Ø  Structured Data


Data & Organizations & Warehouses

After understanding both of the structured and unstructured data, the next fact we need to remember is data is extremely important to the business nowadays, no matter what the type of data are.

  • 80 percent of business-relevant information originates in unstructured form. 

                         –  Justin Langseth

With the dramatically decreasing cost of storage, Data Warehouse, as a product born in the context of big data, can store, extract, transport, and load the huge amount of various data for the business organizations. Based on ETL, OLAP, and other BI applications, data warehouse is best assistant to deal with big unstructured data and users can mine the useful resources from the big data pool easily.

Limitations
However, data warehouse also has some drawbacks, especially in terms of analyzing different types of data.

Ø  Data is hosted on various systems which make silos of information.  It is time consuming to get the data and compile it. Some of the data comes in forms of Excel spreadsheets or PowerPoint presentations. There is no easy way to get access to the data and it requires intensive manual processing to gather the data and create reports. There is no ability to perform custom analysis or drill down capabilities.
Ø  Central place to view the data required for reporting and analysis.
Ø  No automated ways to get reports, no dashboards are available.
Ø  Reports are all Excel based, spreadsheets silo and skew the information.
Ø  No ability to do a quick analysis or “what if” modeling.

Outlook of Data Warehousing
There is no doubt that big data is the most popular word that everyone talks about right now. This is just the way that people gradually see and understand the world: the world is composed of unstructured stuff. Unstructured data is just what the reality is. In the past, people always emphasized the accuracy of the dataset and sampling method. This is because the size of the data is very limited so that we must take great care of it. However, things are changing over time. With the development of cloud computing and decreasing cost of computing and storage, people can see the big picture of the world by accepting and even embracing the unstructured data and even missing values.
Data warehousing is the best carrier to store the big data and explore the greatest use from it. It will be one of the most powerful tools not only in the statistical or business analytics world, but also in the whole business, medical, science fields, in the a few years in the future.

References:






1 条评论:

  1. It's Very informative blog and useful article thank you for sharing with us , keep

    posting learn more about BI Tools
    Tableau Online Training

    回复删除