This blog is going to give an introduction about the differences
between structured data and unstructured data and what those mean for the
organizations in the current big data and data warehouses contexts.
Structured Data vs.
Unstructured Data
To tell the differences between the two, there are many ways can do it.
Let us see the video below first.
This video provides several ways to determine the differences. First, the
video suggests one way, which is whether the data can fit into a pre-define
model. If the answer is yes, then the data is structured, otherwise, it is
unstructured. Pre-defined model can be tables, database, and etc.
On
the other hand, unstructured data would be text-heavy information without an organized
manner.
The other way to determine whether it is Structured Data or not, is to
see whether the pieces of data are meaningful to us. As we know, data is just
raw materials waiting for us to dig out useful information. Before obtaining
useful information, raw data does not make any sense to us. At this time, the
raw data is just the so called unstructured data. On the other hand,
information is meaningful data. It can also be described as structured,
organized, and processed data. The last approach to judge is to use experience.
Normally, the types of data that listed below are unstructured data.
1 Word Doc & PDF’s & Text files
Ø Unstructured data
Ø Examples: Books,
Articles
2. Audio files
Ø Unstructured data
Ø Example: Call
center conversations.
3. email body
Ø Unstructured data
Ø Example: you
don’t need an example here!
4. Videos
Ø Unstructured data
Ø Example: Video
footage of criminal interrogation
On the other hand, those types of data below often are structured data.
1 A Data Mart / Data Warehouse
Ø Structured Data
Data & Organizations
& Warehouses
After understanding both of the structured and unstructured data, the
next fact we need to remember is data is extremely important to the business nowadays,
no matter what the type of data are.
- 80 percent of business-relevant information originates in unstructured form.
– Justin Langseth
With the dramatically decreasing cost of storage, Data Warehouse, as a
product born in the context of big data, can store, extract, transport, and
load the huge amount of various data for the business organizations. Based on
ETL, OLAP, and other BI applications, data warehouse is best assistant to deal
with big unstructured data and users can mine the useful resources from the big
data pool easily.
Limitations
However, data warehouse also has some drawbacks, especially in terms of
analyzing different types of data.
Ø Data is hosted on
various systems which make silos of information. It is time consuming to get the data and
compile it. Some of the data comes in forms of Excel spreadsheets or PowerPoint
presentations. There is no easy way to get access to the data and it requires
intensive manual processing to gather the data and create reports. There is no
ability to perform custom analysis or drill down capabilities.
Ø Central place to
view the data required for reporting and analysis.
Ø No automated ways
to get reports, no dashboards are available.
Ø Reports are all
Excel based, spreadsheets silo and skew the information.
Ø No ability to do
a quick analysis or “what if” modeling.
Outlook of Data Warehousing
There is no doubt that big data is the most popular word that everyone
talks about right now. This is just the way that people gradually see and
understand the world: the world is composed of unstructured stuff. Unstructured
data is just what the reality is. In the past, people always emphasized the
accuracy of the dataset and sampling method. This is because the size of the
data is very limited so that we must take great care of it. However, things are
changing over time. With the development of cloud computing and decreasing cost
of computing and storage, people can see the big picture of the world by accepting
and even embracing the unstructured data and even missing values.
Data warehousing is the best carrier to store the big data and explore
the greatest use from it. It will be one of the most powerful tools not only in
the statistical or business analytics world, but also in the whole business,
medical, science fields, in the a few years in the future.
References: