What is Big Data?
Big Data is data whose scale, distribution, diversity, and/or timeliness require the
use of new technical architectures and analytics to enable insights that unlock new
sources of business value.
McKinsey & Co.; Big Data: The Next Frontier for Innovation, Competition, and
While working as a Data scientist, or even in other Data related roles there will be different types of Data you will face during the course such as:
- Structured data: Data containing a defined data type, format, and structure (that is, transaction data, online analytical processing [OLAP] data cubes, traditional RDBMS, CSV files, and even simple spreadsheets)
- Semi-structured data: Textual data files with a discernible pattern that enables parsing (such as Extensible Markup Language [XML] data files that are self-describing and defined by an XML schema).
- Unstructured Data: Data that has no inherent structure, which may include text documents, PDFs, images, and video.
What are Data Repositories?
- Spreadsheets and data marts: Spreadsheets and low-volume databases for recordkeeping Analyst depend on data extracts such as Excel sheets, or Google Sheets
- Data Warehouses: Centralized data containers in a purpose-built space. Supports BI and reporting, but restricts robust analyses. Analysts are dependent on IT and DBAs for data access and schema changes. Analysts must spend significant time getting aggregated and disaggregated data extracts from multiple sources. Such as Amazon Redshift, or Azure SQL Dataware house.
- Analytic Sandbox: Data assets gathered from multiple sources and technologies for analysis. Enables flexible, high-performance analysis in a nonproduction environment; can leverage in-database processing. Reduces costs and risks associated with data replication into “shadow” file systems. “Analyst owned” rather than “DBA owned”.
Data Science vs Business Intelligence
The image above explains everything on the topic of Data Science vs Business Intelligence
Data Analytics Lifecycle
- What is Data Discovery: In Phase 1, the team learns the business domain, including relevant history such as whether the organization or business unit has attempted similar projects in the past from which they can learn. The team assesses the resources available to support the project in terms of people, technology, time, and data. Important activities in this phase include framing the business problem as an analytics challenge that can be addressed in subsequent phases and formulating initial hypotheses (IHs) to test and begin learning the data.
- What is Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the team can work with data and perform analytics for the duration of the project. The team needs to execute extract, load, and transform (ELT) or extract, transform and load (ETL) to get data into the sandbox. The ELT and ETL are sometimes abbreviated as ETLT. Data should be transformed in the ETLT process so the team can work with it and analyze it. In this phase, the team also needs to familiarize itself with the data thoroughly and take steps to condition the data.
- What is Model planning: Phase 3 is model planning, where the team determines the methods, techniques, and workflow it intends to follow for the subsequent model building phase. The team explores the data to learn about the relationships between variables and subsequently selects key variables and the most suitable models.
- How to Communicate results: In Phase 5, the team, in collaboration with major stakeholders, determines if the results of the project are a success or a failure based on the criteria developed in Phase 1. The team should identify key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders
- What is Operationalize: In Phase 6, the team delivers final reports, briefings, code, and technical documents. In addition, the team may run a pilot project to implement the models in a production environment.
You must Check out my other Video for more information:
Here you will find some amazing Projects for Data Analysis, and also intorduction to the theory of Data analysis.
2 Major type of Machine Learning Model
Supervised Learning: It uses known and labeled data as input. In a supervised model, input and output variables will be given.
Types of Supervised Machines Learnings Model are:
- Predictive analytics (house prices, stock exchange prices, etc.)
- Text recognition
- Spam detection
- Customer sentiment analysis
- Object detection (e.g. face detection)
Unsupervised Learning: It uses unlabeled data as input. In unsupervised learning model, only input data will be given.
Types of Unsupervised Machine Learning Models are:
- Dimensionality reduction