Data Science and Analytics Laboratory Guide PDF

DEPARTMENT OF COMPUTER ENGINEERING

FACULTY OF ENGINEERING

UNIVERSITY OF UYO

===============================

======= CPE 221 =======

================================

DATA SCIENCE AND ANALYTICS

(LABORATORY GUIDE)

OUTLINE

● LAB 1: Programming Fundamentals for Data Science and Analytics (Python)

● LAB 2: Data Acquisition and Management

● LAB 3: Exploratory Data Analysis (EDA) & Visualization

● LAB 4: Data Pre-processing: Cleaning and Preparation

● LAB 5: Data Pre-processing: Feature Engineering & Transformation, Workflow, and Best Practices

● LAB 6: Introduction to Machine Learning

● LAB 7: Supervised Learning: Regression

● LAB 8: Supervised Learning: Classification

● LAB 9: Deploying a Linear Regression Model with Streamlit and GitHub

● LAB 10: End-to-End Mini Project

MINIMUM REQUIREMENTS

S/n

Component

Minimum Version Required

Python

3.11.5

Joblib

1.5.1

Matplotlib

3.10.3

NumPy

2.3.1

Pandas

2.3.1

Scikit-Learn

1.7.0

SciPy

1.16.0

Seaborn

0.13.2

Streamlit

1.46.1

Git

2.43.0

Jupyter Notebook/Google Colab/Visual Studio Code (with Jupyter Extension)

Tools/Libraries Descriptions

1. Python: A versatile and widely used high-level programming language, often employed for data analysis,

machine learning, web development, and automation.

2. Joblib: This is a Python library offering a collection of utilities designed to simplify common data processing

workflows. Its main strengths lie in two areas: efficiently saving and loading Python objects.

3. Matplotlib: A comprehensive library for creating static, animated, and interactive visualizations in Python.

4. NumPy: A fundamental package for numerical computing in Python, providing support for large,

multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these

arrays.

5. Pandas: A powerful and flexible open-source data analysis and manipulation library for Python, built on top of

NumPy. It's excellent for working with tabular data.

6. Scikit-Learn: A popular and robust open-source machine learning library for Python, offering various

classification, regression, clustering, and model selection algorithms.

7. SciPy: An open-source Python library used for scientific computing and technical computing, building on

NumPy and providing functions for optimization, linear algebra, integration, interpolation, and more.

8. Seaborn: A Python data visualization library based on Matplotlib, providing a high-level interface for drawing

attractive and informative statistical graphics.

9. Streamlit: An open-source app framework for machine learning and data science, allowing you to turn data

scripts into shareable web apps in minutes.

10. Git: A distributed version control system for tracking changes in source code during software development,

facilitating collaboration among developers.

Integrated Development Environments (IDEs) / Notebook Environments:

1. Jupyter Notebook: An open-source web application that allows you to create and share documents containing

live code, equations, visualizations, and narrative text. It's widely used for data cleaning and transformation,

numerical simulation, statistical modeling, data visualization, and machine learning.

2. Google Colab (Colaboratory): A free cloud-based Jupyter Notebook environment from Google that requires

no setup and runs entirely in the cloud, providing access to computing resources like GPUs and TPUs.

3. Visual Studio Code (with Jupyter Extension): A lightweight but powerful source code editor that, with the

Jupyter extension, provides rich support for Jupyter Notebooks directly within the editor, combining the

benefits of a code editor with the interactive capabilities of notebooks.

Overview

Data Science and Analytics Laboratory Guide

The Data Science and Analytics Laboratory Guide provides a comprehensive overview of essential programming and analytical techniques for students in computer engineering. It covers key topics such as Python programming fundamentals, data acquisition, exploratory data analysis, and machine learning methodologies. Each lab is designed to enhance practical skills, including data cleaning, feature engineering, and model deployment using tools like Streamlit and GitHub. This guide is ideal for university students pursuing data science and analytics courses, offering hands-on experience through ten detailed laboratory exercises. Key Points Covers ten laboratory exercises in data science and analytics, including Python programming and machine learning. Includes practical applications for da…

/ 90

382

FAQs

What are the minimum requirements for the Data Science and Analytics labs?

The minimum requirements for the Data Science and Analytics labs include specific software and library versions. For Python, version 3.11.5 is required. Additionally, libraries such as Joblib (1.5.1), Matplotlib (3.10.3), NumPy (2.3.1), Pandas (2.3.1), Scikit-Learn (1.7.0), SciPy (1.16.0), Seaborn (0.13.2), Streamlit (1.46.1), and Git (2.43.0) must be installed. Users should also have Jupyter Notebook, Google Colab, or Visual Studio Code with the Jupyter Extension to facilitate coding and visualization.

What is the focus of Lab 3 in the guide?

Lab 3 focuses on Exploratory Data Analysis (EDA) and Visualization. In this lab, learners develop skills to perform comprehensive EDA, summarizing datasets to understand their key characteristics. They also master the creation of informative data visualizations to identify patterns, detect anomalies, and test hypotheses. This lab emphasizes the importance of EDA in the data science workflow, providing foundational skills for analyzing and interpreting data effectively.

How does Lab 1 address Python programming fundamentals?

Lab 1 introduces Programming Fundamentals for Data Science and Analytics using Python. It covers essential concepts such as syntax, data types, control flow structures, functions, and object-oriented principles. The lab is designed to provide a solid foundation specifically geared towards applications in data science and analytics tasks. By the end of this lab, learners will have a foundational understanding of Python programming necessary for subsequent labs.

What techniques are used for handling missing values in the guide?

The guide outlines several techniques for handling missing values. These include dropping rows with missing data using the dropna() method, filling missing values with constants, means, medians, or modes, and employing forward or backward fill methods. Additionally, it suggests model-based imputation methods, such as KNN and regression imputation, to estimate missing values based on other features. The document emphasizes the importance of addressing missing values to maintain data quality.

What is the significance of feature scaling in the data preprocessing workflow?

Feature scaling is crucial in the data preprocessing workflow as it ensures that all numerical features contribute equally to machine learning models, particularly those sensitive to the scale of input data. The guide explains different scaling techniques, including standard scaling (Z-score normalization), min-max scaling, and robust scaling. Each technique has its advantages depending on the data distribution, helping to improve model performance and reduce errors during training.

What is the purpose of Lab 6 in the Data Science and Analytics Laboratory Guide?

Lab 6 serves as an introduction to Machine Learning, providing foundational knowledge about core concepts and workflows in the field. It covers the paradigm of supervised learning, explaining its subtypes such as regression and classification, and outlines the typical stages involved in a machine learning project. Learners gain insights into features, labels, training and test sets, and the significance of generalization, overfitting, and underfitting in model development.

You May Also Like