The Data Science and Analytics Laboratory Guide provides a comprehensive overview of essential programming and analytical techniques for students in computer engineering. It covers key topics such as Python programming fundamentals, data acquisition, exploratory data analysis, and machine learning methodologies. Each lab is designed to enhance practical skills, including data cleaning, feature engineering, and model deployment using tools like Streamlit and GitHub. This guide is ideal for university students pursuing data science and analytics courses, offering hands-on experience through ten detailed laboratory exercises.

Key Points

  • Covers ten laboratory exercises in data science and analytics, including Python programming and machine learning.
  • Includes practical applications for data acquisition, exploratory data analysis, and model deployment.
  • Focuses on essential tools such as Pandas, NumPy, and Scikit-Learn for data manipulation and analysis.
  • Designed for computer engineering students to gain hands-on experience in data science techniques.
Ekemini Tom
90 pages
Language:English
Type:Lab Report
Ekemini Tom
90 pages
Language:English
Type:Lab Report
382
/ 90
DEPARTMENT OF COMPUTER ENGINEERING
FACULTY OF ENGINEERING
UNIVERSITY OF UYO
===============================
======= CPE 221 =======
================================
DATA SCIENCE AND ANALYTICS
(LABORATORY GUIDE)
OUTLINE
LAB 1: Programming Fundamentals for Data Science and Analytics (Python)
LAB 2: Data Acquisition and Management
LAB 3: Exploratory Data Analysis (EDA) & Visualization
LAB 4: Data Pre-processing: Cleaning and Preparation
LAB 5: Data Pre-processing: Feature Engineering & Transformation, Workflow, and Best Practices
LAB 6: Introduction to Machine Learning
LAB 7: Supervised Learning: Regression
LAB 8: Supervised Learning: Classification
LAB 9: Deploying a Linear Regression Model with Streamlit and GitHub
LAB 10: End-to-End Mini Project
MINIMUM REQUIREMENTS
S/n
Component
Minimum Version Required
1
Python
3.11.5
2
Joblib
1.5.1
3
Matplotlib
3.10.3
4
NumPy
2.3.1
5
Pandas
2.3.1
6
Scikit-Learn
1.7.0
7
SciPy
1.16.0
8
Seaborn
0.13.2
9
Streamlit
1.46.1
10
Git
2.43.0
11
Jupyter Notebook/Google Colab/Visual Studio Code (with Jupyter Extension)
Tools/Libraries Descriptions
1. Python: A versatile and widely used high-level programming language, often employed for data analysis,
machine learning, web development, and automation.
2. Joblib: This is a Python library offering a collection of utilities designed to simplify common data processing
workflows. Its main strengths lie in two areas: efficiently saving and loading Python objects.
3. Matplotlib: A comprehensive library for creating static, animated, and interactive visualizations in Python.
4. NumPy: A fundamental package for numerical computing in Python, providing support for large,
multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these
arrays.
5. Pandas: A powerful and flexible open-source data analysis and manipulation library for Python, built on top of
NumPy. It's excellent for working with tabular data.
6. Scikit-Learn: A popular and robust open-source machine learning library for Python, offering various
classification, regression, clustering, and model selection algorithms.
7. SciPy: An open-source Python library used for scientific computing and technical computing, building on
NumPy and providing functions for optimization, linear algebra, integration, interpolation, and more.
8. Seaborn: A Python data visualization library based on Matplotlib, providing a high-level interface for drawing
attractive and informative statistical graphics.
9. Streamlit: An open-source app framework for machine learning and data science, allowing you to turn data
scripts into shareable web apps in minutes.
10. Git: A distributed version control system for tracking changes in source code during software development,
facilitating collaboration among developers.
Integrated Development Environments (IDEs) / Notebook Environments:
1. Jupyter Notebook: An open-source web application that allows you to create and share documents containing
live code, equations, visualizations, and narrative text. It's widely used for data cleaning and transformation,
numerical simulation, statistical modeling, data visualization, and machine learning.
2. Google Colab (Colaboratory): A free cloud-based Jupyter Notebook environment from Google that requires
no setup and runs entirely in the cloud, providing access to computing resources like GPUs and TPUs.
3. Visual Studio Code (with Jupyter Extension): A lightweight but powerful source code editor that, with the
Jupyter extension, provides rich support for Jupyter Notebooks directly within the editor, combining the
benefits of a code editor with the interactive capabilities of notebooks.
/ 90
End of Document
382

FAQs

What are the minimum requirements for the Data Science and Analytics labs?
The minimum requirements for the Data Science and Analytics labs include specific software and library versions. For Python, version 3.11.5 is required. Additionally, libraries such as Joblib (1.5.1), Matplotlib (3.10.3), NumPy (2.3.1), Pandas (2.3.1), Scikit-Learn (1.7.0), SciPy (1.16.0), Seaborn (0.13.2), Streamlit (1.46.1), and Git (2.43.0) must be installed. Users should also have Jupyter Notebook, Google Colab, or Visual Studio Code with the Jupyter Extension to facilitate coding and visualization.
What is the focus of Lab 3 in the guide?
Lab 3 focuses on Exploratory Data Analysis (EDA) and Visualization. In this lab, learners develop skills to perform comprehensive EDA, summarizing datasets to understand their key characteristics. They also master the creation of informative data visualizations to identify patterns, detect anomalies, and test hypotheses. This lab emphasizes the importance of EDA in the data science workflow, providing foundational skills for analyzing and interpreting data effectively.
How does Lab 1 address Python programming fundamentals?
Lab 1 introduces Programming Fundamentals for Data Science and Analytics using Python. It covers essential concepts such as syntax, data types, control flow structures, functions, and object-oriented principles. The lab is designed to provide a solid foundation specifically geared towards applications in data science and analytics tasks. By the end of this lab, learners will have a foundational understanding of Python programming necessary for subsequent labs.
What techniques are used for handling missing values in the guide?
The guide outlines several techniques for handling missing values. These include dropping rows with missing data using the dropna() method, filling missing values with constants, means, medians, or modes, and employing forward or backward fill methods. Additionally, it suggests model-based imputation methods, such as KNN and regression imputation, to estimate missing values based on other features. The document emphasizes the importance of addressing missing values to maintain data quality.
What is the significance of feature scaling in the data preprocessing workflow?
Feature scaling is crucial in the data preprocessing workflow as it ensures that all numerical features contribute equally to machine learning models, particularly those sensitive to the scale of input data. The guide explains different scaling techniques, including standard scaling (Z-score normalization), min-max scaling, and robust scaling. Each technique has its advantages depending on the data distribution, helping to improve model performance and reduce errors during training.
What is the purpose of Lab 6 in the Data Science and Analytics Laboratory Guide?
Lab 6 serves as an introduction to Machine Learning, providing foundational knowledge about core concepts and workflows in the field. It covers the paradigm of supervised learning, explaining its subtypes such as regression and classification, and outlines the typical stages involved in a machine learning project. Learners gain insights into features, labels, training and test sets, and the significance of generalization, overfitting, and underfitting in model development.