Understanding Data Loading in Data Science
Welcome back to the Data Science Series!
In this chapter, we explore one of the most important parts of the data understanding stage: loading data into our environment so we can work with it effectively.
Once we understand the business problem and the constraints around it, the next major stage in any Data Science project is data understanding. And before we can explore or analyze anything, we first need to load the data into our runtime environment.
Data can come from many different sources, and loading it efficiently is essential for smooth analysis later. Let’s walk through the most common formats we work with and how we load them.
Reading CSV Files with Pandas
CSV files, also known as Comma Separated Values files, are one of the simplest and most common formats for storing tabular data.
With Pandas, loading a CSV into a DataFrame is very straightforward:
import pandas as pd
df = pd.read_csv("data.csv")
CSV files are lightweight, easy to use, and perfect for quick experiments.
Reading Excel Files with Pandas
Datasets often come in Excel format, especially when shared by business teams or collected manually. These files usually have the .xlsx extension.
df = pd.read_excel("data.xlsx")
This converts the sheet into a DataFrame, ready for exploration or cleaning.
Reading JSON Files with Pandas
JSON stands for JavaScript Object Notation. It stores information using key value pairs and is commonly used for APIs, logs, and nested data.
Pandas supports several ways to read JSON:
- Using pd.read_json()
- Using Python’s json module with pd.json_normalize()
- Creating a DataFrame with pd.DataFrame()
JSON is helpful when working with hierarchical or semi structured datasets.
Reading SQL Tables with Pandas Using SQLAlchemy
When data is stored in a relational database, we can load entire tables into a DataFrame using SQLAlchemy.
df = pd.read_sql_table("table_name", con=engine)
This reads an entire SQL table without writing raw SQL queries. It works with SQLAlchemy engines only.
Reading Data from MongoDB
MongoDB is a NoSQL database that stores information in documents instead of tables. These documents are similar to JSON but stored in BSON format.
To load data from MongoDB into Pandas, we usually follow these steps:
- Import required modules
- Create a connection
- Access the database
- Access the collection
- Fetch documents
- Convert the cursor into a DataFrame
This allows us to efficiently work with semi structured or unstructured data.
Final Thoughts
No matter where the data comes from, whether it is CSV, Excel, JSON, SQL databases, or MongoDB, we usually convert it into a Pandas DataFrame. Once the data becomes a DataFrame, the rest of the workflow becomes much smoother.
Different sources may look complicated, but they all lead to one destination. Once the data is in a DataFrame, our Data Science journey moves forward with clarity.
In the next chapter, we will explore how raw data begins its transformation into insights.