10 2 Pandas Explained

Key Concepts

Pandas is a powerful Python library for data manipulation and analysis. It provides data structures and functions needed to work with structured data efficiently. Key concepts include:

DataFrame
Series
Indexing and Selection
Data Cleaning
Grouping and Aggregation
Merging and Joining

1. DataFrame

A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is the most commonly used data structure in Pandas.

Example:

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)

Analogy: Think of a DataFrame as a spreadsheet where each column can contain different types of data.

2. Series

A Series is a one-dimensional labeled array capable of holding any data type. It is similar to a column in a spreadsheet or a single column in a DataFrame.

Example:

import pandas as pd

data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)

Analogy: Think of a Series as a single column in a spreadsheet.

3. Indexing and Selection

Indexing and selection in Pandas allow you to access specific rows and columns of a DataFrame. You can use labels, integer positions, or boolean arrays to select data.

Example:

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

# Selecting a single column
print(df['Name'])

# Selecting multiple columns
print(df[['Name', 'City']])

# Selecting rows by label
print(df.loc[0])

# Selecting rows by integer position
print(df.iloc[1])

Analogy: Think of indexing and selection as navigating through a spreadsheet to find specific cells or ranges of cells.

4. Data Cleaning

Data cleaning involves handling missing data, removing duplicates, and correcting inconsistencies in the dataset. Pandas provides various functions to clean and preprocess data.

Example:

import pandas as pd
import numpy as np

data = {
    'Name': ['Alice', 'Bob', np.nan, 'Charlie'],
    'Age': [25, 30, np.nan, 35],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Chicago']
}

df = pd.DataFrame(data)

# Dropping missing values
df_cleaned = df.dropna()
print(df_cleaned)

# Filling missing values
df_filled = df.fillna(0)
print(df_filled)

# Removing duplicates
df_unique = df.drop_duplicates()
print(df_unique)

Analogy: Think of data cleaning as tidying up a messy room by removing unnecessary items and organizing the rest.

5. Grouping and Aggregation

Grouping and aggregation allow you to group data based on one or more columns and apply aggregate functions like sum, mean, count, etc.

Example:

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
    'Age': [25, 30, 35, 25, 30],
    'Salary': [50000, 60000, 70000, 55000, 65000]
}

df = pd.DataFrame(data)

# Grouping by Name and calculating mean salary
grouped = df.groupby('Name').mean()
print(grouped)

Analogy: Think of grouping and aggregation as organizing items in a store by category and calculating the total sales for each category.

6. Merging and Joining

Merging and joining allow you to combine multiple DataFrames based on common columns or indices. This is useful when working with relational data.

Example:

import pandas as pd

data1 = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
}

data2 = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Salary': [50000, 60000, 70000]
}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Merging DataFrames on 'Name'
merged = pd.merge(df1, df2, on='Name')
print(merged)

Analogy: Think of merging and joining as combining two spreadsheets based on a common column, like merging customer information from two different sources.