Skip to main content

File Handling in Python

In data science, being able to work with data from various external sources is crucial. Python provides built-in modules to handle files in formats like CSV, Excel, and JSON, which are widely used in data storage, exchange, and manipulation. This article will focus on using Python's standard libraries to read and write these file formats efficiently.

1. Working with CSV Files

CSV (Comma Separated Values) is a popular file format used to store tabular data. Python’s built-in csv module allows you to read from and write to CSV files.

Reading CSV Files

You can read a CSV file by opening it in read mode and using the csv.reader to parse the file.

import csv

# Open and read a CSV file
with open('data.csv', mode='r') as file:
csv_reader = csv.reader(file)
for row in csv_reader:
print(row) # Each row is a list of values

In this example, each line in the file is converted to a list, where each value corresponds to a column in the CSV file.

Writing CSV Files

To write data to a CSV file, use the csv.writer method. You can specify how the data is formatted, including custom delimiters if needed.

import csv

# Writing data to a CSV file
with open('output.csv', mode='w', newline='') as file:
csv_writer = csv.writer(file)

# Write the header
csv_writer.writerow(['Name', 'Age', 'City'])

# Write data rows
csv_writer.writerow(['Alice', 30, 'New York'])
csv_writer.writerow(['Bob', 25, 'Los Angeles'])

Custom Delimiters

If your CSV uses a delimiter other than a comma (e.g., a tab or semicolon), you can specify it using the delimiter parameter.

with open('data.csv', mode='r') as file:
csv_reader = csv.reader(file, delimiter='\t') # Tab-separated file
for row in csv_reader:
print(row)

2. Working with Excel Files

Although Excel files aren’t natively supported by Python, you can use the openpyxl library to read from and write to Excel files. This library is ideal for handling .xlsx files and allows for more advanced manipulation of Excel sheets.

Installing openpyxl

If you don’t have openpyxl installed, you can install it using pip:

pip install openpyxl

Reading Excel Files

You can use openpyxl to load an Excel workbook and select a specific sheet to work with.

from openpyxl import load_workbook

# Load the Excel workbook
wb = load_workbook('data.xlsx')

# Select a specific sheet by name
sheet = wb['Sheet1']

# Read a specific cell
print(sheet['A1'].value) # Accesses the value in cell A1

# Iterate over the rows and columns
for row in sheet.iter_rows(min_row=1, max_col=3, max_row=5):
for cell in row:
print(cell.value)

Writing to Excel Files

You can also create or modify Excel files by writing new data to a sheet.

from openpyxl import Workbook

# Create a new Excel workbook and add a sheet
wb = Workbook()
sheet = wb.active
sheet.title = "MySheet"

# Write data to cells
sheet['A1'] = 'Name'
sheet['B1'] = 'Age'
sheet['A2'] = 'Alice'
sheet['B2'] = 30

# Save the workbook to a file
wb.save('output.xlsx')

This code demonstrates creating a new Excel file and writing data into specific cells.


3. Working with JSON Files

JSON (JavaScript Object Notation) is a widely-used format for storing and exchanging data, particularly with APIs. Python's built-in json module allows you to easily convert between Python objects and JSON format.

Reading JSON Files

To read a JSON file, use the json.load() method, which parses the file and converts it into a Python dictionary.

import json

# Open and load the JSON file
with open('data.json', 'r') as file:
data = json.load(file)

# Access the data from the JSON file
print(data)

In this example, the data.json file is read into a Python dictionary, which allows you to access the key-value pairs.

Writing JSON Files

To write data to a JSON file, use json.dump(), which serializes Python objects into a JSON-formatted string.

import json

# Data to be written to the JSON file
data = {
"name": "Alice",
"age": 30,
"city": "New York"
}

# Write the data to a JSON file
with open('output.json', 'w') as file:
json.dump(data, file, indent=4) # indent=4 for pretty printing

This writes the data dictionary into a output.json file, and the indent=4 ensures that the JSON is nicely formatted.

Working with JSON from an API or URL

Often, you’ll encounter JSON data returned from APIs. You can use Python’s requests module to fetch data and load it as a JSON object.

import requests

# Fetch JSON data from an API
response = requests.get('https://api.example.com/data')
data = response.json()

# Access the data from the API
print(data)

This code fetches data from an API, which is automatically converted into a Python dictionary using the json() method.


Conclusion

File handling is a fundamental skill for data ingestion and storage in data science projects. Using Python’s built-in modules, you can easily read from and write to common file formats like CSV, Excel (using openpyxl), and JSON. Mastering file handling allows you to work with external datasets and prepare data for analysis, making it a crucial part of your data science toolkit.