Intake#
Intake is a data cataloging system that allows you to access data from a variety of sources, such as CSV, Parquet, and SQL databases. Intake, provides a consistent API to access data from diverse sources (e.g., local files, S3 buckets, databases). It integrates with popular data science tools like Pandas, Dask, s3fs and Xarray.
Use: Work with multiple datasets stored across different locations
Here is their website: link
A full list of file types it can open is here: link
Intake-xarray#
Intake lacks the ability to load zarr, NetCDF, Rasterio, OPeNDAP. Intake-xarray is a plugin for the Intake library that allows you to load data into xarray containers. Xarray is a powerful library for working with labeled multi-dimensional arrays, commonly used in scientific and engineering applications.
Although, zarr is excellent for handling array data, Intake-xarray provides additional tools and flexibility for managing and accessing a wide variety of data formats in a more organized and efficient manner.
Documentation: link
Intake-parquet#
Apache Parquet is a columnar storage file format designed for efficient data processing and storage. I need to learn more about them.
Intake-parquet is a intake plugin to efficiently load and manage Parquet files within the Intake framework.
Github repo: link
Use intake with MinIO#
Create an Intake Catalog (catalog.yml)#
!cat data/catalog.yaml
sources:
dummy_data:
description: "Dummy data for testing"
driver: csv
args:
urlpath: "s3://my-bucket-2/data/data.csv"
storage_options:
key: "minioadmin"
secret: "minioadmin"
client_kwargs:
endpoint_url: "http://localhost:9000"
Load Data Using Intake#
import intake
# Load the catalog
catalog = intake.open_catalog('data/catalog.yaml')
# List available datasets
print(catalog)
<Intake catalog: data>
# Access the dummy data
dummy_data = catalog.dummy_data.read()
print(dummy_data)
/home/hell/anaconda3/lib/python3.9/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
Name Age City
0 Alice 25 New York
1 Bob 30 London
2 Charlie 35 Paris