Intake#

Intake is a data cataloging system that allows you to access data from a variety of sources, such as CSV, Parquet, and SQL databases. Intake, provides a consistent API to access data from diverse sources (e.g., local files, S3 buckets, databases). It integrates with popular data science tools like Pandas, Dask, s3fs and Xarray.

Use: Work with multiple datasets stored across different locations

Here is their website: link

A full list of file types it can open is here: link

Intake-xarray#

Intake lacks the ability to load zarr, NetCDF, Rasterio, OPeNDAP. Intake-xarray is a plugin for the Intake library that allows you to load data into xarray containers. Xarray is a powerful library for working with labeled multi-dimensional arrays, commonly used in scientific and engineering applications.

Although, zarr is excellent for handling array data, Intake-xarray provides additional tools and flexibility for managing and accessing a wide variety of data formats in a more organized and efficient manner.

Documentation: link

Intake-parquet#

Apache Parquet is a columnar storage file format designed for efficient data processing and storage. I need to learn more about them.

Intake-parquet is a intake plugin to efficiently load and manage Parquet files within the Intake framework.

Github repo: link

Use intake with MinIO#

Create an Intake Catalog (catalog.yml)#

!cat data/catalog.yaml
sources:
  dummy_data:
    description: "Dummy data for testing"
    driver: csv
    args:
      urlpath: "s3://my-bucket-2/data/data.csv"
      storage_options:
        key: "minioadmin"
        secret: "minioadmin"
        client_kwargs:
          endpoint_url: "http://localhost:9000"

Load Data Using Intake#

import intake

# Load the catalog

catalog = intake.open_catalog('data/catalog.yaml')

# List available datasets 
print(catalog)
<Intake catalog: data>
# Access the dummy data
dummy_data = catalog.dummy_data.read()

print(dummy_data)
/home/hell/anaconda3/lib/python3.9/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (
      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Paris