Workflow-Aware Autonomous Data Lifecycle Management in IBM GPFS UsingBioinformatics Metadata-Driven ILM Policies
Abstract
Large-scale genomics pipelines generate petabyte-class datasets whose I/O access patterns are governed by discrete biological processingstages rather than by file recency or age. Conventional time-based Integrated Lifecycle Management (ILM) policies in IBM General ParallelFile System (GPFS) are therefore structurally misaligned with these workloads, producing unnecessary I/O latency and avoidable storage cost.This paper presents the design and implementation of a workflow-aware, autonomous ILM policy engine for GPFS that replaces temporalpredicates with bioinformatics workflow metadata as the primary data-placement trigger. The system comprises five tightly integratedcomponents: a Genomics Workflow Metadata Store (GWMS), a Workflow Event Interceptor (WEI), a Metadata-to-Policy TranslationLayer (MPTL), a GPFS ILM Policy Engine Interface (GPEI), and an Extended Attribute Annotation Subsystem (EAAS). Upon detectionof a workflow stage transition event—spanning Raw Sequencing, Alignment, Variant Calling, and Long-term Archive stages—the MPTLdynamically generates and submits scoped GPFS ILM MIGRATE policy rules driven by POSIX extended attribute predicates, autonomouslymoving genomics datasets across NVMe, SSD, HDD, and object storage tiers. A Tier Promotion Mechanism (TPM) handles reversemigration via RECALL policies upon pipeline re-activation. A configurable Stage-to-Tier Mapping Table encodes per-stage storage poolidentifiers, compression settings, replication factors, and migration eligibility thresholds without requiring software recompilation. Thesystem is validated across single-site HPC, multi-site federated, and hybrid cloud deployment configurations, and is shown to eliminatethe two principal failure modes of time-based ILM—premature migration of active-stage data and delayed migration of completed-stagedata—that systematically degrade I/O performance and inflate storage cost in production genomics environments.
Keywords
Citation Information
@article{tusharpathare2026,
title={Workflow-Aware Autonomous Data Lifecycle Management in IBM GPFS UsingBioinformatics Metadata-Driven ILM Policies},
author={Tushar Pathare and Sandeep Patil and Frank Lee},
journal={The Journal of Supercomputing},
year={2026},
doi={https://doi.org/10.21203/rs.3.rs-8987940/v1}
}
SinoXiv