The healthcare company is a giant in Genomics research Field and Healthcare sector spanning services pan-India. Their expertise includes:
Due to this broad aspect of their research model, they gather huge volume of data spanning these listed sources & beyond.
The org pulls genomics data from External Research Databases:
Partnered Hospitals/Healthcare Institutions:
These file types often represent different stages in genomic analysis procedure. The process starts with FASTQ files and by aligning them to a reference genome to produce BAM files, and then perform various analyses on the aligned data.
Current Total Monthly Data Ingestion: Approximately 15-20 TB per month
All these data from various sources are ingested in Amazon EKS by using APIs provided by the 3 rd parties.
Amazon EFSis utilized to store various types of genomic data crucial for research and analysis. This includes large volumes of raw sequencing data in FASTQ files generated by sequencing machines, which need to be accessed quickly and concurrently by multiple researchers and pods for quality control and preprocessing.
During sequence alignment, worker nodes and pods in the Amazon EKS cluster process these raw data files using tools like BWA and Bowtie, producing aligned BAM files.
These BAM files, along with intermediate results and reference genomes, must be immediately available to other nodes and pods for further processing and analysis, ensuring data consistency and eliminating the need for data duplication or transfer.
These kinds of data cannot be stored in Amazon EFS for an extended period due to the high cost and potential performancedegradationover time, especially as the volume of data grows. Therefore, they are moved to Amazon S3 via Amazon DataSync job for more cost-effective, scalable, and durable long-term storage.
Dynamic provisioning allocates storage resources precisely when they are needed, based on the application's requirements. This avoids the inefficiencies of pre-allocating fixed storage sizes with static provisioning. Dynamic provisioning allows for seamless adjustment of storage allocation without downtime or manual resizing optimizing the whole volume provisioning.
In this scenario the uncertainty of the amount of volume that is required to be provisioned lead to the perfect use of Dynamic Provisioning. DP uses Kubernetes StorageClassesto automatically create Persistent Volumes (PVs) when Persistent Volume Claims (PVCs) are made.
For example, if an application needs a 100GB volume, a PVC can be created specifying this size, and Kubernetes will automatically provision a new PV of 100GB using the defined StorageClass. This eliminates the need for manual creation and management of PVs, reducing administrative overhead and ensuring that storage is always available when needed without manual intervention.
Ancrew Installed the Amazon EFS CSI Driver in theAmazon EKS cluster, which automated the installation of necessary utilities and manages Amazon EFS volumes through Amazon EFS CSI Driver Add-on and by utilizing the add-on the Dynamic provisioning solution was implemented.
Amazon S3: Amazon S3 is an object storage service that does not support POSIX compliance file system semantics. It is designed for storing and retrieving large objects but does not support the fine-grained control and real-time file access needed for computational tasks that require standard file operations.
Amazon EBS: Amazon EBS is POSIX compliant but cannot be used in this case as we cannot mount it on multiple Amazon EC2s (worker nodes). Also, Amazon EBS are AZ specific, and it cannot share data out of AZ.
Amazon EFS: Whereas AmazonEFS is POSIX compliant, allowing it to support the standard file system semantics required for bioinformatics tools.
Amazon EFS can be mounted on multiple Amazon EC2 instances across different AZs, providing shared, scalable, and distributed file storage that ensures data redundancy, high availability, and seamless access to shared data across your Amazon EKS cluster.
This makes it ideal for the high-performance, real-time data processing needs of your genomics analysis platform.
Data coming into Amazon S3 is categorized and then stored in the buckets with certain tagging.
The users were given certain permissions based on their requirements but only on bucket level but not on object level.
This increases the risk of the users to see all the different kinds of data present in the folder.
Ancrew suggested to implementtag-based objects and this lead to the users of accessing the required bucket, required folder and along with that the required objects only and thus implementing the advance level accessing control over the users, way deep till the objects.
Ancrew team wrote AWS Lambda function to tag the new uploads happening to Amazon S3 buckets through Amazon Datasync.
This AWS Lambda function runs after Amazon Datasync job is successful and is triggered by Amazon Eventbridge.
This AWS Lambda function with permissions like PutObjectTagging, ListBuckets,etc,extracts bucket name and the folder prefix in which data is stored in Amazon S3.
The naming convention of the objects were unique like for processed_genome (pg), raw_sequence (rs), etc.
Using the unique parameters such as naming convention and prefix of the destination folder of the object's tagging logic has been built and the objects were tagged and uploaded to Amazon S3 bucket.
This is done by using AWS Lambda function which checks if the tagging on the objects is present or not in the certain bucket for which it is written for and if not then it tags the objects based on the parameters like project_name, naming convention, etc.
Ancrew has written a AWS Lambda function which runs in daily to check if the tagging is done on the buckets or not.
If any untagged objects were found, an Amazon SNS alert with the list of untagged objects is sent to the team and immediate action is taken on it.

"Enforce in-transit encryption for all clients"
Implemented container insights to get performance insights through metrics for the Amazon EKS pods.
Various important metrics like pod restarts, CPU/RAM utilizations of the nodes were monitored through Amazon CloudWatch automatic dashboards created while deploying Container Insights on Amazon EKS cluster.
Utilized Amazon CloudWatch metrics for monitoring of Amazon EFS such as DataReadIOBytes, PercentIOLimit, PermittedThroughput, etc to gauge the performance of Amazon EFS.
Similarly, Ancrew used cloudwatch metrics for monitoring Amazon S3 such as Request metrics, Replication metrics, etc.
Some important metrics utilized for Datasync were BytesTransferred, BytesPreparedSource, BytesWritten, etc.
Suggested Enabling API server logging for Amazon EKS.
Implemented container insights where all the pod logs were transferred to Amazon CloudWatch logs.
Helped in central logging of all the pods, nodes, Amazon EKS components present in the environment.
Ancrew suggested using Amazon SNS and utilizing topics to send customized alerts based on subscription filters setupd on Amazon EKS to reduce response time of an error interruption events.
Alerts for whenever the Amazon Datasync job is triggered by AWS Lambda and this alert is sent to the concerned personnels.
Once the Amazon Datasync job is done a Amazon SNS notification is again sent to the team.
Implementation of Alerting when the AWS Lambda Cleaning Job start or is done.
Cross-region replication for taking backupsof Amazon S3 using Amazon S3 cross-region replication.
This Amazon S3 replication setting once setupd, worked 24/7 continuously, to automatically replicate new objects added to the source bucket to the destination bucket in the secondary region.
For Amazon EBS volumes AWS backup was used to send the snapshot from ap-south-1(Mumbai) to ap-south-2(Hyderabad).
The backups were set to be taken dailyafter Business hours.
AWS Backupwas configured to automatically delete the snapshots after 1 weekthus managing the lifecycle of your backups without need for additional scripts or manual intervention.
Ancrew provided AWS Lambda scripts for cost-optimization in Amazon EBS which does the work of deleting idle unattachedAmazon EBS volume resources.
These scripts were utilized monthly which provided the actual unattached volumes a window period of 1 month, if any necessary data is present.
AWS Lambda scripts were written in python for Amazon EC2s after Testing hours.
The data transfer job fromAmazon EFS to Amazon S3 is done throughout the day and during non-Business hours the cleaning of the temporary data becomes a necessity to maintain the performance of the Amazon EFS.
For this a data cleaning AWS Lambdafunction is triggered weeklyat specific time by setting a cron job under Amazon Eventbridgeto clean the data from Amazon EFS.
The notificationof this is sent out to the concerned personnels and the AWS Lambda does its work of cleaning.
Once the cleaning is done another Amazon SNS notificationis sent out to let the concerned team know about the completion of the event.
This leads to an allowance of a weekly retention period for the temporary data and cleans the data to free Amazon EFS of unnecessary temp or intermediate results data.
Ancrew provided a robust, cost-effective, and secure infrastructure that facilitated advanced genomics research by automating storage provisioning, ensuring high availability and fault tolerance, optimizing costs, and enabling collaborative data access.
Ancrew aided in addressing the pain-points of the organization in storage areas while maintaining compliance requirements and security in check.