Apache iceberg example

9/12/2023

To meet these requirements, we introduce Apache Iceberg to enable adding, updating, and deleting records ACID transactions and time travel queries. In this post, we build a data lake of customer review data on top of Amazon S3. Any changes in customer reviews that are made by the organization’s various teams need to be reflected in BI reports and query results.This means that any result in a query isn’t affected by uncommitted customer review write operations.

Customer reviews can always be added, updated, and deleted, even while one of the teams is querying the reviews for analysis.The customer support team sometimes needs to view the history of the customer reviews.The data analyst team needs to use both notebooks and ad hoc queries for their analysis.

Individual customer reviews can be added, updated, and deleted.Query scalability is important because the website is huge.Our solution has the following requirements: This team queries the reviews when they get inquiries about the reviews. Customer support team – Responsible for replying to customer inquiries.This team queries the reviews daily, creates a business intelligence (BI) report, and shares it with sales team. Data analyst team – Responsible for analyzing customer reviews and creating business reports.Data engineering team – Responsible for building and managing data platforms.In this scenario, we have the following teams in our organization: The customer reviews are an important source for analyzing customer sentiment and business trends. Customers can add, update, or delete their reviews at any time. Customers can buy products and write reviews to each product. Let’s assume that an ecommerce company sells products on their online platform. Next, we demonstrate how the Apache Iceberg connector for AWS Glue works for each Iceberg capability based on an example use case. For more details about Iceberg, refer to the Apache Iceberg documentation. Iceberg offers additional useful capabilities such as hidden partitioning schema evolution with add, drop, update, and rename support automatic data compaction and more. Rollback of table versions – You can revert an Iceberg table back to a specific version of the table.Time travel on Iceberg tables – You can read a specific version of an Iceberg table from table snapshots that Iceberg manages.Inserting and updating records – You can run UPSERT (update and insert) queries for your Iceberg table.Basic operations on Iceberg tables – This includes creating Iceberg tables in the AWS Glue Data Catalog and inserting, updating, and deleting records with ACID transactions in the Iceberg tables.With the Apache Iceberg connector for AWS Glue, you can take advantage of the following Iceberg capabilities: We also demonstrate how to run typical Iceberg operations on AWS Glue interactive sessions with an example use case. In this post, we give an overview of how to set up the Iceberg connector for AWS Glue and configure the relevant resources to use Iceberg with AWS Glue jobs. The connector allows you to build Iceberg tables on your data lakes and run Iceberg operations such as ACID transactions, time travel, rollbacks, and so on from your AWS Glue ETL jobs. To expand the accessibility of your AWS Glue extract, transform, and load (ETL) jobs to Iceberg, AWS Glue provides an Apache Iceberg connector. It extracts data from multiple sources and ingests your data to your data lake built on Amazon Simple Storage Service (Amazon S3) using both batch and streaming jobs. It also enables time travel, rollback, hidden partitioning, and schema evolution changes, such as adding, dropping, renaming, updating, and reordering columns.ĪWS Glue is one of the key elements to building data lakes. You can perform ACID transactions against your data lakes by using simple SQL expressions. Iceberg is an open table format designed for large analytic workloads on huge datasets. This means that not only inserts but also updates and deletes need to be replicated into the data lakes.Īpache Iceberg provides the capability of ACID transactions on your data lakes, which allows concurrent queries to add or delete records isolated from any existing queries with read-consistency for queries. There is also a common demand to reflect the changes occurring in the data sources into the data lakes. A large volume of data constantly comes from different data sources into the data lakes.

In a typical use case of data lakes, many concurrent queries run to retrieve consistent snapshots of business insights by aggregating query results. Nowadays, many customers have built their data lakes as the core of their data analytic systems.

0 Comments

Apache iceberg example

Leave a Reply.

Author

Archives

Categories