
Dell Data Lakehouse
dell-data-lakehouse-technical-solution-guide Tech Book
Dell Data Lakehouse Technical Solution Guide
A Modern Data Platform with an integrated data lakehouse built on Dell Hardware and a full-service software suite.
Abstract
This document describes the Dell Data Lakehouse, a revolutionary advancement in modern data platform architecture. Combining the functionalities of a fully integrated data lakehouse with Dell hardware and a comprehensive software suite, the Dell Data Lakehouse redefines the standards of data management. Departing from traditional methodologies, its distributed query processing approach enables seamless data federation for analytics with minimal data movement, while also offering centralized data estate management without compromising SQL performance. At its core lies the Dell Data Analytics Engine, augmented by Starburst technology, facilitating the discovery, querying, and processing of enterprise-wide data assets regardless of their physical locations. By significantly reducing data movement requirements and enhancing query efficiency, the Dell Data Lakehouse sets a new benchmark in data platform optimization and performance.
Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Table of contents
Executive Summary.............................................................................................................................................................. 3 Introduction........................................................................................................................................................................... 5 Business Challenges............................................................................................................................................................ 5 Data Preparation Approaches ............................................................................................................................................. 9 Dell Data Lakehouse............................................................................................................................................................. 9 Partner Technology Overview.............................................................................................................................................. 14 Dell Data Lakehouse Architecture....................................................................................................................................... 15 Dell Data Lakehouse Performance...................................................................................................................................... 23 Dell Data Lakehouse Sizing.................................................................................................................................................. 26 Conclusion ............................................................................................................................................................................ 28 References............................................................................................................................................................................ 29
The information in this publication is provided as is. Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license. Copyright © 2024 Dell Inc. or its subsidiaries. Published in the USA March 2024 [H19955]. Dell Inc. believes the information in this document is accurate as of its publication date. The information is subject to change without notice. 2 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Executive Summary
Overview
The Dell Data Lakehouse provides the best experience for a modern data platform. A fully integrated data lakehouse, built on Dell hardware with a full-service software suite. Its distributed query processing approach enables organizations to federate data for analytics with minimal data movement. Alternatively, they can centralize their data estate and still benefit from performant SQL. The Dell Data Lakehouse employs the Dell Data Analytics Engine powered by Starburst, a unique query engine. It enables the discovery, querying, and processing of all enterprise data, irrespective of location. The Dell Data Analytics Engine reduces data movement and enhances query performance and efficiency.
The Dell Data Lakehouse includes: · Lakehouse Compute cluster comprising Compute hardware, Dell Data Analytics Engine Software and Dell Data Lakehouse System Software · Lakehouse Storage cluster
Audience
This document is intended for enterprises with data lakes, or a data lake strategy interested in empowering their organizations to act more quickly, effectively, and efficiently on their data, as well as modernize to a data lakehouse. Audience roles include:
· Data and application administrators · Data engineers · Data scientists · Hadoop administrators · IT decision-makers
A data lakehouse can not only assist traditional analytics customers looking to modernize their data collection but also help analytics systems to get more value from their data or standardize their data for modern analytics workloads.
3 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Revisions
Date March 2024
Part Number/Revision H19955
Description Initial release
Note: This document may contain language from third-party content that is not under Dell Technologies' control and is not consistent with current guidelines for Dell Technologies' own content. When such third-party content is updated by the relevant third parties, this document will be revised accordingly.
Note: This document may contain language that is not consistent with Dell Technologies' current guidelines. Dell Technologies plans to update the document over subsequent future releases to revise the language accordingly.
We value your feedback
Dell Technologies and the authors of this document welcome your feedback on this document. Contact the Dell Technologies team by email. Author: Kirankumar Bhusanurmath Note: For links to other documentation for this topic, see dell.com/datamanagement.
4 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Introduction
Dell Data Lakehouse
The Dell Data Lakehouse is a turnkey solution comprising the Dell Data Analytics Engine, a powerful federated and data lake query engine powered by Starburst, Dell Lakehouse System Software that provides lifecycle management and tailormade compute hardware all integrated into one. For storing and processing large datasets in open file & table formats, Dell's leading S3 storage platforms such as ECS, ObjectScale and PowerScale offer exceptional performance, reliability and security. The Dell Data Lakehouse is a revolutionary advancement in modern data platform architecture. By combining the functionalities of a fully integrated data lakehouse with Dell hardware and a comprehensive software suite, the Dell Data Lakehouse redefines the standards of data management. Departing from traditional methodologies, its distributed query processing approach enables seamless data federation for analytics with minimal data movement, while also offering centralized data estate management without compromising SQL performance. At the core of the Dell Data Lakehouse lies the Dell Data Analytics Engine, facilitating the discovery, querying, and processing of enterprise-wide data assets regardless of their physical locations. By reducing data movement requirements and enhancing query efficiency, the Dell Data Lakehouse helps accelerate time to insight, as well as help IT teams consolidate most important data sets for isolation or performance reasons behind the scenes without disrupting end users.
Figure 1. Dell Data Lakehouse Diagram
Business challenges
Market environment
Rapid advancements in artificial intelligence (AI) and machine learning (ML) have transformed the business landscape. Given the growing popularity of hybrid cloud computing, businesses face the challenges of establishing an enterprise data platform that integrates with both on-premises and public cloud providers. They also seek self-service analytics and data mesh principles in a cost-effective way. While these technologies offer exciting opportunities for innovation and datadriven decision-making, they also introduce complexities in storage, integration, and processing.
5 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
The types of data generated and collected by businesses have expanded. Organizations have historically dealt with structured data from traditional databases. Today, multinational organizations now deal with semi structured and unstructured data, such as text, logs, images, audio, video, social media content, and sensor data that spans across borders. These types of data cannot be placed in the same repository as structured data or in the same vicinity due to sovereignty regulations. Managing and analyzing this diverse data requires a modern data stack. The stack consists of a data lakehouse storage and lakehouse compute engine called Dell Data Analytics Engine powered by Starburst. It federates queries across multiple data sources while maintaining performance. As data volumes and processing demands increase, traditional data management systems may struggle to scale efficiently to meet the growing needs of businesses. It is evident that the need for an enterprise-wide data management solution is a must have. However, integrating these disparate approaches can be complex and impeded by siloed data environments. These challenges require adopting new scalable and high-performance data management solutions that complement the traditional data stack. Technologies like data mesh are new data architectures that treat data as a product and each domain or business unit becomes responsible for managing and owning its data. By implementing a data mesh architecture, organizations can associate distinct data architecture designs across various Hyperscalers or on-premises. Doing so enables each team to leverage the best technology for their domain while ensuring compatibility and cohesiveness at the company level. This modern data stack addresses the issues of data silos, streamlines the data sharing process, and enables the organization to fully use its data assets. It encourages cooperation, innovation within diverse functional areas, and minimize costs. Addressing these challenges requires adopting modern data management strategies, embracing innovative data technologies, and empowering organizations with the necessary tools and skills for effective data utilization. Businesses that can navigate these complexities and demands successfully are better positioned to leverage data-driven insights and gain a competitive advantage in their respective markets.
Data analytics processing platforms
Broad data architectures have evolved over time to address large-scale data collection and analysis as diverse types are being generated. Their uses have grown exponentially and diversified. See Data storage methodologies for analytics.
Figure 2. Data storage methodologies for analytics
NOTE: The graphic above was adapted from "Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics," 11th Annual Conference on Innovative Data Systems Research, January 1115, 2021, p. 2.
6 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Data warehouse
The data warehouse was designed to handle massive amounts of information. Data warehouses were optimized for business intelligence and decision support, dealing primarily with well-defined and structured data. Using the foundational extract-transform-load (ETL) process, data could be presented to applications in the consistent manner that they require. ETL does have some degree of overhead and lacks a certain amount of flexibility. Businesses adapted to data warehouse requirements because of the valuable insights that a unified collection of many structured datasets could deliver. Large organizations continue to use data warehouse methodology. It provides a consolidated and structured repository for diverse data sources and analysis of data. It offers optimized query performance for complex analytics to help deliver rapid insights for business stakeholders.
Data lake
The data lake technology came into existence driven by the evolving data landscape and the need for more flexible and scalable data storage and processing solutions. It has become popular for several reasons:
· Data lake is designed to store data from multiple sources and multiple data structures in the same repository. They include structured, semi structured, and unstructured data. Data lake also eliminates the need for data modeling at the time of ingestion. It provides a unified platform capable of storing and managing diverse datatypes without the need for predefined schemas.
· Data lake can use both cloud-based and on-premises storage options. It can scale horizontally to accommodate massive amounts of data, making them into cost-effective solutions for storing and processing vast datasets. This scalability is important in the era of big data, where organizations need to handle data growth efficiently.
· Data lake integrates well with various big data processing frameworks and tools, such as Hadoop, Spark, Trino, and more. This capability enables complex data processing and analytics, making it attractive to organizations looking to leverage their existing infrastructure.
Data lake and data warehouse
To continue using data warehouse technology while maintaining the flexibility and the scalability of data storage, organizations turned to both data lake and data warehouse. These two architectures serve different needs and use cases. This two-tier approach enables organizations to have Atomicity, Consistency, Isolation, and Durability (ACID), dynamic, and flexible operations on increasing quantities and varieties of data. Data lakes excel at handling diverse and raw datatypes including structured, semi structured, and unstructured data. This capability enables them to store large amounts of data without the requirement of strict schemas. Data warehouses, however, are optimized for querying and analyzing structured data while maintaining their strong data governance and compliance features. This strategy, however, results in less reliability and the potential for stale data. Users who work in analytics now must deal with two sources, which increases the level of complexity.
Data lakehouse
A data lakehouse is a modern data architecture that combines the structure and performance of a data warehouse with the flexibility of a data lake. It is designed to provide a unified and scalable platform for storing, processing, and analyzing large volumes of structured, semi structured, and unstructured data.
7 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Traditionally, data lakes and data warehouses have been separate concepts with distinct purposes. A data lake is a centralized repository that stores raw and unprocessed data from various sources in its original format. It allows for the storage of diverse datatypes and provides flexibility for data exploration and analysis. However, a data warehouse is a structured and organized repository that stores processed and transformed data optimized for querying and analytics. The data lakehouse concept emerged as a response to the limitations and challenges of these traditional architectures. It aims to combine the strengths of data lakes and data warehouses, bridging the gap between raw data storage and structured analytics. In a data lakehouse, data is stored in its raw form, but it is also curated and organized to enable efficient analytics. Advancements in certain technologies make a data lakehouse possible. These technologies include:
· Metadata layers for data lakes · New query engine designs providing high-performance SQL execution on data lake storage · Advanced analytics and machine learning tools The primary benefits for adopting a data lakehouse can be summarized as: · Flexible and simpler architecture--A data lakehouse provides flexibility in terms of storage solutions like object
storage, S3, or HDFS. It provides fast queries without the need to copy or move data to meet BI requirements. This architecture minimizes data extracts and reduces the challenge of managing multiple copies of data and streamlines cost while improving agility to support changing business requirements. · Workloads consolidation--A data lakehouse supports both BI and data science enabling organizations to consolidate workloads. This consolidation eliminates the need of having two separate platforms, and still maintains an open architecture to interoperate with other tools. Data lakehouse core architecture layers illustrates the core architectural layers of a data lakehouse.
Figure 3. Data lakehouse core architecture layers Data preparation approaches
8 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Data preparation approaches
Data Pipelines
Data pipelines with Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) are two common techniques used in data analytics. They prepare and move data from source systems to a target data warehouse or data storage for analysis. The ETL process is typically used in a data warehouse architecture. The data is first extracted from source systems and then goes through the transformation process to cleanse the data to prepare for analysis. Once the data has been transformed, it is loaded into a target data warehouse or a data mart. This load process typically maps the transformed data into tables and structures in the target systems. The ELT process, typically used in a modern data lakehouse, starts with data extraction from source systems. This technique works well in a data lakehouse architecture. Data is extracted and loaded into the target storage as-is, with minimal transformation. The transformation step is performed directly on the target systems using integrated processing capabilities without cleansing the data first. This approach takes advantage of compute power of a data lakehouse to transform the data as needed during the analysis phase. The transformed data is stored in the open table formats, it could be Apache iceberg, Delta or Apache Hudi. The Dell Data Lakehouse is validated to support iceberg and Delta Lake Table formats. Any of the data modeling techniques can be used such as Medallion Architecture or One Big Table (OBT). ELT compared to ETL shows the differences between the ETL and ELT processes.
Figure 4. ELT compared to ETL Both ETL and ELT methods can be used simultaneously, and each has its own benefits. The choice between them depends on factors such as data volume, data transformation complexities, and capabilities of the data warehouse or storage system being used for analysis.
Dell Data Lakehouse
Overview
In today's data-driven world, organizations face the challenge of efficiently managing and delivering value from vast and diverse datasets. To address this challenge, a comprehensive and robust data management solution is essential.
9 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
The Dell Data Analytics Engine is the core engine of the Dell Data Lakehouse. Powered by Starburst, it is a distributed query engine that can serve as both a single point of access to all data across the enterprise, as well as the data lakehouse query engine. The Dell Data Lakehouse leverages Dell's industry proven PowerEdge R660 Servers tailor-made for Lakehouse workloads. They are referred to as the Dell Data Analytics Engine 660 (DDAE660) nodes. The entire stack is managed and orchestrated by a new Dell Data Lakehouse System Software based on Kubernetes. This software ensures maximum scalability, control, and simplified lifecycle management.
Support, Deploy & Prof Services
Dell Data Analytics Engine
powered by Starburst +
Dell Lakehouse System Software
(with fully embedded K8s) +
Dell Scale-out Lakehouse Compute
ECS | ObjectScale | PowerScale
| Simplified deployment by Dell experts to integrate the solution stack and connect to analytics applications of your choice
| Lower your risk with advanced security features
| Improve visibility into cluster health via alerts and logs
| Peace of mind with easy upgrades and patches on entire stack from Dell
| Scale quickly by adding/removing nodes
| Dedicated residency services to future-proof your IT environment
| Reduced management effort with our Lakehouse System Software
Figure 5. Dell Data Lakehouse Compute Nodes and Lakehouse Software Components
The Dell Data Lakehouse delivers key benefits for the entire organization.
· Separate compute from storage, easily scale with more DDAE compute nodes The cluster can be scaled easily from a 2 worker node minimum to up to 20 nodes in the first release. Storage can be scaled independently based on the workload requirements.
· Out-of-the-box automation and orchestration for day to day IT Ops Dell Data Lakehouse comes with automation out of the box that helps automate the installation and the ongoing lifecycle management. The turnkey stack, when deployed by Dell experts, includes all software components, the management control plane, user management and associated databases including the Hive Metastore. Updates, security patches, and feature additions for any component within the Dell Data Lakehouse stack will be delivered as a consolidated payload. This approach eliminates the complexity arising from sourcing drivers, firmware, software, patches, updates, etc.
· Reduced effort and cost of managing middleware such as operating systems, virtualization or container runtimes, orchestration tools, etc. The turnkey solution delivers data lakehouse in a box with everything built-in so IT teams don't have to incur costs of managing multiple layers of the stack, while maintaining flexibility and openness in data formats.
· Best-in-class integration with Dell storage (ECS, ObjectScale, PowerScale) Based on thorough validation and integration from Dell engineering teams, organizations can receive an enhanced support experience when using Dell Data Lakehouse with Dell storage.
· Single vendor for E2E transaction Procurement teams can take advantage of Dell's broad portfolio and procure the complete solution including all hardware (including server, storage, networking), software, deploy and support services as well as curated professional services.
10 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
· Single vendor for E2E guidance across the lifecycle Dell Services provide deep expertise throughout the lifecycle. From aligning a winning strategy, validating data, quickly implementing a data platform or ensuring secure, optimized operations, trusted experts help accelerate time to value in effectively leveraging enterprise data to power AI projects. Strategize--establish a winning strategy to drive secure business outcomes · ProConsult Advisory for GenAI build a strategy and roadmap, gain consensus on high-priority use cases and align a plan to achieve them. · Leverage Accelerator Workshop for GenAI as a great first step · And with Advisory Services for GenAI Data Security we can help mitigate data security risks throughout the lifecycle IMPLEMENT: Prepare data for GenAI and implement the platform · ProDeploy for Infrastructure--deploy faster with less effort and more control · Accelerator Services for Dell Data Lakehouse--implement a fully operational Data Lakehouse platform to accelerate AI and data analytics · Implementation Services for Data Preparation validate data sets and align data for use by AI ADOPT & Scale--improve operations and extend capabilities · Advisory Services Subscription for Data Analytics Engine provides a dedicated expert to maximize value from the data analytics engine within the lakehouse (powered by Starburst), delivered over a pre-defined set of time · ProSupport Infrastructure Suite--24/7 support from trained experts to maximize value from investments
Dell Data Lakehouse Components
The Dell Data Lakehouse solution is made up of four key components. 1. Dell Data Lakehouse Compute Nodes (control plane, coordinator, worker) 2. Dell Data Analytics Engine powered by Starburst 3. Dell Data Lakehouse System Software 4. Dell Lakehouse storage cluster (fully validated with Dell ECS, compatible with other S3 compliant storage like Dell ObjectScale and Dell PowerScale)
Dell Data Lakehouse Compute Nodes
Built on Dell PowerEdge R660 server (1U), the Dell Data Lakehouse compute cluster is tailormade for data lakehouse workloads. The cluster of nodes is connected to data lake storage as well as the rest of the external environment via customer-provided network equipment. The compute nodes are of two types:
i. Control Plane Nodes: these are fixed 3 nodes that run the essential management software and services such as a Hive Metastore.
ii. Coordinator and Worker Nodes: based on a coordinator-worker node architecture, these nodes run the Dell Data Analytics Engine and scale with additional worker nodes to handle complex or high volume queries.
11 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Figure 6. Dell Data Lakehouse Compute Nodes
Dell Data Analytics Engine powered by Starburst
Dell Data Analytics Engine, powered by Starburst, is a fully supported and enterprise-grade distributed SQL query engine designed for high-performance analytics. It allows users to query large amounts of data stored in various data sources throughout an organization using standard SQL syntax. It delivers on two key benefits for data and IT teams. Firstly, it provides users an ability to run queries across different data sources simultaneously, in the same query. These sources include relational databases, NoSQL databases, object storage systems, and more. Secondly, it can query data directly off a data lake storage system like Dell ECS, ObjectScale, and PowerScale. Such data may be stored in a variety of formats including open formats like Parquet, AVRO and ORC with metadata in open table formats such as Iceberg and Delta Lake. Data users can use this query engine to process data across multiple data systems and data sources. With the integrated query engine, administrators can implement a layer on top of data that abstracts away details on location, connectivity, language variations, and API. This layer of abstraction is critical to simplify data analytics over a diverse set of data sources.
Key Benefits
Today an enterprise data landscape consists of several data sources that are spread across their on-premises data centers, cloud, and edge. Most data users spend a considerable amount of time locating data, gaining access to it, exploring it, and consuming it.
Data users can obtain different benefits from the Dell Data Lakehouse. For example, · Data analysts, data scientists, and data engineers can: · Access and explore data using a high-performance query engine. · Leverage an in-place exploration of data without having to move data into a centralized store like a data warehouse or data lake. · Reduced dependency on data engineering or IT teams to provision data in a centralized repository. · Utilize a simplified, single point of access for data users to connect to various data sources. · Data engineers can: · Create data products for frequently used data assets. · Explore, sample, and model data assets without creating data pipelines and moving data. · Data Stewards, Data Governance administrators, and Data Platform administrators can: · Define policies and ensure access control in a uniform manner. Access control at the row and column level provided along with the data masking capabilities.
12 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
· Dell Data Lakehouse support both interactive and ETL workloads with fault tolerant execution to ensure queries do not have to be completely restarted in the event of failures.
· Dell Data Lakehouse provides smart cache view with table redirection and materialized view functionalities. · IT and Infrastructure teams can:
· Provision access quickly to data without making their users wait until data pipelines are set up. · Track data usage and use this insight to consolidate the highest value, frequently used data into higher
performance, lower latency data stores, like data lakes. · Flexibly scale and isolate workloads (BI, Ad hoc, AI/ML, operational) for optimal performance.
Dell Data Lakehouse System Software
This new system software is the central nervous system of the Dell Data Lakehouse. It simplifies lifecycle management of the entire stack, drives down IT OpEx with pre-built automation and integrated user management, provides visibility into the cluster health and ensures high availability, enables easy upgrades and patches and lets admins control all aspects of the cluster from one convenient control center. Based on Kubernetes, it's what converts an otherwise DIY data lakehouse into an easy button for enterprises of all sizes.
The Dell Data Lakehouse offers three user interfaces tailored to different users' needs.
Dell Data Lakehouse System Software user interface
The Dell Data Analytics Engine offers a web-based administrator user interface known as the Dell Data Lakehouse System Software UI. The Lakehouse System Software UI allows users to manage and monitor system features and settings remotely from any location over a network.
Dell Data Analytics Engine user interface
Dell Data Analytics Engine user interface is powered by Starburst Enterprise Platform (SEP) which is a commercial distribution of Trino.
Dell Data Lakehouse System Software - User Management user interface
Administrators use the Dell Data Lakehouse System Software's User Management interface to manage local users and LDAP-configured users. These users gain access to both the Dell Data Lakehouse System Software and Dell Data Analytics Engine interfaces through this management system.
Dell Lakehouse storage cluster
An object storage system is connected to function as data lake storage to store large amounts of data using an array of industry standard formats. Dell ECS has been fully validated with the Dell Data Lakehouse. Other S3-compliant storage like Dell ObjectScale and Dell PowerScale are also compatible with the Dell Data Lakehouse.
13 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Dell ECS
Dell ECS, the world's most cybersecure object storage, delivers unmatched scalability, performance, resilience, and economics. ECS delivers rich S3-compatibility on a globally distributed architecture, empowering organizations to support enterprise workloads such as AI, analytics, and archiving at scale. ECS customers are reducing TCO up to 76% over public cloud.
Dell ObjectScale
ObjectScale is high-performance containerized object storage built for the toughest applications and workloads-- Generative AI, analytics, and more. Innovate faster, at any scale, with a global namespace, strong S3 compatibility, and enterprise-class security that is ready on day one. Expanding its software-defined options, ObjectScale is now also available as the world's most powerful object storage appliance purpose-built for Kubernetes.
Dell PowerScale
The world's most flexible, secure, and efficient scale-out file storage.
Partner technology overview
Delta Lake from the Linux Foundation
A Linux Foundation project since 2019, Delta Lake by the Linux Foundation is an independent project controlled by a development community rather than any single technology vendor. The Dell Data Lakehouse enables reliable deployment and operation of Delta Lake in the solution. More than 150 developers from over 50 different organizations associated with Delta Lake working on multiple storage repositories are all engaged daily to help push the program goals forward. Key Delta Lake capabilities include:
· ACID transactions--Ensure data consistency and reliability and isolating it at the strongest isolation level, the serializable level.
· Time travel and data versioning--Each data write to a Delta table creates a version number. This feature enables users to query a Delta Lake table as of a specific time. Users can view and revert to previous versions of the data using a timestamp or a version number.
· Scalable metadata--Leverages Spark's distributed processing power to easily handle all the metadata for petabytescale tables with billions of files.
· Schema evolution and enforcement--Perform automatic schema validation by checking against a set of rules to determine the compatibility of a write from a DataFrame to a table.
· DML operations--Support Data Manipulation Language (DML) operations like updates, deletes, and merges by using transaction logs. They enable easy handling of complex use cases like change-data-capture, slowly changing dimension (SCD) operations and streaming upserts.
Apache Iceberg
Apache Iceberg is a new open-source, high-performance data table format designed for large-scale data platforms. Its primary goal is to bring the reliability and simplicity of SQL tables to big data while providing a scalable and efficient
14 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
way to store and manage data. This performance is especially important in the context of big data workloads. Iceberg makes it possible for engines such as Spark, Trino, Flink, Presto, Hive, and Impala to safely work with the same tables, simultaneously. It addresses some of the limitations of traditional data storage formats such as Apache Parquet and ORC.
Some key Apache Iceberg features and concepts include: · ACID transactions--Iceberg tables guarantee atomic, consistent, isolated, and durable (ACID) transactions for write operations, ensuring data integrity. · Catalog--A mechanism that houses metadata pointers for Iceberg tables. · Table metadata--Maintains comprehensive metadata information about tables, including schema details, partitioning information, and datafile locations. The metadata layer consists of three categories: · Metadata files that define the table · Manifest lists that define a snapshot of the table · Manifests that define groups of datafiles that may be part of one or more snapshots · Time travel--Supports the ability to query data at different points in time, enabling historical analysis and data rollback to specific versions. · Schema evolution--Enables schema evolution by adding, renaming, or deleting columns without requiring expensive and time-consuming data migrations. · Partitioning--Enables data to be partitioned into logical segments based on specific columns, such as date or region. Partitioning helps improve query performance by enabling efficient data pruning and reducing the amount of data scanned during queries.
Future technologies
The Dell Data Lakehouse v1.0 is the first release of a new product offering to help organizations accelerate insights from data. With an exciting roadmap ahead and based on customer feedback, this solution will continuously evolve to deliver even better automation, performance and security of the solution. Along with the product offering, Dell will also provide reference designs on how to integrate the Dell Data Lakehouse with other tools in the data ecosystem.
Dell Data Lakehouse architecture
Dell Data Lakehouse overall framework
Designing for today and tomorrow, the Dell Data Lakehouse architecture consists of five layers: Storage layer -- This layer is meant to keep all types of data in high performance, scalable and secure object storage. By integrating other layers with this data lake storage as shown in the below figure, client tools can query (via a query engine) the objects directly residing on object storage and eliminating ETL. Compute layer -- This layer provides scale-out compute horsepower for running the distributed query engine as well as the associated essential services such as a metadata service (Hive Metastore). Lakehouse System Software layer -- While a data lakehouse stack can be assembled with individual pieces, it is hard to deploy, maintain and secure. The Lakehouse System Software delivers a managed compute experience with automated deployment and simplified lifecycle management of the entire lakehouse stack. Distributed Query Engine layer -- This layer pulls data from various sources, including databases, IoT devices, applications, logs, and many more, and deliver it to the end user as a query outcome, or write the data into the storage layer in an open format where it I can be further used to transform data (ELT).
15 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Metadata layer -- The Distributed Query Engine layer relies on the metadata layer to understand the schema and layout of the data stored in the data lakehouse to query and update efficiently. Commonly referred to as a metastore or catalog, this layer provides information on directory structure, file format, and metadata about the stored data and enables ACID compliance, data versioning, caching, indexing, and more. Increasingly popular open source projects such as Iceberg and Delta Lake are now the de-facto sources of metadata that defines the structure of the data stored in open file formats. Consumption layer -- This layer hosts different tools and applications such as BI tools (e.g., Tableau, Power BI), or AI tools (Python, Spark, ML tools) or other data ecosystem tools like data catalog, data security, access control, etc. This layer is where the users perform analytics activities such as running SQL queries, BI, data prep for AI modeling or govern data access and rely on data stored in the data lakehouse. Dell Data Lakehouse stack diagram illustrates the essential layers of a data lakehouse.
Figure 7. Dell Data Lakehouse layers
Infrastructure Overview
Dell provides the necessary infrastructure (compute, memory, storage, and network resources) for the platform. Dell Data Lakehouse infrastructure illustrates the required infrastructure components and their roles in the Lakehouse cluster.
Figure 8. Dell Data Lakehouse infrastructure 16 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
The Dell Data Lakehouse includes tailor-made nodes that are configured specifically to handle the data lakehouse workloads. They come in two different configurations to support the control plane and coordinator/worker node architecture. Individual node configurations include node-specific guidance as their role. More guidance for cluster level sizing and scaling is provided in Sizing the Dell Data Lakehouse solution.
Server Infrastructure
The server infrastructure provides compute, memory, and some of the storage resources that are required to run customer workloads. Dell Data Lakehouse comes with two types of nodes, control plane nodes and analytics engine nodes (coordinator and worker).
Details of the node configurations can be found in Table 1. Node Configurations for Dell Lakehouse.
Node Type
Server Model CPU (dual-socket)
Total CPU Cores (Threads)
Control Plane x 3
Analytics Engine x N
Dell Data Analytics Engine Compute Dell Data Analytics Engine Compute 660 based on PowerEdge Rack Server 660 based on PowerEdge Rack Server
Intel Xeon Gold 5416S 2G, 16C/32T, 16GT/s, 30M Cache, Turbo, HT (150W) DDR5-4400
Intel Xeon Gold 5416S 2G, 16C/32T, 16GT/s, 30M Cache, Turbo, HT (150W) DDR5-4400
32 (64 Hyper threads)
32 (64 Hyper threads)
Memory DIMM
128 GB at 4800 MT/s
256 GB at 4800 MT/s
Data Rate (MT/s)
4800 (running at 4400)
4800 (running at 4400)
Hard Drives
960GB SSD (RAID 5)
480GB SSD SATA (RAID 1)
Network OCP (SFP) Network PCIe (SFP)
Intel E810-XXV Dual Port 10/25GbE SFP28, OCP NIC 3.0
Intel E810-XXV Dual Port 10/25GbE SFP28 Adapter, PCIe Low Profile
Intel E810-XXV Dual Port 10/25GbE SFP28, OCP NIC 3.0
Intel E810-XXV Dual Port 10/25GbE SFP28 Adapter, PCIe Low Profile
17 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Storage Infrastructure
Dell Data Lakehouse is fully validated with Dell ECS and is compatible with S3-compliant storage like Dell ObjectScale and Dell PowerScale. Dell ObjectScale, ECS, and PowerScale are deployed as cluster-level systems. The node recommendations here can be used as guidance for new clusters, verification of compatibility with existing clusters, or expansion of existing clusters.
ECS Node
Dell Technologies recommends using the ECS EX500 node configuration or ECS EXF900 node configuration for their primary S3 Object lakehouse.
Details of the node configurations can be found in Table 2. ECS EX500 and EXF900 node configuration
Model
ECS EX500
Model ECS
EX500
Chassis
2U node
Nodes per rack
16
Node storage Node cache
384 TB (twenty-four 16 TB NLSAS drives)
960 GB SSD
Usable capacity per chassis
Slightly less than 384 TB
Front-end networking
Two 25 GbE (SFP28)
Infrastructure (back-end) networking Two 25 GbE (SFP28)
ECS EXF900
EXF900 2U node 16 184 TB (twenty-four 7.68 TB NVMe drives) N/A Slightly less than 184 TB Two 25 GbE (SFP28) Two 25 GbE (SFP28)
The ECS EX500 configuration provides a good balance of storage density and performance for lakehouse usage. And the ECS EXF900 configuration is an all-flash configuration and provides the highest performance for lakehouse usage.
Two Ethernet network ports per node included for connection to the Cluster data network or an ECS storage network. Two additional network ports included for connection to the ECS back-end network.
18 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
ObjectScale infrastructure
Dell Technologies recommends the configuration in ObjectScale all flash configuration for storage clusters that use ObjectScale for primary lakehouse storage using the s3a:// protocol.
Customers can use either of the following options to deploy ObjectScale: · Software-defined storage to deploy ObjectScale software on Red Hat OpenShift · An XF960 appliance that is based on the latest generation of Dell PowerEdge servers
Details of the fuctions can be found in Table 3. ObjectScale all flash configuration
Machine Function
Platform Nodes per rack Chassis Chassis configuration
Power supply Processor Memory capacity Internal RAID storage controllers Disk NVMe Boot-optimized storage cards Network interface controllers Node storage Front-end networking
Component
PowerEdge R760 server 16 2.5" chassis with up to 24 NVMe Direct Drives, two CPUs Riser configuration 3, half-length, two 2-channel full-height slots (Gen4), two 16-channel full-height slots (Gen5), and two 16-channel low-profile slots (Gen4) Dual, hot-plug, fully redundant (1+1) 1100 W power supplies Intel Xeon Gold 6426Y 2.5 G, 16 C/32 T, 16 GT/s, 38 M 512 GB (sixteen 32 GB RDIMM, 4800 MT/s, dual rank) C30, no RAID for NVMe chassis 24 6.4 TB Enterprise NVMe, mixed-use agnostic drive, U.2 BOSS-N1 controller card + with two M.2 960 GB SSDs (RAID 1) NVIDIA ConnectX-6 Lx dual port 10/25 GbE SFP28 adapter, PCIe low profile 153.6 TB (24 6.84 TB NVMe drives) Two 25 GbE (SFP28)
The ObjectScale configuration is an all-flash configuration and provides the highest performance for lakehouse usage. Two Ethernet network ports per node included for connection to the Cluster data network or an ECS storage network.
19 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
PowerScale infrastructure
Dell Technologies recommends the configuration in PowerScale configuration for storage in clusters using PowerScale for their primary lakehouse storage using both S3 and HDFS protocols.
Details of the storage clusters can be found in Table 4. PowerScale configuration
Machine Function
Component
Model
PowerScale H7000 (hybrid)
Chassis
4U node
Nodes per chassis
Four
Node storage
Twenty 12 TB 3.5-inch, four native sector size SATA hard drives
Node cache
Two 3.2 TB SSDs
Usable capacity per chassis
600 TB
Front-end networking
Two 25 GbE (SFP28)
Infrastructure (back-end) networking Two InfiniBand QDR or two 40 GbE (QSFP+)
Operating system
OneFS 9.5.0.2
The recommended configuration is sized for typical usage as lakehouse storage.
Two Ethernet network ports per node included for connection to the Cluster data network or a PowerScale storage network. Two additional network ports are included for connection to the PowerScale back-end network. These additional ports can be either InfiniBand QDR or 40 GbE, depending upon on-site preferences.
One PowerScale H7000 chassis supports four PowerScale H7000 nodes. This configuration provides approximately 720 TB of usable storage. At 85% utilization, 600 TB of storage is a good guideline for available storage per chassis.
This configuration assumes that the PowerScale nodes are primarily used for lakehouse storage. If the PowerScale nodes are used for other storage applications or clusters, those applications must be taken into account in the overall cluster sizing. Other PowerScale H7000 drive configurations may also be used.
20 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Network infrastructure
The network is designed to meet the needs of a high performance and scalable cluster, while providing redundancy and access to management capabilities. The architecture is a leaf and spine model that is based on Ethernet networking technologies. It uses PowerSwitch S5248F-ON switches for the leaves and PowerSwitch Z9432F-ON switches for the spine. Physical networking in this architecture is straightforward since most of the advanced capabilities of the system are implemented using software-defined networking. The logical network is described in the Container platform implementation. This architecture has three physical networks, as shown in Physical network infrastructure.
iDRAC (or BMC) network
The iDRAC or BMC network is isolated. For utmost security, there is no connectivity to iDRAC via iDRAC interfaces. All BMC activity is secured within the Dell Data Lakehouse software stack and happens securely via the Dell Data Lakehouse OS directly accessing the iDRAC via Dell Data Lakehouse OS passthrough mechanism.
Cluster data network
The Cluster data network is the primary network for internode communication between all server and storage nodes. Each server node on this network is assigned a single IP address on this network.
Core data center network
The Core data center network is the existing enterprise network. The Cluster data network is interfaced with this network through switching and routing allowing cluster services to be exposed to system users.
Figure 9. Physical network infrastructure
21 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Management network fabric
The management traffic and the data traffic share a unified network involving two 25Gbps per node configured into a link aggregated group.
Cluster network fabric
The Cluster network uses a scalable, resilient, nonblocking fabric with a leaf-spine design as shown in Cluster data network connections. Each node on this network is connected to two S5248F-ON leaf switches with 25 GbE network interfaces. The switches run Dell SmartFabric OS10. SmartFabric OS10 enables multilayered disaggregation of network functions that are layered on an open-source Linux-based operating system. On the server side, the two network connections are bonded and have a single IP address assigned. On the switch side, the network design employs a Virtual Link Trunking (VLT) connection between the two leaf switches. VLT technology enables a server to uplink multiple physical trunks into more than one S5248F-ON switch by treating the uplinks as one logical trunk. In a VLT environment, a connected pair of switches acts as a single switch to a connecting server while all paths are active. It is possible to achieve high throughput while still providing resiliency against hardware failures. VLT replaces Spanning Tree Protocol (STP)-based networks, providing both redundancy and full bandwidth utilization using multiple active paths. The VLT configuration in this design uses four 100 GbE ports between each Top of Rack (ToR) switch. The remaining 100 GbE ports can be used for high-speed connectivity to spine switches, or directly to the data center core network infrastructure. The above illustration is Dell specific, Dell Data Lakehouse solution also allows Bring Your Own network, in that case the network architecture must be comparable to the Dell network architecture illustrated below.
Figure 10. Cluster data network connections
22 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Back and Restore
Overview
The backup and restore process is essential for preserving data integrity during upgrades, ensuring business continuity, and mitigating risks associated with data loss or corruption. The Dell Data Lakehouse offers robust backup capabilities through Dell ProSupport, enabling customers to protect their valuable data assets effectively and maintain operational resilience.
Backup
During the initial Dell Data Lakehouse installation, with the help of Dell's ProDeploy personnel, customers can trigger a backup instantly, or customize the backup frequency to align with their organizational policies, ensuring regular backups for mitigating the risk of data loss. The backups are currently supported on NFS export and CIFS share locations. Customers need to provide a reliable backup location during the installation process. Such backups can optionally be encrypted as well for even greater security. The backup functionality comprehensively captures critical data elements, including database schemas, content repositories, configuration settings, and user profiles. Since this operation can run on any node in the lakehouse cluster, it is necessary that the IP addresses of all nodes are allowed access to the NFS export or CIFS share, and no firewall be enabled on the NFS server for accessing the backups. And finally, these backups are timestamped. Therefore, sufficient capacity must be available in the NFS or CIFS location enabling more flexible data retention.
Restoration
Restoring the Dell Data Lakehouse is streamlined and efficient, allowing customers to select the desired backup point and initiate the process seamlessly via the help of ProSupport. However, a successful restore depends on the availability of the latest backup files, system compatibility, and data integrity. By adhering to these dependencies and Dell ProSupport service, customers can efficiently recover their data and maintain continuity in the operations.
Dell Data Lakehouse Performance
The Dell Data Lakehouse performance is benchmarked with a cluster configuration described below. The specification of the nodes is described in the node configuration details table.
Compute cluster 1. Three Control Plane Nodes 2. One Coordinator Node 3. Six Worker Nodes
Storage cluster 1. ECS EX500, eight nodes
Test Setup
In this section, we describe the test setup, including the queries we have used, how we generated data and a summary of the methodology we followed.
23 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Test Data and Queries
Performance validation is done from the queries derived from the TPC-DS benchmark. The TPC-DS benchmark data is modeled on the decision support functions of a retail product supplier and consists of seven (7) fact tables and seventeen (17) dimensions. We use the Scale Factor of 1000 which corresponds to 1TB of data. The 99 queries are divided into four broad classes:
· Reporting queries · Ad-hoc queries · Iterative OLAP queries · Data Mining queries
Test Data Generation
TPC-DS dataset is generated using TPC-DS dsdgen utility in CSV format. The dataset is stored in an ECS bucket, the bucket is mounted as an NFS mount to the host where the dsdgen utility is run during the creation of the dataset. The SQL Create Table as Select (CTAS) statement is used to convert the generated CSV data and write it to ECS in Iceberg table format and Parquet file format using the Iceberg connector. The data is generated in both partitioned and nonpartitioned layout. Partitioned by date on all the seven (7) fact tables. Further partitioned data are compacted for higher read performance.
Test Methodology
The Apache JMeter test suite is used to run benchmarking tests. Starburst Iceberg connector is used to connect to TPCDS data in stored in Dell ECS.
Tests Executed · Execute 3 iterations of 99 queries for each of the below mentioned test conditions. Metrics Collected · Total execution time = total time taken to execute 99 TPS-DS queries from the 3 runs · Per query execution time = time taken to execute each of the 99 TPC-DS queries. · No. of queries in 10mins = Number of Queries executed in 10 mins Metrics Computed · Average execution time for 99 queries = average of execution time to run all 99 queries from the 3 runs · Average execution time = average execution time for each query from the 3 runs
Observations Made
· Variation in the number of queries executed in 10minutes for Partitioned and non-partitioned dataset. · Variation in Average execution time for 99 queries and Average execution time per query with increasing number of
concurrent users 1, 5, 10. Each concurrent thread corresponds to a User and will execute the 99 TPC-DS queries in its entirety and independently of other test threads. For example, 10 concurrent users will run 10 parallel sessions of 99 queries each. Multiple threads are used to simulate concurrent users to Dell Data Analytics Engine. Apache JMeter Thread groups are used to achieve these concurrent users.
24 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Below results illustrate three key findings: a. Data Analytics Engine didn't return any failures and didn't require rewrites to any SQL query indicating strong compliance with industry standard ANSI-SQL. b. Partitioned tables are more performant than non-partitioned tables, it is recommended to partition the tables on Dell Data Lakehouse to get best performance. c. As user concurrency increases, performance of the cluster decreases (fewer queries executed per 10 mins). This means the worker nodes need to scale up to support higher concurrency as shown in the next text.
Figure 11. Number of TPC-DS queries executed every 10 minutes against Partitioned and non-partitioned 1TB (SF1K) Dataset.
NOTE: This document will soon be updated to provide performance results across a range of worker nodes in the cluster aligned to the sizing recommendations in this document. Similar results were published earlier in a Reference Architecture available at Dell Infohub. Additionally, public facing benchmarks for Starburst Enterprise, the technology powering the Dell Data Analytics Engine, can be found with a quick web search.
25 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
Dell Data Lakehouse sizing
Cluster sizing and scaling are important to understand when architecting the Dell Data Lakehouse solution. Sizing is concerned with ensuring the cluster meets the workload requirements for storage and processing throughput. Scaling is concerned with cluster growth over time as capacity needs increase.
Sizing and scaling overview
The Dell Data Lakehouse architecture is a parallel scale-out system with decoupled compute and storage. Some sizing requirements can be addressed through scaling while others must be addressed through node level sizing. Sizing and scaling of a cluster are complex topics that require knowledge of the workloads. This section accentuates the main considerations that are involved but does not provide detailed recommendations for workload sizing. Design guides for specific workloads running on the platform include workload specific sizing guidance. A Dell Technologies or authorized partner sales representative can help with detailed sizing calculations. There are many parameters that are involved in cluster sizing. The primary parameters are:
Data volumes and growth rates
Data volume and growth rate of data volume has a significant impact on cluster sizing. The cluster must be sized to have sufficient memory to process any given query at a given user concurrency. This memory requirement may grow as queries become larger, or underlying datasets grow (e.g., additional months/years of data gets added). Data ingestion also impacts network utilization. Since the lakehouse storage is external to the cluster nodes, network bandwidth is required to access it. The processing throughput requirements must be considered as well as the data size.
Memory and processor capacity
Memory and processor requirements for jobs running on the cluster must be considered when sizing. Memory and processor capacity increases as nodes are added to the cluster.
Service-level agreements
Production cluster sizing must meet any performance requirements that SLAs specify. Critical path jobs that must meet a specific execution time or throughput may require adjusting the cluster sizing and balance between compute and storage accordingly. Overall cluster throughput is as important as storage capacity, and often influences the number of nodes independent of the required storage capacity.
Storage capacity
While the Dell Data Lakehouse compute cluster does not include object storage, it relies on data lakehouse storage externally that also must be sized in sync with the compute cluster. Sizing the storage capacity relies on understanding how much raw or fresh data is expected to be ingested into the data lakehouse storage, as well as how much data will be materialized from other data sources. In addition to raw capacity, it is also important to understand performance of the storage needed to meet SLAs. The available network bandwidth between the compute and storage clusters must also be considered. Bandwidth on the storage and compute clusters scales in direct proportion to the number of nodes. However, dense storage capacity is possible with ECS, ObjectScale, and PowerScale. Such density can result in a large storage capacity without enough bandwidth to support the lakehouse data transfer requirements. An analysis of workload data transfer requirements is necessary to
26 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
correctly size the storage for both capacity and bandwidth. The network architecture allows both compute and storage clusters to use the same fabric. This configuration enables the network bandwidth to scale as either storage or worker nodes are added. The bandwidth available to the external storage systems should also be considered when referencing external storage that is not connected to the core cluster data network.
The good news is that the processing capacity of the lakehouse is decoupled from storage and can be adjusted at any time during the life of the cluster.
Sizing guidelines
Cluster sizing
Below table lists some cluster level starting points for possible deployments Table 4. Example cluster configurations
Configuration Size
Control Plane Coordinator Worker
PoC/ Demo
3 Nodes
1 Node
2 Nodes
Entry
3 Nodes 1 Node 3 Nodes
Growth
3 Nodes 1 Node 6 Node
Enterprise
3 Nodes 1 Node 11 Node
Advanced Enterprise
3 Nodes
1 Node
16 Node
Cluster Size PoC/Demo for Proof of concept
The proof of concept (POC) configuration is a minimal configuration for basic evaluation. In this scenario, three control plane nodes are for Dell Data Lakehouse system software and 2 worker nodes are for the analytics engine. This configuration provides limited resources for workloads but is adequate for basic functionality evaluation. The Dell Technologies Customer Solution Centers can be easily engaged for a demo of the solution, or work with the Dell Account Representative for a POC.
Cluster Size Entry or Growth
The entry configuration is the smallest production grade configuration. In the Entry cluster, three dedicated control plane nodes host the Dell Data Lakehouse System Software, and 3 to 6 worker nodes (+ 1 coordinator node) run the Data Analytics Engine. Dell Technologies recommends it for preproduction or development and test usage. This configuration provides enough resources to support one or two teams running analytics workloads. A PoC/Demo cluster can be scaled to an Entry or Growth cluster simply by adding additional worker nodes.
Cluster Size Enterprise and Advanced Enterprise
The Enterprise and Advanced Enterprise configuration is a medium to large production grade configuration. Just like Entry / Growth, three control plane nodes host the Dell Data Lakehouse System Software, and 11 to 16 worker nodes (+1 coordinator node) run the Data Analytics Engine.
27 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
This configuration provides substantial resources for running analytics workloads supporting multiple teams. This cluster can scale even further with more worker nodes as needed to support petabyte-scale workloads.
Scaling guidelines
Scaling overview
During the lifetime of the system, it is typical to scale the platform to support larger workloads or increase compute and storage capacity. The Dell Data Lakehouse architecture is designed to scale at the compute, storage, and workload levels. The design incorporates network scaling as part of the infrastructure scaling. Compute and storage can be scaled independently.
Compute scaling
Compute scaling is accomplished by adding additional DDAE nodes to the cluster. The DDAE node installation and provisioning the nodes become parts of the cluster.
Storage scaling
Storage scaling is accomplished by adding or upgrading nodes in the ECS, ObjectScale, or PowerScale storage cluster, using the storage cluster management tools.
Network scaling
The Dell Data Lakehouse architecture scales network bandwidth as compute or storage nodes, such as ObjectScale nodes, are added. When scaling either compute or storage, the impact to the network performance must be considered to maintain consistent SLAs and avoid any bottlenecks. Substantial changes in expected data transfer volumes should be considered to ensure that the available bandwidth on the compute and storage clusters is aligned.
Conclusion
The Dell Data Lakehouse has been developed to address the needs of organizations deploying advanced analytics, and AI and ML workloads. The technical solution guide offers detailed product information and design guidance for Dell Data Lakehouse. It targets data analytics infrastructure managers and architects. It describes a predesigned, validated, and scalable architecture for advanced analytics and machine learning on Dell hardware infrastructure. Topics that were discussed include:
· The Dell Data Lakehouse cluster architecture, including cluster server and storage infrastructure and its role in the system
· The cluster physical and logical network designs · Details of the Compute Nodes; PowerScale, ECS, and ObjectScale storage; and PowerSwitch
networking configurations · The recommended software infrastructure components that were used in the architecture Starburst Enterprise and
open table formats - Delta Lake and Apache Iceberg · Cluster sizing and scaling guidance
28 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
References
Dell Technologies documentation
The following Dell Technologies documentation provides other information related to this document. Access to these documents depends on your login credentials. If you do not have access to a document, contact your Dell Technologies representative.
Additional information can be obtained at the Dell Technologies Info Hub for Data Analytics. If you need additional services or implementation help, contact your Dell Technologies sales representative.
Document Type
Dell Data Lakehouse Server specification sheets Storage specification sheets
Switch specification sheets Server manuals Storage manuals
Switch manuals
Location
Dell Data Lakehouse Sizing and Configuration Guide Dell Data Lakehouse Validation Guide Dell Data Lakehouse Spec Sheet PowerEdge R660 Spec Sheet ECS EX500 Spec Sheet ECS EXF900 Spec Sheet ObjectScale Solution Overview PowerScale H7000 Spec Sheet PowerSwitch S3100 Series Spec Sheet PowerSwitch S5200-ON Series Spec Sheet PowerSwitch Z9264F-ON Spec Sheet PowerEdge R660 Manuals and Documents ECS EX500 Manuals and Documents ECS EXF900 Manuals and Documents ObjectScale Overview and Architecture PowerScale H7000 Manuals and Documents PowerSwitch S3100 Manuals and Documents PowerSwitch S5200-ON Series Manuals and Documents PowerSwitch Z9264F-ON Manuals and Documents
29 | Dell Data Lakehouse © Dell Inc. or its subsidiaries.
References
Delta Lake documentation
The following documentation on the Delta Lake documentation website provides additional and relevant information. Table 5. Delta Lake Documentation
Document Type
Location
Lakehouse architecture introductory Lakehouse: A New Generation of Open Platforms that Unify
paper
Data Warehousing and Advanced Analytics
Delta Lake project Delta Lake documentation
Delta Lake Project website Delta Lake documentation website
Apache Iceberg documentation
The following documentation on the Apache Iceberg documentation website provides additional and relevant information. Table 6. Apache Iceberg documentation
Document Type
Iceberg table format
Location
Apache Iceberg
Prior Reference Architecture for Starburst with Dell Infrastructure
The following reference architecture was posted in 2023 highlighting the integration between Starburst and Dell infrastructure. This document also includes performance validation results.
Table 7. Prior Reference Architecture for Starburst with Dell Infrastructure
Document Type
Location
Reference Architecture
Reference Architecture--Multicloud Data Analytics with Dell Technologies Powered by Starburst
Dell Technologies Info Hub
The Dell Technologies Info Hub is your one-stop destination for the latest information about Dell Solutions products. New material is frequently added, so browse often to keep up to date on the expanding Dell portfolio of cutting-edge products and solutions.
More information
For more information, including sizing guidance, technical questions, or sales assistance, email Analytics.Assist@dell.com, or contact your Dell Technologies or authorized partner sales representative.
Adobe PDF Library 17.0 Adobe InDesign 19.3 (Windows)