В какие дистрибутивы входит продукт kudu
Перейти к содержимому

В какие дистрибутивы входит продукт kudu

  • автор:

В какие дистрибутивы входит продукт kudu

While the Apache Kudu project provides client bindings that allow users to mutate and fetch data, more complex access patterns are often written via SQL and compute engines. This is a non-exhaustive list of projects that integrate with Kudu to enhance ingest, querying capabilities, and orchestration.

Frequently used

The following integrations are among the most commonly used with Apache Kudu (sorted alphabetically).

Apache Drill

Apache Drill provides schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage. See the Drill Kudu API documentation for more details.

Apache Hive

The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. See the Hive Kudu integration documentation for more details.

Apache Impala

Apache Impala is the open source, native analytic database for Apache Hadoop. See the Kudu Impala integration documentation for more details.

Apache Spark SQL

Spark SQL is a Spark module for structured data processing. See the Kudu Spark integration documentation for more details.

Presto

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. See the Presto Kudu connector documentation for more details.

Computation

Apache Beam

Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends. See the Beam Kudu source and sink documentation for more details.

Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing. See the Kudu Spark integration documentation for more details.

Pandas

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Kudu Python scanners can be converted to Pandas DataFrames. See Kudu’s Python tests for example usage.

Talend Big Data

Talend simplifies and automates big data integration projects with on demand Serverless Spark and machine learning. See Talend’s Kudu component documentation for more details.

Ingest

Akka facilitates building highly concurrent, distributed, and resilient message-driven applications on the JVM. See the Alpakka Kudu connector documentation for more details.

Apache Flink

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. See the Flink Kudu connector documentation for more details.

Apache Nifi

Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. See the PutKudu processor documentation for more details.

Apache Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. See Kudu’s Spark Streaming tests for example usage.

Confluent Platform Kafka

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. See the Kafka Kudu connector documentation for more details.

StreamSets Data Collector

StreamSets Data Collector is a lightweight, powerful engine that streams data in real time. See the StreamSets Data Collector Kudu destination documentation.

Striim

Striim is real-time data integration software that enables continuous data ingestion, in-flight stream processing, and delivery. See the Striim Kudu Writer documentation for more details.

TIBCO StreamBase

TIBCO StreamBase® is an event processing platform for applying mathematical and relational processing to real-time data streams. See the StreamBase Kudu operator documentation for more details.

Informatica PowerExchange

Informatica® PowerExchange® is a family of products that enables retrieval of a variety of data sources without having to develop custom data-access programs. See the PowerExchange for Kudu documentation for more details.

Deployment and Orchestration

Apache Camel

Camel is an open source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data. See the Camel Kudu component documentation for more details.

Cloudera Manager

Cloudera Manager is an end-to-end application for managing CDH clusters. See the Cloudera Manager documentation for Kudu for more details.

Docker

Docker facilitates packaging software into standardized units for development, shipment, and deployment. See the official Apache Kudu Dockerhub and the Apache Kudu Docker Quickstart for more details.

Wavefront

Wavefront is a high-performance streaming analytics platform that supports 3D observability. See the Wavefront Kudu integration documentation for more details.

Visualization

Zoomdata

Zoomdata provides a high-performance BI engine and visually engaging, interactive dashboards. See Zoomdata’s Kudu page for more details.

Distribution and Support

While Kudu is an Apache-licensed open source project, software vendors may package and license it with other components to facilitate consumption. These offerings are typically bundled with support to tune and facilitate administration.

Copyright © 2023 The Apache Software Foundation.

Apache Kudu, Kudu, Apache, the Apache feather logo, and the Apache Kudu project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.

Introducing Apache Kudu

Kudu is a distributed columnar storage engine optimized for OLAP workloads. Kudu runs on commodity hardware, is horizontally scalable, and supports highly available operation.

Kudu’s design sets it apart. Some of Kudu’s benefits include:

Fast processing of OLAP workloads.

Strong but flexible consistency model, allowing you to choose consistency requirements on a per-request basis, including the option for strict-serializable consistency.

Structured data model.

Strong performance for running sequential and random workloads simultaneously.

Tight integration with Apache Impala, making it a good, mutable alternative to using HDFS with Apache Parquet.

Integration with Apache NiFi and Apache Spark.

Integration with Hive Metastore (HMS) and Apache Ranger to provide fine-grain authorization and access control.

Authenticated and encrypted RPC communication.

High availability: Tablet Servers and Masters use the Raft Consensus Algorithm, which ensures that as long as more than half the total number of tablet replicas is available, the tablet is available for reads and writes. For instance, if 2 out of 3 replicas (or 3 out of 5 replicas, etc.) are available, the tablet is available. Reads can be serviced by read-only follower tablet replicas, even in the event of a leader replica’s failure.

Automatic fault detection and self-healing: to keep data highly available, the system detects failed tablet replicas and re-replicates data from available ones, so failed replicas are automatically replaced when enough Tablet Servers are available in the cluster.

Location awareness (a.k.a. rack awareness) to keep the system available in case of correlated failures and allowing Kudu clusters to span over multiple availability zones.

Logical backup (full and incremental) and restore.

Multi-row transactions (only for INSERT/INSERT_IGNORE operations as of Kudu 1.15 release).

Easy to administer and manage.

By combining all of these properties, Kudu targets support for families of applications that are difficult or impossible to implement using Hadoop storage technologies, while it is compatible with most of the data processing frameworks in the Hadoop ecosystem.

A few examples of applications for which Kudu is a great solution are:

Reporting applications where newly-arrived data needs to be immediately available for end users

Time-series applications that must simultaneously support:

queries across large amounts of historic data

granular queries about an individual entity that must return very quickly

Applications that use predictive models to make real-time decisions with periodic refreshes of the predictive model based on all historic data

For more information about these and other scenarios, see Example Use Cases.

Kudu-Impala Integration Features

Impala supports creating, altering, and dropping tables using Kudu as the persistence layer. The tables follow the same internal / external approach as other tables in Impala, allowing for flexible data ingestion and querying.

Data can be inserted into Kudu tables in Impala using the same syntax as any other Impala table like those using HDFS or HBase for persistence.

Impala supports the UPDATE and DELETE SQL commands to modify existing data in a Kudu table row-by-row or as a batch. The syntax of the SQL commands is chosen to be as compatible as possible with existing standards. In addition to simple DELETE or UPDATE commands, you can specify complex joins with a FROM clause in a subquery.

Similar to partitioning of tables in Hive, Kudu allows you to dynamically pre-split tables by hash or range into a predefined number of tablets, in order to distribute writes and queries evenly across your cluster. You can partition by any number of primary key columns, by any number of hashes, and an optional list of split rows. See Schema Design.

To achieve the highest possible performance on modern hardware, the Kudu client used by Impala parallelizes scans across multiple tablets.

Where possible, Impala pushes down predicate evaluation to Kudu, so that predicates are evaluated as close as possible to the data. Query performance is comparable to Parquet in many workloads.

For more details regarding querying data stored in Kudu using Impala, please refer to the Impala documentation.

Concepts and Terms

Kudu is a columnar data store. A columnar data store stores data in strongly-typed columns. With a proper design, it is superior for analytical or data warehousing workloads for several reasons.

For analytical queries, you can read a single column, or a portion of that column, while ignoring other columns. This means you can fulfill your query while reading a minimal number of blocks on disk. With a row-based store, you need to read the entire row, even if you only return values from a few columns.

Because a given column contains only one type of data, pattern-based compression can be orders of magnitude more efficient than compressing mixed data types, which are used in row-based solutions. Combined with the efficiencies of reading data from columns, compression allows you to fulfill your query while reading even fewer blocks from disk. See Data Compression

A table is where your data is stored in Kudu. A table has a schema and a totally ordered primary key. A table is split into segments called tablets.

A tablet is a contiguous segment of a table, similar to a partition in other data storage engines or relational databases. A given tablet is replicated on multiple tablet servers, and at any given point in time, one of these replicas is considered the leader tablet. Any replica can service reads, and writes require consensus among the set of tablet servers serving the tablet.

A tablet server stores and serves tablets to clients. For a given tablet, one tablet server acts as a leader, and the others act as follower replicas of that tablet. Only leaders service write requests, while leaders or followers each service read requests. Leaders are elected using Raft Consensus Algorithm. One tablet server can serve multiple tablets, and one tablet can be served by multiple tablet servers.

The master keeps track of all the tablets, tablet servers, the Catalog Table, and other metadata related to the cluster. At a given point in time, there can only be one acting master (the leader). If the current leader disappears, a new master is elected using Raft Consensus Algorithm.

The master also coordinates metadata operations for clients. For example, when creating a new table, the client internally sends the request to the master. The master writes the metadata for the new table into the catalog table, and coordinates the process of creating tablets on the tablet servers.

All the master’s data is stored in a tablet, which can be replicated to all the other candidate masters.

Tablet servers heartbeat to the master at a set interval (the default is once per second).

Kudu uses the Raft consensus algorithm as a means to guarantee fault-tolerance and consistency, both for regular tablets and for master data. Through Raft, multiple replicas of a tablet elect a leader, which is responsible for accepting and replicating writes to follower replicas. Once a write is persisted in a majority of replicas it is acknowledged to the client. A given group of N replicas (usually 3 or 5) is able to accept writes with at most (N — 1)/2 faulty replicas.

The catalog table is the central location for metadata of Kudu. It stores information about tables and tablets. The catalog table may not be read or written directly. Instead, it is accessible only via metadata operations exposed in the client API.

The catalog table stores two categories of metadata:

table schemas, locations, and states

the list of existing tablets, which tablet servers have replicas of each tablet, the tablet’s current state, and start and end keys.

Kudu replicates operations, not on-disk data. This is referred to as logical replication, as opposed to physical replication. This has several advantages:

Although inserts and updates do transmit data over the network, deletes do not need to move any data. The delete operation is sent to each tablet server, which performs the delete locally.

Physical operations, such as compaction, do not need to transmit the data over the network in Kudu. This is different from storage systems that use HDFS, where the blocks need to be transmitted over the network to fulfill the required number of replicas.

Tablets do not need to perform compactions at the same time or on the same schedule, or otherwise remain in sync on the physical storage layer. This decreases the chances of all tablet servers experiencing high latency at the same time, due to compactions or heavy write loads.

Architectural Overview

The following diagram shows a Kudu cluster with three masters and multiple tablet servers, each serving multiple tablets. It illustrates how Raft consensus is used to allow for both leaders and followers for both the masters and tablet servers. In addition, a tablet server can be a leader for some tablets, and a follower for others. Leaders are shown in gold, while followers are shown in blue.

Kudu Architecture

Example Use Cases

A common challenge in data analysis is one where new data arrives rapidly and constantly, and the same data needs to be available in near real time for reads, scans, and updates. Kudu offers the powerful combination of fast inserts and updates with efficient columnar scans to enable real-time analytics use cases on a single storage layer.

A time-series schema is one in which data points are organized and keyed according to the time at which they occurred. This can be useful for investigating the performance of metrics over time or attempting to predict future behavior based on past data. For instance, time-series customer data might be used both to store purchase click-stream history and to predict future purchases, or for use by a customer support representative. While these different types of analysis are occurring, inserts and mutations may also be occurring individually and in bulk, and become available immediately to read workloads. Kudu can handle all of these access patterns simultaneously in a scalable and efficient manner.

Kudu is a good fit for time-series workloads for several reasons. With Kudu’s support for hash-based partitioning, combined with its native support for compound row keys, it is simple to set up a table spread across many servers without the risk of «hotspotting» that is commonly observed when range partitioning is used. Kudu’s columnar storage engine is also beneficial in this context, because many time-series workloads read only a few columns, as opposed to the whole row.

In the past, you might have needed to use multiple data stores to handle different data access patterns. This practice adds complexity to your application and operations, and duplicates your data, doubling (or worse) the amount of storage required. Kudu can handle all of these access patterns natively and efficiently, without the need to off-load work to other data stores.

Data scientists often develop predictive learning models from large sets of data. The model and the data may need to be updated or modified often as the learning takes place or as the situation being modeled changes. In addition, the scientist may want to change one or more factors in the model to see what happens over time. Updating a large set of data stored in files in HDFS is resource-intensive, as each file needs to be completely rewritten. In Kudu, updates happen in near real time. The scientist can tweak the value, re-run the query, and refresh the graph in seconds or minutes, rather than hours or days. In addition, batch or incremental algorithms can be run across the data at any time, with near-real-time results.

Companies generate data from multiple sources and store it in a variety of systems and formats. For instance, some of your data may be stored in Kudu, some in a traditional RDBMS, and some in files in HDFS. You can access and query all of these sources and formats using Impala, without the need to change your legacy systems.

Kudu – новый движок хранения данных в экосистеме Hadoop

image
Kudu был одной из новинок, представленых компанией Cloudera на конференции “Strata + Hadoop World 2015”. Это новый движок хранения больших данных, созданный чтобы покрыть нишу между двумя уже существующими движками: распределенной файловой системой HDFS и колоночной базой данных Hbase.

Существующие на данный момент движки не лишены недостатков. HDFS, прекрасно справляющаяся с операциями сканирования больших объемов данных, показывает плохие результаты на операциях поиска. C Hbase все с точностью до наоборот. К тому же HDFS обладает дополнительным ограничением, а именно, не позволяет модифицировать уже записанные данные. Новый движок, согласно разработчикам, обладает преимуществами обеих существующих систем:
— операции поиска с быстрым откликом
— возможность модификации
— высокая производительность при сканировании больших объемов данных

Некоторыми вариантами использования Kudu могут быть анализ временных рядов, анализ логов и сенсорных данных. В настоящее время системы, которые используют Hadoop для подобных вещей, имеют довольно сложную архитектуру. Как правило, данные находятся в нескольких хранилищах одновременно (так называемая “Лямбда-архитектура”). Необходимо решать ряд задач по синхронизации данных между хранилищами (неизбежно возникает лаг с которым, как правило, просто смиряются и живут). Так же приходится настраивать политики безопасности доступа к данным для каждого хранилища отдельно. Да и правило “чем проще – тем надежней” никто не отменял. Используя Kudu вместо нескольких одновременных хранилищ можно значительно упростить архитектуру подобных систем.

image

Характеристики Kudu:
— Высокая производительность при операциях сканирования больших объемов данных
— Быстрое время отклика в операциях поиска
— Колоночная БД, тип СP в теореме CAP, поддерживает несколько уровней согласованности данных
— Поддержка “update”
— Транзакции на уровне записей
— Отказоустойчивость
— Настраиваемый уровень избыточности данных (для сохранности данных при отказе одной из нод)
— API для C++, Java и Python. Поддерживается доступ из Impala, Map/Reduce, Spark.
— Открытый исходный код. Apache License

НЕКОТОРЫЕ СВЕДЕНИЯ ОБ АРХИТЕКТУРЕ

Кластер Kudu cостоит из двух типов сервисов: master – сервис ответственный за управление метаданными и координацию между нодами; tablet – сервис, установленный на каждой ноде предназначенной для хранения данных. В кластере может быть только один активный master. В целях отказоустойчивости могут быть запущены еще несколько master-сервисов в режиме standby. Tablet – сервера разбивают данные на логические разделы (называемые “tablets”).

image

C точки зрения пользователя данные в Кudu хранятся в таблицах. Для каждой таблицы необходимо определить структуру (довольно нетипичный подход для NoSQL баз данных). Кроме столбцов и их типов пользователь должен определить первичный ключ и политику разбиения на разделы.

В отличии от других компонентов экосистемы Hadoop Кudu не использует HDFS для хранения данных. Используется файловая система ОС (рекомендуется использовать ext4 или XFS). Для того, чтобы гарантировать сохранность данных при отказе одтельных нод, Кudu реплицирует данные между серверами. Как правило каждый tablet хранится на трех серверах (однако только один из трех серверов принимает операции на запись, остальные принимают операции только на чтение). Синхронизация между репликами tablet-a реализована по протоколу raft.

ПЕРВЫЕ ШАГИ

Попробуем поработать с Kudu с точки зрения пользователя. Создадим таблицу и попытаемся обратиться к ней при помощи SQL и Java API.

Для заполнения таблицы данными используем этот открытый датасет:

В настоящий момент у Kudu нет своей клиентской консоли. Для создания таблицы будем использовать консоль Impala (impala-shell).

Прежде всего создадим таблицу “employees” с хранением данных в HDFS:

Загружаем датасет на машину с клиентом impala-shell и импортируем данные в таблицу:

После того, как команда завершит выполнение, запустим impala-shell cнова и выполним следующий запрос:

Данный запрос создаст таблицу с аналогичными полями, но уже с Kudu в качестве хранилища. При помощи “AS SELECT” в последней строчке копируем данные из HDFS в Kudu.

Не выходя из impala-shell запустим несколько SQL-запросов к только что созданной таблице:

Можно составлять запросы к обоим хранилиам (Kudu и HDFS) одновременно:

Теперь попробуем воспроизвести результаты первого запроса (подсчет сотрудников мужского и женского пола) используя Java API. Вот код:

В какие дистрибутивы входит продукт kudu

Нейронные сети проявляют значительные успехи, однако иногда могут предоставлять неточные ответы в некоторых областях. Если ответ оказался неудовлетворительным или не точным, рекомендуется задать вопрос более подробно и конкретно, предоставив точные параметры.

Kudu — это открытая система хранения данных, разработанная компанией Apache Software Foundation. Она предназначена для работы с большими объемами данных и обеспечивает быстрый и эффективный доступ к ним. Kudu может использоваться в различных дистрибутивах операционной системы Linux, а также в операционной системе Windows.

Дистрибутивы Linux

Kudu является одним из компонентов Apache Hadoop и может быть установлен на различные дистрибутивы Linux, включая:

  1. Ubuntu: Kudu доступен для установки на Ubuntu с использованием официального репозитория Apache Kudu. Для установки достаточно выполнить несколько команд в терминале.
  1. CentOS: Для установки Kudu на CentOS можно воспользоваться репозиторием Cloudera. Cloudera предоставляет пакеты Kudu, которые можно установить с помощью менеджера пакетов yum.
  1. Red Hat Enterprise Linux (RHEL): Kudu также поддерживается на RHEL и может быть установлен из репозитория Cloudera.
  1. SUSE Linux: Для установки Kudu на SUSE Linux можно использовать репозиторий Cloudera или установить его вручную из исходных кодов.

Операционная система Windows

Kudu также может быть установлен и использован на операционной системе Windows. Для этого необходимо выполнить следующие шаги:

  1. Скачайте исполняемый файл установщика Kudu с официального сайта Apache Kudu.
  2. Запустите установщик и следуйте инструкциям на экране.
  3. После завершения установки Kudu будет доступен для использования на вашей системе Windows.

Заключение

Kudu — это мощный инструмент для работы с большими объемами данных, который поддерживается на различных дистрибутивах операционной системы Linux и операционной системе Windows. Установка Kudu на Linux обычно осуществляется через официальные репозитории или репозитории Cloudera, в то время как на Windows можно воспользоваться установщиком с официального сайта Apache Kudu.

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *