*** Welcome to piglix ***

Apache Avro

Apache Avro
Developer(s) Apache Software Foundation
Stable release
1.8.1 / May 19, 2016 (2016-05-19)
Repository git-wip-us.apache.org/repos/asf/avro.git
Development status Active
Type remote procedure call framework
License Apache License 2.0
Website http://avro.apache.org/

Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and , and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

It is similar to and , but does not require running a code-generation program when a schema changes (unless desired for statically-typed languages).

Apache Spark SQL can access Avro as a data source.

An Avro Object Container File consists of:

A file header consists of:

For data blocks Avro specifies two serialization encodings: binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. For debugging and web-based applications, the JSON encoding may sometimes be appropriate.

Avro schemas are defined using JSON. Schemas are composed of primitive types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed).

Simple schema example:

Data in Avro might be stored with its corresponding schema, meaning serialized item can be read without knowing the schema ahead of time.

Serialization:

File "users.avro" will contain the schema in JSON and a compact binary representation of the data:

Deserialization:

This outputs:

Though theoretically any language could use Avro, the following languages have APIs written for them:

In addition to supporting JSON for type and protocol definitions, Avro includes experimental support for an alternative interface description language (IDL) syntax known as Avro IDL. Previously known as GenAvro, this format is designed to ease adoption by users familiar with more traditional IDLs and programming languages, with a syntax similar to C/C++, and others.



...
Wikipedia

...