The need for data archival continues to grow, but how do we safely preserve that data in the process? I.T. is continually asked to reduce cost, do more with less, and provide innovation to the business. This is complicated due to years of technology debt, resource constraints, and compliance/governance requirements. Having a secure, flexible, and highly scalable data preservation platform is key to these tenets.

Ideal Attributes for Data Preservation

Below are ideal attributes to preserve data during the data archival process.

  • File-based storage
  • Allows for the management of any type of data (structured, un-structured, hybrid)
  • Facilitates the preservation of the legacy data’s original structure
  • Maintains referential integrity
  • Highly compressible
  • Light infrastructure footprint
  • Flexible storage options
  • Metadata-driven
  • Robust compliance and governance
  • Flexible and responsive access to data

Current Archival Standards

Current archival standards use XML or JSON formatting to store data. There are several distinct challenges with using these formats for data preservation:

  • XMLs are open formats that don’t force the need for metadata and have little control on the relationship management of data.
  • XMLs do not support all character sets that are required in an archive.
  • XML is not compression friendly and causes bloating of the data (sometimes heavily).  
  • XML is not the most secure format for storing data.

Some archival systems use a RDBMS (relational database management system) such as SQL. This categorizes the archive as an application, not as a true archive, and we will not compare those in this brief.

Apache Parquet as a Solution

What is Apache Parquet?

Apache Parquet is an open-source columnar data storage format to read and write data in a Hadoop Ecosystem that supports predominantly all data types in Hive, MapReduce, Pig, Spark SQL. Parquet format is most commonly used for big data processing as it is a binary file that contains metadata about the content stored. It uses Google’s record-shredding and assembly algorithm and is optimized to work with complex data in bulk, like in decommissioning and archival applications.

Parquet File Structure

Parquet files store data in a column structure instead of as a row as outlined below.

The file structure contains:

  • Header – contains a magic number “PAR1” (4-byte) that identifies the file as a Parquet format file.
  • Content/Body – Contains column chunks of data.
  • Footer – Contains File metadata, Length of file metadata, Magic number.
Parquet File Structure

Parquet File Features

Compression mechanism, a column or row-based data analytical approach, schema evolution, and performance are the main file features of Parquet.

Compression Mechanism

Organizing by column allows for better and flexible compression as we store data of the same type in each column. It also allows implementing different encoding methods such as, Bit packing, Run Length encoding and dictionary encoding for each column.

Parquet Compression Mechanism

Column or Row Based Data Analytical Approach

Parquet files store data in column structures rather than in row-based structures like CSV and JSON, which increases the speed and performance of data processing. Since data is stored in a column structure, when querying for a record(s) the irrelevant fields can be skipped, thus drastically reducing data retrieval time.

Parquet Columnar Storage

Schema Evolution

Organizing by column allows for better and flexible compression as we store data of the same type in each column. It also allows for implementing different encoding methods like Bit packing and Run Length encoding and dictionary encoding for each column.

Parquet Scheme Evolution

Performance

In a Parquet-based file system only relevant data is focused on during data processing. The amount of data scanned will be smaller and will result in less I/O usage that reduces the storage cost and overall processing time.

Parquet Performance Partition Pruning

Parquet files are most supportive in handling the following:

  • Any application that needs faster data processing with reduced footprints (archival or similar)
  • Big data analytics
  • Business intelligence

The table below shows the primary advantages to consider Parquet over other traditional file storage systems available.

Features Parquet Other File Formats
Storage Technique Parquet’s files stores data in a column format that reduces the I/O usage exponentially. CSV, JSON, Avro, and other common formats use a row-based storage technique that increases the scanning and retrieval time.
Scanning Mechanism Parquet will first find only the relevant “data blocks” based on the filter criteria and only aggregate the rows/columns that match the filter criteria. Other common file formats read the entire row of records to identify the required column.
Text vs. Binary Parquet stores data in binary format. CSV, JSON, and other most common formats stores data as plain text.
Splitable Parquet files can be split or merged and processed accordingly. JSON files can be split and processed individually.
Metadata Information Parquet stores metadata in the footer.  While reading a parquet file, an initial seek will be performed to read the footer metadata length and then a backward seek will be performed to read the footer metadata. Hence, blocks can be easily located and processed in parallel. CSV and JSON store its metadata value in header sync markers and are used to separate blocks that make it tedious.
Self-Explanatory Parquet files store the metadata with schema, version, and structure in each file. CSV and JSON do not have a self-explanatory metadata detail stored.  That makes it quite difficult to understand the type of content present.
Storage Cost Platforms like AWS will charge based on the amount of data scanned per query.  Parquet has reduced the storage requirements more than 30% on large datasets reducing the overall cost. Since other common file format systems uses row-based storage, a data scanning approach and no appropriate compression mechanism, the data processing cost is higher.
Primary Advantages of Parquet vs. Other File Storage Systems

To summarize, compared to other file format systems based on the above features, the Parquet files reduce the number of footprints in data scanning, and thus, make the best option for decommissioning and archival systems.

Types CSV JSON XML AVRO ORC Protocol Buffers Parquet
Text vs. Binary text text text metadata in JSON, data in binary binary text binary
Data type no yes no yes yes yes yes
Schema enforcement no (minimal with header) external for validation external for validation yes yes yes yes
Schema evolution no yes yes yes no no yes
Storage type row row row row column row column
OLAP/OLTP OLTP OLTP OLTP OLTP OLAP OLTP OLAP
Splitable yes, in its simplest form yes, with JSON lines no yes yes no yes

Conclusion

In consideration of the above factors, a Parquet file format data preservation platform is superior to all other formats. It allows you to manage data effectively and efficiently in a secure and optimized fashion with the following benefits:

  • Storage optimization and reduced cost
  • Superior processing and query speed
  • Analytical efficiencies
  • Better human experience through efficient metadata
  • Unlimited scale
Written by

Karkavel Jegadeesan

Board of Director, Chief Architect/Product Owner Passionate about automating and solving complex problems with data for Fortune 500 clients. He has developed leading edge technology solutions to address the archival, decommissioning and migrating of legacy systems. Under his leadership, he has developed the Archon line of products and solutions to help clients in their open-source and digital transformation journeys.