by Gilad David Maayan on 15 Nov. 2019

The concept of data compression has been around for nearly 200 years. It has been refined over the years to meet an increasing drive for efficiency. This advancement is fortunate since individuals and organizations now collectively create more digital data than ever before. Data compression can have a significant impact on not only the costs of storing this data but the efficiency of processing it.

In this article, you’ll learn some tips for compressing your big data. These tips can help you reduce costs, increase efficiency, and hopefully gain greater insights.

5 Tips for Compressing Big Data

Big data can be challenging to compress due to the volume of data, the limitations of tools, and the need to retain fine detail. The following tips can help you overcome some of these hurdles.

1. ​Choose File Formats Carefully

A large portion of big data is collected and stored in a JavaScript Object Notation (JSON) format. Particularly, data that is collected from web applications, as JSON is the format commonly used to serialize and transfer this data. Unfortunately, JSON is not schema-ed or strongly typed, making it slower when you use it with big data tools, like Hadoop. To improve the performance of your JSON files, you should consider using either the Avro or Parquet formats.

Avro files are composed of binary format data and JSON format schema. This composition facilitates maximized efficiency and reduces the necessary file size. Avro is a row-based format. It is most useful for when you need to access all fields in a dataset. 

It is both splittable and compressible. Splittable formats can be processed in parallel for greater efficiency. You can use the Avro format with streaming data. You should consider using Avro when you have write-heavy workloads, as it enables you to easily add new data rows.

Parquet files are composed of binary data with attached metadata. It is a column-based format that is both splittable and compressible. Parquet’s composition enables the tools you are using to read column names, compression type, and data type without needing to parse the file. With Parquet, you can process data significantly faster since it enables one-pass writing. 

You should consider using Parquet when you need to access specific fields rather than all fields in a dataset. You cannot use this format with streaming data. You can use it with complex analysis and read-heavy workloads.

2. Compress Data From the Start

A large part of the cost of big data comes from the initial transfer of data into storage. It takes a significant amount of bandwidth and time to transfer large numbers of files. It also takes a significant amount of storage to contain files after transfer. All three of these costs, bandwidth, storage, and time can be reduced if you compress files before or during transfer. 

This type of compression can be done as part of the Extract, Transform, Load (ETL) process. ETL is a process used when transferring data from a database or other data sources, such as streaming data from sensors. This process extracts data, transforms it for use in the target system and then loads the transformed data. Often, this is done with automated pipelines, making the process faster and easier.

If you are already using data or file management solutions, you might be able to use built-in features to ease this process. For example, you can use digital asset management systems to optimize images or compress video sizes during upload. Many of these tools also allow you to dynamically change file formats, meaning you only need to store one version.

3. Use Co-Processing

Consider using co-processors to optimize your compression workflow. Co-processors can enable you to redirect time and processing power from your main CPU to secondary ones. This lets you retain primary processors for analytics and data processing while still compressing data.

To accomplish this, you can use Field-Programmable Gate Arrays (FPGA). FPGAs are microchips that you can custom configure. In this case, you configure FPGAs to work as additional processors. You can also use these chips to accelerate hardware or share computational loads.

If you dedicate FPGAs to compression, you can avoid tying up your primary processors with less time-sensitive tasks. By queuing your workloads, you can compress many datasets with minimal monitoring and perform compression during off-hours.

4. Match Compression Type to Data

Varying the type of compression you use can make a large difference. There are two types of compression to select from, lossy and lossless. Lossy compression reduces file sizes by eliminating data to create an approximation of the original file. It is often used for images, videos, or audio since humans are less likely to perceive missing data in media. Lossy compression can also be useful for data streams from the Internet of Things (IoT) devices.

Lossless compression reduces file size by identifying repeated patterns in data and assigning those patterns to a variable or parameter. This enables all data to be retained while removing duplicated bits. Lossless compression is typically used for databases, text files, and discrete data. You should use this type of compression if your data needs to be processed multiple times.

The specific codec you use to perform compression is also important. A codec is a program or device that is used to encode and decode data according to a compression algorithm. The types of codecs you can use depend on the type of data, the speed of encoding/decoding, and the tools you’re using. Your codec options are also affected by whether you need your files to be splittable or not.

5. Combine With Data Deduplication

Although it is not needed for compression, data deduplication is a useful process for further reducing your data. Data deduplication is a process that compares data to be stored with data currently stored and eliminates duplicates. It is different than compression because it doesn’t work to reduce the amount of storage needed to store the same data. 

Rather deduplication eliminates redundant files in storage and uses references to point to a single file. This enables you to use one file in multiple data sets. In this way, the process that data deduplication uses is similar to some lossless compression algorithms. 

You can use deduplication for whole files or on a block level. Block-level deduplication works by creating an index of your blocks. Based on the index, only revisions of data are saved rather than entirely new blocks. Data deduplication is particularly good for reducing the storage of backups and archived data.

Conclusion

Big data is valuable because of the amount of information it provides and the depth of analyses that can be performed. It can provide insights that were previously inaccessible. Unfortunately, these insights come with significant processing and storage costs. To prevent these costs from interfering with your ability to learn and benefit from big data, you can compress your data. Hopefully, the tips covered here can help you with this process.

Collected at:  https://datafloq.com/read/5-tips-for-compressing-big-data/7138 

4 thoughts on “5 Tips for Compressing Big Data”

  1. A motivating discussion is definitely worth comment. I do believe
    that you should publish more on this issue, it might not be
    a taboo subject but generally people do not speak about these
    topics. To the next! Best wishes!!

  2. Hmm it seems like your website ate my first comment (it was super long) so I guess I’ll just sum it up what I submitted and say, I’m thoroughly enjoying your blog. I as well am an aspiring blog writer but I’m still new to everything. Do you have any helpful hints for novice blog writers? I’d definitely appreciate it.

Leave a Reply

Your email address will not be published. Required fields are marked *