Row vs Columnar Storage: The Ultimate Showdown for Data Efficiency?

 · 5 min read
 · Konstantinos Tsoumas
Table of contents

Row-Based Formats: A Nostalgic Tale

Let's face it. Record-oriented formats are the files that we grew up with. Of course, we didn't grow up with TSVs and CSVs, but what about text files? The mystical software called Excel?

Evolution of Data Formats: Enter AVRO

However, technologies change or, better put, they evolve! Back in the not-so-distant 2009, a new and improved format was introduced: AVRO. Still row-based, but data is stored within JSON with schemas that can evolve over time (e.g., add/remove a column from a record). Today, row-based formats are commonly used for OLTP workloads for various purposes, depending on the nature of the data and the requirements of the application.

The difficulty with row-based format types (not always!) in production mainly lies in the inefficiency for analytical queries, the storage space of the file, the slow data processing but also the lacking of built-in handling for specific data types.

Lack of Data Type Support

To elaborate on the latter, row-based file formats do not explicitly store the data type of each column to reduce the overhead of reading and writing the data. That may sound amazing at first, but we literally leave the application to infer the data type from the value.

To be more specific, if we have two columns that contain the values "123" and "hello" then the application could infer that the first column is an integer and the second is a string. Though, what if the first column actually contains a string such as "123.45? We got a problem.

Also, what if the data types are unknown? The application must read the entire row into memory and then perform type checking on each column. Inefficiency kicks in.

When AVRO Makes Sense

AVRO can handle quite some of the regular disadvantages that traditional row-based file formats suffer from but at the cost of complex schema definition and performance (Avro can sometimes be slower to work with due to schema validation, binary encoding, and compression).

The Columnar Formats: A Different Beast

On a different page, we've got column-based formats (also known as c-stores). Quite a different design. Not very human-friendly.

Building on the fact that this is a columnar based format, one might expect hundrends of columns. Not an everyday thing in RDBMS (I hope).

But why? Oftentimes, engineers use Hadoop to denormalize data from relation formats to improve query performance. Hadoop is a distributed file system that is well-suited for processing large datasets, so reducing the number of joins and the size of data besides improving data locality (reduce the amount of data Hadoop is a distributed file system that is well-suited for processing large datasets) can improve query performance.

That's all? Guess not. Columnar formats also offer better compression and more efficient schema evolution which are both important topics when it comes to storing big data. Is this the holy grail though? Probably not.

Moving to a Real-World Scenario

The Gaming Platform Dataset

Picture this. Assuming that you own an advanced online gaming platform in 2023, hosting thousands of gamers who enjoy a wide range of games. Your dataset includes detailed player logs with 132 columns, capturing various aspects of player activity, such as game sessions, achievements, and purchases.

Let's use JSON and AVRO as row-based formats, since they are commonly used in modern data processing scenarions.

Row-Based Example: JSON and AVRO

A JSON file would look like:

[
  {
    "player_id": "12345",
    "player_name": "GamerX",
    "game_title": "Space Odyssey",
    "session_duration": "120",
    "achievements": [
      "Level 10 reached",
      "Boss defeated",
      "High score achieved"
    ],
    "purchase_history": [
      {
        "item_name": "Super Laser",
        "price": "9.99"
      },
      {
        "item_name": "Power-Up Pack",
        "price": "4.99"
      }
    ]
  }
]

and an AVRO file would look like:

{
  "type": "record",
  "name": "PlayerRecord",
  "fields": [
    {"name": "player_id", "type": "string"},
    {"name": "player_name", "type": "string"},
    {"name": "game_title", "type": "string"},
    {"name": "session_duration", "type": "string"},
    {"name": "achievements", "type": {"type": "array", "items": "string"}},
    {"name": "purchase_history", "type": {"type": "array", "items": {
      "type": "record",
      "name": "Purchase",
      "fields": [
        {"name": "item_name", "type": "string"},
        {"name": "price", "type": "string"}
      ]
    }}}
  ]
}

The player data is stored as an individual row with hundrends of columns following one another. While this can be considered human-readable (with some experience) it comes with huge challenges.

The query needs to scan every single record of the dataset, including all irrelevant columns. The entire row needs to be read, parsed into columns, and loaded into memory before filtering out the unnecessary information. This is particularly challenging when dealing with distributed databases or large historical datasets.

From a SQL viewpoint this is quite easy to conceptualize. Fetching a range of records based on only a subset of this hundred columns. For instance, to analyze specific player information such as calculating total in-game purchases is bleeding slow and inefficient with this file format.

Apache Parquet is a column-based storage file format optimized for use with Hadoop. It provides efficient data compression and encoding schemes. The reason is, the query need to scan every single record of this dataset including all 150 columns (even the irrelevant ones).

So, the query reads the first row
            
Parse the record into columns (so parse and load all columns into memory)
            
Get the sales, player status (online/offline) columns
            
Include these columns into your result if they satisfy your conditions.

Now this is your first try, guess what?! Repeat.

If the dataset is distributed across multiple storage locations or involves network transfer (e.g., in a distributed database), retrieving and transmitting the entire row increases the amount of data transferred, impacting network efficiency. Now, think about having historical data going back to the last 5 years. This is going to be a very expensive computation.

Partition by "sales" may help but again you are reading, parsing, loading, probably millions rows of data just to answer a simple question. You also need to account for the selection of the partition key (the key needs to align well with your query patterns) and data quality (if errors or incosistencies in the "sales" column exists, it may impact the effectiveness of the partitioning). Row-based formats may also result in significant storage space inefficiency. Long text fields and a multitude of columns can lead to large file sizes, especially when the data is mostly composed of unstructured text.

This is additional burden when trying to optimize your query.

Parquet Example

Apache Parquet is a column-based storage file format optimized for use with Hadoop. It provides efficient data compression and encoding schemes.

+-----------+------------+------------------+------------------+-------------------------------+---------------------------------+
| player_id | player_name| game_title       | session_duration | achievements                  | purchase_history                |
+-----------+------------+------------------+------------------+-------------------------------+---------------------------------+
| 12345     | GamerX     | Space Odyssey    | 120              | ["Level 10 reached", ...]    | [{"item_name": "Super Laser",... |
| 67890     | ProGamer   | Fantasy Quest    | 90               | ["High score achieved", ...] | [{"item_name": "Power-Up Pack",...|
| ...       | ...        | ...              | ...              | ...                           | ...                             |
+-----------+------------+------------------+------------------+-------------------------------+---------------------------------+

In Parquet, each column is stored separately, allowing for more efficient I/O operations, better compression, and faster query responses.

However, Parquet files are designed to optimize disk input/output and storage space, so the data is often divided into row groups or chunks rather than spread over many different blocks on the disk.

Data within each column is typically stored in a binary format, often as a sequence of bytes (allows for efficient encoding and compression) so Parquet files allow query engines to read only the relevant columns and the specific bytes within those columns needed to satisfy the query conditions. Cool stuff. For the record, this is also known as "column pruning".