Columnar databases have been called the future of business intelligence (BI). They're often used in data warehouses, the structured data repositories that businesses use to support corporate decision-making. Businesses extract data from multiple sources, including cloud-based applications and in-house repositories, and pipe it in batches to these data warehouses, where it serves as the basis for BI tools. Data warehouses benefit from the higher performance they can gain from a database that stores data by column rather than by row.

Why are columnar databases faster for data warehouses? Storage systems have to pull data from physical disk drives, which store information magnetically on spinning platters using read/write heads that move around to find the data that users request. The less the heads have to move, the faster the drive performs. If data is kept closer together, minimizing seek time, systems can deliver that data faster.

A multi-platter hard drive, with the read-write head poised over the top platter. Source: Eric Gaba

A multi-platter hard drive, with the read-write head poised over the top platter.
Source: Eric Gaba

What’s “faster”? Nowadays a typical hard drive seek operation may take only 4 milliseconds (ms) – but with the amount of big data stored in today’s enterprises, seek times can add up quickly. Solid state disk drives (SSD) offer seek times of less than 0.1 ms, but they cost several times as much as hard drives per gigabyte. In-memory databases offer seek times of just tens of nanoseconds, but they’re several hundred times more expensive than hard drives per unit of storage. Unless you have unlimited budget to throw at the problem, arranging data on the physical disk efficiently will pay off every time you need to access the data. And you can gain further performance benefits by employing compression on the columnar data, as we'll see in a moment.

Storing data efficiently

Row-oriented databases store each record in one or more contiguous blocks on disk. Column-oriented databases store each column in one or more contiguous blocks. Each scheme is better-suited to different use cases, as the following example illustrates.

Suppose you're a retailer maintaining a web-based storefront. An ecommerce site generates a lot of data. Consider product purchase transactions:

Screen Shot 2017-12-21 at 9.04.10 AM.png

Businesses handle transactions using online transaction-processing (OLTP) software. All the fields in each row are important, so for OLTP it makes sense to store items on disk by row, with each field adjacent to the next in the same block on the hard drive:


Want to learn about setting the data strategy for your organization?

Signup for a free 30-day course to learn what you need to succeed with data. We've worked with more than 500 companies of all sizes and helped them build their data infrastructure, run analytics, and make data-driven decisions. Learn how the data landscape has changed and what that means for your company.

We will never share your email

Choosing a columnar database as your data warehouse

You can buy, install, and host a column-oriented database in your own data center, using software such as HP Vertica, Apache Cassandra, and Apache HBase. If you have high-end hardware, you can expect good performance from on-premises databases, as long as the load is relatively constant. If you have variation in your workloads, you could see performance impacts. You'll also need more people in your IT department to help manage the hardware and software.

Many organizations prefer to host their data warehouses in the cloud, using services such as Amazon Redshift, Google BigQuery, and Snowflake. Cloud applications offer several benefits:

  • No capital requirements for hardware
  • Ability to architect for high availability with built-in fault tolerance
  • Flexible capacity and near-infinite scalability to deal with elastic demands
  • Always the latest technology

Once you’ve settled on using data warehouse, you’re going to have to populate it with data. You may be tempted to write code that extracts the data from your applications and loads it into your columnar database. There’s a better way to go. Stitch is a simple, powerful ETL services for businesses of all sizes, up to and including the enterprise.

Simple data integration with Stitch

Stitch is a cloud data integration service. Stitch connects to today’s most popular business tools – including Salesforce, Facebook Ads, and more than 60 others – and automatically replicates the raw data to a data warehouse. There's no code to write, and it automatically keeps your data up to date.  

Stitch was built to solve data integration. With just a few clicks, Stitch will extract your data from wherever it lives and get it ready to be analyzed, understood, and acted upon.  

14-day free trial  |  setup in minutes  |  no ETL scripts necessary