By definition, a columnar database stores data
by columns rather than by rows, which makes it suitable for analytical query processing, and thus for data warehouses.
Columnar databases have been called the future of business intelligence (BI). They're often used in data warehouses, the structured data repositories that businesses use to support corporate decision-making. Businesses extract data from multiple sources, including cloud-based applications and in-house repositories, and pipe it in batches to these data warehouses, where it serves as the basis for BI tools. Data warehouses benefit from the higher performance they can gain from a database that stores data by column rather than by row.
Why are columnar databases faster for data warehouses? Storage systems have to pull data from physical disk drives, which store information magnetically on spinning platters using read/write heads that move around to find the data that users request. The less the heads have to move, the faster the drive performs. If data is kept closer together, minimizing seek time, systems can deliver that data faster.
What’s “faster”? Nowadays a typical hard drive seek operation may take only 4 milliseconds (ms) – but with the amount of big data stored in today’s enterprises, seek times can add up quickly. Solid state disk drives (SSD) offer seek times of less than 0.1 ms, but they cost several times as much as hard drives per gigabyte. In-memory databases offer seek times of just tens of nanoseconds, but they’re several hundred times more expensive than hard drives per unit of storage. Unless you have unlimited budget to throw at the problem, arranging data on the physical disk efficiently will pay off every time you need to access the data. And you can gain further performance benefits by employing compression on the columnar data, as we'll see in a moment.
Storing data efficiently
Row-oriented databases store each record in one or more contiguous blocks on disk. Column-oriented databases store each column in one or more contiguous blocks. Each scheme is better-suited to different use cases, as the following example illustrates.
Suppose you're a retailer maintaining a web-based storefront. An ecommerce site generates a lot of data. Consider product purchase transactions:
Businesses handle transactions using online transaction-processing (OLTP) software. All the fields in each row are important, so for OLTP it makes sense to store items on disk by row, with each field adjacent to the next in the same block on the hard drive:
512,Seabiscuit,Book,10.95,201712241200,goodreads.com 513,Bowler,Apparel,59.95,201712241200,google.com 514.Cuphead,Game,20.00,201712241201,gamerassaultweekly.com
Want to learn about setting the data strategy for your organization?
Sign up for a free 30-day course to learn what you need to succeed with data. We've worked with more than 500 companies of all sizes and helped them build their data infrastructure, run analytics, and make data-driven decisions. Learn how the data landscape has changed and what that means for your company.
We will never share your email
Choosing a columnar database as your data warehouse
You can buy, install, and host a column-oriented database in your own data center, using software such as HP Vertica, Apache Cassandra, and Apache HBase. If you have high-end hardware, you can expect good performance from on-premises databases, as long as the load is relatively constant. If you have variation in your workloads, you could see performance impacts. You'll also need more people in your IT department to help manage the hardware and software.
- No capital requirements for hardware
- Ability to architect for high availability with built-in fault tolerance
- Flexible capacity and near-infinite scalability to deal with elastic demands
- Always the latest technology
Once you’ve settled on using data warehouse, you’re going to have to populate it with data. You may be tempted to write code that extracts the data from your applications and loads it into your columnar database. There’s a better way to go. Stitch is a simple, powerful ETL services for businesses of all sizes, up to and including the enterprise.
Simple data integration with Stitch
Stitch is a cloud data integration service. Stitch connects to today’s most popular business tools – including Salesforce, Facebook Ads, and more than 60 others – and automatically replicates the raw data to a data warehouse. There's no code to write, and it automatically keeps your data up to date.
Stitch was built to solve data integration. With just a few clicks, Stitch will extract your data from wherever it lives and get it ready to be analyzed, understood, and acted upon.
14-day free trial | setup in minutes | no ETL scripts necessary