{"id":500,"date":"2011-01-31T23:45:06","date_gmt":"2011-02-01T06:45:06","guid":{"rendered":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/?p=500"},"modified":"2011-01-31T23:45:06","modified_gmt":"2011-02-01T06:45:06","slug":"column-store-101","status":"publish","type":"post","link":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/2011\/01\/31\/column-store-101\/","title":{"rendered":"Column Store 101"},"content":{"rendered":"<p>I&#8217;m often asked, as an initial question of why LucidDB can perform so much better than traditional row store databases like Oracle, SQLServer, DB2, MySQL is <strong>HOW<\/strong>?<\/p>\n<p>There&#8217;s a bunch of reasons and we have entire sections of our Wiki dedicated to the topic, but first and foremost is corner stone of almost every analytic database.\u00a0 <em><strong>We&#8217;ve changed the way we orient the data on disk<\/strong><\/em>.\u00a0 I&#8217;m not going to go into too much detail here, but I think as a start, a very simple example can help gain a little more understanding about how we can deliver an order of magnitude better performance over a row store.\u00a0 More to come, as well.<\/p>\n<p>Take the following query that is typical of BI workloads (we&#8217;re using commercial flight data available from the BTS):<\/p>\n<blockquote><p>select count(Carrier), Carrier, DayOfWeek, CancellationCode<br \/>\ngroup by Carrier, DayOfWeek, Cancellation<\/p><\/blockquote>\n<p>This query is typical &#8211; we want some aggregation (count of flights) grouped by several qualifying attributes (Carrier, DayOfWeek, CancellationCode).\u00a0 We need to examine (in our example dataset) approximately 63 million records to answer this question.\u00a0 In other words, we&#8217;re looking at all the data in a table to answer the query.<\/p>\n<p>In a row store, even with indexes, a query that needs all the data from the database needs to touch all the data in the table.\u00a0 This means THE ENTIRE TABLE is accessed from an I\/O perspective.\u00a0 Even though only a few columns or bits of data might be used, the ENTIRE row (that contains all the column data) is accessed off disk.<\/p>\n<p>Let&#8217;s take our 63 million record table, that has approximately 99 columns.\u00a0 Assuming (and this is bad assumption) that the row store takes the exact same amount of storage as a column store, the total table size is <strong>3,025 MB<\/strong>.\u00a0 In the row store, the data stored in rows is stored in blocks and is relatively uniform (ie, approximately 1000 rows \/ blocks and stored in 63,000 blocks).\u00a0 In a column store, we store the columns separately.\u00a0 Our storage for the same 3,025 MB breaks down like this (total is still 3,025 MB).<\/p>\n<p><a href=\"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-content\/uploads\/2011\/01\/image.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-medium wp-image-502\" title=\"image\" src=\"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-content\/uploads\/2011\/01\/image-214x300.png\" alt=\"\" width=\"214\" height=\"300\" \/><\/a><\/p>\n<p><em>NOTE: Sorry for the graph labels! Sort of jumbled!<\/em><\/p>\n<p>As you can see, some columns still continue to take up a fair amount of space (120MB +) but other columns are much much smaller.\u00a0 The smaller columns are ones where values are repeated often (Year, Carrier, etc).<\/p>\n<p>Here&#8217;s the gist.\u00a0 Remember our SQL query typical in BI systems?<\/p>\n<blockquote><p>select count(Carrier), Carrier, DayOfWeek, CancellationCode group by Carrier, DayOfWeek, Cancellation<\/p><\/blockquote>\n<p>The SQL statement only access 3 columns (Carrier, DayOfWeek, CancellationCode).<\/p>\n<p>In a Row Store, the database has to read the ENTIRE table from disk (3025 MB)<\/p>\n<p><a href=\"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-content\/uploads\/2011\/01\/row_store_io.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-medium wp-image-503\" title=\"row_store_io\" src=\"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-content\/uploads\/2011\/01\/row_store_io-300x229.png\" alt=\"\" width=\"300\" height=\"229\" \/><\/a><\/p>\n<p>In a Column Store, the database only reads the columns it needs from disk (44 MB)<\/p>\n<p><a href=\"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-content\/uploads\/2011\/01\/column_store_io.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-medium wp-image-504\" title=\"column_store_io\" src=\"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-content\/uploads\/2011\/01\/column_store_io-300x229.png\" alt=\"\" width=\"300\" height=\"229\" \/><\/a><\/p>\n<p>The column store is doing almost <strong>1\/100th of the work of the row store<\/strong>!\u00a0 44 MB vs 3025 MB!<\/p>\n<p>It isn&#8217;t magic.\u00a0 It isn&#8217;t some magical breakthrough in CPU technology.<\/p>\n<blockquote><p><strong>We&#8217;ve simply changed how we&#8217;re storing the data on disk so that asking the same question (SQL) on the same data (63 million rows) does far less work (44 MB vs 3025 MB) to give the same answer.<\/strong><\/p><\/blockquote>\n<p>I&#8217;ll follow up with more on the this and other topics, but I hope this helps explain a very very basic reason of how LucidDB can deliver such fantastic improvements over otherwise very well performing OLTP databases like Oracle.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;m often asked, as an initial question of why LucidDB can perform so much better than traditional row store databases like Oracle, SQLServer, DB2, MySQL is HOW? There&#8217;s a bunch of reasons and we have entire sections of our Wiki dedicated to the topic, but first and foremost is corner stone of almost every analytic [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[5],"tags":[],"_links":{"self":[{"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/posts\/500"}],"collection":[{"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/comments?post=500"}],"version-history":[{"count":0,"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/posts\/500\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/media?parent=500"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/categories?post=500"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/tags?post=500"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}