{"id":234,"date":"2007-04-18T09:01:27","date_gmt":"2007-04-18T16:01:27","guid":{"rendered":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/2007\/04\/18\/mysql-archive-tablespace-for-facts\/"},"modified":"2007-04-18T09:01:27","modified_gmt":"2007-04-18T16:01:27","slug":"mysql-archive-tablespace-for-facts","status":"publish","type":"post","link":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/2007\/04\/18\/mysql-archive-tablespace-for-facts\/","title":{"rendered":"MySQL Archive Tablespace for FACTs"},"content":{"rendered":"<p>I&#8217;m visiting a <a href=\"http:\/\/pentaho.com\">Pentaho<\/a> customer right now whose current &#8220;transaction&#8221; volume is 200 million rows per day.&nbsp; Relatively speaking, this puts their planned warehouse in the top quintile of size.&nbsp; They will face significant issues with load times, data storage, processing reliability, etc.&nbsp; <a href=\"http:\/\/kettle.pentaho.org\">Kettle<\/a> is the tool they selected and it is working really well.&nbsp; Distributed record processing using Kettle and a FOSS database is a classic case study for Martens <a href=\"http:\/\/www.mysql.com\/why-mysql\/scaleout.html\">scale out manifesto<\/a>.&nbsp; <\/p>\n<p>This organization doesn&#8217;t have unlimited budget.&nbsp; Specifically, they don&#8217;t have a telecom type budget for their telecom like volume of data.&nbsp; One of the issues that has come up with their implementation has been the tradeoff between space, and keeping the base level fact records.&nbsp; For example, at 200 million \/ day and X bytes per fact you start to get into terabytes of storage quickly.&nbsp; <b>It was assumed, from the start of the project, only summary level data could be stored for any window of time exceeding 10 days or so.&nbsp; <\/b><\/p>\n<p>The overall math is sound.&nbsp; <\/p>\n<blockquote><p>Size per record (S) x number of records per day (N) = size per day of data growth (D)<\/p><\/blockquote>\n<p>In this equation, there&#8217;s really not much we can do, if we want to keep the actual base level transaction, about the number of records per day.&nbsp; N becomes a fixed parameter in this equation.&nbsp; We do have some control over the S value, which is mostly about what this blog is about.<\/p>\n<p><b>Can we reduce the size of S by such an amount that D becomes more realistic?<\/b>&nbsp; The answer is the ARCHIVE engine in MySQL.<\/p>\n<p>I created the DDL for a typical fact record.&nbsp; Few Dimension IDs and a few of numeric values.<br \/><code><br \/>CREATE TABLE test_archive<br \/>(<br \/>&nbsp; DIM1_ID INT<br \/>, DIM2_ID INT<br \/>, DIM3_ID INT<br \/>, DIM4_ID INT<br \/>, DIM5_ID INT<br \/>, NUMBER1 FLOAT<br \/>, NUMBER2 FLOAT<br \/>, NUMBER3 FLOAT<br \/>, NUMBER4 FLOAT<br \/>) engine=ARCHIVE<br \/>;<br \/><\/code><\/p>\n<p>I did this in MyISAM as well<br \/><code><br \/>CREATE TABLE test_myisam<br \/>(<br \/>&nbsp; DIM1_ID INT<br \/>, DIM2_ID INT<br \/>, DIM3_ID INT<br \/>, DIM4_ID INT<br \/>, DIM5_ID INT<br \/>, NUMBER1 FLOAT<br \/>, NUMBER2 FLOAT<br \/>, NUMBER3 FLOAT<br \/>, NUMBER4 FLOAT<br \/>) engine=MyISAM<br \/>;<br \/><\/code><br \/>I used a simple Kettle transformation to populate both the these tables with 1 million rows of data.  It wasn&#8217;t scientific but the INSERT performance of the Archive and MyISAM tables were very similar (within 10% of the throughput of rows).<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/www.nicholasgoodman.com\/entry_images\/kettle_archive_engine.jpg\" \/><\/p>\n<p>So now we have a million rows, with a reasonable FACT record format.&nbsp; How much space do these 1 million rows require to store?<br \/><code><br \/>+--------------+---------+---------------+------------+<br \/>| table_name   | engine  | total_size_mb | table_rows |<br \/>+--------------+---------+---------------+------------+<br \/>| test_archive | ARCHIVE |          2.63 |    1000000 |<br \/>| test_myisam  | MyISAM  |         36.24 |    1000000 |<br \/>+--------------+---------+---------------+------------+<br \/><\/code><br \/><b>The Archive table uses an order of magnitude LESS space to store the same set of FACTS.<\/b>&nbsp; What about query performance?&nbsp; I&#8217;ll have to do another blog on a more scientific approach but the anecdotal query performance on typical OLAP queries (select sum() from fact group by dim1_id) seemed related (less than a 15% difference).&nbsp; <\/p>\n<p>Let&#8217;s be clear here, one MUST have a good aggregate table strategy so the fact records are RARELY accessed because the performance will not rock your world.&nbsp; However, this is the case with these volumes anyhow.&nbsp; Summaries and Aggregates have to be built for the end user to have a good experience on the frond end.<\/p>\n<p>Archive engine is strange. INSERT\/SELECT only.&nbsp; You can&#8217;t do ANYTHING with it, except drop and recreate it.&nbsp; For historical fact loads on &#8220;settled&#8221; DW data segmented on a daily basis this is usually not an issue.&nbsp; No transactions need, no DML needed.&nbsp; I&#8217;d also like to see how this compares to &#8220;packed&#8221; MyISAM tables at some point.&nbsp; I&#8217;m guessing I wouldn&#8217;t expect to see it beat the compression in terms of storage space, but there&#8217;s some nice things you can do with MyISAM tables (for instance, MERGE).<\/p>\n<p>Back of the napkin calculation to handle BIG data volume:<\/p>\n<blockquote><p>2.63 MB \/ million * 200 per day = <b>526 MB \/ day<\/b><br \/>526 MB \/ day * 365 days = <b>187.5 GB<\/b><\/p><\/blockquote>\n<p>Anyone else have fun experiences to share with the Archive engine?<br \/><i><br \/>Side Note:<\/i>&nbsp; I did all this testing on a VNC Server enabled EC2 instance I use for testing.&nbsp; Fire it up, VNC in, do my testing, shut it down.&nbsp; I LOVE EC2! <\/p>\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;m visiting a Pentaho customer right now whose current &#8220;transaction&#8221; volume is 200 million rows per day.&nbsp; Relatively speaking, this puts their planned warehouse in the top quintile of size.&nbsp; They will face significant issues with load times, data storage, processing reliability, etc.&nbsp; Kettle is the tool they selected and it is working really well.&nbsp; [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[6,11],"tags":[],"_links":{"self":[{"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/posts\/234"}],"collection":[{"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/comments?post=234"}],"version-history":[{"count":0,"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/posts\/234\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/media?parent=234"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/categories?post=234"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/tags?post=234"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}