At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. My impression, which was confirmed by folks in the Oracle world, is that she knows her way around the Oracle optimizer.
See part 1 for the introduction and talking about power and hardware. This part will go over the 2nd “P”, partitioning. Learning about Oracle’s partitioning has gotten me more interested in how MySQL’s partitioning works, and I do hope that MySQL partitioning will develop to the level that Oracle partitioning does, because Oracle’s partitioning looks very nice (then again, that’s why it costs so much I guess).
Partition – Larger tables or fact tables can benefit from partitioning because it makes data load easier and can increase join performance and use data elimination. Parallel execution can be done with partitioning due to partition pruning. The degree of parallelism should be a power of 2, because of hash-based algorithm in hash partitioning. To translate this to the MySQL world, if you are using LINEAR HASH partitioning, then you should use a degree of parallelism that is a power of 2 (I checked, and indeed. Otherwise, use a degree of parallelism that makes sense given the number of partitions you have.
One important note that during Pythian’s testing of MySQL partitioning, we found that all partitions were locked when an INSERT occurs, for the duration of the INSERT. Bulk-loading with MySQL partitioning is not as fast as it would be if MySQL allowed partition pruning for INSERTs.
So, what should be partitioned? For the first level of partitioning, the goal is to enable partitioning pruning and simplify data management. The most typical partitioning is range or interval partitioning on a date column. Interval partitioning is you say what the partition is (date, month) and partition is automatically created. MySQL does not have interval partitioning, and I have seen typical first-level partitioning be range or list based on a date or timestamp column. Note that if you use a timestamp field, the partitioning expression is optimized if you use TO_DAYS(timestamp_field) or YEAR(timestamp_field). In my experience, using anything else (such as DATE(timestamp_field)) actually makes partitioning slower than not using partitioning at all. Note that this is based on tests I did a few months ago, and your mileage may vary.
So — how do you decide partitioning strategy? Ask yourself:
The answers to the above questions will tell you about how big your interval needs to be. The best scenario is that all answers are the same, “we load every day, and people query by day.” If the answers are different weight access a higher priority than loading, because most people care more about query performance than performance of ETL.
This is true even if your intervals have different sizes — ie sales per day are much bigger in Dec but that’s OK. However, Maria recommends that the subpartition be as evenly divided as possible.
Easier to look at more partitions than to look at a partition that’s too big. But you don’t want too many partitions, max Oracle allows partitions is 1 million partitions, prior to 11g it was 64,000. “Stick closer to 64,000 than 1 million”. MySQL’s limitation is 1024 per table.
For the second level of partitioning, also called subpartitioning, the goal is to allow for multi-level pruning and improve join performance. In Oracle, the most typical subpartition is hash or list – in MySQL, you can only subpartition by hash or key.
How do you decide subpartitioning strategy?
For example, if you want to look at sales per day, per store, you would choose “per day” as the partition and “per store” as the subpartition.
If you do not have a good partition on logical elements (like grouping), then you can subpartition using hash partitioning on common joins — perhaps surrogate keys, or using join key of the largest table involved in the join.
For example, if the sales table is partitioned and another big table is product, you can hash subpartition product_id.
Because there’s overhead in partitions (loading metadata, reading metadata), make sure size of partitions and subpartitions is >20 Mb. So better to have a 30 Mb subpartition than a 15 Mb subpartition. [I have no idea if this is true in MySQL or not -- I think the general concept is true, because there is some overhead, but I have no idea about the 20 Mb figure and why that's true for Oracle, nor do I know what is true in MySQL.]
One easy calculation is double the # of CPUs, round up to nearest power of 2. If you’re executing in parallel, Oracle will use 2x CPUs. (all this advice, by the way, follows 80/20 rule, this is probably good for about 80% of the environments out there). Of course, MySQL does not do parallel execution very well, so this probably does not apply.
Oracle knows it can get partition elimination while it does a join.
If 2 tables have the same degree of parallelism (same # of buckets) and are partitioned in the same way on the join column (say, customer_id in a subpartition of sales and a partition of customer), Oracle will match the partitions when joining:
sales table joined with customer table can change into 4 small joins:
sales sub part 1 joins with customer part 1
sales sub part 2 joins with customer part 2
sales sub part 3 joins with customer part 3
sales sub part 4 joins with customer part 4
And with parallelism, the total time is now reduced to the time it takes to do one of those smaller joins.
This is also why you want to have a power of 2 for buckets – because cores/processors come in powers of 2. Partition-wise joins like this can also be done with range or list, assuming both tables in the join have the same buckets.
I have no idea if MySQL partitioning works this way, but it’s certainly a functionality that makes sense to me.
At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. My impression, which was confirmed by folks in the Oracle world, is that she knows her way around the Oracle optimizer.
These are my notes from the session, which include comparisons of how Oracle works (which Maria gave) and how MySQL works (which I researched to figure out the difference, which is why this blog post took a month after the conference to write). Note that I am not an expert on data warehousing in either Oracle or MySQL, so these are more concepts to think about than hard-and-fast advice. In some places, I still have questions, and I am happy to have folks comment and contribute what they know.
One interesting point brought up:
Maria quoted someone (she said the name but I did not grab it) from Forrester saying, “3NF is typically a selfless model used by Enterprise data warehouse, which is used by the whole company. A star schema is a selfish model, used by a department, because it’s already got aggregation in it.”
I thought that was an interesting way of pointing that out — most people do not understand why 3NF is not good enough for data warehousing, and I have had a hard time explaining why a star or snowflake schema should be used. Another schema-related topic I had a hard time putting into words before this workshop was the difference between a star and a snowflake schema: compared to a star schema, in a snowflake schema, you have more than one fact table and maybe some dimensions that are not used often.
From Maria and the slides:
“Oracle says model what will suit your business best. Don’t get lost in academia. Most schemas are not 100% according to the theoretical
models. Some examples: 3NF schema with denormalized attributes to avoid costly joins, Star schema with multiple hierarchies in same fact table.”
Data warehousing has a 3-step approach —
1) data sources -> staging layer (temp loading layer)
2) staging layer (temp loading layer)-> foundation (logical, data store) layer
3) foundation (logical, data store) layer -> access and performance layer
The foundation layer is usually 3NF the access layer is usually a star or snowflake schema. As for the data sources, they can be varied, you would hope that they are in 3NF (and if they are you can skip the first 2 steps) but they are not always that way.
The 3 P’s of best practice for data warehousing (on Oracle) are power, partitioning, parallelism. The goal of the data warehousing environment is to minimize the amount of data accessed and use the most efficient joins – so it is not so index focused. This may be based on Oracle’s way of doing joins, I am not so sure if it applies to MySQL as well.
Power The weakest link in the chain (the 3 steps above) will define the throughput, so make sure your hardware configuration is balanced. Maria mentioned that as DBAs, “most of the time we don’t have control over this, but we’re still bound to the SLAs.”
This includes hardware that immediately comes to mind such as # of CPUs/cores, speed of CPU, amount of RAM, speed of disk as well as what we may not think of immediately: speed of network switches, speed of disk controllers, number and speed of host BUS adapters. Notes on host BUS adapters (HBAs): Know the # of HBA ports you have. 4 Gb HBA does 400 Mb/sec. 2 Gb HBA does 200 Mb/sec. Make sure there’s enough HBA capacity to sustain the CPU throughput (ie, make sure HBA isn’t the bottleneck). Also the speed at which it all talks. If you have a 4 Gb machine but a 2 Gb switch, you end up having 2 Gb throughput. Upgrade the network at the same time you upgrade machines.
Because we are talking about data warehousing, it is often not possible to eliminate disk I/O, so the goal is to have the fastest I/O throughput possible. Data warehouses need to be sized on I/O throughput not number of I/O’s.
I made a post earlier about how to determine I/O throughput for a system, which used information from this session. Justin Swanhart already pointed out that this is based on the fact that Oracle can do hash joins and MySQL can only do nested loop joins. I wonder, though, if there is indeed no case when using MySQL for which I/O throughput is a more useful metric than iops.
Disk arrays that are expensive are usually sized for iops, not throughput, and because they’re expensive the disk array is shared throughout the company. A DBA needs to ask ‘how many connections into the storage array do I have? How many disk controllers do I have? Where are my physical disks, and which controllers are they hanging off of?’
Typical 15k rpm disk can do about 25-35 Mb/sec (per disk) random i/o’s. Disk manufacturers will throw out numbers like 200-300 Mb/sec but that’s sequential I/O and leading edge of the drive. Make sure all your LUNs are not coming off the same set of disks, so that you’re not conflicting on disk seeks.
Continue to part 2, partitioning.
At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. In that session, there was a section on how to determine I/O throughput for a system, because in data warehousing I/O per second (iops) is less important than I/O throughput (how much actual data goes through, not just how many reads/writes).
The section contained an Oracle-specific in-database tool, and a standalone tool that can be used on many operating systems, regardless of whether or not a database exists:
If Oracle is installed, run DBMS_RESOURCE_MANAGER.CALIBRATE_IO:
For us MySQL folks, or even the Drizzle or NoSQL folks, Oracle offers a free standalone tool called Orion. The example given in the slides was:
./orion –run advanced –testname mytest –num_small 0 –size_large 1024 –type rand –simulate contact –write 0 –duration 60 –matrix column
-num_small is 0 because you don’t usually do small transactions in a dw.
-type rand for random I/O’s because data warehouse queries usually don’t do sequential reads
-write 0 – no writes, because you do not write often to the dw, that is what the ETL is for.
-duration is in seconds
-matrix column shows you how much you can sustain
I would be interested to see how other folks measure I/O throughput, and maybe even do a side-by-side comparison of different tools. Orion is available for:
Linux (x86, x86-64, Itanium, Power)
Solaris (SPARC64)
AIX (PPC64)
zLinux
HPUX (PA RISC, Itanium)
Windows
I am working on a larger write-up of the session itself, which had many concise descriptions of data warehousing issues, but I thought that this merited its own post.
MySQL Cluster is mainly an in-memory database. Nevertheless it requires a good I/O system for writing various different information to disk.
The information MySQL Cluster writes to disk are the:
In the following schema you can see what is related to each other:
Please find here the meaning of each parameter:
I get asked this question often. It was mentioned again recently in a NYTECH executive breakfast with RedHat CIO Lee Congdon.
The short answer is No.
There is clear evidence that in the short to medium term Oracle will continue to promote and enhance MySQL. Some of these indicators include:
It is clear from these sources that Oracle intends to incorporate MySQL into Oracle Backup and Security Vault products. Both a practical and necessary step. There is also a clear mention of focusing on the Microsoft platform, a clear indicator that SQL Server is in their sights without actually saying it.
What is unknown is exact how and when features will be implemented. Also important is how much these may cost the end user. Oracle is in the business of selling, now an entire H/W and S/W stack. They also have a complicated pricing model of different components with product offerings. I assume this will continue. There are already two indications, InnoDBbackup included for Enterprise Backup (from April Keynote) and 5.1 enterprise split. (Note: while this split may have existed prior to Oracle, it is now more clearly obvious).
MySQL can never be seen as drawing away from any Oracle sales of the core entry level database product. It is likely Oracle will provide a SQL Syntax compatibility layer for MySQL within 2 years, however it will I’m sure be a commercial add-on. Likewise, I would suspect a PL/SQL lite layer within 5 years, but again at a significant cost to offset the potential loss of sales in the low end of the server market. There continues to be active development in the MySQL Enterprise Monitor, MySQL Workbench and MySQL Connectors which is all excellent news for users.
Moving forward, how long will this ancillary development of free tools continue? What will happen to the commercial storage engine, OEM and licensing model after the 5 year commitment? How will the MySQL ecosystem survive.? There is active development in Percona, MariaDB and Drizzle forks, however unless all players that want to provide a close MySQL compatible solution work together, progress will continue to be a disappointing disjointed approach. The 2011 conference season will also see a clear line with competing MySQL conferences in April scheduled at the same time, the O’Reilly MySQL conference in Santa Clara California and the Oracle supported(*) Collaborate 2011 in Orlando, Florida.
I have a number of predictions on what Oracle ME MySQL may look like in 5 years however this is a topic for a personal discussion.
First this week we have John Anderson filling us up on the Perl high drama of OSCON of earlier this week. In a nutshell the organizers provided, as it’s the tradition, ribbons to the attendees, and the Perl Mongers in the crowd got one reading Desperate Perl Hacker. The epithet, coined in an XML article written in 1997, was meant in good fun, but was received with a distinct lack of glee by the Perl hackers. Which is no surprise, considering how mongers already have to fight tooth and nail to dispel the heavy baggage of preconceptions that our language accumulated throughout the years. Quite a few blog entries sprouted to discuss the whole hooplah, with Piers Cawley’s being one of the most eloquent, explaining that, yes Virginia, Perl can, and is, often used at the 11th hour to save someone’s bacon and, consequently, has instances of code that are less than perfectly pretty. But, it’s also much more than an emergency fire extinguisher and and has flip side that consists of a strong, solid and modern ecosystem that can, and is, definitively used for bigger projects.
Buuut enough about that. Let’s return to our regular parade of technological goodies, shall we?
Perl 5.12 will issue some deprecation warnings, even if the warnings pragma is not enabled. brian d foy walks us through them and dares us to turn them off. But before we do, he asks us to ask ourselves a very simple question. Namely, as punks, do we feel lucky? Do we?
Sam Graham joins the angst-filled crowd and, without doubt starring at the empty sockets of a grinning skull, asks himself To Dist::Zilla, or not to Dist::Zilla.
DBIx::Class users, prepare to squeal in delight. fREW opened a can of delicious frosting and came up with DBIx::Class::Candy.
Adam Kennedy gives us a glimpse of how his team deals with the deployment of massive Perl applications. Massive, really, you ask? 250,000 lines of Perl, spread over 750 modules, managing 100,000 physical users, with a turn over of about a billion dollars. So, yeah, massive, I say.
Hanemoku presents a few modern Perl administration tools. Some of them you might have seen in previous Shuck and Awe issues, some are new, all are good to know and have in your toolbox.
Kartak is working on SDL gaming goodness. As a sneak preview of things to come, he shows us a Zelda map walker. (Oooh, the memories…)
Our very own Dean is putting some popular web frameworks to the test and see how easy/fast it is to write a blog application using them. His first contestant? Catalyst.
Using DBIx::Class, and your table has some horrendously huge columns that you don’t necessarily want to retrieve each time you query a row? Ovid had that problem too, and shows us how he deals with it.
Roles in Perl? They’re awesome. Chris Prather tells you why.
Git is a developer’s best friend — maybe rough at the edges, maybe grumpy, but always there to make your life easier. However, a lot of projects / companies still use Subversion for a variety of reasons (politics, legacy, not having realized we have stepped into the 21st century yet, etc.). Not to fear, as fREW discovered recently git-svn is there to seamlessly bridge the svn mothership with your very own private satellite repository.
[yanick@enkidu shuck]$ perl -E'sleep 2 * 7 * 24 * 60 * 60 # see y'all in 2 weeks!'I will be joining a great list of quality speakers including John Allspaw, Theo Schlossnagle, Rasmus Lerdorf and Tom Cook at Surge 2010 in Baltimore, Maryland on Thu 30 Sep, and Fri Oct 1st 2010.
My presentation on “The most common MySQL scalability mistakes, and how to avoid them.” will include discussing various experiences observed in the field as a MySQL Consultant and MySQL Performance Tuning expert.
Abstract:The most common mistakes are easy to avoid however many startups continue to fall prey, with the impact including large re-design costs, delays in new feature releases, lower staff productivity and less then ideal ROI. All growing and successful sites need to achieve higher Availability, seamless Scalability and proven Resilience. Know the right MySQL environments to provide a suitable architecture and application design to support these essential needs.
Overview:Some details of the presentation would include:
To my shame I must admit, I missed it. It happened on June 29, 2005 when db4free.net was first available to the public. At that time it was running MySQL 5.0.7 beta. Quite a lot has happened since then, MySQL 5.0 made its way up to 5.0.91 and the current MySQL GA version is 5.1.49, which is also the version db4free.net is running as of today. The first phpMyAdmin version that db4free.net was offering to provide easy access to the user’s databases was 2.6.3. Today I updated phpMyAdmin to 3.3.5.
Statistics are not necessarily 100 % accurate, but here is the best I can come up with. Since its launch, db4free.net had 528,900 visits. The ratio registrations per visits is at about 22 %, so more than every fifth visit ends in signing up for a new database. Which means, that about between 110,000 and 120,000 database accounts (meaning database and user) have ever been created. There have been some cleanups since then to make resources available to people who actively use their databases. Which is why the current number of databases is much lower, at slightly above 13,400.
Today I must (also) admit that the code behind db4free.net’s web application was initially quite poor and it stayed poor for quite long. Why it was poor from my today’s point of view is probably due to my learning process in these 5 years. This year I did a lot of cleanup on the code behind db4free.net’s web application and removed a few bottlenecks which often made the website painfully slow. Now it’s in a fairly reasonable state again, and I have some ideas in the back of my head how to further improve it. So db4free.net is far from its end of life. Quite the opposite is the case, it’s time to give it a new boost so that the balance after its 10th anniversary is even more impressive than today’s 5 year balance.
One of the issues that has arisen from the ongoing debate about the open core licensing strategy is the continuing confusion about open core compared to the use of open source components in a larger proprietary product – such as IBM’s use of Apache within WebSphere.
To some people there is no difference between the two (since they both result in products that make use of open source but are not open source), however it is clear to me that while the end result might be the same these are very different strategies that involve different approaches to engaging with open source communities/projects.
While open core has a clear definition there is no agreed term or definition for the latter category.
Over the years we have used a variety of terms to describe it, including “open and closed”, “embedded open source”, “open inside” and “open complement”, while Jack Repenning has referred to it as “open infrastructure”.
Our next categorization of open source-related business strategies is still a work in progress but the current thinking is as follows:
(This categorization is a work in progress, we welcome and encourage any feedback)
Open core and open foundation have different evolutionary lineages: open core is a variation on dual licensing as practiced by the likes of MySQL and Sleepycat that also borrows heavily on the value-added subscription model as practiced by Red Hat and JBoss. Meanwhile open foundation has its roots in the commercialization of BSD, which pre-dates the concepts of open source and free software, as well as Apache.
From a practical perspective, the easiest way to think of the distinction between open core and open foundation is via an example:
PostgreSQL is an independent, community-developed open source project. EnterpriseDB offers extensions to the PostgreSQL core, such as Oracle-compatibility, in the form of Postgres Plus Advanced Server.
PostgreSQL has also been used by many other vendors to create commercial products. For example Greenplum used PostgreSQL as the foundation of its Greenplum Database (for other examples see this post). This allowed the company to build on proven database technology and avoid reinventing the wheel, but it also involved the creation of an entirely new product, rather than extensions to an open source project (the company initially actually started a new project, Bizgres, and created extensions to that but Bizgres was last seen in August 2008).
So while open core involves offering proprietary extensions targeted at a segment of the open source project user base, open foundation involves using open source software to create entirely new products, targeted at a different user base.
The example used above highlights three important points to consider when comparing open core and open foundation strategies:
1/ While open core is most readily associated with vendor-controlled projects it can also be used as a strategy to monetize community-controlled projects.
2/ Open core strategies can be used in conjunction with complementary strategies. In the Greenplum example the company’s relationship with Bizgres was open core, while the relationship with PostgreSQL was open foundation. Similarly there is an open core relationship between Actuate’s BIRT products and the Eclipse BIRT project, and an open complement relationship between Actuate 10 and the Eclipse BIRT project. Meanwhile there is an open core relationship between Day Software’s CRX content repository and the Apache Jackrabbit and Sling projects, and a open foundation relationship between CQ5 and Jackrabbit, Felix and Sling – as well as the numerous other Apache projects that Day contributes to.
3/ Open core and open foundation are licensing strategies used as part of a larger business strategy for engaging with and commercializing open source software, which highlights the futility in trying to pigeon-hole companies as “open core vendors” or “open source vendors”.
Finally it is worth thinking about the different tensions that the open core and open foundation strategies create with their respective communities.
As Jorg Janke notes, “looking for an income stream as an open source vendor always results in some sort of conflict with the community. So, you have to pick the community you want to ‘offend’.”
With a vendor-controlled open core strategy the community is a user community, and as we have previously discussed the conflict is in deciding what features belong in the core and what features don’t.
With an open foundation strategy the community is the open source project developer community, and the conflict lies in deciding what features and resources to contribute to that project.
A community-controlled open core strategy arguable results in conflict with both the user and developer communities, although since the vendor does not own or control the project the relationship is much more comparable to the open foundation strategy.
We will be writing more about other strategies for generating revenue from open source software, in a follow-up to our Open Source is Not a Business Model report, which is due to be published latter this year. It will provide more context for the economic motivators and issues involved in the various models, as well as updated research on which vendors are following which strategies, and why, as well as a survey to uncover what software users make of it all. The report will be freely available to CAOS subscribers. For more details of the CAOS research practice, and to apply for trial access, click here.
I’ll be giving a webinar about Exadata implementation, where I’ll be talking about Exadata features and how best to use them. I’ll also be sharing some lessons learned from my own implementation experience.
The webinar will be on Wednesday August 11 at high noon eastern time. Note that this is a change from the previous date.
To register, visit https:// www2.gotomeeting.com/register/171889707. For more Pythian webinars, visit http://www.pythian.com/library/webinars/.
This is my first mobile post so please excuse any typos. I must admit though that T-Mobile 3G is _fast_.
I recently gave a webinar to the LAOUC and NZOUG user groups on MySQL Idiosyncrasies that BITE.
For the benefit of many viewers that do not use English as a first language my slides include some additional information from my ODTUG Kaleidoscope presentation in June.
Thanks to Francisco Munoz Alvarez for organizing.
MySQL Idiosyncrasies That Bite 2010.07 View more presentations from Ronald Bradford.If you own a Nokia N900 cellular device you might be interested in the ability to control all of your IM accounts from the command line. For those that do not know, the N900 runs Maemo Linux and is capable of running MySQL embedded if you so choose. Here’s a quick script I wrote to provide that functionality for IM accounts. It’s at the bottom of the page, called “im-connections”.
wiki: http://wiki.maemo.org/N900_Mission_Control#Set_all_SIP_accounts_to_online_or_offline
pastebin: http://pastebin.com/qAC57E1N
MySQL Connector/Net 6.3.3, a new version of the all-managed .NET driver for MySQL has been released. This is a beta release and is intended to introduce you to the new features and enhancements we are planning. This release should not be used in a production environment. It is now available in source and binary form from http://dev.mysql.com/downloads/connector/net/6.3.html] and mirror sites (note that not all mirror sites may be up to date at this point of time – if you can’t find this version on some mirror, please try again later or choose another download site.)
The new features or changes in this release are:
What we know may be broken
Documentation is not updated yet and is not integrated into VS 2010.
Please let us know what else we broke and how we can make it better!
“Dynamo: Amazon’s Highly Available Key-value Store” is a high level description of a data store, written by Amazon to solve the problem of a system where updates must never ever fail and must take less than a specific about of time in 99.99% of the cases. No matter what happens to the servers or the network, updates to the system must continue as usual, and they emphasize that they deal with hardware and network failures nearly constantly.
The paper has one of the best descriptions on the trade-offs involved in eventual consistency, and when it makes sense. But even more interesting is the implicit decision that disks, commits and synchronous writes to redo logs are not really needed for durability.
To allow for simple design, robustness and high performance, Amazon limited the reporting capabilities – data is accessed by primary key only. To implement the high availability requirement, Dynamo replicates each key-value pair between multiple machines in different data stores. And to make sure that replication doesn’t impact performance, Amazon decided they can live without consistency – requests from the system can sometimes return old data or several contradicting versions of data – for Amazon’s requirements, this is much better than failure or delays.
The consistency trade-off is very explicitly dealt with in the paper – there will be multiple versions of the data, a request may get any number of replies and may not get the most recent updates. The application developers have to build applications that can handle this situation and merge several versions of the data into the information they want.
The trade-off of durability is done implicitly. The paper simply does not mention writing anything to any disk. Nothing like redo logs is mentioned either (except in the context of tracking changes for a missing node). This is obviously very fast, but can we really trust this system not to lose any changes?
To avoid losing data, the system uses a quorum-like system. There are 3 important parameters to tune for the replication: How many times each key-value pair should be replicated (N), how many nodes should participate in a successful write (W) and how many nodes should participate in successful read (R).
When W=3, at least 3 nodes should signal that the write was successful before the application gets an “ack” from the data store. A coordinator node will attempt to write to N nodes, but once W nodes approve the write, it is considered permanent. In order to lose an update we need at least W nodes to crash simultaneously after an update and probably more because W is just the lower bound. As long as at least one of those nodes is up, the data is not lost.
The idea is to set the replication level to a number high enough to satisfy durability requirements, but low enough to prevent too much latency. You also want to split those nodes between at least two data centers to reduce the chance of losing all nodes at once. Considering that this system doesn’t allow for backups, I would set the number of replicated nodes really high. I would expect about 50 nodes covering each piece of data before I’d consider it as durable as disk bases system (considering how often servers crash vs. how often disks crash). Amazon, on the other hand, mentioned in the paper that they use replication level of 3, and they keep stressing that not losing a single update is vital to their business (because each update can be a sell of an item), so maybe I’m under-estimating the durability of this system.
From the description, it also appears that it is impossible to dump the contents of the system. This means that backups are impossible, no point-in-time-recovery, and ETL into a DW system also sounds impossible.
I like the idea of databases without disks – disks are notorious for just slowing everything down. I’m just not sure I would want my bank to use a replication-only system, even if it does seem durable in theory. So, if we can’t use Dynamo for banks, what is it good for?
Amazon’s paper gives Amazon shopping cart as an example – The data is temporary anyway, there are many more writes to the system than reads (you keep puting items in the cart, but only look at the cart once before checking out) and there is a clear rule on how to merge conflicting versions (union). Social Networks is another obvious use-case, with tons of updates and trivial way to merge different version, I’m not sure if access is limited to just primary key though – you want to see your own data and that of your friends.
I think that an interesting possible use case is for system monitoring data – it is also update heavy, can be mostly primary-key access (metric, such as CPU on specific machine, being the key), and merging versions is trivial (union). Monitoring data does need to be reported, so I would expect my data store to allow dumping the data into a more traditional store in specific intervals.
The Dynamo paper was written four years ago, I’m sure the system improved since. I’m also sure that the paper gives only a partial description of the features of the system (and I suspect that they do write data to disks after all). Still, it paved way for a future of special purpose databases , ones that solve a specific use-case better than a general purpose RDBMS.
If you ask yourself “What are my requirements?” and “What is the best data store for my requirements?” instead of saying “We have Oracle, lets throw everything there and it will take care of everything magically”, your application will be better for it. As Cary Millsap recently wrote – it is better to think clearly than to be correct.
Josh Berkus gave a great talk at linux.conf.au 2010 (the CFP for linux.conf.au 2011 is open until August 7th) entitled “How to destroy your community” (lwn coverage). It was a simple, patented, 10 step program, finely homed over time to have maximum effect. Each step is simple and we can all name a dozen companies that have done at least three of them.
Simon Phipps this past week at OSCON talked about Open Source Continuity in practice – specifically mentioning some open source software projects that were at Sun but have since been abandoned by Oracle and different strategies you can put in place to ensure your software survives, and check lists for software you use to see if it will survive.
So what can you do to not destroy your community, but ensure you never get one to begin with?
Similar to destroying your community, you can just make it hard: “#1 is to make the project depend as much as possible on difficult tools.”
#1 A Contributor License Agreement and Copyright Assignment.
If you happen to be in the unfortunate situation of being employed, this means you get to talk to lawyers. While your employer may well have an excellent Open Source Contribution Policy that lets you hack on GPL software on nights and weekends without a problem – if you’re handing over all the rights to another company – there gets to be lawyer time.
Your 1hr of contribution has now just ballooned. You’re going to use up resources of your employer (hey, lawyers are not cheap), it’s going to suck up your work time talking to them, and if you can get away from this in under several hours over a few weeks, you’re doing amazingly well – especially if you work for a large company.
If you are the kind of person with strong moral convictions, this is a non-starter. It is completely valid to not want to waste your employers’ time and money for a weekend project.
People scratching their own itch, however small is how free software gets to be so awesome.
I think we got this almost right with OpenStack. If you compare the agreement to the Apache License, there’s so much common wording it ends up pretty much saying that you agree you are able to submit things to the project under the Apache license. This (of course) makes the entire thing pretty redundant as if people are going to be dishonest about submitting things under the Apache licnese there’s no reason they’re not going to be dishonest and sign this too.
You could also never make it about people – just make it about your company.
#2 Make it all about the company, and never about the project
People are not going to show up, do free work for you to make your company big, huge and yourself rich.
People are self serving. They see software they want only a few patches away, they see software that serves their company only a few patches away. They see software that is an excellent starting point for something totally different.
I’m not sure why this is down at number three… it’s possibly the biggest one for danger signs that you’re going to destroy something that doesn’t even yet exist…
#3 Open Core
This pretty much automatically means that you’re not going to accept certain patches for reasons of increasing your own company’s short term profit. i.e. software is no longer judged on technical merits, but rather political ones.
There is enough politics in free software as it is, creating more is not a feature.
So when people ask me about how I think the OpenStack launch went, I really want people to know how amazing it can be to just not fuck it up to begin with. Initial damage is very, very hard to ever undo. The number of Open Source software projects originally coming out of a company that are long running, have a wide variety of contributors and survive the original company are much smaller than you think.
PostgreSQL has survived many companies coming and going around it, and is stronger than ever. MySQL only has a developer community around it almost in spite of the companies that have shepherded the project. With Drizzle I think we’ve been doing okay – I think we need to work on some things, but they’re more generic to teams of people working on software in general rather than anything to do with a company.
MySQL seems to be happy to convert types for you. Developers are rushed to complete their project and if the function works they just move on. But what is the costs of mixing your types? Does it matter if your are running across a million rows or more? Lets find out.
Here is what the programmers see.
mysql> select 1+1; +-----+ | 1+1 | +-----+ | 2 | +-----+ 1 row in set (0.00 sec) mysql> select "1"+"1"; +---------+ | "1"+"1" | +---------+ | 2 | +---------+ 1 row in set (0.00 sec)Benchmark
What if we do a thousand simple loops? How long does the looping itself take?
The BENCHMARK() function executes the expression expr repeatedly count times. It may be used to time how quickly MySQL processes the expression. The result value is always 0.
mysql> select benchmark(1000000000, 1); +--------------------------+ | benchmark(1000000000, 1) | +--------------------------+ | 0 | +--------------------------+ 1 row in set (5.42 sec) mysql> select benchmark(1000000000, "1" ); +-----------------------------+ | benchmark(1000000000, "1" ) | +-----------------------------+ | 0 | +-----------------------------+ 1 row in set (5.40 sec)So maybe type doesn’t matter? About five seconds just to loop but the type didn’t change it. What if we add 1+”1″?
mysql> select benchmark(1000000000, 1+1); +----------------------------+ | benchmark(1000000000, 1+1) | +----------------------------+ | 0 | +----------------------------+ 1 row in set (12.65 sec) mysql> select benchmark(1000000000, 1+"1"); +------------------------------+ | benchmark(1000000000, 1+"1") | +------------------------------+ | 0 | +------------------------------+ 1 row in set (35.58 sec) mysql> select benchmark(1000000000, "1"+"1"); +--------------------------------+ | benchmark(1000000000, "1"+"1") | +--------------------------------+ | 0 | +--------------------------------+ 1 row in set (51.59 sec)It looks like type does matter. But does it always matter?
mysql> select benchmark(1000000000, sum(1+1)); +---------------------------------+ | benchmark(1000000000, sum(1+1)) | +---------------------------------+ | 0 | +---------------------------------+ 1 row in set (9.69 sec) mysql> select benchmark(1000000000, sum("1"+"1")); +-------------------------------------+ | benchmark(1000000000, sum("1"+"1")) | +-------------------------------------+ | 0 | +-------------------------------------+ 1 row in set (9.94 sec) mysql> select benchmark(1000000000, sum("1.23456789"+"1.23456789")); +-------------------------------------------------------+ | benchmark(1000000000, sum("1.23456789"+"1.23456789")) | +-------------------------------------------------------+ | 0 | +-------------------------------------------------------+ 1 row in set (10.32 sec)So, not all functions are the same. But it looks like size might matter!
mysql> select benchmark(1000000000, 1.1+1.1); +--------------------------------+ | benchmark(1000000000, 1.1+1.1) | +--------------------------------+ | 0 | +--------------------------------+ 1 row in set (34.90 sec) mysql> select benchmark(1000000000, "1.1"+"1.1"); +------------------------------------+ | benchmark(1000000000, "1.1"+"1.1") | +------------------------------------+ | 0 | +------------------------------------+ 1 row in set (1 min 15.32 sec) mysql> select benchmark(1000000000, "1.123456789"+"1.123456789"); +----------------------------------------------------+ | benchmark(1000000000, "1.123456789"+"1.123456789") | +----------------------------------------------------+ | 0 | +----------------------------------------------------+ 1 row in set (1 min 53.32 sec)Sorry. Looks like size does matter.
This doesn't seem logical.
Maybe we should CAST our work?
mysql> select benchmark(1000000000, cast("1" as unsigned)); +----------------------------------------------+ | benchmark(1000000000, cast("1" as unsigned)) | +----------------------------------------------+ | 0 | +----------------------------------------------+ 1 row in set (32.27 sec) mysql> select benchmark(1000000000, cast("1" as unsigned) + cast("1" as unsigned)); +----------------------------------------------------------------------+ | benchmark(1000000000, cast("1" as unsigned) + cast("1" as unsigned)) | +----------------------------------------------------------------------+ | 0 | +----------------------------------------------------------------------+ 1 row in set (1 min 7.24 sec)Maybe not!
Conclusion: Be careful with your data types. If you are taking user input, do the type conversion ONCE in your program. Don’t let MySQL do the type conversions for you.
query = “SELECT * FROM table where $INPUT = 1″; could be doing your wrong.
最近评论
1 周 2 天 前
1 周 4 天 前
1 周 4 天 前
1 周 5 天 前
1 周 5 天 前
1 周 5 天 前
1 周 6 天 前