May/June 2018 issue of acmqueue The May/June issue of acmqueue is out now



Databases

 

Download PDF version of this article
This and other acmqueue articles have been translated into Portuguese
ACM Q em Língua Portuguesa

ITEM not available

acmqueue

Originally published in Queue vol. 9, no. 4
see this item in the ACM Digital Library


Tweet



Related:

Graham Cormode - Data Sketching
The approximate approach is often faster and more efficient.


Heinrich Hartmann - Statistics for Engineers
Applying statistical techniques to operations data


Pat Helland - Immutability Changes Everything
We need it, we can afford it, and the time is now.


R. V. Guha, Dan Brickley, Steve MacBeth - Schema.org: Evolution of Structured Data on the Web
Big data makes common schemas even more necessary.



Comments

(newest first)

Benjamin Black | Fri, 22 Apr 2011 21:26:37 UTC

I covered similar ground in a rather similar way in this GigaOm article from last year: http://gigaom.com/cloud/nosql-is-for-the-birds/

b


Michael Schuerig | Wed, 20 Apr 2011 20:01:06 UTC

Michael, I think what you are describing could be called "scaling SQL by non-relational means". The approach is more or less the same as what NoSQL people would do, with the only difference being how data is stored at the bottom: relationally or some other way.

When scaling up, "relationality" gets lost. I don't know, much less claim that it has to be that way. Scaling up in practice means distributing data and processing over many nodes with fallible connections. Arbitrary relational operations, joins in particular, are no longer practical in such a setting.

Queries are no longer independent of the physical organization of data. To the contrary, the physical organization must be specifically designed to optimize for expected queries.

This comment is by no means intended as a criticism of your or anyone else's approach. It's meant as a reminder of the unavoidable(?) price we pay for scaling, namely that we no longer have a purely logical model of data.


Andrew Wolfe | Wed, 20 Apr 2011 15:18:41 UTC

I am a proud employee of Oracle Corporation, but this is my personal opinion, not written on behalf of Oracle.

The claim that relational databases cannot scale and that noSQL databases are somehow magically able to perform better on huge datasets is uninformed or outdated. I remember the stunned look on noSQL proponent's face around 2008 when I told him I could run multiple simultaneous 10k row/second imports into a departmental-sized Oracle server. The same organization considered 250+ midtier application servers to be a "scalable" solution for a mid-sized e-commerce site, much better than the single 4-core database server that was envisioned.

Proponents of NoSQL and critics of SQL really have to show due diligence in SQL tuning not only along the lines of this article but in leveraging vendor capabilities in hardware and software. A noSQL solution may seem to get implemented fast, but as you debug the multiple threading, concurrent data loads, interleaving multiple data streams across multiple disks, multiple servers - are you really saving implementation time? Haven't you just shifted time from well-known relational DBA practices to a similar amount of improvisational noSQL design and development?

In relational databases, scalability is "compromised" by the maintenance of transactional integrity - the so-called "ACID" properties. NoSQL proponents often correctly identify this issue. As someone who has worked with large data sets, I'm not on board with relaxing integrity support. In fact, the very size of the datasets requires vastly MORE attention to correctness, not a mindless obsession with performance numbers. The only thing worth "scaling" is not data throughput, but DURABLE TRANSACTIONS. The single-percentage data error one can tolerate with 500 records becomes intolerable at 500k. At 500M records, one percent is absolutely catastrophic. At that scale you can hardly even find out if you have a data problem without a relational database to sift it for you. It's unaccountable. Frankly, I believe many people choose non-relational data storage not despite, but because it limits their accountability for administering a large data store.

So I'll go Michael Rys one step further. SQL is not just a good thing to scale into millions and billions - it's the ONLY thing that does.

Respectfully

Andrew D Wolfe Jr


Leave this field empty

Post a Comment:







© 2018 ACM, Inc. All Rights Reserved.