DBMS2

Subscribe to DBMS2 feed
Choices in data management and analysis
Updated: 5 hours 47 min ago

Imanis Data

Tue, 2017-08-22 07:46

I talked recently with the folks at Imanis Data. For starters:

  • The point of Imanis is to make copies of your databases, for purposes such as backup/restore, test/analysis, or compliance-driven archiving. (That’s in declining order of current customer activity.) Another use is migration via restoring to a different cluster than the one that created the data in the first place.
  • The data can come from NoSQL database managers, from Hadoop, or from Vertica. (Again, that’s in declining order.)
  • As you might imagine, Imanis makes incremental backups; the only full backup is the first one you do for that database.
  • “Imanis” is a new name; the previous name was “Talena”.

Also:

  • Imanis has ~35 subscription customers, a significant majority of which are in the Fortune 1000.
  • Customer industries, in roughly declining order, include:
    • Financial services other than insurance.
    • Insurance.
    • Retail.
    • “Technology”.
  • ~40% of Imanis customers are in the public cloud.
  • Imanis is focused on the North American market at this time.
  • Imanis has ~45 employees.
  • The Imanis product just hit Version 3.

Imanis correctly observes that there are multiple reasons you might want to recover from backup, including:

  • General disaster/system failure.
  • Bug in an application that writes data.
  • Malicious acts, including encryption-by-ransomware.

Imanis uses the phrase “point-in-time backup” to emphasize its flexibility in letting you choose your favorite time-version of your rolling backup.

Imanis also correctly draws the inference that the right backup strategy is some version of:

  • Make backups very frequently. This boils down to “Do a great job of making incremental backups (and restoring from them when necessary). This is where Imanis has spent the bulk of its technical effort to date.
  • In case recovery is needed, identify that last clean (or provably/confidently clean) version of the database and restore from that. The identification part boils down to letting the backup databases be queried directly. That’s largely a roadmap item.
    • Imanis has recently added the capability to build its own functionality querying the backup database.
    • JDBC/whatever general access is still in the future.

Note: When Imanis backups offer direct query access, the possibility will of course exist to use the backup data for general query processing. But while that kind of capability sounds great in theory, I’m not aware of it being a big deal (on technology stacks that already offer it) in practice.

The most technically notable other use cases Imanis mentioned are probably:

  • Data science dataset generation. Imanis lets you generate a partial copy of the database for analytic or test purposes.
    • You can project, select or sample your data, which suggests use of the current query capabilities.
    • There’s an API to let you mask Personally Identifiable Information by writing your own data transformations.
  • Archiving/tiering/ILM (Information Lifecycle Management). Imanis lets you divide data according to its hotness.

Imanis views its competition as:

  • Native utilities of the data stores.
  • Hand-coded scripts.
  • Datos.io, principally in the Cassandra market (so far).

Beyond those, the obvious comparison to Imanis is Delphix. I haven’t spoken with Delphix for a few years, but I believe that key differences between Delphix and Imanis start:

  • Delphix is focused on widely-installed RDBMS such as Oracle.
  • Delphix actually tries to have different production logical copies of your database run off of the same physical copy. Imanis, in contrast, offers technology to help you copy your databases quickly and effectively, but the copies you actually use will indeed be separate from each other.

Imanis software runs on its own cluster, based on hacked Hadoop. A lot of the hacking seems to related to a metadata store, which supports things like:

  • Understanding which (incrementally backed up) blocks need to be pulled together to make a specific copy of the database.
  • Putting data in different places for ILM/tiering.

Another piece of Imanis tech is machine-learning-based anomaly detection.

  • As incrementally backed-up blocks arrive, Imanis flags anomalous ones and states a reason for them.
  • A flag is given a reason.
  • You can denounce the flag as a false alert, and hopefully similar flags won’t be raised in the future.

The technology for this seems rather basic:

  • Random forests for the flagging.
  • No drilldown w/in the Imanis system for follow-up.

But in general concept this is something a lot more systems should be doing.

Most of the rest of Imanis’ tech story is straightforward — support various alternatives for computing platforms, offer the usual security choices, etc. One exception that was new to me was the use of erasure codes, which seem to be a generalization of the concept of parity bits. Allegedly, when used in a storage context these have the near-magical property of offering 4X replication safety with only a 1.5X expansion of data volume. I won’t claim to have understood the subject well enough to see how that could make sense, or what tradeoffs it would entail.

Categories: Other

More notes on the transition to the cloud

Thu, 2017-08-17 04:11

Last year I posted observations about the transition to the cloud. Here are some further thoughts.

0. In case any doubt remained, the big questions about transitioning to the cloud are “When?” and “How?”. “Whether”, by way of contrast, is pretty much settled.

1. The answer to “When?” is generally “Over many years”. In particular, at most enterprises the cloud transition will span multiple CIO’s tenure in their positions.

Few enterprises will ever execute on simple, consistent, unchanging “cloud strategies”.

2. The SaaS (Software as a Service) vs. on-premises tradeoffs are being reargued, except that proponents now spell SaaS C-L-O-U-D. (Ali Ghodsi of Databricks made a particularly energetic version of that case in a recent meeting.)

3. In most countries (at least in the US and the rest of the West), the cloud vendors deemed to matter are Amazon, followed by Microsoft, followed by Google. And so, when it comes to the public cloud, Microsoft is much, much more enterprise-savvy than its key competitors.

4. In another non-technical competitive factor: Wal-Mart isn’t the only huge company that is hostile to the Amazon cloud because of competition with other Amazon businesses.

5. It was once thought that in many small countries around the world, there would be OpenStack-based “national champion” cloud winners, perhaps as subsidiaries of the leading telecom vendors. This doesn’t seem to be happening.

Even so, some of the larger managed-economy and/or generally authoritarian countries will have one or more “national champion” cloud winners each — surely China, presumably Russia, obviously Iran, and probably some others as well.

6. While OpenStack in general seems to have fizzled, S3 compatibility has momentum.

7. Finally, let’s return to our opening points: The cloud transition will happen, but it will take considerable time. A principal reason for slowness is that, as a general rule, apps aren’t migrated to platforms directly; rather, they get replaced by new apps on new platforms when the time is right for them to be phased out anyway.

However, there’s a codicil to those generalities — in some cases it’s easier to migrate to the new platform than in others. The hardest migration was probably when the rise of RDBMS, the shift from mainframes to UNIX and the switch to client/server all happened at once; just about nothing got ported from the old platforms to the new. Easier migrations included:

  • The switch from Unix to Linux. They were very similar.
  • The adoption of virtualization. A major purpose of the technology was to make migration easy.
  • The initial adoption of DBMS. Then-legacy apps relied on flat file systems, which DBMS often found easy to emulate.

The cloud transition is somewhere in the middle between those extremes. On the “easy” side:

  • Popular database management technologies and so on are available in the cloud just as they are on-premise.
  • Major app vendors are doing the hard work of cloud ports themselves.

Nonetheless, the public cloud is in many ways a whole new computing environment — and so for the most part, customer-built apps will prove too difficult to migrate. Hence my belief that overall migration to the cloud will be very incremental.

Categories: Other

Notes on data security

Thu, 2017-08-10 04:15

1. In June I wrote about burgeoning interest in data security. I’d now like to add:

  • Even more than I previously thought, demand seems to be driven largely by issues of regulatory compliance.
  • In an exception to that general rule, many enterprise have vague mandates for data encryption.
  • In awkward contradiction to that general rule, there’s a general sense that it’s just security’s “turn” to be a differentiating feature, since various other “enterprise” needs are already being well-addressed.

We can reconcile these anecdata pretty well if we postulate that:

  • Enterprises generally agree that data security is an important need.
  • Exactly how they meet this need depends upon what regulators choose to require.

2. My current impressions of the legal privacy vs. surveillance tradeoffs are basically:

  • The freer non-English-speaking countries are more concerned about ensuring data privacy. In particular, the European Union’s upcoming GDPR (General Data Protection Regulation) seems like a massive addition to the compliance challenge.
  • The “Five Eyes” (US, UK, Canada, Australia, New Zealand) are more concerned about maintaining the efficacy of surveillance.
  • Authoritarian countries, of course, emphasize surveillance as well.

3. Multiple people have told me that security concerns include (data) lineage and (data) governance as well. I’m fairly OK with that conflation.

  • By citing “lineage” I think they’re referring to the point that if you don’t know where data came from, you don’t know if it’s trustworthy. This fits well with standard uses of the “data lineage” term.
  • By “data governance” they seem to mean policies and procedures to limit the chance of unauthorized or uncontrolled data change, or technology to support those policies. Calling that “data governance” is a bit of a stretch, but it’s not so ridiculous that we need to make a big fuss about it.

In other words: If your data transformation pipelines aren’t locked down, then your data isn’t locked down either.

4. But how seriously does that last point need to be taken? For starters, the possibility of erroneous calculations:

  • Is a strong threat to analytic accuracy, as has been recognized at least for the decades that “one version of the truth” has been a catchphrase.
  • Has some regulatory risk, e.g. in the United States around Sarbanes-Oxley.
  • Is not as a big a deal for the core security threat of data theft/exfiltration.

Further, it’s not too hard architecturally to have a divide between:

  • Data transformation for operational use cases, which may need to be locked down.
  • Data transformation for purely investigative analytics, which can be very fluid, for transformation technologies such as Hadoop, Spark and Excel alike.

Bottom line: Data transformation security is an accessible must-have in some use cases, but an impractical nice-to-have in others.

Categories: Other

Analytics on the edge?

Fri, 2017-06-30 03:27

There’s a theory going around to the effect that:

  • Compute power is and will be everywhere, for example in cars, robots, medical devices or microwave ovens. Let’s refer to these platforms collectively as “real-world appliances”.
  • Much more data will be created on these platforms than can reasonably be sent back to centralized/cloudy servers.
  • Therefore, cloud-centric architectures will soon be obsolete, perhaps before they’re ever dominant in the first place.

There’s enough truth to all that to make it worth discussing. But the strong forms of the claims seem overblown.

1. This story doesn’t even make sense except for certain new classes of application. Traditional business applications run all over the world, in dedicated or SaaSy modes as the case may be. E-commerce is huge. So is content delivery. Architectures for all those things will continue to evolve, but what we have now basically works.

2. When it comes to real-world appliances, this story is partially accurate. An automobile is a rolling network of custom Linux systems, each running hand-crafted real-time apps, a few of which also have minor requirements for remote connectivity. That’s OK as far as it goes, but there could be better support for real-time operational analytics. If something as flexible as Spark were capable of unattended operation, I think many engineers of real-world appliances would find great ways to use it.

3. There’s a case to be made for something better yet. I think the argument is premature, but it’s worth at least a little consideration. 

There are any number of situations in which decisions are made on or about remote systems, based on models or rules that should be improved over time. For example, such decisions might be made in:

  • Machine vision or other “recognition”-oriented areas of AI.
  • Detection or prediction of malfunctions.
  • Choices as to what data is significant enough to ship back upstream.

In the canonical case, we might envision a system in which:

  • Huge amounts of data are collected and are used to make real-time decisions.
  • The models are trained centrally, and updated remotely over time as they are improved.
  • The remote systems can only ship back selected or aggregated data to help train the models.

This all seems like an awkward fit for any common computing architecture I can think of.

But it’s hard to pin down important examples of that “canonical” case. The story implicitly assumes:

  • A model is widely deployed.
  • The model does a decent job but not a perfect one.
  • Based on its successes and failures, the model gets improved.

And now we’re begging a huge question: What exactly is there that keeps score as to when the model succeeds and fails? Mathematically speaking, I can’t imagine what a general answer would be like.

4. So when it comes to predictive models executed on real-world appliances I think that analytic workflows will:

  • Differ for different (categories) of applications.
  • Rely in most cases on simple patterns of data movement, such as:
    • Stream everything to central servers and sort it out there, or if that’s not workable …
    • … instrument a limited number of test nodes to store everything, and recover the data in batch for analysis.
    • Update models only in timeframes that you’re doing a full app update/refresh.

And with that much of the apparent need for fancy distributed analytic architectures evaporates.

5. Finally, and notwithstanding the previous point: Across many use cases, there’s some kind of remote log data being shipped back to a central location. It may be the complete log. It may be periodic aggregates. It may happen only what the edge nodes regard as significant events. But something is getting shipped home.

The architectures for shipping, receiving and analyzing such data are in many cases immature. That’s obvious if there’s any kind of streaming involved, or if analysis is done in Spark. Ditto if there’s anything we might call “non-tabular business intelligence”. As this stuff matures, it will in many cases fit very well with today’s cloud thinking. But in any case — it needs to mature.

Truth be told, even the relational case is immature, in that it can easily rely on what I called:

data warehouses (perhaps really data marts) that are updated in human real-time

That quote is from a recent post about Kudu, which:

  • Is designed for exactly that use case.
  • Went GA early this year.

As always, technology is in flux.

Related links

Categories: Other

Generally available Kudu

Fri, 2017-06-16 10:52

I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including:

  • Security is an ever bigger deal.
  • There’s a lot of interest in data warehouses (perhaps really data marts) that are updated in human real-time.
    • Prospects for that respond well to the actual term “data warehouse”, at least when preceded by some modifier to suggest that it’s modern/low-latency/non-batch or whatever.
    • Flash is often — but not yet always — preferred over disk for that kind of use.
    • Sometimes these data stores are greenfield. When they’re migrations, they come more commonly from analytic RDBMS or data warehouse appliance (the most commonly mentioned ones are Teradata, Netezza and Vertica, but that’s perhaps just due to those product lines’ market share), rather than from general purpose DBMS such as Oracle or SQL Server.
  • Intel is making it ever easier to vectorize CPU operations, and analytic data managers are increasingly taking advantage of this possibility.

Now let’s talk about Kudu itself. As I discussed at length in September 2015, Kudu is:

  • A data storage system introduced by Cloudera (and subsequently open-sourced).
  • Columnar.
  • Updatable in human real-time.
  • Meant to serve as the data storage tier for Impala and Spark.

Kudu’s adoption and roll-out story starts:

  • Kudu went to general availability on January 31. I gather this spawned an uptick in trial activity.
  • A subsequent release with some basic security features spawned another uptick.
  • I don’t think Cloudera will mind my saying that there are many hundreds of active Kudu clusters.
  • But Cloudera believes that, this soon after GA, very few Kudu users are in actual production.

Early Kudu interest is focused on 2-3 kinds of use case. The biggest is the kind of “data warehousing” highlighted above. Cloudera characterizes the others by the kinds of data stored, specifically the overlapping categories of time series — including financial trading — and machine-generated data. A lot of early Kudu use is with Spark, even ahead of (or in conjunction with) Impala. A small amount has no relational front-end at all.

Other notes on Kudu include:

  • Solid-state storage is recommended, with a few terabytes per node.
  • You can also use spinning disk. If you do, your write-ahead logs can still go to flash.
  • Cloudera said Kudu compression ratios can be as low as 2-5X, or as high as 10-20X. With that broad a range, I didn’t drill down into specifics of what they meant.
  • There seem to be a number of Kudu clusters with 50+ nodes each. By way of contrast, a “typical” Cloudera customer has 100s of nodes overall.
  • As you might imagine from their newness, Kudu security features — Kerberos-based — are at the database level rather than anything more granular.

And finally, the Cloudera folks woke me up to some issues around streaming data ingest. If you stream data in, there will be retries resulting in duplicate delivery. So your system needs to deal with those one way or another. Kudu’s way is:

  • Primary keys will be unique. (Note: This is not obvious in a system that isn’t an entire RDBMS in itself.)
  • You can configure the uniqueness to be guaranteed either through an upsert mechanism or just by simply rejecting duplicates.
  • Alternatively, you can write code to handle duplication errors, e.g. via Spark.
Categories: Other

The data security mess

Wed, 2017-06-14 08:21

A large fraction of my briefings this year have included a focus on data security. This is the first year in the past 35 that that’s been true.* I believe that reasons for this trend include:

  • Security is an important aspect of being “enterprise-grade”. Other important checkboxes have been largely filled in. Now it’s security’s turn.
  • A major platform shift, namely to the cloud, is underway or at least being planned for. Security is an important thing to think about as that happens.
  • The cloud even aside, technology trends have created new ways to lose data, which security technology needs to address.
  • Traditionally paranoid industries are still paranoid.
  • Other industries are newly (and rightfully) terrified of exposing customer data.
  • My clients at Cloudera thought they had a chance to get significant messaging leverage from emphasizing security. So far, it seems that they were correct.

*Not really an exception: I did once make it a project to learn about classic network security, including firewall appliances and so on.

Certain security requirements, desires or features keep coming up. These include (and as in many of my lists, these overlap):

  • Easy, comprehensive access control. More on this below.
  • Encryption. If other forms of security were perfect, encryption would never be needed. But they’re not.
  • Auditing. Ideally, auditing can alert you to trouble before (much) damage is done. If not, then it can at least help you do proactive damage control in the face of breach.
  • Whatever regulators mandate.
  • Whatever is generally regarded as best practices. Security “best practices” generally keep enterprises out of legal and regulatory trouble, or at least minimize same. They also keep employees out of legal and career trouble, or minimize same. Hopefully, they even keep data safe.
  • Whatever the government is known to use. This is a common proxy for “best practices”.

More specific or extreme requirements include: 

I don’t know how widely these latter kinds of requirements will spread.

The most confusing part of all this may be access control.

  • Security has a concept called AAA, standing for Authentication, Authorization and Accounting/Auditing/Other things that start with”A”. Yes — even the core acronym in this area is ill-defined.
  • The new standard for authentication is Kerberos. Or maybe it’s SAML (Security Assertion Markup Language). But SAML is actually an old, now-fragmented standard. But it’s also particularly popular in new, cloud use cases. And Kerberos is actually even older than SAML.
  • Suppose we want to deny somebody authorization to access certain raw data, but let them see certain aggregated or derived information. How can we be sure they can’t really see the forbidden underlying data, except through a case-by-case analysis? And if that case-by-case analysis is needed, how can the authorization rules ever be simple?

Further confusing matters, it is an extremely common analytic practice to extract data from somewhere and put it somewhere else to be analyzed. Such extracts are an obvious vector for data breaches, especially when the target system is managed by an individual or IT-weak department. Excel-on-laptops is probably the worst case, but even fat-client BI — both QlikView and Tableau are commonly used with local in-memory data staging — can present substantial security risks. To limit such risks, IT departments are trying to impose new standards and controls on departmental analytics. But IT has been fighting that war for many decades, and it hasn’t won yet.

And that’s all when data is controlled by a single enterprise. Inter-enterprise data sharing confuses things even more. For example, national security breaches in the US tend to come from government contractors more than government employees. (Ed Snowden is the most famous example. Chelsea Manning is the most famous exception.) And as was already acknowledged above, even putting your data under control of a SaaS vendor opens hard-to-plug security holes.

Data security is a real mess.

Categories: Other

Light-touch managed services

Wed, 2017-06-14 08:14

Cloudera recently introduced Cloudera Altus, a Hadoop-in-the-cloud offering with an interesting processing model:

  • Altus manages jobs for you.
  • But you actually run them on your own cluster, and so you never have to put your data under Altus’ control.

Thus, you avoid a potential security risk (shipping your data to Cloudera’s service). I’ve tentatively named this strategy light-touch managed services, and am interested in exploring how broadly applicable it might or might not be.

For light-touch to be a good approach, there should be (sufficiently) little downside in performance, reliability and so on from having your service not actually control the data. That assumption is trivially satisfied in the case of Cloudera Altus, because it’s not an ordinary kind of app; rather, its whole function is to improve the job-running part of your stack. Most kinds of apps, however, want to operate on your data directly. For those, it is more challenging to meet acceptable SLAs (Service-Level Agreements) on a light-touch basis.

Let’s back up and consider what “light-touch” for data-interacting apps (i.e., almost all apps) would actually mean. The basics are: 

  • The user has some kind of environment that manages data and executes programs.
  • The light-touch service, running outside this environment, spawns one or more app processes inside it.
  • Useful work ensues …
  • … with acceptable reliability and performance.
  • The environment’s security guarantees ensure that data doesn’t leak out.

Cases where that doesn’t even make sense include but are not limited to:

  • Transaction-processing applications that are carefully tuned for efficient database access.
  • Applications that need to be carefully installed on or in connection with a particular server, DBMS, app server or whatever.

On the other hand:

  • A light-touch service is at least somewhat reasonable in connection with analytics-oriented data-management-plus-processing environments such as Hadoop/Spark clusters.
  • There are many workloads over Hadoop clusters that don’t need efficient database access. (Otherwise Hive use would not be so prevalent.)
  • Light-touch efforts seem more likely to be helped than hurt by abstraction environments such as the public cloud.

So we can imagine some kind of outside service that spawns analytic jobs to be run on your preferred — perhaps cloudy — Hadoop/Spark cluster. That could be a safe way to get analytics done over data that really, really, really shouldn’t be allowed to leak.

But before we anoint light-touch managed services as the NBT (Next Big Thing/Newest Bright Thought), there’s one more hurdle for it to overcome — why bother at all? What would a light-touch managed service provide that you wouldn’t also get from installing packaged software onto your cluster and running it in the usual way? The simplest answer is “The benefits of SaaS (Software as a Service)”, and so we can rephrase the challenge as “Which benefits of SaaS still apply in the light-touch managed service scenario?”

The vendor perspective might start, with special cases such as Cloudera Altus excepted:

  • The cost-saving benefits of multi-tenancy mostly don’t apply. Each instance winds up running on a separate cluster, namely the customer’s own. (But that’s likely to be SaaS/cloud itself.)
  • The benefits of controlling your execution environment apply at best in part. You may be able to assume the customer’s core cluster is through some cloud service, but you don’t get to run the operation yourself.
  • The benefits of a SaaS-like product release cycle do mainly apply.
    • Only having to support the current version(s) of the product is a little limited when you don’t wholly control your execution environment.
    • Light-touch doesn’t seem to interfere with the traditional SaaS approach of a rapid, incremental product release cycle.

When we flip to the user perspective, however, the idea looks a little better.

Bottom line: Light-touch managed services are well worth thinking about. But they’re not likely to be a big deal soon.

Categories: Other

Cloudera Altus

Wed, 2017-06-14 08:12

I talked with Cloudera before the recent release of Altus. In simplest terms, Cloudera’s cloud strategy aspires to:

  • Provide all the important advantages of on-premises Cloudera.
  • Provide all the important advantages of native cloud offerings such as Amazon EMR (Elastic MapReduce, or at least come sufficiently close to that goal.
  • Benefit from customers’ desire to have on-premises and cloud deployments that work:
    • Alike in any case.
    • Together, to the extent that that makes use-case sense.

In other words, Cloudera is porting its software to an important new platform.* And this port isn’t complete yet, in that Altus is geared only for certain workloads. Specifically, Altus is focused on “data pipelines”, aka data transformation, aka “data processing”, aka new-age ETL (Extract/Transform/Load). (Other kinds of workload are on the roadmap, including several different styles of Impala use.) So what about that is particularly interesting? Well, let’s drill down.

*Or, if you prefer, improving on early versions of the port.

Since so much of the Hadoop and Spark stacks is open source, competition often isn’t based on core product architecture or features, but rather on factors such as:

  • Ease of management. This one is nuanced in the case of cloud/Altus. For starters:
    • One of Cloudera’s main areas of differentiation has always been Cloudera Manager.
    • Cloudera Director was Cloudera’s first foray into cloud-specific management.
    • Cloudera Altus features easier/simpler management than Cloudera Director, meant to be analogous to native Amazon management tools, and good-enough for use cases that don’t require strenuous optimization.
    • Cloudera Altus also includes an optional workload analyzer, in slight conflict with other parts of the Altus story. More on that below.
  • Ease of development. Frankly, this rarely seems to come up as a differentiator in the Hadoop/Spark world, various “notebook” offerings such as Databricks’ or Cloudera’s notwithstanding.
  • Price. When price is the major determinant, Cloudera is sad.
  • Open source purity. Ditto. But at most enterprises — at least those with hefty IT budgets — emphasis on open source purity either is a proxy for price shopping, or else boils down to largely bogus concerns about vendor lock-in.

Of course, “core” kinds of considerations are present to some extent too, including:

  • Performance, concurrency, etc. I no longer hear many allegations of differences in across-the-board Hadoop performance. But the subject does arise in specific areas, most obviously in analytic SQL processing. It arises in the case of Altus as well, in that Cloudera improved in a couple of areas that it concedes were previously Amazon EMR advantages, namely:
    • Interacting with S3 data stores.
    • Spinning instances up and down.
  • Reliability and data safety. Cloudera mentioned that it did some work so as to be comfortable with S3’s eventual consistency model.

Recently, Cloudera has succeeded at blowing security up into a major competitive consideration. Of course, they’re trying that with Altus as well. Much of the Cloudera Altus story is the usual — rah-rah Cloudera security, Sentry, Kerberos everywhere, etc. But there’s one aspect that I find to be simple yet really interesting:

  • Cloudera Altus doesn’t manage data for you.
  • Rather, it launches and manages jobs on a separate Hadoop cluster.

Thus, there are very few new security risks to running Cloudera Altus, beyond whatever risks are inherent to running any version of Hadoop in the public cloud.

Where things get a bit more complicated is some features for workload analysis.

  • Cloudera recently introduced some capabilities for on-the-fly trouble-shooting. That’s fine.
  • Cloudera has also now announced an offline workload analyzer, which compares actual metrics computed from your log files to “normal” ones from well-running jobs. For that, you really do have to ship information to a separate cluster managed by Cloudera.

The information shipped is logs rather than actual query results or raw data. In theory, an attacker who had all those logs could conceivably make inferences about the data itself; but in practice, that doesn’t seem like an important security risk at all.

So is this an odd situation where that strategy works, or could what we might call light-touch managed services turn out to be widespread and important? That’s a good question to address in a separate post.

Categories: Other

Interana

Mon, 2017-04-17 05:10

Interana has an interesting story, in technology and business model alike. For starters:

  • Interana does ad-hoc event series analytics, which they call “interactive behavioral analytics solutions”.
  • Interana has a full-stack analytic offering, include:
    • Its own columnar DBMS …
    • … which has a non-SQL DML (Data Manipulation Language) meant to handle event series a lot more fluently than SQL does, but which the user is never expected to learn because …
    • … there also are BI-like visual analytics tools that support plenty of drilldown.
  • Interana sells all this to “product” departments rather than marketing, because marketing doesn’t sufficiently value Interana’s ad-hoc query flexibility.
  • Interana boasts >40 customers, with annual subscription fees ranging from high 5 figures to low 7 digits.

And to be clear — if we leave aside any questions of marketing-name sizzle, this really is business intelligence. The closest Interana comes to helping with predictive modeling is giving its ad-hoc users inspiration as to where they should focus their modeling attention.

Interana also has an interesting twist in its business model, which I hope can be used successfully by other enterprise software startups as well.

  • For now, at no extra charge, Interana will operate its software for you as a managed service. (A majority of Interana’s clients run the software on Amazon or Azure, where that kind of offering makes sense.)
  • However, presumably in connection with greater confidence in its software’s ease of administration, Interana will move this year toward unbundling the service as an extra-charge offering on top of the software itself.

The key to understanding Interana is its DML. Notes on that include:

  • Interana’s DML is focused on path analytics …
    • … but Interana doesn’t like to use that phrase because it sounds too math-y and difficult.
    • Interana may be the first company that’s ever told me it’s focused on providing a better nPath. :)
  • Primitives in Interana’s language — notwithstanding the company’s claim that it never ever intended to sell to marketing departments — include familiar web analytics concepts such as “session”, “funnel” and so on. (However, these are being renamed to more neutral terms such as “flow” in an upcoming version of the product.)
  • As typical example questions or analytic subjects, Interana offered:
    • “Which are the most common products in shopping carts where time-to-checkout was greater than 30 minutes?”
    • Exactly which steps in the onboarding process result in the greatest user frustration?
  • The Interana folks and I agree that Splunk is the most recent example of a new DML kicking off a significant company.
  • The most recent example I can think of in which a vendor hung its hat on a new DML that was a “visual programming language” is StreamBase, with EventFlow. That didn’t go all that well.
  • To use Founder/CTO Bobby Johnson’s summary term, the real goal of the Interana language is to describe a state machine, specifically one that produces (sets of) sequences of events (and the elapsed time between them).

Notes on Interana speeds & feeds include:

  • Interana only promises data freshness up to micro-batch latencies — i.e., a few minutes. (Obviously, this shuts them out of most networking monitoring and devops use cases.)
  • Interana thinks it’s very important for query response time to max out at a low number of seconds. If necessary, the software will return approximate results rather than exact ones so as to meet this standard.
  • Interana installations and workloads to date have gotten as large as:
    • 1-200 nodes.
    • Trillions of rows, equating to 100s of TBs of data after compression/ >1 PB uncompressed.
    • Billions of rows/events received per day.
    • 100s of 1000s of (very sparse) columns.
    • 1000s of named users.

Although Interana’s original design point was spinning disk, most customers store their Interana data on flash.

Interana architecture choices include:

  • They’re serious about micro-batching.
    • If the user’s data is naturally micro-batched — e.g. a new S3 bucket every few minutes — Interana works with that.
    • Even if the customer’s data is streamed — e.g. via Kafka — Interana insists on micro-batching it.
  • They’re casual about schemas.
    • Interana assumes data arrives with some kind of recognizable structure, via JSON, CSV or whatever.
      • Interana observes, correctly, that log data often is decently structured.
        • For example, if you’re receiving “phone home” pings from products you originally manufactured, you know what data structures to expect.
        • Interana calls this “logging with intent”.
      • Interana is fine with a certain amount of JSON (for example) schema change over time.
      • If your arriving data truly is a mess, then you need to calm it down via a pass through Splunk or whatever before sending it to Interana.
    • JSON hierarchies turn into multi-part column names in the usual way.
    • Interana supports one level of true nesting, and one level only; column values can be “lists”, but list values can’t be list themselves.

Finally, other Interana tech notes include:

  • Compression is a central design consideration …
    • … especially but not only compression algorithms designed to deal with great sparseness, such as run-length encoding (RLE).
    • Dictionary compression, in a strategy that is rarer than I once expected it to be, uses a global rather than shard-by-shard dictionary. The data Interana expects is of low-enough cardinality for this to be the better choice.
    • Column data is sorted. A big part of the reason is of course to aid compression.
    • Compression strategies are chosen automatically for each segment. Wholly automatically, I gather; you can’t tune the choice manually.
  • As you would think, Interana technically includes multiple data stores.
    • Data first hits a write-optimized store. Unlike the case of Vertica, this WOS never is involved in answering queries.
    • Asynchronously, the data is broken into columns, and banged to “disk”.
    • Asynchronously again, the data is sorted.
    • Queries run against sorted data, sorting recent blocks on-the-fly if necessary.
  • Interana lets you shard different replicas of the data according to different shard keys.
  • Interana is proud of the random sampling it does when serving approximate query results.
Categories: Other

Analyzing the right data

Thu, 2017-04-13 07:05

0. A huge fraction of what’s important in analytics amounts to making sure that you are analyzing the right data. To a large extent, “the right data” means “the right subset of your data”.

1. In line with that theme:

  • Relational query languages, at their core, subset data. Yes, they all also do arithmetic, and many do more math or other processing than just that. But it all starts with the set theory.
  • Underscoring the power of this approach, other data architectures over which analytics is done usually wind up with SQL or “SQL-like” language access as well.

2. Business intelligence interfaces today don’t look that different from what we had in the 1980s or 1990s. The biggest visible* changes, in my opinion, have been in the realm of better drilldown, ala QlikView and then Tableau. Drilldown, of course, is the main UI for business analysts and end users to subset data themselves.

*I used the word “visible” on purpose. The advances at the back end have been enormous, and much of that redounds to the benefit of BI.

3. I wrote 2 1/2 years ago that sophisticated predictive modeling commonly fit the template:

  • Divide your data into clusters.
  • Model each cluster separately.

That continues to be tough work. Attempts to productize shortcuts have not caught fire.

4. In an example of the previous point, anomaly management technology can, in theory, help shortcut any type of analytics, in that it tries to identify what parts of your data to focus on (and why). But it’s in its early days; none of the approaches to general anomaly management has gained much traction.

5. Marketers have vast amounts of information about us. It starts with every credit card transaction line item and a whole lot of web clicks. But it’s not clear how many of those (10s of) thousands of columns of data they actually use.

6. In some cases, the “right” amount of data to use may actually be tiny. Indeed, some statisticians claim that fewer than 10 data points may be enough to get a good model. I’m skeptical, at least as to the practical significance of such extreme figures. But on the more plausible side — if you’re hunting bad guys, it may not take very many separate facts before you have good evidence of collusion or fraud.

Internet fraud excepted, of course. Identifying that usually involves sifting through a lot of log entries.

7. All the needle-hunting in the world won’t help you unless what you seek is in the haystack somewhere.

  • Often, enterprises explicitly invest in getting more data.
  • Keeping everything you already generate is the obvious choice for most categories of data, but some of the lowest-value-per-bit logs may forever be thrown away.

8. Google is famously in the camp that there’s no such thing as too much data to analyze. For example, it famously uses >500 “signals” in judging the quality of potential search results. I don’t know how many separate data sources those signals are informed by, but surely there are a lot.

9. Few predictive modeling users demonstrate a need for vast data scaling. My support for that claim is a lot of anecdata. In particular:

  • Some predictive modeling techniques scale well. Some scale poorly. The level of pain around the “scale poorly” aspects of that seems to be fairly light (or “moderate” at worst). For example:
    • In the previous technology generation, analytic DBMS and data warehouse appliance vendors tried hard to make statistical packages scale across their systems. Success was limited. Nobody seemed terribly upset.
    • Cloudera’s Data Science Workbench messaging isn’t really scaling-centric.
  • Spark’s success in machine learning is rather rarely portrayed as centering on scaling. And even when it is, Spark basically runs in memory, so each Spark node is processing all that much data.

10. Somewhere in this post — i.e. right here :) — let’s acknowledge that the right data to analyze may not be exactly what was initially stored. Data munging/wrangling/cleaning/preparation is often a big deal. Complicated forms of derived data can be important too.

11. Let’s also mention data marts. Basically, data marts subset and copy data, because the data will be easier to analyze in its copied form, or because they want to separate workloads between the original and copied data store.

  • If we assume the data is on spinning disks or even flash, then the need for that strategy declined long ago.
  • Suppose you want to keep data entirely in memory? Then you might indeed want to subset-and-copy it. But with so many memory-centric systems doing decent jobs of persistent storage too, there’s often a viable whole-dataset management alternative.

But notwithstanding the foregoing:

  • Security/access control can be a good reason for subset-and-copy.
  • So can other kinds of administrative simplification.

12. So what does this all suggest going forward? I believe:

  • Drilldown is and will remain central to BI. If your BI doesn’t support robust drilldown, you’re doing it wrong. “Real-time” use cases are not exceptions to this rule.
  • In a strong overlap with the previous point, drilldown is and will remain central to monitoring. Whatever monitoring means to you, the ability to pinpoint the specific source of interesting signals is crucial.
  • The previous point can be recast as saying that it’s crucial to identify, isolate and explain anomalies. Some version(s) of anomaly management will become a big deal.
  • SQL and “SQL-like” languages will remain integral to analytic processing for a long time.
  • Memory-centric analytic frameworks such as Spark will continue to win. The data size constraints imposed by memory-centric processing will rarely cause difficulties.

Related links

Categories: Other

Monitoring

Sun, 2017-03-26 06:16

A huge fraction of analytics is about monitoring. People rarely want to frame things in those terms; evidently they think “monitoring” sounds boring or uncool. One cost of that silence is that it’s hard to get good discussions going about how monitoring should be done. But I’m going to try anyway, yet again. :)

Business intelligence is largely about monitoring, and the same was true of predecessor technologies such as green paper reports or even pre-computer techniques. Two of the top uses of reporting technology can be squarely described as monitoring, namely:

  • Watching whether trends are continuing or not.
  • Seeing if there are any events — actual or impending as the case may be — that call for response, in areas such as:
    • Machine breakages (computer or general metal alike).
    • Resource shortfalls (e.g. various senses of “inventory”).

Yes, monitoring-oriented BI needs investigative drilldown, or else it can be rather lame. Yes, purely investigative BI is very important too. But monitoring is still the heart of most BI desktop installations.

Predictive modeling is often about monitoring too. It is common to use statistics or machine learning to help you detect and diagnose problems, and many such applications have a strong monitoring element.

I.e., you’re predicting trouble before it happens, when there’s still time to head it off.

As for incident response, in areas such as security — any incident you respond to has to be noticed first Often, it’s noticed through analytic monitoring.

Hopefully, that’s enough of a reminder to establish the great importance of analytics-based monitoring. So how can the practice be improved? At least three ways come to mind, and only one of those three is getting enough current attention.

The one that’s trendy, of course, is the bringing of analytics into “real-time”. There are many use cases that genuinely need low-latency dashboards, in areas such as remote/phone-home IoT (Internet of Things), monitoring of an enterprise’s own networks, online marketing, financial trading and so on. “One minute” is a common figure for latency, but sometimes a couple of seconds are all that can be tolerated.

I’ve posted a lot about all this, for example in posts titled:

One particular feature that could help with high-speed monitoring is to meet latency constraints via approximate query results. This can be done entirely via your BI tool (e.g. Zoomdata’s “query sharpening”) or more by your DBMS/platform software (the Snappy Data folks pitched me on that approach this week).

Perennially neglected, on the other hand, are opportunities for flexible, personalized analytics. (Note: There’s a lot of discussion in that link.) The best-acknowledged example may be better filters for alerting. False negatives are obviously bad, but false positives are dangerous too. At best, false positives are annoyances; but too often, alert fatigue causes you employees to disregard crucial warning signals altogether. The Gulf of Mexico oil spill disaster has been blamed on that problem. So was a fire in my own house. But acknowledgment != action; improvement in alerting is way too slow. And some other opportunities described in the link above aren’t even well-acknowledged, especially in the area of metrics customization.

Finally, there’s what could be called data anomaly monitoring. The idea is to check data for surprises as soon as it streams in, using your favorite techniques in anomaly management. Perhaps an anomaly will herald a problem in the data pipeline. Perhaps it will highlight genuinely new business information. Either way, you probably want to know about it.

David Gruzman of Nestlogic suggests numerous categories of anomaly to monitor for. (Not coincidentally, he believes that Nestlogic’s technology is a great choice for finding each of them.) Some of his examples — and I’m summarizing here — are:

  • Changes in data format, schema, or availability. For example:
    • Data can completely stop coming in from a particular source, and the receiving system might not immediately realize that. (My favorite example is the ad tech firm that accidentally stopped doing business in the whole country of Australia.)
    • A data format change might make data so unreadable it might as well not arrive.
    • A decrease in the number of approval fields might highlight a questionable change in workflow.
  • Data quality NULLs or malformed values might increase suddenly, in particular fields and data segments.
  • Data value distribution This category covers a lot of cases. A few of them are:
    • A particular value is repeated implausibly often. A bug is the likely explanation.
    • E-commerce results suddenly decrease, but only from certain client technology configuration. Probably there is a bug affecting only those particular clients.
    • Clicks suddenly increase from certain client technologies. A botnet might be at work.
    • Sales suddenly increase from a particular city. Again this might be fraud — or more benignly, perhaps some local influencers have praised your offering.
    • A particular medical diagnosis becomes much more common in a particular city. Reasons can range from fraud, to a new facility for certain kinds of tests, to a genuine outbreak of disease.

David offered yet more examples of significant anomalies, including ones that could probably only be detected via Nestlogic’s tools. But the ones I cited above can probably be found via any number of techniques — and should be, more promptly and accurately than they currently are.

Related links

Categories: Other

Cloudera’s Data Science Workbench

Sun, 2017-03-19 19:41

0. Matt Brandwein of Cloudera briefed me on the new Cloudera Data Science Workbench. The problem it purports to solve is:

  • One way to do data science is to repeatedly jump through the hoops of working with a properly-secured Hadoop cluster. This is difficult.
  • Another way is to extract data from a Hadoop cluster onto your personal machine. This is insecure (once the data arrives) and not very parallelized.
  • A third way is needed.

Cloudera’s idea for a third way is:

  • You don’t run anything on your desktop/laptop machine except a browser.
  • The browser connects you to a Docker container that holds (and isolates) a kind of virtual desktop for you.
  • The Docker container runs on your Cloudera cluster, so connectivity-to-Hadoop and security are handled rather automagically.

In theory, that’s pure goodness … assuming that the automagic works sufficiently well. I gather that Cloudera Data Science Workbench has been beta tested by 5 large organizations and many 10s of users. We’ll see what is or isn’t missing as more customers take it for a spin.

1. Recall that Cloudera installations have 4 kinds of nodes. 3 are obvious:

  • Hadoop worker nodes.
  • Hadoop master nodes.
  • Nodes that run Cloudera Manager.

The fourth kind are edge/gateway nodes. Those handle connections to the outside world, and can also run selected third-party software. They also are where Cloudera Data Science Workbench lives.

2. One point of this architecture is to let each data scientist run the languages and tools of her choice. Docker isolation is supposed to make that practical and safe.

And so we have a case of the workbench metaphor actually being accurate! While a “workbench” is commonly just an integrated set of tools, in this case it’s also a place for you to use other tools your personally like and bring in.

Surely there are some restrictions as to which tools you can use, but I didn’t ask for those to be spelled out.

3. Matt kept talking about security, to an extent I recall in almost no other analytics-oriented briefing. This had several aspects.

  • As noted above, a lot of the hassle of Hadoop-based data science relates to security.
  • As also noted above, evading the hassle by extracting data is a huge security risk. (If you lose customer data, you’re going to have a very, very bad day.)
  • According to Matt, standard uses of notebook tools such as Jupyter or Zeppelin wind up having data stored wherever code is. Cloudera’s otherwise similar notebook-style interface evidently avoids that flaw. (Presumably, it you want to see the output, you rerun the script against the data store yourself.)

4. To a first approximation, the target users of Cloudera Data Science Workbench can be characterized the same way BI-oriented business analysts are. They’re people with:

  • Sufficiently good quantitative skills to do the analysis.
  • Sufficiently good computer skills to do SQL queries and so on, but not a lot more than that.

Of course, “sufficiently good quantitative skills” can mean something quite different in data science than it does for the glorified arithmetic of ordinary business intelligence.

5. Cloudera Data Science Workbench doesn’t have any special magic in parallelization. It just helps you access the parallelism that’s already out there. Some algorithms are easy to parallelize. Some libraries have parallelized a few algorithms beyond that. Otherwise, you’re on your own.

6. When I asked whether Cloudera Data Science Workbench was open source (like most of what Cloudera provides) or closed source (like Cloudera Manager), I didn’t get the clearest of answers. On the one hand, it’s a Cloudera-specific product, as the name suggests; on the other, it’s positioned as having been stitched together almost entirely from a collection of open source projects.

Categories: Other

Introduction to SequoiaDB and SequoiaCM

Sun, 2017-03-12 13:19

For starters, let me say:

  • SequoiaDB, the company, is my client.
  • SequoiaDB, the product, is the main product of SequoiaDB, the company.
  • SequoiaDB, the company, has another product line SequoiaCM, which subsumes SequoiaDB in content management use cases.
  • SequoiaDB, the product, is fundamentally a JSON data store. But it has a relational front end …
  • … and is usually sold for RDBMS-like use cases …
  • … except when it is sold as part of SequoiaCM, which adds in a large object/block store and a content-management-oriented library.
  • SequoiaDB’s products are open source.
  • SequoiaDB’s largest installation seems to be 2 PB across 100 nodes; that includes block storage.
  • Figures for DBMS-only database sizes aren’t as clear, but the sweet spot of the cluster-size range for such use cases seems to be 6-30 nodes.

Also:

  • SequoiaDB, the company, was founded in Toronto, by former IBM DB2 folks.
  • Even so, it’s fairly accurate to view SequoiaDB as a Chinese company. Specifically:
    • SequoiaDB’s founders were Chinese nationals.
    • Most of them went back to China.
    • Other employees to date have been entirely Chinese.
    • Sales to date have been entirely in China, but SequoiaDB has international aspirations
  • SequoiaDB has >100 employees, a large majority of which are split fairly evenly between “engineering” and “implementation and technical support”.
  • SequoiaDB’s marketing (as opposed to sales) department is astonishingly tiny.
  • SequoiaDB cites >100 subscription customers, including 10 in the global Fortune 500, a large fraction of which are in the banking sector. (Other sectors mentioned repeatedly are government and telecom.)

Unfortunately, SequoiaDB has not captured a lot of detailed information about unpaid open source production usage.

While I usually think that the advantages of open source are overstated, in SequoiaDB’s case open source will have an additional benefit when SequoiaDB does go international — it addresses any concerns somebody might have about using Chinese technology.

SequoiaDB’s technology story starts:

  • SequoiaDB is a layered DBMS.
  • It manages JSON via update-in-place. MVCC (Multi-Version Concurrency Control) is on the roadmap.
  • Indexes are B-tree.
  • Transparent sharding and elasticity happen in what by now is the industry-standard/best-practices way:
    • There are many (typically 4096) logical partitions, many of which are assigned to each physical partition.
    • If the number of physical partitions changes, logical partitions are reassigned accordingly.
  • Relational OLTP (OnLine Transaction Processing) functionality is achieved by using a kind of PostgreSQL front end.
  • Relational batch processing is done via SparkSQL.
  • There also is a block/LOB (Large OBject) storage engine meant for content management applications.
  • SequoiaCM boils down technically to:
    • SequoiaDB, which is used to store JSON metadata about the LOBs …
    • … and whose generic-DBMS coordination capabilities are also used over the block/LOB engine.
    • A Java library focused on content management.

SequoiaDB’s relationship with PostgreSQL is complicated, but as best I understand SequoiaDB’s relational operations:

  • SQL parsing, optimization, and so on rely mainly on PostgreSQL code. (Of course, there are some hacks, such as to the optimizer’s cost functions.)
  • Actual data storage is done via SequoiaDB’s JSON store, using PostgreSQL Foreign Data Wrappers. Each record goes in a separate JSON document. Locks, commits and so on — i.e. “write prevention” :) — are handled by the JSON store.
  • PostgreSQL’s own storage engine is actually part of the stack, but only to manage temp space and the like.

PostgreSQL stored procedures are already in the SequoiaDB product. Triggers and referential integrity are not. Neither, so far as I can tell, are PostgreSQL’s datatype extensibility capabilities.

I neglected to ask how much of that remains true when SparkSQL is invoked.

SequoiaDB’s use cases to date seem to fall mainly into three groups:

  • Content management via SequoiaCM.
  • “Operational data lakes”.
  • Pretty generic replacement of legacy RDBMS.

Internet back-ends, however — and this is somewhat counter-intuitive for an open-source JSON store — are rare, at least among paying subscription customers. But SequoiaDB did tell me of one classic IoT (Internet of Things) application, with lots of devices “phoning home” and the results immediately feeding a JSON-based dashboard.

To understand SequoiaDB’s “operational data lake” story, it helps to understand the typical state of data warehousing at SequoiaDB’s customers and prospects, which isn’t great:

  • 2-3 years of data, and not all the data even from that time period.
  • Only enough processing power to support structured business intelligence …
  • … and hence little opportunity for ad-hoc query.

SequoiaDB operational data lakes offer multiple improvements over that scenario:

  • They hold as much relational data as customers choose to dump there.
  • That data can be simply copied from operational stores, with no transformation.
  • Or if data arrives via JSON — from external organizations or micro-services as the case may be — the JSON can be stored unmodified as well.
  • Queries can be run straight against this data soup.
  • Of course, views can also be set up in advance to help with querying.

Views are particularly useful with what might be called slowly changing schemas. (I didn’t check whether what SequoiaDB is talking about matches precisely with the more common term “slowly changing dimensions”.) Each time the schema changes, a new table is created in SequoiaDB to receive copies of the data. If one wants to query against the parts of the database structure that didn’t change — well, a view can be establish to allow for that.

Finally, it seems that SequoiaCM uses are concentrated in what might be called “security and checking-up” areas, such:

  • Photographs as part of an authentication process.
  • Video of in-person banking transactions, both for fraud prevention and for general service quality assurance.
  • Storage of security videos (for example from automated teller machines).

SequoiaCM deals seem to be bigger than other SequoiaDB ones, surely in part because the amounts of data managed are larger.

Categories: Other

One bit of news in Trump’s speech

Tue, 2017-02-28 23:26

Donald Trump addressed Congress tonight. As may be seen by the transcript, his speech — while uncharacteristically sober — was largely vacuous.

That said, while Steve Bannon is firmly established as Trump’s puppet master, they don’t agree on quite everything, and one of the documented disagreements had been in their view of skilled, entrepreneurial founder-type immigrants: Bannon opposes them, but Trump has disagreed with his view. And as per the speech, Trump seems to be maintaining his disagreement.

At least, that seems implied by his call for “a merit-based immigration system.”

And by the way — Trump managed to give a whole speech without saying anything overtly racist. Indeed, he specifically decried the murder of an Indian-immigrant engineer. By Trump standards, that counts as a kind of progress.

Categories: Other

Coordination, the underused “C” word

Tue, 2017-02-28 22:34

I’d like to argue that a single frame can be used to view a lot of the issues that we think about. Specifically, I’m referring to coordination, which I think is a clearer way of characterizing much of what we commonly call communication or collaboration.

It’s easy to argue that computing, to an overwhelming extent, is really about communication. Most obviously:

  • Data is constantly moving around — across wide area networks, across local networks, within individual boxes, or even within particular chips.
  • Many major developments are almost purely about communication. The most important computing device today may be a telephone. The World Wide Web is essentially a publishing platform. Social media are huge. Etc.

Indeed, it’s reasonable to claim:

  • When technology creates new information, it’s either analytics or just raw measurement.
  • Everything else is just moving information around, and that’s communication.

A little less obvious is the much of this communication could be alternatively described as coordination. Some communication has pure consumer value, such as when we talk/email/Facebook/Snapchat/FaceTime with loved ones. But much of the rest is for the purpose of coordinating business or technical processes.

Among the technical categories that boil down to coordination are:

  • Operating systems.
  • Anything to do with distributed computing.
  • Anything to do with system or cluster management.
  • Anything that’s called “collaboration”.

That’s a lot of the value in “platform” IT right there. 

Meanwhile, in pre-internet apps:

  • Some of the early IT wins were in pure accounting and information management. But a lot of the rest were in various forms of coordination, such as logistics and inventory management.
  • The glory days of enterprise apps really started with SAP’s emphasis on “business process'”. (“Business process reengineering” was also a major buzzword back in the day.)

This also all fits with the “route” part of my claim that “historically, application software has existed mainly to record and route information.”

And in the internet era:

  • “Sharing economy” companies, led by Uber and Airbnb, have created a lot more shareholder value than the most successful pure IT startups of the era.
  • Amazon, in e-commerce and cloud computing alike, has run some of the biggest coordination projects of all.

This all ties into one of the key underlying subjects to modern politics and economics, namely the future of work.

  • Globalization is enabled by IT’s ability to coordinate far-flung enterprises.
  • Large enterprises need fewer full-time employees when individual or smaller-enterprise contractors are easier to coordinate. (It’s been 30 years since I drew a paycheck from a company I didn’t own.)
  • And of course, many white collar jobs are being entirely automated away, especially those that can be stereotyped as “paper shuffling”.

By now, I hope it’s clear that “coordination” covers a whole lot of IT. So why do I think using a term with such broad application adds any clarity? I’ve already given some examples above, in that:

  • “Coordination” seems clearer than “communication” when characterizing the essence of distributed computing.
  • “Coordination” seems clearer than “communication” if we’re discussing the functioning of large enterprises or of large-enterprise-substitutes.

Further — even when we focus on the analytic realm, the emphasis on “coordination” has value. A big part of analytic value comes in determining when to do something. Specifically that arises when:

  • Analytics identifies a problem that just occurred, or is about to happen, allowing a timely fix.
  • Business intelligence is using for monitoring, of impending problems or otherwise, as a guide to when action is needed.
  • Logistics of any kind get optimized.

I’d also say that most recommendation/personalization fits into the “coordination” area, but that’s a bit more of a stretch; you’re welcome to disagree.

I do not claim that analytics’ value can be wholly captured by the “coordination” theme. Decisions about whether to do something major — or about what to do — are typically made by small numbers of people; they turn into major coordination exercises only after a project gets its green light. But such cases, while important, are pretty rare. For the most part, analytic results serve as inputs to business processes. And business processes, on the whole, typically have a lot to do with coordination.

Bottom line: Most of what’s valuable in IT relates to communication or coordination. Apparent counterexamples should be viewed with caution.

Related links

Categories: Other

There’s no escape from politics now

Wed, 2017-02-01 23:31

The United States and consequently much of the world are in political uproar. Much of that is about very general and vital issues such as war, peace or the treatment of women. But quite a lot of it is to some extent tech-industry-specific. The purpose of this post is outline how and why that is.

For example:

  • There’s a worldwide backlash against “elites” — and tech industry folks are perceived as members of those elites.
  • That perception contains a lot of truth, and not just in terms of culture/education/geography. Indeed, it may even be a bit understated, because trends commonly blamed on “trade” or “globalization” often have their roots in technological advances.
  • There’s a worldwide trend towards authoritarianism. Surveillance/ privacy and censorship issues are strongly relevant to that trend.
  • Social media companies are up to their neck in political considerations.

Because they involve grave threats to liberty, I see surveillance/privacy as the biggest technology-specific policy issues in the United States. (In other countries, technology-driven censorship might loom larger yet.) My views on privacy and surveillance have long been:

  • Fixing the legal frameworks around information use is a difficult and necessary job. The tech community should be helping more than it is.
  • Until those legal frameworks are indeed cleaned up, the only responsible alternative is to foot-drag on data collection, on data retention, and on the provision of data to governmental agencies.

Given the recent election of a US president with strong authoritarian tendencies, that foot-dragging is much more important than it was before.

Other important areas of technology/policy overlap include:

  • The new head of the Federal Communications Commission is hostile to network neutrality. (Perhaps my compromise proposal for partial, market-based network neutrality should get another look some day.)
  • There’s a small silver lining in Trump’s attacks on free trade; the now-abandoned (at least by the US) Trans-Pacific Partnership had gone too far on “intellectual property” rights.
  • I’m a skeptic about software patents.
  • Government technology procurement processes have long been broken.
  • “Sharing economy” companies such as Uber and Airbnb face a ton of challenges in politics and regulation, often on a very local basis.

And just over the past few days, the technology industry has united in opposing the Trump/Bannon restrictions on valuable foreign visitors.

Tech in the wider world

Technology generally has a huge impact on the world. One political/economic way of viewing that is:

  • For a couple of centuries, technological advancement has:
    • Destroyed certain jobs.
    • Replaced them directly with a smaller number of better jobs.
    • Increased overall wealth, which hopefully leads to more, better jobs in total.
  • Over a similar period, improvements in transportation technology have moved work opportunities from richer countries to poorer areas (countries or colonies as the case may be). This started in farming and extraction, later expanded to manufacturing, and now includes “knowledge workers” as well.
  • Both of these trends are very strong in the current computer/internet era.
  • Many working- and middle-class people in richer countries now feel that these trends are leaving them worse off.
    • To some extent, they’re confusing correlation and causality. (The post-WW2 economic boom would have slowed no matter what.)
    • To some extent, they’re ignoring the benefits of technology in their day to day lives. (I groan when people get on the internet to proclaim that technology is something bad.)
    • To some extent, however, they are correct.

Further, technology is affecting how people relate to each other, in multiple ways.

  • This is obviously the case with respect to cell phones and social media.
  • Also, changes to the nature of work naturally lead to changes in the communities where the workers live.

For those of us with hermit-like tendencies or niche interests, that may all be a net positive. But others view these changes less favorably.

Summing up: Technology induces societal changes of such magnitudes as to naturally cause (negative) political reactions.

And in case you thought I was exaggerating the political threat to the tech industry …

… please consider the following quotes from Trump’s most powerful advisor, Steve Bannon:

The “progressive plutocrats in Silicon Valley,” Bannon said, want unlimited ability to go around the world and bring people back to the United States. “Engineering schools,” Bannon said, “are all full of people from South Asia, and East Asia. . . . They’ve come in here to take these jobs.” …

“Don’t we have a problem with legal immigration?” asked Bannon repeatedly.

“Twenty percent of this country is immigrants. Is that not the beating heart of this problem?”

Related links

I plan to keep updating the list of links at the bottom of my post Politics and policy in the age of Trump.

Categories: Other

Politics and policy in the age of Trump

Wed, 2017-02-01 23:28

The United States presidency was recently assumed by an Orwellian lunatic.* Sadly, this is not an exaggeration. The dangers — both of authoritarianism and of general mis-governance — are massive. Everybody needs in some way to respond.

*”Orwellian lunatic” is by no means an oxymoron. Indeed, many of the most successful tyrants in modern history have been delusional; notable examples include Hitler, Stalin, Mao and, more recently, Erdogan. (By way of contrast, I view most other Soviet/Russian leaders and most jumped-up-colonel coup leaders as having been basically sane.)

There are many candidates for what to focus on, including:

  • Technology-specific issues — e.g. privacy/surveillance, network neutrality, etc.
  • Issues in which technology plays a large role — e.g. economic changes that affect many people’s employment possibilities.
  • Subjects that may not be tech-specific, but are certainly of great importance. The list of candidates here is almost endless, such as health care, denigration of women, maltreatment of immigrants, or the possible breakdown of the whole international order.

But please don’t just go on with your life and leave the politics to others. Those “others” you’d like to rely on haven’t been doing a very good job.

What I’ve chosen to do personally includes:

  • Get and stay current in my own knowledge. That’s of course a prerequisite for everything else.
  • Raise consciousness among my traditional audience. This post is an example. :)
  • Educate my traditional audience. Some of you are American, well-versed in history and traditional civics. Some of you are American, but not so well-versed. Some of you are from a broad variety of other countries. The sweet spot of my target is the smart, rational, not-so-well-versed Americans. But I hope others are interested as well.
  • Prepare for such time as nuanced policy analysis is again appropriate. In the past, I’ve tried to make thoughtful, balanced, compromise suggestions for handling thorny issues such as privacy/surveillance or network neutrality. In this time of crisis, people don’t care, and I don’t blame them at all. But hopefully this ill wind will pass, and serious policy-making will restart. When it does, we should be ready for it.
  • Support my family in whatever they choose to do. It’s a small family, but it includes some stars, more articulate and/or politically experienced than I am.

Your choices will surely differ (and later on I will offer suggestions as to what those choices might be). But if you take only one thing from this post and its hopefully many sequels, please take this: Ignoring politics is no longer a rational choice.

Related links

This is my first politics/policy-related post since the start of the Trump (or Trump/Bannon) Administration. I’ll keep a running guide to others here, and in the comments below.

  • The technology industry in particular is now up to its neck in politics. I gave quite a few examples to show why for tech folks there’s no escaping politics now.
  • Some former congressional staffers put out a great guide to influencing your legislators. It’s focused on social justice and anti-discrimination kinds of issues, but can probably be applied more broadly, e.g. to Senator Feinstein’s (D-Cal) involvement in overseeing the intelligence community.
Categories: Other

Introduction to Crate.io and CrateDB

Sat, 2016-12-17 23:27

Crate.io and CrateDB basics include:

  • Crate.io makes CrateDB.
  • CrateDB is a quasi-RDBMS designed to receive sensor data and similar IoT (Internet of Things) inputs.
  • CrateDB’s creators were perhaps a little slow to realize that the “R” part was needed, but are playing catch-up in that regard.
  • Crate.io is an outfit founded by Austrian guys, headquartered in Berlin, that is turning into a San Francisco company.
  • Crate.io says it has 22 employees and 5 paying customers.
  • Crate.io cites bigger numbers than that for confirmed production users, clearly active clusters, and overall product downloads.

In essence, CrateDB is an open source and less mature alternative to MemSQL. The opportunity for MemSQL and CrateDB alike exists in part because analytic RDBMS vendors didn’t close it off.

CrateDB’s not-just-relational story starts:

  • A column can contain ordinary values (of usual-suspect datatypes) or “objects”, …
  • … where “objects” presumably are the kind of nested/hierarchical structures that are common in the NoSQL/internet-backend world, …
  • … except when they’re just BLOBs (Binary Large OBjects).
  • There’s a way to manually define “strict schemas” on the structured objects, and a syntax for navigating their structure in WHERE clauses.
  • There’s also a way to automagically infer “dynamic schemas”, but it’s simplistic enough to be more suitable for development/prototyping than for serious production.

Crate gave an example of data from >800 kinds of sensors being stored together in a single table. This leads to significant complexity in the FROM clauses. But querying the same data in a relational schema would be at least as complicated, and probably worse.

One key to understanding Crate’s architectural choices is to note that they’re willing to have different latency/consistency standards for:

  • Writes and single-row look-ups.
  • Aggregates and joins.

And so it makes sense that:

  • Data is banged into CrateDB in a NoSQL-ish kind of way as it arrives, with RYW consistency.
  • The indexes needed for SQL functionality are updated in microbatches as soon as possible thereafter. (Think 100 milliseconds as a base case.) Crate.io characterizes the consistency for this part as “eventual”.

CrateDB will never have real multi-statement transactions, but it has simpler levels of isolation that may be called “transactions” in some marketing contexts.

CrateDB technical highlights include:

  • CrateDB records are stored as JSON documents. (Actually, I didn’t ask whether this was true JSON or rather something “JSON-like”.)
    • In the purely relational case, the documents may be regarded as glorified text strings.
    • I got the impression that BLOB storage was somewhat separate from the rest.
  • CrateDB’s sharding story starts with consistent hashing.
    • Shards are physical-only. CrateDB lacks the elasticity-friendly feature of there being many logical shards for each physical shard.
    • However, you can change your shard count, and any future inserts will go into the new set of shards.
  • In line with its two consistency models, CrateDB also has two indexing strategies.
    • Single-row/primary-key lookups have a “forward lookup” index, whatever that is.
    • Tables also have a columnar index.
      • More complex queries and aggregations are commonly done straight against the columnar index, rather than the underlying data.
      • CrateDB’s principal columnar indexing strategy sounds a lot like inverted-list, which in turn is a lot like standard text indexing.
      • Specific datatypes — e.g. geospatial — can be indexed in different ways.
    • The columnar index is shard-specific, and located at the same node as the shard.
    • At least the hotter parts of the columnar index will commonly reside in memory. (I didn’t ask whether this was via straightforward caching or some more careful strategy.)
  • While I didn’t ask about CrateDB’s replication model in detail, I gathered that:
    • Data is written synchronously to all nodes. (That’s sort of implicit in RYW consistency anyway.)
    • Common replication factors are either 1 or 3, depending on considerations such as the value of the data. But as is usual, some tables can be replicated across all nodes.
    • Data can be read from all replicas, for obvious reasons of performance.
  • Where relevant — e.g. the wire protocol or various SQL syntax specifics — CrateDB tends to emulate Postgres.
  • The CrateDB stack includes Elasticsearch and Lucene, both of which make sense in connection with Crate’s text/document orientation.

Crate.io is proud of its distributed/parallel story.

  • Any CrateDB node can plan a query. Necessary metadata for that is replicated across the cluster.
  • Execution starts on a shard-by-shard basis. Data is sorted at each shard before being sent onward.
  • Crate.io encourages you to run Spark and CrateDB on the same nodes.
    • This is supported by parallel Spark-CrateDB integration of the obvious kind.
    • Crate.io notes a happy synergy to this plan, in that Spark stresses CPU while CrateDB is commonly I/O-bound.

The CrateDB-Spark integration was the only support I could find for various marketing claims about combining analytics with data management.

Given how small and young Crate.io is, there are of course many missing features in CrateDB. In particular:

  • A query can only reshuffle data once. Hence, CrateDB isn’t currently well-designed for queries that join more than 2 tables together.
  • The only join strategy currently implemented is nested loop. Others are in the future.
  • CrateDB has most of ANSI SQL 92, but little or nothing specific to SQL 99. In particular, SQL windowing is under development.
  • Geo-distribution is still under development (even though most CrateDB data isn’t actually about people).
  • I imagine CrateDB administrative tools are still rather primitive.

In any case, creating a robust DBMS is an expensive and time-consuming process. Crate has a long road ahead of it.

Categories: Other

DBAs of the future

Wed, 2016-11-23 06:02

After a July visit to DataStax, I wrote

The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.

  • Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.
  • Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.

That turns out to understate the core point, which is that DBAs still matter in non-RDBMS environments. Specifically, it’s too narrow in two ways.

  • First, it’s generally too narrow as to what DBAs do; people with DBA-like skills are also involved in other areas such as “data governance”, “information lifecycle management”, storage, or what I like to call data mustering.
  • Second — and more narrowly :) — the first bullet point of the quote is actually incorrect. In fact, the database design part of application development can be done by a specialized person up front in the NoSQL world, just as it commonly is for RDBMS apps.

My wake-up call for that latter bit was a recent MongoDB 3.4 briefing. MongoDB certainly has various efforts in administrative tools, which I won’t recapitulate here. But to my surprise, MongoDB also found a role for something resembling relational database design. The idea is simple: A database administrator defines a view against a MongoDB database, where views:

  • Are logical rather than materialized. (At least at this time.)
  • Have their permissions and so on set by the DBA.
  • Are the sole thing the programmer writes against.

Besides the obvious benefits in development ease and security, MongoDB says that performance can be better as well.* This is of course a new feature, without a lot of adoption at this time. Even so, it seems likely that NoSQL doesn’t obsolete any part of the traditional DBA role.

*I didn’t actually ask what a naive programmer can do to trash performance that views can forestall, but … well, I was once a naive programmer myself. :)

Two trends that I think could make DBA’s lives even more interesting and challenging in the future are:

  • The integration of quick data management into complex analytic processes. Here by “quick data management” I mean, for example, what you do in connection with a complex Hadoop or Spark (set of) job(s). Leaving the data management to a combination of magic and Python scripts doesn’t seem to respect how central data operations are to analytic tasks.
  • The integration of data management and streaming. I should probably write about this point separately, but in any case — it seems that streaming stacks will increasingly look like over-caffeinated DBMS.

Bottom line: Database administration skills will be needed for a long time to come.

Categories: Other

MongoDB 3.4 and “multimodel” query

Wed, 2016-11-23 06:01

“Multimodel” database management is a hot new concept these days, notwithstanding that it’s been around since at least the 1990s. My clients at MongoDB of course had to join the train as well, but they’ve taken a clear and interesting stance:

  • A query layer with multiple ways to query and analyze data.
  • A separate data storage layer in which you have a choice of data storage engines …
  • … each of which has the same logical (JSON-based) data structure.

When I pointed out that it would make sense to call this “multimodel query” — because the storage isn’t “multimodel” at all — they quickly agreed.

To be clear: While there are multiple ways to read data in MongoDB, there’s still only one way to write it. Letting that sink in helps clear up confusion as to what about MongoDB is or isn’t “multimodel”. To spell that out a bit further:

  • In query, MongoDB mixes multiple paradigms for DML (Data Manipulation Language). The main one is of course JSON.
  • When writing, the DML paradigm is unmixed — it’s just JSON.

Further, MongoDB query DML statements can be mixed with analytic functions rooted in Spark.

The main ways to query data in MongoDB, to my knowledge, are:

  • Native/JSON. Duh.
  • SQL.
    • MongoDB has used MySQL as a guide to what SQL coverage they think the market is calling for.
    • More to the point, they’re trying to provide enough SQL so that standard business intelligence tools work well (enough) against MongoDB.
    • I neglected to ask why this changed from MongoDB’s adamantly non-SQL approach of 2 1/2 years ago.
  • Search.
    • MongoDB has been adding text search features for a few releases.
    • MongoDB’s newest search feature revolves around “facets”, in the Endeca sense of the term. MongoDB characterizes as a kind of text-oriented GroupBy.
  • Graph. MongoDB just introduced a kind of recursive join capability, which is useful for detecting multi-hop relationships (e.g. ancestor/descendant rather than just parent/child). MongoDB declares that the “graph” box is thereby checked. :)

Three years ago, in an overview of layered and multi-DML architectures, I suggested:

  • Layered DBMS and multimodel functionality fit well together.
  • Both carried performance costs.
  • In most cases, the costs could be affordable.

MongoDB seems to have bought strongly into that view on the query side — which is, of course, exactly the right way for them to have started.

Categories: Other

Pages