In 2014, I completed a big data security architecture project for a global payments processor. The task was to analyse the security posture of a wide variety of technologies. These products were vying to be a part of the "big data" revolution.
This included Hadoop, Talend, Spark, Hue, RabbitMQ, Ambari, Cassandra and Spark to name but a few.
I am sure much has changed, since my initial work on the project, but I can share some of the insights from my prior analysis.
The default security of most products was focused around getting administrators up and running with "big data" as quick as possible. This often equated to an insecure initial configuration. You had an arduous task to get these products running securely. Often this meant getting your hands dirty with numerous XML and application configuration files.
This probably makes sense operationally, most admins are new to big data concepts and over-complicated build tasks would create confusion, but from a security perspective it is a data owner's nightmare.
However, it was also a tell-tale sign that many products were rushing to market without adopting many standards that are ubiquitous in secure Enterprise applications.
A number of pretty basic security steps were ommitted from many applications e.g. use of outdated authentication methods only, SSL/TLS are disabled by default, insecure cipher suites in use, sparse support for Kerberos/SPENGO and some products didn't even have simple role-based access control/authentication.
Many of the "in-memory" products had no encryption when enabling persistent data storage features. A fundamental problem with encryption was that most of the products on the market, at the time, could only secure relational databases. Key-value pair stores and NoSQL databases had little encryption options. I know that Hadoop, for example, now offers at-rest encryption so things have clearly improved in that area.
Hadoop had a secure mode but, with many features in active development and no standardisation, many features would require you to disable Hadoop secure mode. Again, I'm sure much has changed since 2014 in this regard.
In summary, I would advise that sensitive data owners are extremely careful with this domain. It is still maturing and that process always takes a great deal of time! If you do operate such an environment, with secure data, then also be ready to completely overhaul your risk register.