11 open source tools for making the most of machine learning
Spam filtering, face recognition, recommendation
engines -- when you have a large data set on which you’d like to perform
predictive analysis or pattern recognition, machine learning
is the way to go. This science, in which computers are trained to learn
from, analyze, and act on data without being explicitly programmed, has
surged in interest of late outside of its original cloister of academic
and high-end programming circles.
This rise in popularity is due
not only to hardware growing cheaper and more powerful, but also the
proliferation of free software that makes machine learning easier to
implement both on single machines and at scale. The diversity of machine
learning libraries means there’s likely to be an option available
regardless of what language or environment you prefer.
These 11
machine learning tools provide functionality for individual apps or
whole frameworks, such as Hadoop. Some are more polyglot than others:
Scikit, for instance, is exclusively for Python, while Shogun sports
interfaces to many languages, from general-purpose to domain-specific.
Scikit-learn
Python has become a go-to
programming language for math, science, and statistics due to its ease
of adoption and the breadth of libraries available for nearly any
application. Scikit-learn leverages this breadth by building on top of
several existing Python packages -- NumPy, SciPy, and matplotlib -- for
math and science work. The resulting libraries can be used either for
interactive “workbench” applications or be embedded into other software
and reused. The kit is available under a BSD license, so it’s fully open
and reusable.
Shogun
Among the oldest, most venerable of
machine learning libraries, Shogun was created in 1999 and written in
C++, but isn’t limited to working in C++. Thanks to the SWIG library, Shogun can be used transparently in such languages and environments: as Java, Python, C#, Ruby, R, Lua, Octave, and Matlab.
Though venerable, Shogun has competition. Another C++-based machine learning library, Mlpack, has been around only since 2011, although it professes to be faster and easier to work with (by way of a more integral API set) than competing libraries.
Accord Framework/AForge.net
Accord, a machine learning and signal processing framework for .Net, is an extension of a previous project in the same vein, AForge.net.
“Signal processing,” by the way, refers here to a range of machine
learning algorithms for images and audio, such as for seamlessly
stitching together images or performing face detection. A set of
algorithms for vision processing are included; it operates on image
streams (such as video) and can be used to implement such functions as
the tracking of moving objects. Accord also includes libraries that
provide a more conventional gamut of machine learning functions, from
neural networks to decision-tree systems.
Mahout
The Mahout framework has long been
tied to Hadoop, but many of the algorithms under its umbrella can also
run as-is outside Hadoop. They're useful for stand-alone applications
that might eventually be migrated into Hadoop or for Hadoop projects
that could be spun off into their own stand-alone applications.
One downside of Mahout: Few of its algorithms currently support the high-performance Spark
framework for Hadoop, and instead use the legacy (and in increasingly
obsolete) MapReduce framework. The project no longer accepts
MapReduce-based algorithms, but those looking for a more performant and
future-proof library want to look into MLlib instead.
MLlib
Apache’s own machine learning library for Spark and Hadoop, MLlib boasts a gamut of common algorithms and useful data types,
designed to run at speed and scale. As you’d expect with any Hadoop
project, Java is the primary language for working in MLlib, but Python
users can connect MLlib with the NumPy library (also used in
scikit-learn), and Scala users can write code against MLlib. If setting
up a Hadoop cluster is impractical, MLlib can be deployed on top of
Spark without Hadoop -- and in EC2 or on Mesos.
Another project, MLbase, builds on top of MLlib to make it easier to derive results. Rather than write code, users make queries by way of a declarative language à la SQL.
H2O
0xdata’s H2O's algorithms are geared
for business processes -- fraud or trend predictions, for instance --
rather than, say, image analysis. H2O can interact in a stand-alone
fashion with HDFS stores, on top of YARN, in MapReduce, or directly in
an Amazon EC2 instance. Hadoop mavens can use Java to interact with H2O,
but the framework also provides bindings for Python, R, and Scala,
providing cross-interaction with all the libraries available on those
platforms as well.
Cloudera Oryx
Yet another machine learning project designed for Hadoop, Oryx comes courtesy of the creators of the Cloudera Hadoop distribution. The name on the label isn’t the only detail that sets Oryx apart: Per Cloudera’s emphasis on analyzing live streaming data
by way of the Spark project, Oryx is designed to allow machine learning
models to be deployed on real-time streamed data, enabling projects
like real-time spam filters or recommendation engines.
An all-new
version of the project, tentatively titled Oryx 2, is in the works. It
uses Apache projects like Spark and Kafka for better performance, and
its components are built along more loosely coupled lines for further
future-proofing.
GoLearn
Google’s Go language has been in the wild for only five years, but has started to enjoy wider use,
due to a growing collection of libraries. GoLearn was created to
address the lack of an all-in-one machine learning library for Go; the
goal is “simplicity paired with customizability,” according to developer
Stephen Witworth. The simplicity comes from the way data is loaded and
handled in the library, since it’s patterned after SciPy and R. The
customizability lies in both the library’s open source nature (it’s
MIT-licensed) and in how some of the data structures can be easily
extended in an application. Witworth has also created a Go wrapper for the Vowpal Wabbit library, one of the libraries found in the Shogun toolbox.
Weka
Weka, a product of the University of
Waikato, New Zealand, collects a set of Java machine learning
algorithms engineered specifically for data mining. This GNU
GPLv3-licensed collection has a package system to extend its
functionality, with both official and unofficial packages available. Weka even comes with a book
to explain both the software and the techniques used, so those looking
to get a leg up on both the concepts and the software may want to start
there.
While Weka isn’t aimed specifically at Hadoop users, it can
be used with Hadoop thanks to a set of wrappers produced for the most
recent versions of Weka. Note that it doesn’t yet support Spark, only
MapReduc. Clojure users can also leverage Weka, thanks to the Clj-ml library.
CUDA-Convnet
By now most everyone knows how GPUs
can crunch certain problems faster than CPUs. But applications don’t
automatically take advantage of GPU acceleration; they have to be
specifically written to do so. CUDA-Convnet is a machine learning
library for neural-network applications, written in C++ to exploit the
Nvidia’s CUDA GPU processing technology (CUDA boards of at least the
Fermi generation are required). For those using Python rather than C++,
the resulting neural nets can be saved as Python pickled objects and
thus accessed from Python.
Note that original version of the
project is no longer being developed, but has since been reworked into a
successor, CUDA-Convnet2, with support for multiple GPUs and
Kepler-generation GPUs. A similar project, Vulpes, has been written in F# and works with the .Net framework generally.
ConvNetJS
As the name implies, ConvNetJS
provides neural network machine learning libraries for use in
JavaScript, facilitating use of the browser as a data workbench. An NPM
version is also available for those using Node.js, and the library is
designed to make proper use of JavaScript’s asynchronicity -- for
example, training operations can be given a callback to execute once
they complete. Plenty of demo examples are included, too.
Source: http://www.infoworld.com
No comments:
Post a Comment