Automated Machine Learning and Data Science [AMLDS] - overview

Towards Democratizing Data Science and Machine Learning

From automated data discovery, composition, and preparation to rapid automated model construction, deployment, and maintenance.

We conduct research with the view to helping data scientists and machine learning practitioners. We apply techniques from Artificial Intelligence (AI), Machine Learning (ML), and data management to accelerate and optimize the creation of machine learning and data science workflows.

Practitioners typically perform a number of tedious and time-consuming steps to derive insight from raw data. The process usually starts with data ingestion, cleaning, transformation (e.g. outlier removal, missing value imputation), then model building, and finally a presentation of predictions that align with the end-users objectives and preferences. It is a long and complex process requiring substantial time, skill and effort especially because of the combinatorial explosion in choices of algorithms and platforms, their parameters, and their compositions. Our tools automate various steps in this process resulting in accelerated time-to-delivery of data products and machine learning insights, expand the reach of data science to non-experts, and offer a more systematic exploration of the available options.

Our research leverages approaches from various areas of AI and ML and apply the philosophy of 'Learning to Learn'. Here are some examples.

  • Reinforcement Learning to learn a policy across data sets to apply sequences of data transformations that improve model performance (see here); 


  • One Button Machine: automation of feature engineering for relational databases (see here).


  • Use Bandit-based approach to perform analytic pipeline selection (see here); 


  • Automated Interactive Outlier Detection using Mixed Integer Programming and Matrix Factorization 


  • Recommend interesting columns and column relationships in the data visually and interactively


  • Extending Random Forests to work with arbitrary sets of data objects (see here);


  • Employ black-box optimization to build neural network structures (see here).


As part of our research efforts we are building a system that integrates the various tools seamlessly and is used by actual practioners. The cognitive assistant for data scientists helps practitioners with a variety of tasks in their workflow such as automated model selection and feature engineering and supports analytics across various platforms (e.g., R, Spark, Python Sklearn, Tensorflow, XgBoost).

The video below illustrates the operations of the current system (add link to video):

Our technology is part of the IBM Data Science Experience and IBM Watson Machine Learning.

As an example consider this Jupyter notebook.

Finally, here is a recent blog post on one of our novel instant feature engineering approaches.

Automated Machine Learning and Data Science

Check out some features here: