LabBook: Accelarating Data Science Through Metadata - overview
LabBook is an open, social, and collaborative data analytics platform designed explicitly to support collaboration and accelerate discovery in data science. Its goal is to help data scientists, domain experts, and decision makers leverage 'tribal knowledge' to help users to quickly find relevant data, collaborate with others, and reuse analytic work within a community, while at the same time providing a seamless and transparent provenance mechanism, across users and systems.
A core element of LabBook is a metadata graph, which captures the interactions between data, people, and analytics. Metadata about how people work together and use data and tools, simply put, can significantly help in bringing together the right data, people, and tools to solve complex analytic problems. LabBook collects and utilizes (a) schematic metadata (i.e., how data is structured in files, tables, columns), (b) semantic metadata (e.g., the meaning of data attributes and values, relationships between datasets), (c) social metadata (e.g., who collaborates with whom, thoughts, hypotheses, decisions as people work) on which data, using which tools), and (d) usage metadata (e.g. how is data used, which applications are executed. When combined such metadata has the potential to accurately and holistically represent all data science work, across people, tools, and systems.
The LabBook user interface design delivers a social, conversational user experience, where each user has a homepage, can create communities, can follow entities such as people and datasets, and can explore data in an ad hoc and agile manner. A user’s homepage lists her communities, recent notebooks, frequently used apps, datasets, and documents, as well as recommendations of people, data, notebooks, and applications based on her activity. At the center of the homepage, other activities relevant to the user are summarized, including notes mentioning the user, updates to notebooks of which she is an author, new notebooks added to her communities, and new releases of datasets she is following. The activity stream gives a quick overview of recent activity, but also provides an opportunity to make quick responses to urgent notes.
Notebooks act as a digital version of a scientist’s lab notebook. They contain free-form notes about work being conducted, such as thoughts regarding analysis of a particular dataset, and artifacts that would facilitate such analysis, such as visualizations and models. Notebooks can be public, private or shared, allowing information to be compartmentalized if necessary. By capturing the exchange of ideas, knowledge and expertise in notebooks, LabBook facilitates collaboration among a community of individuals Apps can also be invoked from notes in the context of a notebook, capturing the input, output, and status of execution.
An extensible apps architecture is critical to achieving an open analytics platform. LabBook’s web-based apps architecture allows users to dynamically upload and register apps, and its app framework provides APIs that allow anyone to develop apps and integrate them. LabBook comes with several default apps for visualizing data, browsing catalogs, recommending data and people, etc. A contributed app is packaged as set of source (e.g., JavaScript) and support files (e.g., images, UI templates, or internal data), along with a descriptor that lists package contents and app metadata.
LabBook generates several kinds of context-aware search and recommendations, from general-purpose recommendations on a user’s homepage to personalized content on specific asset pages and personalized ranking of search results. For example, on the user’s homepage, recommendations target the user’s complete profile, using all related entities from the property graph as context (including the user’s social network, datasets she has used, notebooks she has authored, etc.) However, when performing a search in a notebook, browsing another user’s profile, or visualizing a dataset, that particular context, as well as the identity of the user, is used to generate recommendations. A search such as “sensors in CA”, as illustrated below, can leverage relationships between potential dataset candidates and entities in the context (e.g., the issuer, Mary, the search note, “sensors in CA”, and the notebook containing it, “Ozone”). For example, the CA-Ozone dataset might be preferred over the CA-ARB dataset because Peter is a colleague who has collaborated with Mary while John has not, making Peter’s relationship to Mary weigh more than John’s.