OCS 2020 Breakout: Mark Grover

#ocs2020 #data #metadata #ecosystem

Mark Grover is the creator of the open-source data catalog and metadata engine, Amundsen. Amundsen is used by data scientists and analysts to discover, understand, and trust data they use. At Lyft, Amundsen has 700+ active users every week, and outside of Lyft, Amundsen is used by 27 companies like Instacart, ING, Square, etc.

Relevant Links
LinkedIn - Twitter

Mark Grover is the founder of Stemma, the COSS company behind the Amundsen project. He shares the vision and progress around Stemma.

Introduction and topic: From Discovering to Trusting Data - 0:00

Mark’s journey, from Cloudera to Lyft to Amundsen to Stemma - 0:36

Problem: Lots of wasted technology and business users’ time in data discovery and exploration - 1:02

Lack of productivity had many side effects (Lots of unknowns, Increased database load, Interrupt heavy data culture) - 2:00

Walking through example problem from Lyft: Trying to optimize ETA (time it takes a driver to get to your car) - 2:50

How did this become a problem? The lower barrier to entry to create data, so it becomes really easy to add data to your data lake, and on the consumption side, you have analytical/ML tools. This leaves a wide array of data in the datalake, and in the middle, you have a chasm of trust. What data do I use, who else is using this data, can I trust it? - 4:21

Existing solutions and Goals for evaluation: A solution should (1) automatically capture everything related to data endeavors (tables, dashboards, ETL DAGs, HR systems and relationships), (2) expose it in user friendly ways (search, lineage, and API), and (3) be easy to extend to new sources and new classes of source. It should be a source of truth ffor where, what, and how data is being stored and used - 5:38

Goal: Reduce time to find trusted data w/versatile graph. Showing the graph. - 6:57

Holy grail of solving for productivity: metadata. What kind of metadata do we want to capture? ABC. Application Context (metadata needed by humans or applications to operate), Behavior (how data is created and used over time), Change (change in data over time). Terminology borrowed from the [Ground: Data Context as a Service](http://cidrdb.org/cidr2017/papers/p111-hellerstein-cidr17.pdf) paper. - 7:05

We also want to capture metadata from a variety of types of sources (any within organization, such as Data stores, People, Dashboard/Reports, Notebooks, EVents/Schemas, and Streams) - 8:13

Evaluating the solution space - 8:42

Meet Amundsen. Walking through product. - 8:57

See detailed descriptions and profiles of columns, see dashboards built on a dataset, search for data owned and frequently used by people - 14:27

Amundsen Architecture - 16:10

Exploring the pull data model vs. push data model - 18:51

Relevance vs. Popularity in the context of search ranking. How Amundsen strikes the balance - 19:46

Why Amundsen? What makes Amundsen different? First, it’s a catalog for next generation data infrastructure (e.g. Airflow for orchestration, Hive/Spark for ETL, Presto/BQ/Snowflake for Data Lake). Second, lower time to value. - 20:30

Amundsen’s Impact - 22:05

Amundsen’s open-source community and ecosystem, including prominent users and active company community - 23:14

What Amundsen’s future looks like (focus on lineage, ACL integration, showing search context of what’s matched) - 23:42

Developing breadth of applications from metadata in Amundsen - 24:35

Summary and concluding remarks (see also: The data production-consumption gap and Using Amundsen to Support User Privacy via Metadata Collection at Square - 26:00

Share your questions and comments below!

COSS Community 🌱

OCS 2020 Breakout: Mark Grover

Top comments (0)