You are viewing documentation for Kubeflow 1.0

This is a static snapshot from the time of the Kubeflow 1.0 release.
For up-to-date information, see the latest version.

Component

Conceptual overview of components in Kubeflow Pipelines

A pipeline component is self-contained set of code that performs one step in the ML workflow (pipeline), such as data preprocessing, data transformation, model training, and so on. A component is analogous to a function, in that it has a name, parameters, return values, and a body.

Component code

The code for each component includes the following:

  • Client code: The code that talks to endpoints to submit jobs. For example, code to talk to the Google Dataproc API to submit a Spark job.

  • Runtime code: The code that does the actual job and usually runs in the cluster. For example, Spark code that transforms raw data into preprocessed data.

Note the naming convention for client code and runtime code—for a task named “mytask”:

  • The mytask.py program contains the client code.
  • The mytask directory contains all the runtime code.

Component definition

A component specification in YAML format describes the component for the Kubeflow Pipelines system. A component definition has the following parts:

  • Metadata: name, description, etc.
  • Interface: input/output specifications (name, type, description, default value, etc).
  • Implementation: A specification of how to run the component given a set of argument values for the component’s inputs. The implementation section also describes how to get the output values from the component once the component has finished running.

For the complete definition of a component, see the component specification.

Containerizing components

You must package your component as a Docker image. Components represent a specific program or entry point inside a container.

Each component in a pipeline executes independently. The components do not run in the same process and cannot directly share in-memory data. You must serialize (to strings or files) all the data pieces that you pass between the components so that the data can travel over the distributed network. You must then deserialize the data for use in the downstream component.

Next steps