Lifelong Learning¶
Motivation¶
At present, edge-cloud synergy machine learning is confronted with the challenge of heterogeneous data distributions in complex scenarios and small samples on the edge. The edge-cloud synergy lifelong learning is accordingly proposed: 1) In order to learn with shared knowledge between historical scenarios, the scheme is essentially the combination of another two learning schemes, i.e., multi-task learning and incremental learning; 2) The cloud knowledge base in lifelong learning empowers the scheme with memory ability, which helps to adapt historical knowledge to new and unseen situations on the edge. Joining the forces of multi-task learning, incremental learning and the knowledge base, the lifelong learning scheme seeks to fundamentally overcome the above challenges of edge-cloud synergy machine learning.
Goals¶
edge-cloud collaborative continuous learning.
Knowledge sharing across the edge of the cloud.
Automatic discovery and transfer learning of new knowledge.
Proposal¶
We propose using Kubernetes Custom Resource Definitions (CRDs) to describe the lifelong learning specification/status and a controller to synchronize these updates between edge and cloud.
Use Cases¶
Users can create the lifelong learning jobs, by providing training scripts, configuring training hyperparameters, providing training datasets, configuring training and deployment triggers.
Design Details¶
There are three stages in a lifelong learning job: train/eval/deploy.
Each stage contains these below states:
Waiting: wait to trigger satisfied, i.e. wait to train/eval/deploy
Ready: the corresponding trigger satisfied, now ready to train/eval/deploy
Starting: the corresponding stage is starting
Running: the corresponding stage is running
Failed: the corresponding stage failed
Completed: the corresponding stage completed
CRD API Group and Version¶
The LifelongLearningJob
CRD will be namespace-scoped.
The tables below summarize the group, kind and API version details for the CRD.
LifelongLearningJob
Field |
Description |
---|---|
Group |
sedna.io |
APIVersion |
v1alpha1 |
Kind |
LifelongLearningJob |
Lifelong learning CRD¶
See the crd source for details.
Lifelong learning job type definition¶
See the golang source for details.
Validation¶
Open API v3 Schema based validation can be used to guard against bad requests. Invalid values for fields (example string value for a boolean field etc) can be validated using this.
Here is a list of validations we need to support :
The
dataset
specified in the crd should exist in k8s.The edgenode name specified in the crd should exist in k8s.
Lifelong learning job sample¶
See the source for an example.
Controller Design¶
The Lifelong learning controller starts three separate goroutines called upstream
, downstream
and Lifelonglearningjob
controller.
These are not separate controllers as such but named here for clarity.
Lifelong learning: watch the updates of lifelong-learning job crds, and create the workers depending on the state machine.
downstream: synchronize the lifelong-learning-job updates from the cloud to the edge node.
upstream: synchronize the lifelong-learning-job updates from the edge to the cloud node.
Lifelong Learning Controller¶
The lifelong-learning controller watches for the updates of lifelong-learning jobs and the corresponding pods against the K8S API server.
Updates are categorized below along with the possible actions:
Update Type |
Action |
---|---|
New lifelong-learning-job Created |
Wait to train trigger satisfied |
lifelong-learning-job Deleted |
NA. These workers will be deleted by k8s gc. |
The Status of lifelong-learning-job Updated |
Create the train/eval worker if it’s ready. |
The corresponding pod created/running/completed/failed |
Update the status of lifelong-learning job. |
Downstream Controller¶
The downstream controller watches for the lifelong-learning job updates against the K8S API server.
Updates are categorized below along with the possible actions that the downstream controller can take:
Update Type |
Action |
---|---|
New Lifelong-learning-job Created |
Sends the job information to LCs. |
Lifelong-learning-job Deleted |
The controller sends the delete event to LCs. |
Upstream Controller¶
The upstream controller watches for the lifelong-learning job updates from the edge node and applies these updates against the API server in the cloud.
Updates are categorized below along with the possible actions that the upstream controller can take:
Update Type |
Action |
---|---|
Lifelong-learning-job Reported State Updated |
The controller appends the reported status of the job by LC in the cloud. |
Details of api between GM(cloud) and LC(edge)¶
The flows of lifelong learning job¶
Flow of the job creation:
Flow of the
train
stage:
Flow of the
eval
stage:
Flow of the
deploy
stage:
Workers Communication¶
No need to communicate between workers.