Dataset Development Guide¶
Introduction¶
The Sedna provides interfaces and public methods related to data conversion and sampling in the Dataset class. The user data processing class can inherit from the Dataset class and use these public capabilities.
1. Example¶
The following describes how to use the Dataset by using a txt-format contain sets of images
as an example. The procedure is as follows:
1.1. All dataset classes of Sedna are inherited from the base class
sedna.datasources.BaseDataSource
. The base class BaseDataSource defines the interfaces required by the dataset, provides attributes such as data_parse_func, save, and concat, and provides default implementation. The derived class can reload these default implementations as required.
class BaseDataSource:
"""
An abstract class representing a :class:`BaseDataSource`.
All datasets that represent a map from keys to data samples should subclass
it. All subclasses should overwrite parse`, supporting get train/eval/infer
data by a function. Subclasses could also optionally overwrite `__len__`,
which is expected to return the size of the dataset.overwrite `x` for the
feature-embedding, `y` for the target label.
Parameters
----------
data_type : str
define the datasource is train/eval/test
func: function
function use to parse an iter object batch by batch
"""
def __init__(self, data_type="train", func=None):
self.data_type = data_type # sample type: train/eval/test
self.process_func = None
if callable(func):
self.process_func = func
elif func:
self.process_func = ClassFactory.get_cls(
ClassType.CALLBACK, func)()
self.x = None # sample feature
self.y = None # sample label
self.meta_attr = None # special in lifelong learning
def num_examples(self) -> int:
return len(self.x)
def __len__(self):
return self.num_examples()
def parse(self, *args, **kwargs):
raise NotImplementedError
@property
def is_test_data(self):
return self.data_type == "test"
def save(self, output=""):
return FileOps.dump(self, output)
class TxtDataParse(BaseDataSource, ABC):
"""
txt file which contain image list parser
"""
def __init__(self, data_type, func=None):
super(TxtDataParse, self).__init__(data_type=data_type, func=func)
def parse(self, *args, **kwargs):
pass
1.2. Defining Dataset parse function
def parse(self, *args, **kwargs):
x_data = []
y_data = []
use_raw = kwargs.get("use_raw")
for f in args:
with open(f) as fin:
if self.process_func:
res = list(map(self.process_func, [
line.strip() for line in fin.readlines()]))
else:
res = [line.strip().split() for line in fin.readlines()]
for tup in res:
if not len(tup):
continue
if use_raw:
x_data.append(tup)
else:
x_data.append(tup[0])
if not self.is_test_data:
if len(tup) > 1:
y_data.append(tup[1])
else:
y_data.append(0)
self.x = np.array(x_data)
self.y = np.array(y_data)
2. Commissioning¶
The preceding implementation can be directly used in the PipeStep in Sedna or independently invoked. The code for independently invoking is as follows:
import os
import unittest
def _load_txt_dataset(dataset_url):
# use original dataset url,
# see https://github.com/kubeedge/sedna/issues/35
return os.path.abspath(dataset_url)
class TestDataset(unittest.TestCase):
def test_txtdata(self):
train_data = TxtDataParse(data_type="train", func=_load_txt_dataset)
train_data.parse(train_dataset_url, use_raw=True)
self.assertEqual(len(train_data), 1)
if __name__ == "__main__":
unittest.main()