DataReader

class dragon.io.DataReader(**kwargs)[source]

Read examples from a dataset.

The dataset class and data source are required to create a reader:

# Here we use ``dragon.io.KPLRecordDataset``
dataset = dragon.io.KPLRecordDataset
simple_reader = DataReader(dataset=dataset, source=path)

Partition are available over distributed nodes:

distributed_reader = DataReader(
    dataset=dataset,
    source=path,
    part_idx=rank,
    num_parts=num_ranks,
)

There are two shuffle schemes:

# Recommendation: SSD or dataset is tiny
example_wise_shuffle_reader = DataReader(
    dataset=dataset,
    source=path,
    shuffle=True,
    num_chunks=0,  # Set to the number of examples
)

# Recommendation: HDD or dataset is huge
chunk_wise_shuffle_reader = DataReader(
    dataset=dataset,
    source=path,
    shuffle=True,
    num_chunks=2048,
)

__init__

DataReader.__init__(**kwargs)[source]

Create a DataReader.

Parameters:
  • dataset (class) – The dataset class to load examples.
  • source (str) – The path of data source.
  • shuffle (bool, optional, default=False) – Whether to shuffle the data.r
  • num_chunks (int, optional, default=0) – The number of chunks to split.
  • num_parts (int, optional, default=1) – The number of partitions over dataset.
  • part_idx (int, optional, default=0) – The index of current partition.
  • seed (int, optional) – The random seed to use instead.

Methods

before_first

DataReader.before_first()[source]

Move the cursor before begin.

next_chunk

DataReader.next_chunk()[source]

Select the next chunk.

next_example

DataReader.next_example()[source]

Return the next example.

reset

DataReader.reset()[source]

Reset the environment of dataset.

run

DataReader.run()[source]

Start the process.