Skip to content

Create Your First Dataset

In labelit datasets contain documents. Each document has some optional metadata, an optional text file, and an optional audio recording. A document may contain both text and audio, and must contain at least one audio file or one text file.

Importing the provided example dataset

To help you get started, we've created an example dataset. It's a zip file located at backend/src/labelit/tests/data/example_dataset.zip relative to the project root folder.

If you inspect this dataset, you will have an idea of the required format for import.

example_dataset/
    ├── documents
       ├── file1.meta.json
       ├── file1.txt
       ├── file1.wav
       ├── file2.meta.json
       ├── file2.txt
       └── file2.wav
    └── meta.json

Notice the meta.json file at the root of the dataset. This file must contain a name and may contain any other metadata you'd like to associate with this dataset.

To import this dataset into labelit, login to labelit as a QA user and navigate to the "Datasets" page. There, click on "Import dataset" and follow the steps to upload example_dataset.zip. In your list of datasets, "Dummy dataset" should now appear.

Importing your own dataset

To import your own dataset, follow closely the format of example_dataset :

  • At the root, there must be a file named meta.json, containing at the mimimum the name of the dataset

    {
        "name": "the name of my own dataset"
    }
    
  • At the root, there must be a documents folder :

    • For each document, this folder may contain a metadata file named <document_identifier>.meta.json

    • For each document, there may be an audio (wave) file named <document_identifier>.wav

    • For each document, there may be an text file named <document_identifier>.txt containing the associated text