Ideas for annotation utilities

Currently using this dummy python script to gather annotation data but there is some method that I can see for converting this from a "dummy script" to a more standardized function

I want to address the idea of "caching the raw dataset" but also "an initial pull of data" into "some data is annotated" and finally towards "need more data to train my model, keep pulling more, but skip existing data"

```python
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import logging
from pathlib import Path

import torch
from cdp_data import CDPInstances, datasets

from speakerbox.preprocessing import diarize_and_split_audio

###############################################################################

logging.basicConfig(
    level=logging.INFO,
    format="[%(levelname)4s: %(module)s:%(lineno)4s %(asctime)s] %(message)s",
)
log = logging.getLogger(__name__)

###############################################################################

torch.device("cpu")

# Pull specific meetings
for start_date, end_date in [
    ("2021-05-24", "2021-05-25"),
    ("2021-06-07", "2021-06-08"),
    ("2021-09-20", "2021-09-21"),
    ("2021-06-28", "2021-06-29"),
    ("2021-07-12", "2021-07-13"),
]:
    datasets.get_session_dataset(
        CDPInstances.Seattle,
        start_datetime=start_date,
        end_datetime=end_date,
        store_audio=True,
    )


dataset_dir = Path(f"cdp-datasets/{CDPInstances.Seattle}")
for audio_file in dataset_dir.glob("event-*/session-*/audio.wav"):
    storage = audio_file.parent.parent.name
    if not (Path(storage).exists() or Path(f"ANNOTATED-{storage}").exists()):
        print("working on file:", audio_file)
        print("storing:", storage)
        torch.cuda.empty_cache()
        diarize_and_split_audio(audio_file, storage_dir=storage)
    else:
        print("skipping", storage)

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ideas for annotation utilities #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ideas for annotation utilities #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions