-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Currently using this dummy python script to gather annotation data but there is some method that I can see for converting this from a "dummy script" to a more standardized function
I want to address the idea of "caching the raw dataset" but also "an initial pull of data" into "some data is annotated" and finally towards "need more data to train my model, keep pulling more, but skip existing data"
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
from pathlib import Path
import torch
from cdp_data import CDPInstances, datasets
from speakerbox.preprocessing import diarize_and_split_audio
###############################################################################
logging.basicConfig(
level=logging.INFO,
format="[%(levelname)4s: %(module)s:%(lineno)4s %(asctime)s] %(message)s",
)
log = logging.getLogger(__name__)
###############################################################################
torch.device("cpu")
# Pull specific meetings
for start_date, end_date in [
("2021-05-24", "2021-05-25"),
("2021-06-07", "2021-06-08"),
("2021-09-20", "2021-09-21"),
("2021-06-28", "2021-06-29"),
("2021-07-12", "2021-07-13"),
]:
datasets.get_session_dataset(
CDPInstances.Seattle,
start_datetime=start_date,
end_datetime=end_date,
store_audio=True,
)
dataset_dir = Path(f"cdp-datasets/{CDPInstances.Seattle}")
for audio_file in dataset_dir.glob("event-*/session-*/audio.wav"):
storage = audio_file.parent.parent.name
if not (Path(storage).exists() or Path(f"ANNOTATED-{storage}").exists()):
print("working on file:", audio_file)
print("storing:", storage)
torch.cuda.empty_cache()
diarize_and_split_audio(audio_file, storage_dir=storage)
else:
print("skipping", storage)Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request