Skip to content

Conversation

@joshfactorial
Copy link
Collaborator

No description provided.

@joshfactorial joshfactorial self-assigned this Oct 16, 2025
run_neat(new_config_file)



Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are great to have, but let's wrap them in a unit test. Most of our tests live in "NEAT/tests", but it's easier just to add them to your script. That way Github will run the tests automatically for us. Look up Python's unit-test package (https://realpython.com/python-unittest/, is one description) and let me know if you have any questions.


f.close()

full_orig_seq = orig_seq.replace("\n", "")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but requires looping over the entire sequence again ("replace" has to check every character for '\n') which could become an issue with larger datasets or someone running multiple bacteria at once. What you can do instead is when your read in the line on line 73, use python's "strip" (line.strip()) method, which automatically removes line breaks and white space from the end of the string.

# Runs the NEAT read simulator using the given config file

def run_neat(config_file):
subprocess.run(["neat", "read-simulator", "-c", config_file, "-o", config_file])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "-o" variable should be the name of the output directory (small change since last version). Let's try this: take "output_dir" and "prefix". Ultimately, those will just be user inputs, but we can set output dir equal to the parent dir of the config file. One was is to use Pathlib -> str(Path(config_file).parent.absolute()) or even just default to the current working directory (os.getcwd()). But we may want to use a temporary directory as well, for the intermediate outputs.

# if not line.startswith(b"#"):
# out_f.write(line)

def stitch_all_outputs(ofw: OutputFileWriter, output_files: list[tuple[int, dict[str, Path]]],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see you were updating from my latest work. I think the original version will work better here, though, because in the main code I decided to do VCFs a slightly different way. Start with the commented out code that includes all 4, and don't worry so much about the ofw code and all that, we won't need it yet.


# Stitching all outputs together - Keshav's script

def concat_fq(input_files: List[Path], dest: BgzfWriter) -> None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be correct, because NEAT will output bgzipped fastq files, so we'll need bgzf to read and write them. For fq, this should work, so we can go with this version over the commented out version.

# with input_file.open("rb") as in_f:
# shutil.copyfileobj(in_f, out_f)

def merge_bam(bams: List[Path], ofw: OutputFileWriter, threads: int) -> None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this pysam method should work for stitching the bams, so we can use this But instead of OutputFileWriter, you can just pass in the file names directly.

concat_fq(fq2_list, ofw.files_to_write[ofw.fq2])
merge_bam(bam_list, ofw, threads)

# def stitch_all_outputs(options: Options, thread_options: list[Options]) -> None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Options has a lot of things, but the most important for this piece is just the file names. So you can pass the filenames in directly here (maybe have them default to "None" since we want to account for any and all files).

# vcf_list.append(local_ops.vcf)
# if local_ops.bam:
# bam_list.append(local_ops.bam)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably won't need the loop, since we know ahead of time this will only be two files (regular and wrapped)

# options = Options(reference)
# bam_header = {}

# ofw = OutputFileWriter(options, bam_header)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do it this way, if you get all the parts to work, or just directly import Bio.bgzf and open the files with bgzfReader and bgzfWriter. Whichever way is easiest for now is perfect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants