-
Notifications
You must be signed in to change notification settings - Fork 21
Feature/karen bacterial wrapper #183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| run_neat(new_config_file) | ||
|
|
||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are great to have, but let's wrap them in a unit test. Most of our tests live in "NEAT/tests", but it's easier just to add them to your script. That way Github will run the tests automatically for us. Look up Python's unit-test package (https://realpython.com/python-unittest/, is one description) and let me know if you have any questions.
|
|
||
| f.close() | ||
|
|
||
| full_orig_seq = orig_seq.replace("\n", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but requires looping over the entire sequence again ("replace" has to check every character for '\n') which could become an issue with larger datasets or someone running multiple bacteria at once. What you can do instead is when your read in the line on line 73, use python's "strip" (line.strip()) method, which automatically removes line breaks and white space from the end of the string.
| # Runs the NEAT read simulator using the given config file | ||
|
|
||
| def run_neat(config_file): | ||
| subprocess.run(["neat", "read-simulator", "-c", config_file, "-o", config_file]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "-o" variable should be the name of the output directory (small change since last version). Let's try this: take "output_dir" and "prefix". Ultimately, those will just be user inputs, but we can set output dir equal to the parent dir of the config file. One was is to use Pathlib -> str(Path(config_file).parent.absolute()) or even just default to the current working directory (os.getcwd()). But we may want to use a temporary directory as well, for the intermediate outputs.
| # if not line.startswith(b"#"): | ||
| # out_f.write(line) | ||
|
|
||
| def stitch_all_outputs(ofw: OutputFileWriter, output_files: list[tuple[int, dict[str, Path]]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see you were updating from my latest work. I think the original version will work better here, though, because in the main code I decided to do VCFs a slightly different way. Start with the commented out code that includes all 4, and don't worry so much about the ofw code and all that, we won't need it yet.
|
|
||
| # Stitching all outputs together - Keshav's script | ||
|
|
||
| def concat_fq(input_files: List[Path], dest: BgzfWriter) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be correct, because NEAT will output bgzipped fastq files, so we'll need bgzf to read and write them. For fq, this should work, so we can go with this version over the commented out version.
| # with input_file.open("rb") as in_f: | ||
| # shutil.copyfileobj(in_f, out_f) | ||
|
|
||
| def merge_bam(bams: List[Path], ofw: OutputFileWriter, threads: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this pysam method should work for stitching the bams, so we can use this But instead of OutputFileWriter, you can just pass in the file names directly.
| concat_fq(fq2_list, ofw.files_to_write[ofw.fq2]) | ||
| merge_bam(bam_list, ofw, threads) | ||
|
|
||
| # def stitch_all_outputs(options: Options, thread_options: list[Options]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Options has a lot of things, but the most important for this piece is just the file names. So you can pass the filenames in directly here (maybe have them default to "None" since we want to account for any and all files).
| # vcf_list.append(local_ops.vcf) | ||
| # if local_ops.bam: | ||
| # bam_list.append(local_ops.bam) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably won't need the loop, since we know ahead of time this will only be two files (regular and wrapped)
| # options = Options(reference) | ||
| # bam_header = {} | ||
|
|
||
| # ofw = OutputFileWriter(options, bam_header) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do it this way, if you get all the parts to work, or just directly import Bio.bgzf and open the files with bgzfReader and bgzfWriter. Whichever way is easiest for now is perfect.
No description provided.