Refactoring and updating data generating methods.#34
Refactoring and updating data generating methods.#34linlin-s merged 15 commits intoamazon-ion:masterfrom
Conversation
tgregg
left a comment
There was a problem hiding this comment.
Encapsulating the schema-related logic is an improvement. A few comments below.
| * Initializing the ByteLength object. | ||
| * @param value represents constraint field 'byte_length'. | ||
| */ | ||
| public ByteLength(IonValue value) { |
There was a problem hiding this comment.
Since there's a public factory method, the constructor should be private. This goes for all of these constraint classes.
There was a problem hiding this comment.
Since there's a public factory method, the constructor should be private. This goes for all of these constraint classes.
Thanks for catching this, will update them in the next commit.
| // Processing the constraint value which contains 'range' annotation. | ||
| public class Range { | ||
| private static final String KEYWORD_RANGE = "range"; | ||
| IonSequence sequence; |
There was a problem hiding this comment.
Looks like this should be final. Should it also be private?
There was a problem hiding this comment.
For the attribute sequence, it should be public final, as it will be used in the class TimestampPrecision.
| import com.amazon.ion.IonValue; | ||
|
|
||
| public class Regex extends ReparsedConstraint{ | ||
| String pattern; |
| @@ -0,0 +1,5 @@ | |||
| package com.amazon.ion.benchmark.schema.constraints; | |||
|
|
|||
| public abstract class ReparsedConstraint { | |||
There was a problem hiding this comment.
Should this be an interface?
There was a problem hiding this comment.
Should this be an
interface?
For now, it would be better to assign this as interface, will update it in the next commit.
| private static final String KEYWORD_VALID_VALUES = "valid_values"; | ||
| private static final String KEYWORD_NAME = "name"; | ||
| // Using map to avoid processing the multiple repeat constraints situation. | ||
| Map<String, ReparsedConstraint> constraintMap; |
| Map<String, ReparsedConstraint> constraintMap = parsedTypeDefinition.getConstraintMap(); | ||
| Map<String, ReparsedConstraint> constraintMapClone = new HashMap<>(); | ||
| constraintMapClone.putAll(constraintMap); |
There was a problem hiding this comment.
Why is it necessary to clone the constraint map?
| } else if (regexPattern != null && codePointsLengthBound == null) { | ||
| RgxGen rgxGen = new RgxGen(regexPattern); | ||
| constructedString = rgxGen.generate(); | ||
| Regex regex = (Regex) constraintMapClone.remove("regex"); |
There was a problem hiding this comment.
This is the direct answer to my question about why the constraint map needed to be cloned, but now my question is: why do values have to be removed from the constraint map?
There was a problem hiding this comment.
This is the direct answer to my question about why the constraint map needed to be cloned, but now my question is: why do values have to be removed from the constraint map?
By removing values which we are able to process in ion data generator, we can check what has been left in the provided constraint map. If there are still values left after removing, the generator will throw the message that these constraints are not supported for now.
e.g.
if (!constraintMapClone.isEmpty()) {
throw new IllegalStateException ("Found unhandled constraints : " + constraintMapClone.values());
}
| IonSequence sequence = SYSTEM.newList( SYSTEM.newDecimal(MINIMUM_TIMESTAMP_IN_MILLIS_DECIMAL), SYSTEM.newDecimal(MAXIMUM_TIMESTAMP_IN_MILLIS_DECIMAL)); | ||
| Range range = new Range(sequence); |
There was a problem hiding this comment.
Can this range be a constant? Is there a good reason to reconstruct it on every call?
There was a problem hiding this comment.
Can this range be a constant? Is there a good reason to reconstruct it on every call?
Yes, it would be better to define it as constant instead of reconstructing it on every call.
| // Build ion schema system from input ISL file. | ||
| IonSchemaSystem ISS = buildIonSchemaSystem(inputFile); | ||
| // Get the name of ISL file as schema ID. | ||
| String schemaID = inputFile.substring(inputFile.lastIndexOf('/') + 1); |
There was a problem hiding this comment.
Use a more robust way of extracting the filename in order to account for platform differences (/ isn't the path separator on all platforms). e.g. Paths.get(inputFile).toFile().getName()
There was a problem hiding this comment.
Use a more robust way of extracting the filename in order to account for platform differences (
/isn't the path separator on all platforms). e.g.Paths.get(inputFile).toFile().getName()
Will update it in the next commit :)
| IonSchemaUtilities.checkValidationOfSchema(inputFilePath); | ||
| ReadGeneralConstraints.readIonSchemaAndGenerate(size, inputFilePath, format, path); | ||
| // Check whether the input schema file is valid and get the loaded schema. | ||
| Schema schema = IonSchemaUtilities.checkValidationOfSchema(inputFilePath); |
There was a problem hiding this comment.
checkValidationOfSchema doesn't really validate the schema, rather it just loads the schema file and returns Schema.
Maybe rename this method to loadSchemaDefinition or getSchemaDefinition.
There was a problem hiding this comment.
Loading the schema into ion-schema-kotlin does validate it, but yes :)
There was a problem hiding this comment.
What I meant to suggest here is it becomes unclear as to whether it points to validating a value using schema or parsing the schema file for validation.
There was a problem hiding this comment.
Changing the method name to loadSchemaDefinition or getSchemaDefinition would be more straightforward :), and I'll also add a comment to indicate that this method is used for loading schema and checking whether the input schema is valid or not.
| } | ||
| } | ||
| public static void readIonSchemaAndGenerate(int size, Schema schema, String format, String outputFile) throws Exception { | ||
| // Assume there's only one constraint between schema_header and schema_footer. |
There was a problem hiding this comment.
| // Assume there's only one constraint between schema_header and schema_footer. | |
| // Assume there's only one type definition between schema_header and schema_footer. |
One type definition can have multiple constraints. I guess you are expecting a single type definition here.
| value.writeTo(writer); | ||
| private static void writeDataToFile(IonWriter writer, ReparsedType parsedTypeDefinition) throws Exception { | ||
| // The first step is to check whether parsedTypeDefinition contains 'valid_values'. The reason we prioritize checking | ||
| // 'valid_values' is that the constraint 'type' might not be contained in the type definition, in that case we cannot trigger |
There was a problem hiding this comment.
If any type definition doesn't contain a type constraint its assumed to be default type:any constraint. So this could be applied to logical constraints as well.
e.g.
type::{
name: "my_type",
one_of: [string, int],
type: any <--- default type `any`
}
type::{
name: "my_type",
valid_values: ["a", 5, 4.5],
type: any <--- default type `any`
}
Definition for type any: represents any of the types listed above as well as types derived from those types
There was a problem hiding this comment.
If any type definition doesn't contain a type constraint its assumed to be default
type:anyconstraint. So this could be applied to logical constraints as well. e.g.type::{ name: "my_type", one_of: [string, int], type: any <--- default type `any` }type::{ name: "my_type", valid_values: ["a", 5, 4.5], type: any <--- default type `any` }Definition for type any: represents any of the types listed above as well as types derived from those types
Thanks for correction. The assumption of 'constraint type is required when constraint valid_values is not provided' is not aligned with ion schema specification grammar. The logic of getting data type information will refer to this comment. This change will be updated in the next PR.
| /** | ||
| * Get the value of constraint 'type' in IonType format. | ||
| * @return the value of 'type' in IonType format. | ||
| */ | ||
| public IonType getIonType() { | ||
| return IonType.valueOf(getIsl().get(KEYWORD_TYPE).toString().toUpperCase()); | ||
| } |
There was a problem hiding this comment.
This is working for now because we're only supporting basic Ion types for generation. Since later on we'll want to support user-defined types for the type constraint we'll ultimately have to change this- yield a string instead and then look that up by Schema#getType(String) so that we can recursively resolve constraints and generate.
There was a problem hiding this comment.
This is working for now because we're only supporting basic Ion types for generation. Since later on we'll want to support user-defined types for the
typeconstraint we'll ultimately have to change this- yield a string instead and then look that up bySchema#getType(String)so that we can recursively resolve constraints and generate.
This logic sounds better. Only one question here: Looking up type definition by Schema#getType(String), does this mean the provided type definition should be identified by name(should contain name constraint)? Cause the parameter required by getType (String name) is the name constraint of the type definition. And if unnamed type definition provided, we are not able to get that.
| public static void writeRequestedSizeFile(int size, IonWriter writer, File file, ReparsedType parsedTypeDefinition) throws Exception { | ||
| int currentSize = 0; | ||
| int count = 0; | ||
| // Determine how many values should be written before the writer.flush(), and this process aims to reduce the execution time of writer.flush(). | ||
| while (currentSize <= 0.05 * size) { | ||
| WriteRandomIonValues.writeDataToFile(writer, constraintStruct); | ||
| WriteRandomIonValues.writeDataToFile(writer, parsedTypeDefinition); | ||
| count += 1; | ||
| writer.flush(); | ||
| currentSize = (int) file.length(); | ||
| } | ||
| while (currentSize <= size) { | ||
| for (int i = 0; i < count; i++) { | ||
| WriteRandomIonValues.writeDataToFile(writer, constraintStruct); | ||
| WriteRandomIonValues.writeDataToFile(writer, parsedTypeDefinition); | ||
| } | ||
| writer.flush(); | ||
| currentSize = (int) file.length(); | ||
| } | ||
| } |
There was a problem hiding this comment.
We should use a CountingOutputStream or the like to track size written after each generated value. Then we don't have to check the file repeatedly or use this structure.
You'll want to put the counter between the BufferedOutputStream and the FileOutputStream in formatWriter.
There was a problem hiding this comment.
We should use a
CountingOutputStreamor the like to track size written after each generated value. Then we don't have to check the file repeatedly or use this structure.You'll want to put the counter between the BufferedOutputStream and the FileOutputStream in
formatWriter.
This is the update based on this comment: Refactoring and updating data generating methods.
In this commit, instance outputStream constructed from CountingOutputStream will be the parameter of IonTextWriterBuilder.standard().build(outputStream), then using getCount() to monitor the size of written data.
… method of counting the size of written data.
The updates from these comments are included in this commit: Updates PR based on the suggestions from comments. |
| // 10000T in millis, upper bound exclusive. | ||
| final static private BigDecimal MAXIMUM_TIMESTAMP_IN_MILLIS_DECIMAL = new BigDecimal(253402300800000L); | ||
| // Create a range which contains the default lower bound and upper bound values. | ||
| final static private Range RANGE = new Range(SYSTEM.newList( SYSTEM.newDecimal(62135769600000L), SYSTEM.newDecimal(253402300800000L))); |
There was a problem hiding this comment.
Should this be DEFAULT_RANGE?
There was a problem hiding this comment.
Should this be
DEFAULT_RANGE?
Yes, the constant name should be replaced. Cause DEFAULT_RANGE has been used before, in the next commit it will be replaced with DEFAULT_TIMESTAMP_IN_MILLIS_DECIMAL_RANGE.
| public class Range { | ||
| private static final String KEYWORD_RANGE = "range"; | ||
| IonSequence sequence; | ||
| public final IonSequence sequence; |
There was a problem hiding this comment.
I'd recommend making this private and adding an accessor method with the appropriate visibility. Or, if it's only used by subclasses, consider protected.
There was a problem hiding this comment.
I'd recommend making this
privateand adding an accessor method with the appropriate visibility. Or, if it's only used by subclasses, considerprotected.
Sounds good, in the next commit the accessor method will be added.
Adds accessor method to Range. Removes subclasses of QuantifiableConstraints which don't include addtional methods or attributes. Updates varaible name from RANGE to DEFAULT_TIMESTAMP_IN_MILLIS_DECIMAL_RANGE.
| import com.amazon.ion.system.IonTextWriterBuilder; | ||
| import com.amazon.ionschema.Schema; | ||
| import com.amazon.ionschema.Type; | ||
| import com.google.common.io.CountingOutputStream; |
There was a problem hiding this comment.
If this is the only thing we're using Guava for, I'd rather just include our own CountingOutputStream in this package and remove the dependency, which is pretty bulky. As you can see from the implementation, it's trivial: https://github.com/google/guava/blob/master/guava/src/com/google/common/io/CountingOutputStream.java
There was a problem hiding this comment.
If this is the only thing we're using Guava for, I'd rather just include our own CountingOutputStream in this package and remove the dependency, which is pretty bulky. As you can see from the implementation, it's trivial: https://github.com/google/guava/blob/master/guava/src/com/google/common/io/CountingOutputStream.java
Indeed, it is not necessary to fetch the whole dependency to use the CountingOutputStream. Our own CountingOutputStream will be included in the following commit.
| while (count <= size) { | ||
| IonValue constructedData = DataConstructor.constructIonData(parsedTypeDefinition); | ||
| constructedData.writeTo(writer); | ||
| count = outputStreamCounter.getCount(); |
There was a problem hiding this comment.
In order for the count to update, the writer needs to be flushed. We may want to do this after every X values are written.
There was a problem hiding this comment.
In order for the count to update, the writer needs to be flushed. We may want to do this after every X values are written.
Thanks for pointing this out. Yes, we need to flush the buffered stream in writer to make sure the size of written data is up to date.
And here is another thought I have:
The motivation of using counter to count OutputStream is to avoid the original while loop logic and checking the OutputStream size instead of checking the output file size.
Original logic:
public static void writeRequestedSizeFile(int size, IonWriter writer, File file, IonStruct constraintStruct) throws Exception {
int currentSize = 0;
int count = 0;
// Determine how many values should be written before the writer.flush(), and this process aims to reduce the execution time of writer.flush().
while (currentSize <= 0.05 * size) {
WriteRandomIonValues.writeDataToFile(writer, constraintStruct);
count += 1;
writer.flush();
currentSize = (int) file.length();
}
while (currentSize <= size) {
for (int i = 0; i < count; i++) {
WriteRandomIonValues.writeDataToFile(writer, constraintStruct);
}
writer.flush();
currentSize = (int) file.length();
}
}
We do need to flush the writer, but flush after X written data would take the method back to the logic we were trying to avoid. Probably the further consideration/discussion is needed here to come up with a more straightforward way to monitor the size of written data.
There was a problem hiding this comment.
The main thing that is bad about the old logic was repeatedly calling File.length(), which has to go all the way to the file system each call. CountingOutputStream is at least an improvement over that.
There was a problem hiding this comment.
The main thing that is bad about the old logic was repeatedly calling
File.length(), which has to go all the way to the file system each call.CountingOutputStreamis at least an improvement over that.
Yes, for now it would be fine to keep the logic of flushing after X written data then using CoutingOutputStream as counter to monitor the size. The updates will be included in the following commit.
| @@ -0,0 +1,5 @@ | |||
| package com.amazon.ion.benchmark.schema.constraints; | |||
|
|
|||
| public interface ReparsedConstraint { | |||
There was a problem hiding this comment.
This could use a JavaDoc to explain to a reader why it exists.
There was a problem hiding this comment.
This could use a JavaDoc to explain to a reader why it exists.
The explanation is added in the following commit.
0fc749e to
2a8ca42
Compare
| } | ||
|
|
||
| @Override | ||
| public void write(byte[] b, int off, int len) throws IOException { |
There was a problem hiding this comment.
You'll also need to override public void write(int b) throws IOException.
There was a problem hiding this comment.
You'll also need to override
public void write(int b) throws IOException.
Right! My bad :( thanks for reminding. It will be resolved in the next commit.
Adds writer.flush() while counting the written data size. Override write(int b) method in CountingOutputStream.
| case KEYWORD_NAME: | ||
| case KEYWORD_TYPE: | ||
| return; | ||
| default: |
There was a problem hiding this comment.
This could also include fields that are open content and which would not be part of ISL constraints.
Adding a comment toConstraint method for this would work.
[EDIT]: As per offline discussion, there is still a recursive type definition change to be addressed along with this one before merging this.
There was a problem hiding this comment.
This could also include fields that are open content and which would not be part of ISL constraints. Adding a comment
toConstraintmethod for this would work. [EDIT]: As per offline discussion, there is still a recursive type definition change to be addressed along with this one before merging this.
The comment will be added in the next commit. For the recursive type definition change, there might be more potential updates along with this change. I will throw an issue for this comment and contain this change in the future PR.
Description of changes:
Motivation:
This PR extracted the process of understanding ion schema out of data generating and data writing processes. Before refactoring, the method of extracting information out of ion schema file is to load the ion schema file to IonReader and then iterate the element until we find the target constraints. And this process also followed by data generating which makes adding new features of processing constraints complicated. After refactoring, supporting new features process will only contain two steps:
There are multiple files changed in this PR and this PR has been separated into different commits based on the different functionalities.
Commits:
getName(),getISL(),getIonType(),getConstraintMap(). After passing parametertypewhich represents ion schema type definition to constructor, the created object allows to extract constraints information directly by using provided methods.QuantifiableConstraints,RegexandValidValuesthree different types of constraints.Scale,Precision,TimestampPrecision,ByteLength,CodepointLength, andContainerLength. All these constraint objects’ values have the same pattern from set ([, <RANGE>]) or could be processed into the format provided in the set. The value of these objects will be extracted as Range type by using methodgetRange().getValidValues(). If the provided ‘valid_values’ contains annotation range, the range value will be extracted by using methodgetRange().uppBound()andlowerBound()which allow to access the bound values. In the data generating process, when a range of quantifiable constraint value provided, the ion data generator will randomly choose the constraint value randomly within the provided range. This class allows to get the random value within the range directly by using methodgetRandomQuantifiableValueFromRange().Regexinherits classReparsedConstraint. This class implements methodgetPattern()which allows to get the regular expression pattern in String format.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.