Refactoring and updating data generating methods. by linlin-s · Pull Request #34 · amazon-ion/ion-java-benchmark-cli

linlin-s · 2022-05-31T15:51:26Z

Description of changes:

Motivation:
This PR extracted the process of understanding ion schema out of data generating and data writing processes. Before refactoring, the method of extracting information out of ion schema file is to load the ion schema file to IonReader and then iterate the element until we find the target constraints. And this process also followed by data generating which makes adding new features of processing constraints complicated. After refactoring, supporting new features process will only contain two steps:

Adding the class which is used for implementing the functionalities of the new constraint.
Adding more steps in data generating process to extract the constraint value then generating data based on this constraint value.
There are multiple files changed in this PR and this PR has been separated into different commits based on the different functionalities.

Commits:

Adds package to process ion schema file:

ReparsedType.java: ReparsedType aims to simplify the process of understanding ion schema and the implemented methods including getName(), getISL(), getIonType(), getConstraintMap(). After passing parameter type which represents ion schema type definition to constructor, the created object allows to extract constraints information directly by using provided methods.
ReparsedConstraint.java: All constraints included in ion schema type definition after converting ion schema type definition from Type to ReparsedType will be converted to ReparsedConstraint. The ReparsedConstraint class is inherited by QuantifiableConstraints, Regex and ValidValues three different types of constraints.
QuantifiableConstraint.java: This class is inherited by Scale, Precision, TimestampPrecision, ByteLength, CodepointLength, and ContainerLength. All these constraint objects’ values have the same pattern from set ([, <RANGE>]) or could be processed into the format provided in the set. The value of these objects will be extracted as Range type by using method getRange().
ValidValues.java: After creating object from class ValidValues, the constraint value can be extracted by method getValidValues(). If the provided ‘valid_values’ contains annotation range, the range value will be extracted by using method getRange().
Range.java: This class implements methods of processing constraints values which contain annotation ‘range’. It provides methods uppBound() and lowerBound() which allow to access the bound values. In the data generating process, when a range of quantifiable constraint value provided, the ion data generator will randomly choose the constraint value randomly within the provided range. This class allows to get the random value within the range directly by using method getRandomQuantifiableValueFromRange().
Regex.java: Regex inherits class ReparsedConstraint. This class implements method getPattern() which allows to get the regular expression pattern in String format.

Commits which updated the process of generating different types of ion data:

In these commits, the process of extracting constraints from schema has been replaced by using the new ion schema understanding classes. For now, only generating scalar types of ion data processes have been updated, and some container data generating processes have been temporarily comment to avoid build failure. However, this PR will keep updating until all refactoring finished.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…f data.

tgregg

Encapsulating the schema-related logic is an improvement. A few comments below.

tgregg · 2022-06-01T00:29:11Z

src/com/amazon/ion/benchmark/schema/constraints/ByteLength.java

+     * Initializing the ByteLength object.
+     * @param value represents constraint field 'byte_length'.
+     */
+    public ByteLength(IonValue value) {


Since there's a public factory method, the constructor should be private. This goes for all of these constraint classes.

Since there's a public factory method, the constructor should be private. This goes for all of these constraint classes.

Thanks for catching this, will update them in the next commit.

tgregg · 2022-06-01T00:32:00Z

src/com/amazon/ion/benchmark/schema/constraints/Range.java

+// Processing the constraint value which contains 'range' annotation.
+public class Range {
+    private static final String KEYWORD_RANGE = "range";
+    IonSequence sequence;


Looks like this should be final. Should it also be private?

For the attribute sequence, it should be public final, as it will be used in the class TimestampPrecision.

tgregg · 2022-06-01T00:35:26Z

src/com/amazon/ion/benchmark/schema/constraints/Regex.java

+import com.amazon.ion.IonValue;
+
+public class Regex extends ReparsedConstraint{
+    String pattern;


private final

tgregg · 2022-06-01T00:36:00Z

src/com/amazon/ion/benchmark/schema/constraints/ReparsedConstraint.java

@@ -0,0 +1,5 @@
+package com.amazon.ion.benchmark.schema.constraints;
+
+public abstract class ReparsedConstraint {


Should this be an interface?

Should this be an interface?

For now, it would be better to assign this as interface, will update it in the next commit.

src/com/amazon/ion/benchmark/schema/constraints/Range.java

tgregg · 2022-06-01T00:44:27Z

src/com/amazon/ion/benchmark/schema/ReparsedType.java

+    private static final String KEYWORD_VALID_VALUES = "valid_values";
+    private static final String KEYWORD_NAME = "name";
+    // Using map to avoid processing the multiple repeat constraints situation.
+    Map<String, ReparsedConstraint> constraintMap;


private final

tgregg · 2022-06-01T00:52:02Z

src/com/amazon/ion/benchmark/WriteRandomIonValues.java

+        Map<String, ReparsedConstraint> constraintMap = parsedTypeDefinition.getConstraintMap();
+        Map<String, ReparsedConstraint> constraintMapClone = new HashMap<>();
+        constraintMapClone.putAll(constraintMap);


Why is it necessary to clone the constraint map?

tgregg · 2022-06-01T00:53:12Z

src/com/amazon/ion/benchmark/WriteRandomIonValues.java

-        } else if (regexPattern != null && codePointsLengthBound == null) {
-            RgxGen rgxGen = new RgxGen(regexPattern);
-            constructedString = rgxGen.generate();
+        Regex regex = (Regex) constraintMapClone.remove("regex");


This is the direct answer to my question about why the constraint map needed to be cloned, but now my question is: why do values have to be removed from the constraint map?

This is the direct answer to my question about why the constraint map needed to be cloned, but now my question is: why do values have to be removed from the constraint map?

By removing values which we are able to process in ion data generator, we can check what has been left in the provided constraint map. If there are still values left after removing, the generator will throw the message that these constraints are not supported for now.
e.g.

if (!constraintMapClone.isEmpty()) { throw new IllegalStateException ("Found unhandled constraints : " + constraintMapClone.values()); }

tgregg · 2022-06-01T00:56:07Z

src/com/amazon/ion/benchmark/WriteRandomIonValues.java

+        IonSequence sequence = SYSTEM.newList( SYSTEM.newDecimal(MINIMUM_TIMESTAMP_IN_MILLIS_DECIMAL), SYSTEM.newDecimal(MAXIMUM_TIMESTAMP_IN_MILLIS_DECIMAL));
+        Range range = new Range(sequence);


Can this range be a constant? Is there a good reason to reconstruct it on every call?

Can this range be a constant? Is there a good reason to reconstruct it on every call?

Yes, it would be better to define it as constant instead of reconstructing it on every call.

tgregg · 2022-06-01T01:02:12Z

src/com/amazon/ion/benchmark/IonSchemaUtilities.java

+        // Build ion schema system from input ISL file.
        IonSchemaSystem ISS = buildIonSchemaSystem(inputFile);
+        // Get the name of ISL file as schema ID.
        String schemaID = inputFile.substring(inputFile.lastIndexOf('/') + 1);


Use a more robust way of extracting the filename in order to account for platform differences (/ isn't the path separator on all platforms). e.g. Paths.get(inputFile).toFile().getName()

Use a more robust way of extracting the filename in order to account for platform differences (/ isn't the path separator on all platforms). e.g. Paths.get(inputFile).toFile().getName()

Will update it in the next commit :)

desaikd · 2022-06-01T03:54:21Z

src/com/amazon/ion/benchmark/GeneratorOptions.java

-        IonSchemaUtilities.checkValidationOfSchema(inputFilePath);
-        ReadGeneralConstraints.readIonSchemaAndGenerate(size, inputFilePath, format, path);
+        // Check whether the input schema file is valid and get the loaded schema.
+        Schema schema = IonSchemaUtilities.checkValidationOfSchema(inputFilePath);


checkValidationOfSchema doesn't really validate the schema, rather it just loads the schema file and returns Schema.
Maybe rename this method to loadSchemaDefinition or getSchemaDefinition.

Loading the schema into ion-schema-kotlin does validate it, but yes :)

What I meant to suggest here is it becomes unclear as to whether it points to validating a value using schema or parsing the schema file for validation.

Changing the method name to loadSchemaDefinition or getSchemaDefinition would be more straightforward :), and I'll also add a comment to indicate that this method is used for loading schema and checking whether the input schema is valid or not.

desaikd · 2022-06-01T03:58:10Z

src/com/amazon/ion/benchmark/ReadGeneralConstraints.java

-                }
-            }
+    public static void readIonSchemaAndGenerate(int size, Schema schema, String format, String outputFile) throws Exception {
+        // Assume there's only one constraint between schema_header and schema_footer.


Suggested change

// Assume there's only one constraint between schema_header and schema_footer.

// Assume there's only one type definition between schema_header and schema_footer.

One type definition can have multiple constraints. I guess you are expecting a single type definition here.

desaikd · 2022-06-01T04:06:09Z

src/com/amazon/ion/benchmark/WriteRandomIonValues.java

-            value.writeTo(writer);
+    private static void writeDataToFile(IonWriter writer, ReparsedType parsedTypeDefinition) throws Exception {
+        // The first step is to check whether parsedTypeDefinition contains 'valid_values'. The reason we prioritize checking
+        // 'valid_values' is that the constraint 'type' might not be contained in the type definition, in that case we cannot trigger


If any type definition doesn't contain a type constraint its assumed to be default type:any constraint. So this could be applied to logical constraints as well.
e.g.

type::{ name: "my_type", one_of: [string, int], type: any <--- default type `any` }

type::{ name: "my_type", valid_values: ["a", 5, 4.5], type: any <--- default type `any` }

Definition for type any: represents any of the types listed above as well as types derived from those types

If any type definition doesn't contain a type constraint its assumed to be default type:any constraint. So this could be applied to logical constraints as well. e.g.

type::{ name: "my_type", one_of: [string, int], type: any <--- default type `any` }

type::{ name: "my_type", valid_values: ["a", 5, 4.5], type: any <--- default type `any` }

Definition for type any: represents any of the types listed above as well as types derived from those types

Thanks for correction. The assumption of 'constraint type is required when constraint valid_values is not provided' is not aligned with ion schema specification grammar. The logic of getting data type information will refer to this comment. This change will be updated in the next PR.

jobarr-amzn · 2022-06-03T17:53:27Z

src/com/amazon/ion/benchmark/schema/ReparsedType.java

+    /**
+     * Get the value of constraint 'type' in IonType format.
+     * @return the value of 'type' in IonType format.
+     */
+    public IonType getIonType() {
+        return IonType.valueOf(getIsl().get(KEYWORD_TYPE).toString().toUpperCase());
+    }


This is working for now because we're only supporting basic Ion types for generation. Since later on we'll want to support user-defined types for the type constraint we'll ultimately have to change this- yield a string instead and then look that up by Schema#getType(String) so that we can recursively resolve constraints and generate.

This is working for now because we're only supporting basic Ion types for generation. Since later on we'll want to support user-defined types for the type constraint we'll ultimately have to change this- yield a string instead and then look that up by Schema#getType(String) so that we can recursively resolve constraints and generate.

This logic sounds better. Only one question here: Looking up type definition by Schema#getType(String), does this mean the provided type definition should be identified by name(should contain name constraint)? Cause the parameter required by getType (String name) is the name constraint of the type definition. And if unnamed type definition provided, we are not able to get that.

jobarr-amzn · 2022-06-03T17:56:55Z

src/com/amazon/ion/benchmark/WriteRandomIonValues.java

+    public static void writeRequestedSizeFile(int size, IonWriter writer, File file, ReparsedType parsedTypeDefinition) throws Exception {
        int currentSize = 0;
        int count = 0;
        // Determine how many values should be written before the writer.flush(), and this process aims to reduce the execution time of writer.flush().
        while (currentSize <= 0.05 * size) {
-            WriteRandomIonValues.writeDataToFile(writer, constraintStruct);
+            WriteRandomIonValues.writeDataToFile(writer, parsedTypeDefinition);
            count += 1;
            writer.flush();
            currentSize = (int) file.length();
        }
        while (currentSize <= size) {
            for (int i = 0; i < count; i++) {
-                WriteRandomIonValues.writeDataToFile(writer, constraintStruct);
+                WriteRandomIonValues.writeDataToFile(writer, parsedTypeDefinition);
            }
            writer.flush();
            currentSize = (int) file.length();
        }
    }


We should use a CountingOutputStream or the like to track size written after each generated value. Then we don't have to check the file repeatedly or use this structure.

You'll want to put the counter between the BufferedOutputStream and the FileOutputStream in formatWriter.

We should use a CountingOutputStream or the like to track size written after each generated value. Then we don't have to check the file repeatedly or use this structure.

You'll want to put the counter between the BufferedOutputStream and the FileOutputStream in formatWriter.

This is the update based on this comment: Refactoring and updating data generating methods.
In this commit, instance outputStream constructed from CountingOutputStream will be the parameter of IonTextWriterBuilder.standard().build(outputStream), then using getCount() to monitor the size of written data.

… method of counting the size of written data.

linlin-s · 2022-06-06T22:28:45Z

Encapsulating the schema-related logic is an improvement. A few comments below.

The updates from these comments are included in this commit: Updates PR based on the suggestions from comments.

tgregg · 2022-06-07T19:09:55Z

src/com/amazon/ion/benchmark/WriteRandomIonValues.java

-    // 10000T in millis, upper bound exclusive.
-    final static private BigDecimal MAXIMUM_TIMESTAMP_IN_MILLIS_DECIMAL = new BigDecimal(253402300800000L);
+    // Create a range which contains the default lower bound and upper bound values.
+    final static private Range RANGE = new Range(SYSTEM.newList( SYSTEM.newDecimal(62135769600000L), SYSTEM.newDecimal(253402300800000L)));


Should this be DEFAULT_RANGE?

Should this be DEFAULT_RANGE?

Yes, the constant name should be replaced. Cause DEFAULT_RANGE has been used before, in the next commit it will be replaced with DEFAULT_TIMESTAMP_IN_MILLIS_DECIMAL_RANGE.

tgregg · 2022-06-07T19:11:19Z

src/com/amazon/ion/benchmark/schema/constraints/Range.java

 public class Range {
    private static final String KEYWORD_RANGE = "range";
-    IonSequence sequence;
+    public final IonSequence sequence;


I'd recommend making this private and adding an accessor method with the appropriate visibility. Or, if it's only used by subclasses, consider protected.

I'd recommend making this private and adding an accessor method with the appropriate visibility. Or, if it's only used by subclasses, consider protected.

Sounds good, in the next commit the accessor method will be added.

Adds accessor method to Range. Removes subclasses of QuantifiableConstraints which don't include addtional methods or attributes. Updates varaible name from RANGE to DEFAULT_TIMESTAMP_IN_MILLIS_DECIMAL_RANGE.

tgregg · 2022-06-08T18:43:05Z

src/com/amazon/ion/benchmark/ReadGeneralConstraints.java

+import com.amazon.ion.system.IonTextWriterBuilder;
+import com.amazon.ionschema.Schema;
+import com.amazon.ionschema.Type;
+import com.google.common.io.CountingOutputStream;


If this is the only thing we're using Guava for, I'd rather just include our own CountingOutputStream in this package and remove the dependency, which is pretty bulky. As you can see from the implementation, it's trivial: https://github.com/google/guava/blob/master/guava/src/com/google/common/io/CountingOutputStream.java

If this is the only thing we're using Guava for, I'd rather just include our own CountingOutputStream in this package and remove the dependency, which is pretty bulky. As you can see from the implementation, it's trivial: https://github.com/google/guava/blob/master/guava/src/com/google/common/io/CountingOutputStream.java

Indeed, it is not necessary to fetch the whole dependency to use the CountingOutputStream. Our own CountingOutputStream will be included in the following commit.

tgregg · 2022-06-08T18:56:37Z

src/com/amazon/ion/benchmark/ReadGeneralConstraints.java

+            while (count <= size) {
+                IonValue constructedData = DataConstructor.constructIonData(parsedTypeDefinition);
+                constructedData.writeTo(writer);
+                count = outputStreamCounter.getCount();


In order for the count to update, the writer needs to be flushed. We may want to do this after every X values are written.

In order for the count to update, the writer needs to be flushed. We may want to do this after every X values are written.

Thanks for pointing this out. Yes, we need to flush the buffered stream in writer to make sure the size of written data is up to date.
And here is another thought I have:
The motivation of using counter to count OutputStream is to avoid the original while loop logic and checking the OutputStream size instead of checking the output file size.

Original logic:

public static void writeRequestedSizeFile(int size, IonWriter writer, File file, IonStruct constraintStruct) throws Exception { int currentSize = 0; int count = 0; // Determine how many values should be written before the writer.flush(), and this process aims to reduce the execution time of writer.flush(). while (currentSize <= 0.05 * size) { WriteRandomIonValues.writeDataToFile(writer, constraintStruct); count += 1; writer.flush(); currentSize = (int) file.length(); } while (currentSize <= size) { for (int i = 0; i < count; i++) { WriteRandomIonValues.writeDataToFile(writer, constraintStruct); } writer.flush(); currentSize = (int) file.length(); } }

We do need to flush the writer, but flush after X written data would take the method back to the logic we were trying to avoid. Probably the further consideration/discussion is needed here to come up with a more straightforward way to monitor the size of written data.

The main thing that is bad about the old logic was repeatedly calling File.length(), which has to go all the way to the file system each call. CountingOutputStream is at least an improvement over that.

The main thing that is bad about the old logic was repeatedly calling File.length(), which has to go all the way to the file system each call. CountingOutputStream is at least an improvement over that.

Yes, for now it would be fine to keep the logic of flushing after X written data then using CoutingOutputStream as counter to monitor the size. The updates will be included in the following commit.

tgregg · 2022-06-08T19:16:35Z

src/com/amazon/ion/benchmark/schema/constraints/ReparsedConstraint.java

@@ -0,0 +1,5 @@
+package com.amazon.ion.benchmark.schema.constraints;
+
+public interface ReparsedConstraint {


This could use a JavaDoc to explain to a reader why it exists.

This could use a JavaDoc to explain to a reader why it exists.

The explanation is added in the following commit.

tgregg · 2022-06-13T23:04:46Z

src/com/amazon/ion/benchmark/CountingOutputStream.java

+    }
+
+    @Override
+    public void write(byte[] b, int off, int len) throws IOException {


You'll also need to override public void write(int b) throws IOException.

You'll also need to override public void write(int b) throws IOException.

Right! My bad :( thanks for reminding. It will be resolved in the next commit.

Adds writer.flush() while counting the written data size. Override write(int b) method in CountingOutputStream.

desaikd · 2022-06-15T23:44:54Z

src/com/amazon/ion/benchmark/schema/ReparsedType.java

+            case KEYWORD_NAME:
+            case KEYWORD_TYPE:
+                return;
+            default:


This could also include fields that are open content and which would not be part of ISL constraints.
Adding a comment toConstraint method for this would work.
[EDIT]: As per offline discussion, there is still a recursive type definition change to be addressed along with this one before merging this.

This could also include fields that are open content and which would not be part of ISL constraints. Adding a comment toConstraint method for this would work. [EDIT]: As per offline discussion, there is still a recursive type definition change to be addressed along with this one before merging this.

The comment will be added in the next commit. For the recursive type definition change, there might be more potential updates along with this change. I will throw an issue for this comment and contain this change in the future PR.

linlin-s added 9 commits May 24, 2022 23:18

Adds package to process ion schema file.

2a61c7f

Changes the way of processing ion schema file.

38334bf

Updates float generating process.

1d1ff41

Updates symbol and string generating process.

de35e99

Updates blob and clob generating process.

3aeabc8

Updates int generating process.

1eb60ff

Updates decimal generating process.

5e0619c

Updates timestamp generating process.

8ee7330

Temporarily comment some unit tests for generating constainer types o…

5be4df5

…f data.

linlin-s changed the title ~~Refactoring~~ Refactoring and updating data generating methods. May 31, 2022

linlin-s requested review from desaikd, jobarr-amzn and tgregg May 31, 2022 22:15

tgregg reviewed Jun 1, 2022

View reviewed changes

desaikd reviewed Jun 1, 2022

View reviewed changes

jobarr-amzn reviewed Jun 3, 2022

View reviewed changes

linlin-s added 2 commits June 6, 2022 09:24

Updates PR based on the suggestions from comments.

89842cd

Separates data constructing from data writing process and updates the…

b984845

… method of counting the size of written data.

linlin-s force-pushed the refactoring branch from 87c8dcc to b984845 Compare June 6, 2022 22:19

tgregg reviewed Jun 7, 2022

View reviewed changes

Updates based on the most recent comments:

870c7a9

Adds accessor method to Range. Removes subclasses of QuantifiableConstraints which don't include addtional methods or attributes. Updates varaible name from RANGE to DEFAULT_TIMESTAMP_IN_MILLIS_DECIMAL_RANGE.

tgregg reviewed Jun 8, 2022

View reviewed changes

linlin-s force-pushed the refactoring branch 2 times, most recently from 0fc749e to 2a8ca42 Compare June 13, 2022 22:49

Remove the guava dependency and add comment for ReparsedConstraint.

bca28f9

linlin-s force-pushed the refactoring branch from 2a8ca42 to bca28f9 Compare June 13, 2022 22:54

tgregg reviewed Jun 13, 2022

View reviewed changes

Updates based on comments:

5b59119

Adds writer.flush() while counting the written data size. Override write(int b) method in CountingOutputStream.

desaikd approved these changes Jun 16, 2022

View reviewed changes

Adds comment to explain the case of 'open content'.

b9e259e

linlin-s mentioned this pull request Jun 20, 2022

Support user-defined types for type constraint. #35

Open

linlin-s merged commit f6f7fb8 into amazon-ion:master Jun 20, 2022

		@@ -0,0 +1,5 @@
		package com.amazon.ion.benchmark.schema.constraints;

		public abstract class ReparsedConstraint {

		IonSequence sequence = SYSTEM.newList( SYSTEM.newDecimal(MINIMUM_TIMESTAMP_IN_MILLIS_DECIMAL), SYSTEM.newDecimal(MAXIMUM_TIMESTAMP_IN_MILLIS_DECIMAL));
		Range range = new Range(sequence);

	// Assume there's only one constraint between schema_header and schema_footer.
	// Assume there's only one type definition between schema_header and schema_footer.

		@@ -0,0 +1,5 @@
		package com.amazon.ion.benchmark.schema.constraints;

		public interface ReparsedConstraint {

Conversation

linlin-s commented May 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgregg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linlin-s commented Jun 6, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linlin-s Jun 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linlin-s commented May 31, 2022 •

edited

Loading

linlin-s Jun 13, 2022 •

edited

Loading

desaikd Jun 15, 2022 •

edited

Loading

linlin-s Jun 20, 2022 •

edited

Loading