Skip to content

Refactoring and updating data generating methods.#34

Merged
linlin-s merged 15 commits intoamazon-ion:masterfrom
linlin-s:refactoring
Jun 20, 2022
Merged

Refactoring and updating data generating methods.#34
linlin-s merged 15 commits intoamazon-ion:masterfrom
linlin-s:refactoring

Conversation

@linlin-s
Copy link
Contributor

@linlin-s linlin-s commented May 31, 2022

Description of changes:

Motivation:
This PR extracted the process of understanding ion schema out of data generating and data writing processes. Before refactoring, the method of extracting information out of ion schema file is to load the ion schema file to IonReader and then iterate the element until we find the target constraints. And this process also followed by data generating which makes adding new features of processing constraints complicated. After refactoring, supporting new features process will only contain two steps:

  • Adding the class which is used for implementing the functionalities of the new constraint.
  • Adding more steps in data generating process to extract the constraint value then generating data based on this constraint value.
    There are multiple files changed in this PR and this PR has been separated into different commits based on the different functionalities.

Commits:

  1. Adds package to process ion schema file:
  • ReparsedType.java: ReparsedType aims to simplify the process of understanding ion schema and the implemented methods including getName(), getISL(), getIonType(), getConstraintMap(). After passing parameter type which represents ion schema type definition to constructor, the created object allows to extract constraints information directly by using provided methods.
  • ReparsedConstraint.java: All constraints included in ion schema type definition after converting ion schema type definition from Type to ReparsedType will be converted to ReparsedConstraint. The ReparsedConstraint class is inherited by QuantifiableConstraints, Regex and ValidValues three different types of constraints.
  • QuantifiableConstraint.java: This class is inherited by Scale, Precision, TimestampPrecision, ByteLength, CodepointLength, and ContainerLength. All these constraint objects’ values have the same pattern from set ([, <RANGE>]) or could be processed into the format provided in the set. The value of these objects will be extracted as Range type by using method getRange().
  • ValidValues.java: After creating object from class ValidValues, the constraint value can be extracted by method getValidValues(). If the provided ‘valid_values’ contains annotation range, the range value will be extracted by using method getRange().
  • Range.java: This class implements methods of processing constraints values which contain annotation ‘range’. It provides methods uppBound() and lowerBound() which allow to access the bound values. In the data generating process, when a range of quantifiable constraint value provided, the ion data generator will randomly choose the constraint value randomly within the provided range. This class allows to get the random value within the range directly by using method getRandomQuantifiableValueFromRange().
  • Regex.java: Regex inherits class ReparsedConstraint. This class implements method getPattern() which allows to get the regular expression pattern in String format.
  1. Commits which updated the process of generating different types of ion data:
  • In these commits, the process of extracting constraints from schema has been replaced by using the new ion schema understanding classes. For now, only generating scalar types of ion data processes have been updated, and some container data generating processes have been temporarily comment to avoid build failure. However, this PR will keep updating until all refactoring finished.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@linlin-s linlin-s changed the title Refactoring Refactoring and updating data generating methods. May 31, 2022
Copy link
Contributor

@tgregg tgregg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encapsulating the schema-related logic is an improvement. A few comments below.

* Initializing the ByteLength object.
* @param value represents constraint field 'byte_length'.
*/
public ByteLength(IonValue value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's a public factory method, the constructor should be private. This goes for all of these constraint classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's a public factory method, the constructor should be private. This goes for all of these constraint classes.

Thanks for catching this, will update them in the next commit.

// Processing the constraint value which contains 'range' annotation.
public class Range {
private static final String KEYWORD_RANGE = "range";
IonSequence sequence;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this should be final. Should it also be private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the attribute sequence, it should be public final, as it will be used in the class TimestampPrecision.

import com.amazon.ion.IonValue;

public class Regex extends ReparsedConstraint{
String pattern;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private final

@@ -0,0 +1,5 @@
package com.amazon.ion.benchmark.schema.constraints;

public abstract class ReparsedConstraint {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an interface?

For now, it would be better to assign this as interface, will update it in the next commit.

private static final String KEYWORD_VALID_VALUES = "valid_values";
private static final String KEYWORD_NAME = "name";
// Using map to avoid processing the multiple repeat constraints situation.
Map<String, ReparsedConstraint> constraintMap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private final

Comment on lines 330 to 332
Map<String, ReparsedConstraint> constraintMap = parsedTypeDefinition.getConstraintMap();
Map<String, ReparsedConstraint> constraintMapClone = new HashMap<>();
constraintMapClone.putAll(constraintMap);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to clone the constraint map?

} else if (regexPattern != null && codePointsLengthBound == null) {
RgxGen rgxGen = new RgxGen(regexPattern);
constructedString = rgxGen.generate();
Regex regex = (Regex) constraintMapClone.remove("regex");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the direct answer to my question about why the constraint map needed to be cloned, but now my question is: why do values have to be removed from the constraint map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the direct answer to my question about why the constraint map needed to be cloned, but now my question is: why do values have to be removed from the constraint map?

By removing values which we are able to process in ion data generator, we can check what has been left in the provided constraint map. If there are still values left after removing, the generator will throw the message that these constraints are not supported for now.
e.g.

   if (!constraintMapClone.isEmpty()) {
            throw new IllegalStateException ("Found unhandled constraints : " + constraintMapClone.values());
        }

Comment on lines 543 to 544
IonSequence sequence = SYSTEM.newList( SYSTEM.newDecimal(MINIMUM_TIMESTAMP_IN_MILLIS_DECIMAL), SYSTEM.newDecimal(MAXIMUM_TIMESTAMP_IN_MILLIS_DECIMAL));
Range range = new Range(sequence);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this range be a constant? Is there a good reason to reconstruct it on every call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this range be a constant? Is there a good reason to reconstruct it on every call?

Yes, it would be better to define it as constant instead of reconstructing it on every call.

// Build ion schema system from input ISL file.
IonSchemaSystem ISS = buildIonSchemaSystem(inputFile);
// Get the name of ISL file as schema ID.
String schemaID = inputFile.substring(inputFile.lastIndexOf('/') + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a more robust way of extracting the filename in order to account for platform differences (/ isn't the path separator on all platforms). e.g. Paths.get(inputFile).toFile().getName()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a more robust way of extracting the filename in order to account for platform differences (/ isn't the path separator on all platforms). e.g. Paths.get(inputFile).toFile().getName()

Will update it in the next commit :)

IonSchemaUtilities.checkValidationOfSchema(inputFilePath);
ReadGeneralConstraints.readIonSchemaAndGenerate(size, inputFilePath, format, path);
// Check whether the input schema file is valid and get the loaded schema.
Schema schema = IonSchemaUtilities.checkValidationOfSchema(inputFilePath);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkValidationOfSchema doesn't really validate the schema, rather it just loads the schema file and returns Schema.
Maybe rename this method to loadSchemaDefinition or getSchemaDefinition.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading the schema into ion-schema-kotlin does validate it, but yes :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant to suggest here is it becomes unclear as to whether it points to validating a value using schema or parsing the schema file for validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the method name to loadSchemaDefinition or getSchemaDefinition would be more straightforward :), and I'll also add a comment to indicate that this method is used for loading schema and checking whether the input schema is valid or not.

}
}
public static void readIonSchemaAndGenerate(int size, Schema schema, String format, String outputFile) throws Exception {
// Assume there's only one constraint between schema_header and schema_footer.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Assume there's only one constraint between schema_header and schema_footer.
// Assume there's only one type definition between schema_header and schema_footer.

One type definition can have multiple constraints. I guess you are expecting a single type definition here.

value.writeTo(writer);
private static void writeDataToFile(IonWriter writer, ReparsedType parsedTypeDefinition) throws Exception {
// The first step is to check whether parsedTypeDefinition contains 'valid_values'. The reason we prioritize checking
// 'valid_values' is that the constraint 'type' might not be contained in the type definition, in that case we cannot trigger
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any type definition doesn't contain a type constraint its assumed to be default type:any constraint. So this could be applied to logical constraints as well.
e.g.

type::{
   name: "my_type",
   one_of: [string, int],
   type: any <--- default type `any`
}
type::{
   name: "my_type",
   valid_values: ["a", 5, 4.5],
   type: any <--- default type `any`
}

Definition for type any: represents any of the types listed above as well as types derived from those types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any type definition doesn't contain a type constraint its assumed to be default type:any constraint. So this could be applied to logical constraints as well. e.g.

type::{
   name: "my_type",
   one_of: [string, int],
   type: any <--- default type `any`
}
type::{
   name: "my_type",
   valid_values: ["a", 5, 4.5],
   type: any <--- default type `any`
}

Definition for type any: represents any of the types listed above as well as types derived from those types

Thanks for correction. The assumption of 'constraint type is required when constraint valid_values is not provided' is not aligned with ion schema specification grammar. The logic of getting data type information will refer to this comment. This change will be updated in the next PR.

Comment on lines +68 to +74
/**
* Get the value of constraint 'type' in IonType format.
* @return the value of 'type' in IonType format.
*/
public IonType getIonType() {
return IonType.valueOf(getIsl().get(KEYWORD_TYPE).toString().toUpperCase());
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working for now because we're only supporting basic Ion types for generation. Since later on we'll want to support user-defined types for the type constraint we'll ultimately have to change this- yield a string instead and then look that up by Schema#getType(String) so that we can recursively resolve constraints and generate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working for now because we're only supporting basic Ion types for generation. Since later on we'll want to support user-defined types for the type constraint we'll ultimately have to change this- yield a string instead and then look that up by Schema#getType(String) so that we can recursively resolve constraints and generate.

This logic sounds better. Only one question here: Looking up type definition by Schema#getType(String), does this mean the provided type definition should be identified by name(should contain name constraint)? Cause the parameter required by getType (String name) is the name constraint of the type definition. And if unnamed type definition provided, we are not able to get that.

Comment on lines 300 to 317
public static void writeRequestedSizeFile(int size, IonWriter writer, File file, ReparsedType parsedTypeDefinition) throws Exception {
int currentSize = 0;
int count = 0;
// Determine how many values should be written before the writer.flush(), and this process aims to reduce the execution time of writer.flush().
while (currentSize <= 0.05 * size) {
WriteRandomIonValues.writeDataToFile(writer, constraintStruct);
WriteRandomIonValues.writeDataToFile(writer, parsedTypeDefinition);
count += 1;
writer.flush();
currentSize = (int) file.length();
}
while (currentSize <= size) {
for (int i = 0; i < count; i++) {
WriteRandomIonValues.writeDataToFile(writer, constraintStruct);
WriteRandomIonValues.writeDataToFile(writer, parsedTypeDefinition);
}
writer.flush();
currentSize = (int) file.length();
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use a CountingOutputStream or the like to track size written after each generated value. Then we don't have to check the file repeatedly or use this structure.

You'll want to put the counter between the BufferedOutputStream and the FileOutputStream in formatWriter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use a CountingOutputStream or the like to track size written after each generated value. Then we don't have to check the file repeatedly or use this structure.

You'll want to put the counter between the BufferedOutputStream and the FileOutputStream in formatWriter.

This is the update based on this comment: Refactoring and updating data generating methods.
In this commit, instance outputStream constructed from CountingOutputStream will be the parameter of IonTextWriterBuilder.standard().build(outputStream), then using getCount() to monitor the size of written data.

@linlin-s
Copy link
Contributor Author

linlin-s commented Jun 6, 2022

Encapsulating the schema-related logic is an improvement. A few comments below.

The updates from these comments are included in this commit: Updates PR based on the suggestions from comments.

// 10000T in millis, upper bound exclusive.
final static private BigDecimal MAXIMUM_TIMESTAMP_IN_MILLIS_DECIMAL = new BigDecimal(253402300800000L);
// Create a range which contains the default lower bound and upper bound values.
final static private Range RANGE = new Range(SYSTEM.newList( SYSTEM.newDecimal(62135769600000L), SYSTEM.newDecimal(253402300800000L)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be DEFAULT_RANGE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be DEFAULT_RANGE?

Yes, the constant name should be replaced. Cause DEFAULT_RANGE has been used before, in the next commit it will be replaced with DEFAULT_TIMESTAMP_IN_MILLIS_DECIMAL_RANGE.

public class Range {
private static final String KEYWORD_RANGE = "range";
IonSequence sequence;
public final IonSequence sequence;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend making this private and adding an accessor method with the appropriate visibility. Or, if it's only used by subclasses, consider protected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend making this private and adding an accessor method with the appropriate visibility. Or, if it's only used by subclasses, consider protected.

Sounds good, in the next commit the accessor method will be added.

Adds accessor method to Range.
Removes subclasses of QuantifiableConstraints which don't include addtional methods or attributes.
Updates varaible name from RANGE to DEFAULT_TIMESTAMP_IN_MILLIS_DECIMAL_RANGE.
import com.amazon.ion.system.IonTextWriterBuilder;
import com.amazon.ionschema.Schema;
import com.amazon.ionschema.Type;
import com.google.common.io.CountingOutputStream;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the only thing we're using Guava for, I'd rather just include our own CountingOutputStream in this package and remove the dependency, which is pretty bulky. As you can see from the implementation, it's trivial: https://github.com/google/guava/blob/master/guava/src/com/google/common/io/CountingOutputStream.java

Copy link
Contributor Author

@linlin-s linlin-s Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the only thing we're using Guava for, I'd rather just include our own CountingOutputStream in this package and remove the dependency, which is pretty bulky. As you can see from the implementation, it's trivial: https://github.com/google/guava/blob/master/guava/src/com/google/common/io/CountingOutputStream.java

Indeed, it is not necessary to fetch the whole dependency to use the CountingOutputStream. Our own CountingOutputStream will be included in the following commit.

while (count <= size) {
IonValue constructedData = DataConstructor.constructIonData(parsedTypeDefinition);
constructedData.writeTo(writer);
count = outputStreamCounter.getCount();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order for the count to update, the writer needs to be flushed. We may want to do this after every X values are written.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order for the count to update, the writer needs to be flushed. We may want to do this after every X values are written.

Thanks for pointing this out. Yes, we need to flush the buffered stream in writer to make sure the size of written data is up to date.
And here is another thought I have:
The motivation of using counter to count OutputStream is to avoid the original while loop logic and checking the OutputStream size instead of checking the output file size.

Original logic:

 public static void writeRequestedSizeFile(int size, IonWriter writer, File file, IonStruct constraintStruct) throws Exception {
        int currentSize = 0;
        int count = 0;
        // Determine how many values should be written before the writer.flush(), and this process aims to reduce the execution time of writer.flush().
        while (currentSize <= 0.05 * size) {
            WriteRandomIonValues.writeDataToFile(writer, constraintStruct);
            count += 1;
            writer.flush();
            currentSize = (int) file.length();
        }
        while (currentSize <= size) {
            for (int i = 0; i < count; i++) {
                WriteRandomIonValues.writeDataToFile(writer, constraintStruct);
            }
            writer.flush();
            currentSize = (int) file.length();
        }
    }

We do need to flush the writer, but flush after X written data would take the method back to the logic we were trying to avoid. Probably the further consideration/discussion is needed here to come up with a more straightforward way to monitor the size of written data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing that is bad about the old logic was repeatedly calling File.length(), which has to go all the way to the file system each call. CountingOutputStream is at least an improvement over that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing that is bad about the old logic was repeatedly calling File.length(), which has to go all the way to the file system each call. CountingOutputStream is at least an improvement over that.

Yes, for now it would be fine to keep the logic of flushing after X written data then using CoutingOutputStream as counter to monitor the size. The updates will be included in the following commit.

@@ -0,0 +1,5 @@
package com.amazon.ion.benchmark.schema.constraints;

public interface ReparsedConstraint {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use a JavaDoc to explain to a reader why it exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use a JavaDoc to explain to a reader why it exists.

The explanation is added in the following commit.

@linlin-s linlin-s force-pushed the refactoring branch 2 times, most recently from 0fc749e to 2a8ca42 Compare June 13, 2022 22:49
}

@Override
public void write(byte[] b, int off, int len) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll also need to override public void write(int b) throws IOException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll also need to override public void write(int b) throws IOException.

Right! My bad :( thanks for reminding. It will be resolved in the next commit.

Adds writer.flush() while counting the written data size.
Override write(int b) method in CountingOutputStream.
case KEYWORD_NAME:
case KEYWORD_TYPE:
return;
default:
Copy link

@desaikd desaikd Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also include fields that are open content and which would not be part of ISL constraints.
Adding a comment toConstraint method for this would work.
[EDIT]: As per offline discussion, there is still a recursive type definition change to be addressed along with this one before merging this.

Copy link
Contributor Author

@linlin-s linlin-s Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also include fields that are open content and which would not be part of ISL constraints. Adding a comment toConstraint method for this would work. [EDIT]: As per offline discussion, there is still a recursive type definition change to be addressed along with this one before merging this.

The comment will be added in the next commit. For the recursive type definition change, there might be more potential updates along with this change. I will throw an issue for this comment and contain this change in the future PR.

@linlin-s linlin-s merged commit f6f7fb8 into amazon-ion:master Jun 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants