Skip to content

Conversation

@JingsongLi
Copy link
Contributor

@JingsongLi JingsongLi commented Oct 14, 2024

Purpose

At present, Paimon uses BinaryRow to store statistical information. Generally, it is not a problem, but some businesses have fields with over 3000 columns.

The BinaryRow structure has a characteristic that each field occupies a fixed 8 bytes, and for a 3000 column table, one BinaryRow has 25kb of storage. SimpleStats has 3 BinaryRow, then it will be 100 kb storage. 100000 files have GB level storage. This is unacceptable.

This PR:

  1. Introduce metadata.stats-dense-store to a dense mode to store SimpleStats and valueStatsCols in DataFileMeta.
  2. You can set metadata.stats-mode = none, then valueStatsCols will be empty list, and SimpleStats is empty.
  3. You can also set fields.b.stats-mode = full to enable stats for specific columns to enable data skipping, the meta storage will only contain b column.

Tests

  • AppendOnlyFileStoreTableTest
  • PrimaryKeyFileStoreTableTest

API and Format

Documentation

@wwj6591812
Copy link
Contributor

Good,we just encountered a similar problem today!

+ " none statistic mode is set.")
.linebreak()
.text(
"Note, When this mode is enabled, the sdk in reading engine requires at least"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1、Change the "When" to "when".
2、the Paimon sdk in reading engine requires at least version 0.9.1 or 1.0.0 or higher?

return fieldId >= SYSTEM_FIELD_ID_START;
}

public static boolean isSystemField(String field) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add KEY_FIELD_PREFIX to SYSTEM_FIELD_NAMES?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What to solve?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If KEY_FIELD_PREFIX is not in SYSTEM_FIELD_NAMES, then the funtion name "isSystemField" is inappropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So add KEY_FIELD_PREFIX to SYSTEM_FIELD_NAMES, what problem can be solved? Can you write a ut demo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A solution is using a SYSTEM_FIELD_PREFIXS, and always using starWith. But it is not good for performance.

Let it go now.

*/
public class ProjectedArray implements InternalArray {

protected final int[] indexMapping;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indexMapping、array and ProjectedArray should be private?


this.keyStatsConverter = new SimpleStatsConverter(keyType);
this.valueStatsConverter = new SimpleStatsConverter(valueType);
this.keyStatsConverter = new SimpleStatsConverter(keyType, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.keyStatsConverter = new SimpleStatsConverter(keyType);

private final InternalArray array;
private final long notFoundValue;

protected NullCountsEvoArray(int[] indexMapping, InternalArray array, long notFoundValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why protected?

private final InternalArray array;
private final long notFoundValue;

protected NullCountsEvoArray(int[] indexMapping, InternalArray array, long notFoundValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why protected?

fileDeserializer.get(), fileDeserializer.get(), fileDeserializer.get()),
new IndexIncrement(
indexEntrySerializer.deserializeList(view),
indexEntrySerializer.deserializeList(view)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Condition 'version <= 2' is always 'true'

} else if (version == 2) {
DataFileMeta09Serializer serializer = new DataFileMeta09Serializer();
return serializer::deserialize;
} else if (version == 3) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public static final int DATA_FILE_META_VERSION_1 = 1;
public static final int DATA_FILE_META_VERSION_2= 2;
public static final int DATA_FILE_META_VERSION_3= 3;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


public static final ConfigOption<String> METADATA_STATS_MODE =
key("metadata." + STATS_MODE_SUFFIX)
key("metadata.stats-mode")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can become more intuitive in the code.

null);
List<DataFileMeta> dataFiles = Collections.singletonList(dataFile);

LinkedHashMap<String, Pair<Integer, Integer>> dvRanges = new LinkedHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Map<String, Pair<Integer, Integer>> dvRanges = new LinkedHashMap<>();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexFileMeta requires a LinkedHashMap

@wwj6591812
Copy link
Contributor

Please modify this doc : https://paimon.apache.org/docs/master/flink/sql-ddl/#specify-statistics-mode

@wwj6591812
Copy link
Contributor

A question :
If I have a paimon table with 1000 columns, and I just want field B to use dense stat, other columns close stats.
Is my configuration correct?

metadata.stats-dense-store=true
metadata.stats-mode=none
fields.b.stats-mode=truncate(16)

public static final ConfigOption<Boolean> METADATA_STATS_DENSE_STORE =
key("metadata.stats-dense-store")
.booleanType()
.defaultValue(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the default value is not true?
You are worry about that many users are using old versions of Paimon sdk in their reading engine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is true

@JingsongLi
Copy link
Contributor Author

A question : If I have a paimon table with 1000 columns, and I just want field B to use dense stat, other columns close stats. Is my configuration correct?

metadata.stats-dense-store=true metadata.stats-mode=none fields.b.stats-mode=truncate(16)

Correct!

@wwj6591812
Copy link
Contributor

Looks good to me!
I can't wait to use this feature quickly to solve our online produce problems!

+1

@LinMingQiang
Copy link
Contributor

It should be related to this change : #5035

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants