-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[core] Introduce metadata.stats-dense-store to reduce meta size for multiple columns table #4322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Good,we just encountered a similar problem today! |
| + " none statistic mode is set.") | ||
| .linebreak() | ||
| .text( | ||
| "Note, When this mode is enabled, the sdk in reading engine requires at least" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1、Change the "When" to "when".
2、the Paimon sdk in reading engine requires at least version 0.9.1 or 1.0.0 or higher?
| return fieldId >= SYSTEM_FIELD_ID_START; | ||
| } | ||
|
|
||
| public static boolean isSystemField(String field) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add KEY_FIELD_PREFIX to SYSTEM_FIELD_NAMES?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What to solve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If KEY_FIELD_PREFIX is not in SYSTEM_FIELD_NAMES, then the funtion name "isSystemField" is inappropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So add KEY_FIELD_PREFIX to SYSTEM_FIELD_NAMES, what problem can be solved? Can you write a ut demo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A solution is using a SYSTEM_FIELD_PREFIXS, and always using starWith. But it is not good for performance.
Let it go now.
| */ | ||
| public class ProjectedArray implements InternalArray { | ||
|
|
||
| protected final int[] indexMapping; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indexMapping、array and ProjectedArray should be private?
|
|
||
| this.keyStatsConverter = new SimpleStatsConverter(keyType); | ||
| this.valueStatsConverter = new SimpleStatsConverter(valueType); | ||
| this.keyStatsConverter = new SimpleStatsConverter(keyType, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.keyStatsConverter = new SimpleStatsConverter(keyType);
| private final InternalArray array; | ||
| private final long notFoundValue; | ||
|
|
||
| protected NullCountsEvoArray(int[] indexMapping, InternalArray array, long notFoundValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why protected?
| private final InternalArray array; | ||
| private final long notFoundValue; | ||
|
|
||
| protected NullCountsEvoArray(int[] indexMapping, InternalArray array, long notFoundValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why protected?
| fileDeserializer.get(), fileDeserializer.get(), fileDeserializer.get()), | ||
| new IndexIncrement( | ||
| indexEntrySerializer.deserializeList(view), | ||
| indexEntrySerializer.deserializeList(view))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Condition 'version <= 2' is always 'true'
| } else if (version == 2) { | ||
| DataFileMeta09Serializer serializer = new DataFileMeta09Serializer(); | ||
| return serializer::deserialize; | ||
| } else if (version == 3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static final int DATA_FILE_META_VERSION_1 = 1;
public static final int DATA_FILE_META_VERSION_2= 2;
public static final int DATA_FILE_META_VERSION_3= 3;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
|
||
| public static final ConfigOption<String> METADATA_STATS_MODE = | ||
| key("metadata." + STATS_MODE_SUFFIX) | ||
| key("metadata.stats-mode") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can become more intuitive in the code.
| null); | ||
| List<DataFileMeta> dataFiles = Collections.singletonList(dataFile); | ||
|
|
||
| LinkedHashMap<String, Pair<Integer, Integer>> dvRanges = new LinkedHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Map<String, Pair<Integer, Integer>> dvRanges = new LinkedHashMap<>();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IndexFileMeta requires a LinkedHashMap
|
Please modify this doc : https://paimon.apache.org/docs/master/flink/sql-ddl/#specify-statistics-mode |
|
A question : metadata.stats-dense-store=true |
| public static final ConfigOption<Boolean> METADATA_STATS_DENSE_STORE = | ||
| key("metadata.stats-dense-store") | ||
| .booleanType() | ||
| .defaultValue(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the default value is not true?
You are worry about that many users are using old versions of Paimon sdk in their reading engine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is true
Correct! |
|
Looks good to me! +1 |
|
It should be related to this change : #5035 |
Purpose
At present, Paimon uses
BinaryRowto store statistical information. Generally, it is not a problem, but some businesses have fields with over 3000 columns.The
BinaryRowstructure has a characteristic that each field occupies a fixed 8 bytes, and for a 3000 column table, oneBinaryRowhas 25kb of storage.SimpleStatshas 3BinaryRow, then it will be 100 kb storage. 100000 files have GB level storage. This is unacceptable.This PR:
metadata.stats-dense-storeto a dense mode to storeSimpleStatsandvalueStatsColsinDataFileMeta.metadata.stats-mode=none, thenvalueStatsColswill be empty list, andSimpleStatsis empty.fields.b.stats-mode=fullto enable stats for specific columns to enable data skipping, the meta storage will only containbcolumn.Tests
AppendOnlyFileStoreTableTestPrimaryKeyFileStoreTableTestAPI and Format
Documentation