-
Notifications
You must be signed in to change notification settings - Fork 3.7k
load data from Parquet file (Issues #911) #1173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
imay
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you change so many files which are not necessary. Please change your IDE's code style to complain with current code.
| 4: required i64 length; | ||
| } | ||
|
|
||
| enum EWhence { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why EWhence, change to TSeekWhence
T means thrift
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i had changed
be/src/exec/scanner_interface.h
Outdated
| @@ -0,0 +1,42 @@ | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i had add
| TTypeNode node; | ||
| node.__set_type(TTypeNodeType::SCALAR); | ||
| TScalarType scalar_type; | ||
| scalar_type.__set_type(TPrimitiveType::INT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why you change this type???
| public void testNormal() throws AnalysisException { | ||
| DataDescription desc = new DataDescription("testTable", null, Lists.newArrayList("abc.txt"), | ||
| null, null, false, null); | ||
| null, null, "csv", false, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't have to change this file, DataDescription has two constructor to comply with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
| 1: required TBrokerVersion version; | ||
| 2: required TBrokerFD fd; | ||
| 3: required i64 offset; | ||
| 4: required TSeekWhence whence; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new field should be optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| 4: required i64 length; | ||
| } | ||
|
|
||
| enum TSeekWhence { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you should change function seek's response with offset like fseek's response
| struct TBrokerOpenReaderResponse { | ||
| 1: required TBrokerOperationStatus opStatus; | ||
| 2: optional TBrokerFD fd; | ||
| 3: optional i64 size; //file size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When read a parquet file by RPC, libparquet.a must kown the file size so that it can parse file. for example: When it read footer information.
| } | ||
|
|
||
| public TBrokerFD openReader(String clientId, String path, long startOffset, Map<String, String> properties) { | ||
| public TBrokerOpenReaderResponse openReader(String clientId, String path, long startOffset, Map<String, String> properties) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bad modification.
In this class, we can't return a RPC's response. We should handle RPC in RPC layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope get file size when open file success.
| // lineDelimiter | ||
| Text.writeString(out, lineDelimiter); | ||
| // format type | ||
| Text.writeString(out, fileFormat); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can't write this value without change META_VERSION. this can make frontend unable to load old image.
Now, we don't persist this value until we have enough meta change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| columnSeparator = Text.readString(in); | ||
| lineDelimiter = Text.readString(in); | ||
|
|
||
| fileFormat = Text.readString(in); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not write format here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
be/test/exec/CMakeLists.txt
Outdated
| ADD_BE_TEST(plain_text_line_reader_bzip_test) | ||
| ADD_BE_TEST(plain_text_line_reader_lz4frame_test) | ||
| ADD_BE_TEST(plain_text_line_reader_lzop_test) | ||
| #ADD_BE_TEST(broker_parquet_reader_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why you not compile this UT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ut depend on local real environment. for exmaple:
std::string path = "hdfs://A.B.C.D:8020/user/hive/warehouse/test.db/parquet_test/000000_0";
be/src/exec/broker_reader.cpp
Outdated
| *eof = false; | ||
| *bytes_read = response.data.size(); | ||
| memcpy(out, response.data.data(), *bytes_read); | ||
| _cur_offset += *bytes_read; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and you don't set eof to false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does'nt need. Because *bytes_read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even if there is bytes_read, you must set eof to false
be/src/exec/broker_scanner.h
Outdated
| // Get next tuple | ||
| Status get_next(Tuple* tuple, MemPool* tuple_pool, bool* eof); | ||
| // Get next tuple | ||
| virtual Status get_next(Tuple* tuple, MemPool* tuple_pool, bool* eof); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| virtual Status get_next(Tuple* tuple, MemPool* tuple_pool, bool* eof); | |
| Status get_next(Tuple* tuple, MemPool* tuple_pool, bool* eof) override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i had changed.
be/src/exec/broker_scanner.h
Outdated
|
|
||
| // Close this scanner | ||
| void close(); | ||
| virtual void close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| virtual void close(); | |
| void close() override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i had changed.
be/src/exec/broker_scanner.h
Outdated
| // Open this scanner, will initialize informtion need to | ||
| Status open(); | ||
| // Open this scanner, will initialize information need to | ||
| virtual Status open(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| virtual Status open(); | |
| Status open() override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i had changed.
be/src/exec/ifile_reader.h
Outdated
| class Tuple; | ||
| class SlotDescriptor; | ||
| class MemPool; | ||
| class IFileReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this class want to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used parse parquet file. We will write the data of parquet file to tuple-object so that i have define a interface class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need an interface? Isn't a single class enough?
be/src/exec/broker_reader.cpp
Outdated
| *eof = false; | ||
| *bytes_read = response.data.size(); | ||
| memcpy(out, response.data.data(), *bytes_read); | ||
| _cur_offset += *bytes_read; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should set _cur_offset to position at the begin of this function, or there will be something wrong
be/src/exec/broker_reader.cpp
Outdated
| *eof = false; | ||
| *bytes_read = response.data.size(); | ||
| memcpy(out, response.data.data(), *bytes_read); | ||
| _cur_offset += *bytes_read; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should set _cur_offset to position at the begin of this function, or there will be something wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i changed
be/src/exec/broker_scan_node.cpp
Outdated
| ScannerInterface *scan = nullptr; | ||
| switch (scan_range.ranges[0].format_type) { | ||
| case TFileFormatType::FORMAT_PARQUET: | ||
| LOG(INFO) << "Parquet Scanner Create"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this useless log
be/src/exec/broker_scan_node.cpp
Outdated
| counter); | ||
| break; | ||
| default: | ||
| LOG(INFO) << "Broker Scanner Create"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it be deleted
be/src/exec/local_file_reader.cpp
Outdated
| return Status(ss.str()); | ||
| } | ||
| // get file size | ||
| fseek(_fp,0L,SEEK_END); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| fseek(_fp,0L,SEEK_END); | |
| fseek(_fp, 0L, SEEK_END); |
be/src/exec/local_file_reader.cpp
Outdated
| << ", error=" << strerror_r(errno, err_buf, 64); | ||
| return Status(ss.str()); | ||
| } | ||
| // get file size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'd better not get size in open function
you can get through size function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i changed
be/src/exec/broker_reader.cpp
Outdated
| return Status(ss.str()); | ||
| } | ||
|
|
||
| _file_size = response.size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for compatible with old broker, you should check if there is size field in response
if (reponse.__isset.size) {
}
be/src/exec/local_file_reader.cpp
Outdated
|
|
||
| Status LocalFileReader::readat(int64_t position, int64_t nbytes, int64_t* bytes_read, void* out) { | ||
| if (position != _current_offset) { | ||
| fseek(_fp, position, SEEK_SET); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should check fseek's result
be/src/exec/parquet_reader.cpp
Outdated
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
| #include <sstream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should include "exec/parquet_reader.h" first
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include "exec/parquet_scanner.h" first
|
|
||
| DataDescription dataDescription = new DataDescription( | ||
| tbl, null, files, columns, columnSeparator, false, null); | ||
| tbl, null, files, columns, columnSeparator, format, false, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no need to change this file, because there is a DataDescription constructor without format
| columnSeparator = Text.readString(in); | ||
| lineDelimiter = Text.readString(in); | ||
|
|
||
| fileFormat = Text.readString(in); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change here, you should do some check META_VERSION to support old meta imange
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| Text.writeString(out, columnSeparator); | ||
| Text.writeString(out, lineDelimiter); | ||
|
|
||
| Text.writeString(out, fileFormat); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If fileFormat is null, this will cause FE crash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
be/src/exec/parquet_reader.cpp
Outdated
| column_index = column_ids[i];// column index with Parquet Field | ||
|
|
||
| switch (fieldSchema->field(column_index)->type()->id()) { | ||
| case arrow::Type::type::BINARY: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| case arrow::Type::type::BINARY: | |
| case arrow::Type::type::BINARY: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
be/src/exec/parquet_reader.cpp
Outdated
| { | ||
| for (auto slot_desc : tuple_slot_descs) { | ||
| // Get the Column Reader for the boolean column | ||
| auto iter = _mapColumn.find(slot_desc->col_name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| auto iter = _mapColumn.find(slot_desc->col_name()); | |
| auto iter = _map_column.find(slot_desc->col_name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
be/src/exec/parquet_reader.h
Outdated
| std::unique_ptr<parquet::arrow::FileReader> reader; | ||
| std::shared_ptr<parquet::FileMetaData> _file_metadata; | ||
| std::map<std::string, int> _mapColumn; // column-name <---> column-index | ||
| std::vector<int> column_ids; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::vector<int> column_ids; | |
| std::vector<int> _column_ids; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
be/src/exec/parquet_reader.h
Outdated
| std::shared_ptr<arrow::RecordBatch> batch; | ||
| std::unique_ptr<parquet::arrow::FileReader> reader; | ||
| std::shared_ptr<parquet::FileMetaData> _file_metadata; | ||
| std::map<std::string, int> _mapColumn; // column-name <---> column-index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::map<std::string, int> _mapColumn; // column-name <---> column-index | |
| std::map<std::string, int> _map_column; // column-name <---> column-index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
be/src/exec/parquet_reader.h
Outdated
| std::shared_ptr<ParquetFile> _parquet; | ||
|
|
||
| // parquet file reader object | ||
| std::shared_ptr<::arrow::RecordBatchReader> rb_batch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::shared_ptr<::arrow::RecordBatchReader> rb_batch; | |
| std::shared_ptr<::arrow::RecordBatchReader> _rb_batch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
be/src/exec/parquet_reader.h
Outdated
| // parquet file reader object | ||
| std::shared_ptr<::arrow::RecordBatchReader> rb_batch; | ||
| std::shared_ptr<arrow::RecordBatch> batch; | ||
| std::unique_ptr<parquet::arrow::FileReader> reader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::unique_ptr<parquet::arrow::FileReader> reader; | |
| std::unique_ptr<parquet::arrow::FileReader> _reader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
be/src/exec/parquet_reader.h
Outdated
|
|
||
| // parquet file reader object | ||
| std::shared_ptr<::arrow::RecordBatchReader> rb_batch; | ||
| std::shared_ptr<arrow::RecordBatch> batch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::shared_ptr<arrow::RecordBatch> batch; | |
| std::shared_ptr<arrow::RecordBatch> _batch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
be/src/exec/parquet_reader.cpp
Outdated
| throw parquet::ParquetException(status.ToString()); | ||
| } | ||
| } | ||
| else if (_current_line_of_group >= _rows_of_group) {// read next row group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else if () {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| } | ||
|
|
||
| Status seek(int64_t position) override { | ||
| return Status("Not implementation"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return Status("Not implementation"); | |
| return Status("Not implemented"); |
| } | ||
|
|
||
| Status tell(int64_t* position) override { | ||
| return Status("Not implementation"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return Status("Not implementation"); | |
| return Status("Not implemented"); |
| } | ||
|
|
||
| Status readat(int64_t position, int64_t nbytes, int64_t* bytes_read, void* out) { | ||
| return Status("Not implementation"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return Status("Not implementation"); | |
| return Status("Not implemented"); |
be/src/exec/parquet_reader.cpp
Outdated
| } | ||
| #endif | ||
|
|
||
| void ParquetReaderWrap::inital_parquet_reader() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| void ParquetReaderWrap::inital_parquet_reader() { | |
| void ParquetReaderWrap::init_parquet_reader() { |
be/src/exec/parquet_reader.cpp
Outdated
| #endif | ||
|
|
||
| void ParquetReaderWrap::inital_parquet_reader() { | ||
| //new file reader for parquet file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| //new file reader for parquet file | |
| // new file reader for parquet file |
be/src/exec/parquet_reader.cpp
Outdated
| return Status::OK; | ||
| } | ||
|
|
||
| void ParquetReaderWrap::fill_solt(Tuple* tuple, SlotDescriptor* slot_desc, MemPool* mem_pool, const uint8_t* value, int32_t len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| void ParquetReaderWrap::fill_solt(Tuple* tuple, SlotDescriptor* slot_desc, MemPool* mem_pool, const uint8_t* value, int32_t len) { | |
| void ParquetReaderWrap::fill_slot(Tuple* tuple, SlotDescriptor* slot_desc, MemPool* mem_pool, const uint8_t* value, int32_t len) { |
be/src/exec/parquet_reader.cpp
Outdated
| auto iter = _map_column.find(slot_desc->col_name()); | ||
| if (iter == _map_column.end()) { | ||
| std::stringstream str_error; | ||
| str_error<<"Invalid Column Name:" << slot_desc->col_name(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| str_error<<"Invalid Column Name:" << slot_desc->col_name(); | |
| str_error << "Invalid Column Name:" << slot_desc->col_name(); |
be/src/exec/parquet_reader.cpp
Outdated
| return Status::OK; | ||
| } | ||
|
|
||
| void ParquetReaderWrap::set_filed_null(Tuple* tuple, const SlotDescriptor* slot_desc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| void ParquetReaderWrap::set_filed_null(Tuple* tuple, const SlotDescriptor* slot_desc) { | |
| void ParquetReaderWrap::set_field_null(Tuple* tuple, const SlotDescriptor* slot_desc) { |
be/src/exec/parquet_reader.cpp
Outdated
| auto iter = _map_column.find(slot_desc->col_name()); | ||
| if (iter == _map_column.end()) { | ||
| std::stringstream str_error; | ||
| str_error<<"Invalid Column Name:" << slot_desc->col_name(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| str_error<<"Invalid Column Name:" << slot_desc->col_name(); | |
| str_error << "Invalid Column Name:" << slot_desc->col_name(); |
be/src/exec/parquet_reader.cpp
Outdated
| return status; | ||
| } | ||
|
|
||
| std::shared_ptr<arrow::Schema> fieldSchema = _batch->schema(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::shared_ptr<arrow::Schema> fieldSchema = _batch->schema(); | |
| std::shared_ptr<arrow::Schema> field_schema = _batch->schema(); |
be/src/exec/parquet_reader.cpp
Outdated
| arrow::Status status = _rb_batch->ReadNext(&_batch); | ||
| if (!status.ok()) { | ||
| LOG(WARNING) << status.ToString(); | ||
| throw parquet::ParquetException(status.ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change throw exception to return status. we don't use exception in our CPP files
be/src/exec/parquet_reader.cpp
Outdated
|
|
||
| void ParquetReaderWrap::set_field_null(Tuple* tuple, const SlotDescriptor* slot_desc) { | ||
| if (!slot_desc->is_nullable()) { | ||
| throw parquet::ParquetException("Null is not allowed, but Parquet field is NULL."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use exception
| _strict_mode(false) { | ||
| } | ||
|
|
||
| Status BaseScanner::init_expr_ctxes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is base class's open, and in child class call this BaseScanner::open()
| Status BaseScanner::init_expr_ctxes() { | |
| Status BaseScanner::open() { |
be/src/exec/base_scanner.cpp
Outdated
| return Status::OK(); | ||
| } | ||
|
|
||
| bool BaseScanner::fill_dest_tuple(Tuple* dest_tuple, MemPool* mem_pool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should add line here to output line information to let user know which line has error data
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…electdb-cloud-dev (20221130 23a144c) (apache#1199) * [feature](selectdb-cloud) Fix file cache metrics nullptr error (apache#1060) * [feature](selectdb-cloud) Fix abort copy when -235 (apache#1039) * [feature](selectdb-cloud) Replace libfdb_c.so to make it compatible with different OS (apache#925) * [feature](selectdb-cloud) Optimize RPC retry in cloud_meta_mgr (apache#1027) * Optimize RETRY_RPC in cloud_meta_mgr * Add random sleep for RETRY_RPC * Add a simple backoff strategy for rpc retry * [feature](selectdb-cloud) Copy into support select by column name (apache#1055) * Copy into support select by column name * Fix broker load core dump due to mis-match of number of columns between remote and schema * [feature](selectdb-cloud) Fix test_dup_mv_schema_change case (apache#1022) * [feature](selectdb-cloud) Make the broker execute on the specified cluster (apache#1043) * Make the broker execute on the specified cluster * Pass the cluster parameter * [feature](selectdb-cloud) Support concurrent BaseCompaction and CumuCompaction on a tablet (apache#1059) * [feature](selectdb-cloud) Reduce meta-service log (apache#1067) * Quote string in the tagged log * Add template to enable customized log for RPC requests * [feature](selectdb-cloud) Use read-only txn + read-write txn for `commit_txn` (apache#1065) * [feature](selectdb-cloud) Pick "[fix](load) fix that load channel failed to be released in time (apache#14119)" commit 3690c4d Author: Xin Liao <liaoxinbit@126.com> Date: Wed Nov 9 22:38:08 2022 +0800 [fix](load) fix that load channel failed to be released in time (apache#14119) * [feature](selectdb-cloud) Add compaction profile log (apache#1072) * [feature](selectdb-cloud) Fix abort txn fail when copy job `getAllFileStatus` exception (apache#1066) * Revert "[feature](selectdb-cloud) Copy into support select by column name (apache#1055)" This reverts commit f1a543e. * [feature](selectdb-cloud) Pick"[fix](metric) fix the bug of not updating the query latency metric apache#14172 (apache#1076)" * [feature](selectdb-cloud) Distinguish KV_TXN_COMMIT_ERR or KV_TXN_CONFLICT while commit failed (apache#1082) * [feature](selectdb-cloud) Support configuring base compaction concurrency (apache#1080) * [feature](selectdb-cloud) Enhance start.sh/stop.sh for selectdb_cloud (apache#1079) * [feature](selectdb-cloud) Add smoke testing (apache#1056) Add smoke test, 1. upload,query http data api. 2. internal, external stage. 3. select,insert * [feature](selectdb-cloud) Disable admin stmt in cloud mode (apache#1064) Disable the following stmt. * AdminRebalanceDiskStmt/AdminCancelRebalanceDiskStmt * AdminRepairTableStmt/AdminCancelRepairTableStmt * AdminCheckTabletsStmt * AdminCleanTrashStmt * AdminCompactTableStmt * AdminCopyTabletStmt * AdminDiagnoseTabletStmt * AdminSetConfigStmt * AdminSetReplicaStatusStmt * AdminShowConfigStmt * AdminShowReplicaDistributionStmt * AdminShowReplicaStatusStmt * AdminShowTabletStorageFormatStmt Leaving a backdoor for the user root: * AdminSetConfigStmt * AdminShowConfigStmt * AdminShowReplicaDistributionStmt * AdminShowReplicaStatusStmt * AdminDiagnoseTabletStmt * [feature](selectdb-cloud) Update copy into doc (apache#1063) * [feature](selectdb-cloud) Fix AdminSetConfigStmt cannot work with root (apache#1085) * [feature](selectdb-cloud) Fix userid null lead to checkpoint error (apache#1083) * [feature](selectdb-cloud) Support controling the space used for upload (apache#1091) * [feature](selectdb-cloud) Pick "[fix](sequence) fix that update table core dump with sequence column (apache#13847)" (apache#1092) * [Fix](memory-leak) Fix boost::stacktrace memory leak (1097) * [Fix](selectdb-cloud) Several picks to fix memtracker (apache#1087) * [enhancement](memtracker) Add independent and unique scanner mem tracker for each query (apache#13262) * [enhancement](memory) Print memory usage log when memory allocation fails (apache#13301) * [enhancement](memtracker) Print query memory usage log every second when `memory_verbose_track` is enabled (apache#13302) * [fix](memory) Fix USE_JEMALLOC=true UBSAN compilation error apache#13398 * [enhancement](memtracker) Fix bthread local consume mem tracker (apache#13368) Previously, bthread_getspecific was called every time bthread local was used. In the test at apache#10823, it was found that frequent calls to bthread_getspecific had performance problems. So a cache is implemented on pthread local based on the btls key, but the btls key cannot correctly sense bthread switching. So, based on bthread_self to get the bthread id to implement the cache. * [enhancement](memtracker) Fix brpc causing query mem tracker to be inaccurate apache#13401 * [fix](memtracker) Fix transmit_tracker null pointer because phamp is not thread safe apache#13528 * [enhancement](memtracker) Fix Brpc mem count and refactored thread context macro (apache#13469) * [fix](memtracker) Fix the usage of bthread mem tracker (apache#13708) bthead context init has performance loss, temporarily delete it first, it will be completely refactored in apache#13585. * [enhancement](memtracker) Refactor load channel + memtable mem tracker (apache#13795) * [fix](load) Fix load channel mgr lock (apache#13960) hot fix load channel mgr lock * [fix](memtracker) Fix DCHECK !std::count(_consumer_tracker_stack.begin(), _consumer_tracker_stack.end(), tracker) * [tempfix][memtracker] wait pick 0b945fe Co-authored-by: Xinyi Zou <zouxinyi02@gmail.com> * [feature](selectdb-cloud) Add more recycler case (apache#1094) * [feature](selectdb-cloud) Pick "[improvement](load) some simple optimization for reduce load memory policy (apache#14215)" (apache#1096) * [feature](selectdb-cloud) Reduce unnecessary get rowset rpc when prepare compaction (apache#1099) * [feature](selectdb-cloud) Pick "[improvement](load) reduce memory in batch for small load channels (apache#14214)" (apache#1100) * [feature](selectdb-cloud) Pick "[improvement](load) release load channel actively when error occurs (apache#14218)" (apache#1102) * [feature](selectdb-cloud) Print build info of ms/recycler to stdout when launch (apache#1105) * [feature](selectdb-cloud) copy into support select by column name and load with partial columns (apache#1104) e.g. ``` COPY INTO test_table FROM (SELECT col1, col2, col3 FROM @ext_stage('1.parquet')) COPY INTO test_table (id, name) FROM (SELECT col1, col2 FROM @ext_stage('1.parquet')) ``` * [fix](selectdb-cloud) Pick "[Fix](array-type) bugfix for array column with delete condition (apache#13361)" (apache#1109) Fix for SQL with array column: delete from tbl where c_array is null; more info please refer to apache#13360 Co-authored-by: camby <104178625@qq.com> Co-authored-by: cambyzju <zhuxiaoli01@baidu.com> * [feature](selectdb-cloud) Copy into support force (apache#1081) * [feature](selectdb-cloud) Add abort txn, abort tablet job http api (apache#1101) Abort load txn by txn_id: ``` curl "{meta_sevice_ip}:{brpc_port}/MetaService/http/abort_txn?token=greedisgood9999" -d '{ "cloud_unique_id": string, "txn_id": int64 }' ``` Abort load txn by db_id and label: ``` curl "{meta_sevice_ip}:{brpc_port}/MetaService/http/abort_txn?token=greedisgood9999" -d '{ "cloud_unique_id": string, "db_id": int64, "label": string }' ``` Only support abort compaction job currently: ``` curl "{meta_sevice_ip}:{brpc_port}/MetaService/http/abort_tablet_job?token=greedisgood9999" -d '{ "cloud_unique_id": string, "job" : { "idx": {"tablet_id": int64}, "compaction": [{"id": string}] } }' ``` * [feature](selectdb-cloud) Fix external stage data for smoke test and retry to create stage (apache#1119) * [feature](selectdb-cloud) Fix data leaks when truncating table (apache#1114) * Drop cloud partition when truncating table * Add retry strategy for dropCloudMaterializedIndex * [feature](selectdb-cloud) Fix missing library when compiling unit test (apache#1128) * [feature](selectdb-cloud) Validate the object storage when create stage (apache#1115) * [feature](selectdb-cloud) Fix incorrectly setting cumulative point when committing base compaction (apache#1127) * [feature](selectdb-cloud) Fix missing lease when preparing cumulative compaction (apache#1131) * [feature](selectdb-cloud) Fix unbalanced tablet distribution (apache#1121) * Fix the bug of unbalanced tablet distribution * Use replica index hash to BE * [feature](selectdb-cloud) Fix core dump when get tablets info by BE web page (apache#1113) * [feature](selectdb-cloud) Fix start_fe.sh --version (apache#1106) * [feature](selectdb-cloud) Print tablet stats before and after compaction (apache#1132) * Log num rowsets before and after compaction * Print tablet stats after committing compaction * [feature](selectdb-cloud) Allow root user execute AlterSystemStmt (apache#1143) * [feature](selectdb-cloud) Fix BE UT (apache#1141) * [feature](selectdb-cloud) Select BE for the first bucket of every partition randomly (apache#1136) * [feature](selectdb-cloud) Fix query_limit int -> int64 (apache#1154) * [feature](selectdb-cloud) Add more cloud recycler case (apache#1116) * add more cloud recycler case * modify cloud recycler case dateset from sf0.1 to sf1 * [feature](selectdb-cloud) Fix misuse of aws transfer which may delete tmp file prematurely (apache#1160) * [feature](selectdb-cloud) Add test for copy into http data api and userId (apache#1044) * Add test for copy into http data api and userId * Add external and internal stage cross use regression case. * [feature](selectdb-cloud) Pass the cloud compaction regression test (apache#1173) * [feature](selectdb-cloud) Modify max_bytes_per_broker_scanner default value to 150G (apache#1184) * [feature](selectdb-cloud) Fix missing lock when calling Tablet::delete_predicates (apache#1182) * [improvement](config)change default remote_fragment_exec_timeout_ms to 30 seconds * [improvement](config) change default value of broker_load_default_timeout_second to 12 hours * [feature](selectdb-cloud) Fix replay copy into (apache#1167) * Add stage ddl regression * fix replay copy into * remove unused log * fix user name * [feature](selectdb-cloud) Fix FE --version option not work after fe started (apache#1161) * [feature](selectdb-cloud) BE accesses object store using HTTP (apache#1111) * [feature](selectdb-cloud) Refactor recycle copy jobs (apache#1062) * [fix](FE) Pick fix from doris master (apache#1177) (apache#1178) Commit: 53e5f39 Author: starocean999 <40539150+starocean999@users.noreply.github.com> Committer: GitHub <noreply@github.com> Date: Mon Oct 31 2022 10:19:32 GMT+0800 (China Standard Time) fix result exprs should be substituted in the same way as agg exprs (apache#13744) Commit: a4a9912 Author: starocean999 <40539150+starocean999@users.noreply.github.com> Committer: GitHub <noreply@github.com> Date: Thu Nov 03 2022 10:26:59 GMT+0800 (China Standard Time) fix group by constant value bug (apache#13827) Commit: 84b969a Author: starocean999 <40539150+starocean999@users.noreply.github.com> Committer: GitHub <noreply@github.com> Date: Thu Nov 10 2022 11:10:42 GMT+0800 (China Standard Time) fix the grouping expr should check col name from base table first, then alias (apache#14077) Commit: ae4f4b9 Author: starocean999 <40539150+starocean999@users.noreply.github.com> Committer: GitHub <noreply@github.com> Date: Thu Nov 24 2022 10:31:58 GMT+0800 (China Standard Time) fix having clause should use column name first then alias (apache#14408) * [feature](selectdb-cloud) Deal with getNextTransactionId rpc exception (apache#1181) Before fixing, getNextTransactionId will return -1 if there is RPC exception, it will cause schema change and the previous load task execute in parallel unexpectedly. * [feature](selectdb-cloud) Throw exception for unsupported operations in CloudGlobalTransactionMgr (apache#1180) * [improvement](load) Add more log on RPC error (apache#1183) * [feature](selectdb-cloud) Add copy_into case(json, parquet, orc) and tpch_sf1 to smoke test (apache#1140) * [feature](selectdb-cloud) Recycle dropped stage (apache#1071) * log s3 response code * add log in S3Accessor::delete_objects_by_prefix * Fix show copy * remove empty line * [feature](selectdb-cloud) Support bthread for new scanner (apache#1117) * Support bthread for new scanner * Keep the number of remote threads same as local threads * [feature](selectdb-cloud) Implement self-explained cloud unique id for instance id searching (apache#1089) 1. Implement self-explained cloud unique id for instance id searching 2. Fix register core when metaservice start error 3. Fix drop_instance not set mtime 4. Add HTTP API to get instance info ``` curl "127.0.0.1:5008/MetaService/http/get_instance?token=greedisgood9999&cloud_unique_id=regression-cloud-unique-id-fe-1" curl "127.0.0.1:5008/MetaService/http/get_instance?token=greedisgood9999&cloud_unique_id=1:regression_instance0:regression-cloud-unique-id-fe-1" curl "127.0.0.1:5008/MetaService/http/get_instance?token=greedisgood9999&instance_id=regression_instance0" ``` * [improvement](memory) simplify memory config related to tcmalloc and add gc (apache#1191) * [improvement](memory) simplify memory config related to tcmalloc There are several configs related to tcmalloc, users do know how to config them. Actually users just want two modes, performance or compact, in performance mode, users want doris run query and load quickly while in compact mode, users want doris run with less memory usage. If we want to config tcmalloc individually, we can use env variables which are supported by tcmalloc. * [improvement](tcmalloc) add moderate mode and avoid oom with a lot of cache (apache#14374) ReleaseToSystem aggressively when there are little free memory. * [feature](selectdb-cloud) Pick "[fix](hashjoin) fix coredump of hash join in ubsan build apache#13479" (apache#1190) commit b5cd167 Author: TengJianPing <18241664+jacktengg@users.noreply.github.com> Date: Thu Oct 20 10:16:19 2022 +0800 [fix](hashjoin) fix coredump of hash join in ubsan build (apache#13479) * [feature](selectdb-cloud) Support close FileWriter without forcing sync data to storage medium (apache#1134) * Trace accumulated time * Support close FileWriter without forcing sync data to storage medium * Avoid trace overhead when disable trace * [feature](selectdb-cloud) Pick "[BugFix](function) fix reverse function dynamic buffer overflow due to illegal character apache#13671" (apache#1146) * pick [opt](exec) Replace get_utf8_byte_length function by array (apache#13664) * pick [BugFix](function) fix reverse function dynamic buffer overflow due to illegal character apache#13671 Co-authored-by: HappenLee <happenlee@hotmail.com> * [feature](selectdb-cloud) Pick "[fix](fe) Inconsistent behavior for string comparison in FE and BE (apache#13604)" (apache#1150) Co-authored-by: xueweizhang <zxw520blue1@163.com> * [feature](selectdb-cloud) Copy into support delete_on condition (apache#1148) * [feature](selectdb-cloud) Pick "[fix](agg)fix group by constant value bug (apache#13827)" (apache#1152) * [fix](agg)fix group by constant value bug * keep only one const grouping exprs if no agg exprs Co-authored-by: starocean999 <40539150+starocean999@users.noreply.github.com> * [feature](selectdb-cloud) Pick "[fix](join)the build and probe expr should be calculated before converting input block to nullable (apache#13436)" (apache#1155) * [fix](join)the build and probe expr should be calculated before converting input block to nullable * remove_nullable can be called on const column Co-authored-by: starocean999 <40539150+starocean999@users.noreply.github.com> * [feature](selectdb-cloud) Pick "[Bug](predicate) fix core dump on bool type runtime filter (apache#13417)" (apache#1156) fix core dump on bool type runtime filter Co-authored-by: Pxl <pxl290@qq.com> * [feature](selectdb-cloud) Pick "[Fix](agg) fix bitmap agg core dump when phmap pointer assert alignment (apache#13381)" (apache#1157) Co-authored-by: zhangstar333 <87313068+zhangstar333@users.noreply.github.com> * [feature](selectdb-cloud) Pick "[Bug](function) fix core dump on case when have 1000 condition apache#13315" (apache#1158) Co-authored-by: Pxl <pxl290@qq.com> * [feature](selectdb-cloud) Pick "[fix](sort)the sort expr nullable info is wrong in some case (apache#12003)" * [feature](selectdb-cloud) Pick "[Improvement](decimal) print decimal according to the real precision and scale (apache#13437)" * [feature](selectdb-cloud) Pick "[bugfix](VecDateTimeValue) eat the value of microsecond in function from_date_format_str (apache#13446)" * [bugfix](VecDateTimeValue) eat the value of microsecond in function from_date_format_str * add sql based regression test Co-authored-by: xiaojunjie <xiaojunjie@baidu.com> Co-authored-by: Lightman <31928846+Lchangliang@users.noreply.github.com> Co-authored-by: meiyi <myimeiyi@gmail.com> Co-authored-by: Xiaocc <598887962@qq.com> Co-authored-by: Lei Zhang <27994433+SWJTU-ZhangLei@users.noreply.github.com> Co-authored-by: Xin Liao <liaoxinbit@126.com> Co-authored-by: Luwei <814383175@qq.com> Co-authored-by: plat1ko <platonekosama@gmail.com> Co-authored-by: deardeng <565620795@qq.com> Co-authored-by: Kidd <107781942+k-i-d-d@users.noreply.github.com> Co-authored-by: Xinyi Zou <zouxinyi02@gmail.com> Co-authored-by: zhannngchen <48427519+zhannngchen@users.noreply.github.com> Co-authored-by: camby <104178625@qq.com> Co-authored-by: cambyzju <zhuxiaoli01@baidu.com> Co-authored-by: Yongqiang YANG <98214048+dataroaring@users.noreply.github.com> Co-authored-by: starocean999 <40539150+starocean999@users.noreply.github.com> Co-authored-by: Gabriel <gabrielleebuaa@gmail.com> Co-authored-by: AlexYue <yj976240184@qq.com> Co-authored-by: xueweizhang <zxw520blue1@163.com> Co-authored-by: Pxl <pxl290@qq.com> Co-authored-by: zhangstar333 <87313068+zhangstar333@users.noreply.github.com> Co-authored-by: xiaojunjie <971308896@qq.com> Co-authored-by: xiaojunjie <xiaojunjie@baidu.com>
…-dev (561fddc 20221228) (apache#1304) ``` 20211227 20221228 db04150a8d cd65d15ede v v selectdb-cloud-release-2.0 --o---.-----o------o-----o--o------------------ . \ . \ selectdb-cloud-release-2.1 --o---o \ \ \ \___________ \ \ \ selectdb-cloud-merge-2.0-2.1(tmp) o----o---o / \ selectdb-cloud-dev ----o-----o--------o-----o--o---o--------------- ^ 561fddc 20221228 ``` * [feature](selectdb-cloud) Fix file cache metrics nullptr error (apache#1060) * [feature](selectdb-cloud) Fix abort copy when -235 (apache#1039) * [feature](selectdb-cloud) Replace libfdb_c.so to make it compatible with different OS (apache#925) * [feature](selectdb-cloud) Optimize RPC retry in cloud_meta_mgr (apache#1027) * Optimize RETRY_RPC in cloud_meta_mgr * Add random sleep for RETRY_RPC * Add a simple backoff strategy for rpc retry * [feature](selectdb-cloud) Copy into support select by column name (apache#1055) * Copy into support select by column name * Fix broker load core dump due to mis-match of number of columns between remote and schema * [feature](selectdb-cloud) Fix test_dup_mv_schema_change case (apache#1022) * [feature](selectdb-cloud) Make the broker execute on the specified cluster (apache#1043) * Make the broker execute on the specified cluster * Pass the cluster parameter * [feature](selectdb-cloud) Support concurrent BaseCompaction and CumuCompaction on a tablet (apache#1059) * [feature](selectdb-cloud) Reduce meta-service log (apache#1067) * Quote string in the tagged log * Add template to enable customized log for RPC requests * [feature](selectdb-cloud) Use read-only txn + read-write txn for `commit_txn` (apache#1065) * [feature](selectdb-cloud) Pick "[fix](load) fix that load channel failed to be released in time (apache#14119)" commit 3690c4d Author: Xin Liao <liaoxinbit@126.com> Date: Wed Nov 9 22:38:08 2022 +0800 [fix](load) fix that load channel failed to be released in time (apache#14119) * [feature](selectdb-cloud) Add compaction profile log (apache#1072) * [feature](selectdb-cloud) Fix abort txn fail when copy job `getAllFileStatus` exception (apache#1066) * Revert "[feature](selectdb-cloud) Copy into support select by column name (apache#1055)" This reverts commit f1a543e. * [feature](selectdb-cloud) Pick"[fix](metric) fix the bug of not updating the query latency metric apache#14172 (apache#1076)" * [feature](selectdb-cloud) Distinguish KV_TXN_COMMIT_ERR or KV_TXN_CONFLICT while commit failed (apache#1082) * [feature](selectdb-cloud) Support configuring base compaction concurrency (apache#1080) * [feature](selectdb-cloud) Enhance start.sh/stop.sh for selectdb_cloud (apache#1079) * [feature](selectdb-cloud) Add smoke testing (apache#1056) Add smoke test, 1. upload,query http data api. 2. internal, external stage. 3. select,insert * [feature](selectdb-cloud) Disable admin stmt in cloud mode (apache#1064) Disable the following stmt. * AdminRebalanceDiskStmt/AdminCancelRebalanceDiskStmt * AdminRepairTableStmt/AdminCancelRepairTableStmt * AdminCheckTabletsStmt * AdminCleanTrashStmt * AdminCompactTableStmt * AdminCopyTabletStmt * AdminDiagnoseTabletStmt * AdminSetConfigStmt * AdminSetReplicaStatusStmt * AdminShowConfigStmt * AdminShowReplicaDistributionStmt * AdminShowReplicaStatusStmt * AdminShowTabletStorageFormatStmt Leaving a backdoor for the user root: * AdminSetConfigStmt * AdminShowConfigStmt * AdminShowReplicaDistributionStmt * AdminShowReplicaStatusStmt * AdminDiagnoseTabletStmt * [feature](selectdb-cloud) Update copy into doc (apache#1063) * [feature](selectdb-cloud) Fix AdminSetConfigStmt cannot work with root (apache#1085) * [feature](selectdb-cloud) Fix userid null lead to checkpoint error (apache#1083) * [feature](selectdb-cloud) Support controling the space used for upload (apache#1091) * [feature](selectdb-cloud) Pick "[fix](sequence) fix that update table core dump with sequence column (apache#13847)" (apache#1092) * [Fix](memory-leak) Fix boost::stacktrace memory leak (1097) * [Fix](selectdb-cloud) Several picks to fix memtracker (apache#1087) * [enhancement](memtracker) Add independent and unique scanner mem tracker for each query (apache#13262) * [enhancement](memory) Print memory usage log when memory allocation fails (apache#13301) * [enhancement](memtracker) Print query memory usage log every second when `memory_verbose_track` is enabled (apache#13302) * [fix](memory) Fix USE_JEMALLOC=true UBSAN compilation error apache#13398 * [enhancement](memtracker) Fix bthread local consume mem tracker (apache#13368) Previously, bthread_getspecific was called every time bthread local was used. In the test at apache#10823, it was found that frequent calls to bthread_getspecific had performance problems. So a cache is implemented on pthread local based on the btls key, but the btls key cannot correctly sense bthread switching. So, based on bthread_self to get the bthread id to implement the cache. * [enhancement](memtracker) Fix brpc causing query mem tracker to be inaccurate apache#13401 * [fix](memtracker) Fix transmit_tracker null pointer because phamp is not thread safe apache#13528 * [enhancement](memtracker) Fix Brpc mem count and refactored thread context macro (apache#13469) * [fix](memtracker) Fix the usage of bthread mem tracker (apache#13708) bthead context init has performance loss, temporarily delete it first, it will be completely refactored in apache#13585. * [enhancement](memtracker) Refactor load channel + memtable mem tracker (apache#13795) * [fix](load) Fix load channel mgr lock (apache#13960) hot fix load channel mgr lock * [fix](memtracker) Fix DCHECK !std::count(_consumer_tracker_stack.begin(), _consumer_tracker_stack.end(), tracker) * [tempfix][memtracker] wait pick 0b945fe Co-authored-by: Xinyi Zou <zouxinyi02@gmail.com> * [feature](selectdb-cloud) Add more recycler case (apache#1094) * [feature](selectdb-cloud) Pick "[improvement](load) some simple optimization for reduce load memory policy (apache#14215)" (apache#1096) * [feature](selectdb-cloud) Reduce unnecessary get rowset rpc when prepare compaction (apache#1099) * [feature](selectdb-cloud) Pick "[improvement](load) reduce memory in batch for small load channels (apache#14214)" (apache#1100) * [feature](selectdb-cloud) Pick "[improvement](load) release load channel actively when error occurs (apache#14218)" (apache#1102) * [feature](selectdb-cloud) Print build info of ms/recycler to stdout when launch (apache#1105) * [feature](selectdb-cloud) copy into support select by column name and load with partial columns (apache#1104) e.g. ``` COPY INTO test_table FROM (SELECT col1, col2, col3 FROM @ext_stage('1.parquet')) COPY INTO test_table (id, name) FROM (SELECT col1, col2 FROM @ext_stage('1.parquet')) ``` * [fix](selectdb-cloud) Pick "[Fix](array-type) bugfix for array column with delete condition (apache#13361)" (apache#1109) Fix for SQL with array column: delete from tbl where c_array is null; more info please refer to apache#13360 Co-authored-by: camby <104178625@qq.com> Co-authored-by: cambyzju <zhuxiaoli01@baidu.com> * [feature](selectdb-cloud) Copy into support force (apache#1081) * [feature](selectdb-cloud) Add abort txn, abort tablet job http api (apache#1101) Abort load txn by txn_id: ``` curl "{meta_sevice_ip}:{brpc_port}/MetaService/http/abort_txn?token=greedisgood9999" -d '{ "cloud_unique_id": string, "txn_id": int64 }' ``` Abort load txn by db_id and label: ``` curl "{meta_sevice_ip}:{brpc_port}/MetaService/http/abort_txn?token=greedisgood9999" -d '{ "cloud_unique_id": string, "db_id": int64, "label": string }' ``` Only support abort compaction job currently: ``` curl "{meta_sevice_ip}:{brpc_port}/MetaService/http/abort_tablet_job?token=greedisgood9999" -d '{ "cloud_unique_id": string, "job" : { "idx": {"tablet_id": int64}, "compaction": [{"id": string}] } }' ``` * [feature](selectdb-cloud) Fix external stage data for smoke test and retry to create stage (apache#1119) * [feature](selectdb-cloud) Fix data leaks when truncating table (apache#1114) * Drop cloud partition when truncating table * Add retry strategy for dropCloudMaterializedIndex * [feature](selectdb-cloud) Fix missing library when compiling unit test (apache#1128) * [feature](selectdb-cloud) Validate the object storage when create stage (apache#1115) * [feature](selectdb-cloud) Fix incorrectly setting cumulative point when committing base compaction (apache#1127) * [feature](selectdb-cloud) Fix missing lease when preparing cumulative compaction (apache#1131) * [feature](selectdb-cloud) Fix unbalanced tablet distribution (apache#1121) * Fix the bug of unbalanced tablet distribution * Use replica index hash to BE * [feature](selectdb-cloud) Fix core dump when get tablets info by BE web page (apache#1113) * [feature](selectdb-cloud) Fix start_fe.sh --version (apache#1106) * [feature](selectdb-cloud) Print tablet stats before and after compaction (apache#1132) * Log num rowsets before and after compaction * Print tablet stats after committing compaction * [feature](selectdb-cloud) Allow root user execute AlterSystemStmt (apache#1143) * [feature](selectdb-cloud) Fix BE UT (apache#1141) * [feature](selectdb-cloud) Select BE for the first bucket of every partition randomly (apache#1136) * [feature](selectdb-cloud) Fix query_limit int -> int64 (apache#1154) * [feature](selectdb-cloud) Add more cloud recycler case (apache#1116) * add more cloud recycler case * modify cloud recycler case dateset from sf0.1 to sf1 * [feature](selectdb-cloud) Fix misuse of aws transfer which may delete tmp file prematurely (apache#1160) * [feature](selectdb-cloud) Add test for copy into http data api and userId (apache#1044) * Add test for copy into http data api and userId * Add external and internal stage cross use regression case. * [feature](selectdb-cloud) Pass the cloud compaction regression test (apache#1173) * [feature](selectdb-cloud) Modify max_bytes_per_broker_scanner default value to 150G (apache#1184) * [feature](selectdb-cloud) Fix missing lock when calling Tablet::delete_predicates (apache#1182) * [improvement](config)change default remote_fragment_exec_timeout_ms to 30 seconds * [improvement](config) change default value of broker_load_default_timeout_second to 12 hours * [feature](selectdb-cloud) Fix replay copy into (apache#1167) * Add stage ddl regression * fix replay copy into * remove unused log * fix user name * [feature](selectdb-cloud) Fix FE --version option not work after fe started (apache#1161) * [feature](selectdb-cloud) BE accesses object store using HTTP (apache#1111) * [feature](selectdb-cloud) Refactor recycle copy jobs (apache#1062) * [fix](FE) Pick fix from doris master (apache#1177) (apache#1178) Commit: 53e5f39 Author: starocean999 <40539150+starocean999@users.noreply.github.com> Committer: GitHub <noreply@github.com> Date: Mon Oct 31 2022 10:19:32 GMT+0800 (China Standard Time) fix result exprs should be substituted in the same way as agg exprs (apache#13744) Commit: a4a9912 Author: starocean999 <40539150+starocean999@users.noreply.github.com> Committer: GitHub <noreply@github.com> Date: Thu Nov 03 2022 10:26:59 GMT+0800 (China Standard Time) fix group by constant value bug (apache#13827) Commit: 84b969a Author: starocean999 <40539150+starocean999@users.noreply.github.com> Committer: GitHub <noreply@github.com> Date: Thu Nov 10 2022 11:10:42 GMT+0800 (China Standard Time) fix the grouping expr should check col name from base table first, then alias (apache#14077) Commit: ae4f4b9 Author: starocean999 <40539150+starocean999@users.noreply.github.com> Committer: GitHub <noreply@github.com> Date: Thu Nov 24 2022 10:31:58 GMT+0800 (China Standard Time) fix having clause should use column name first then alias (apache#14408) * [feature](selectdb-cloud) Deal with getNextTransactionId rpc exception (apache#1181) Before fixing, getNextTransactionId will return -1 if there is RPC exception, it will cause schema change and the previous load task execute in parallel unexpectedly. * [feature](selectdb-cloud) Throw exception for unsupported operations in CloudGlobalTransactionMgr (apache#1180) * [improvement](load) Add more log on RPC error (apache#1183) * [feature](selectdb-cloud) Add copy_into case(json, parquet, orc) and tpch_sf1 to smoke test (apache#1140) * [feature](selectdb-cloud) Recycle dropped stage (apache#1071) * log s3 response code * add log in S3Accessor::delete_objects_by_prefix * Fix show copy * remove empty line * [feature](selectdb-cloud) Support bthread for new scanner (apache#1117) * Support bthread for new scanner * Keep the number of remote threads same as local threads * [feature](selectdb-cloud) Implement self-explained cloud unique id for instance id searching (apache#1089) 1. Implement self-explained cloud unique id for instance id searching 2. Fix register core when metaservice start error 3. Fix drop_instance not set mtime 4. Add HTTP API to get instance info ``` curl "127.0.0.1:5008/MetaService/http/get_instance?token=greedisgood9999&cloud_unique_id=regression-cloud-unique-id-fe-1" curl "127.0.0.1:5008/MetaService/http/get_instance?token=greedisgood9999&cloud_unique_id=1:regression_instance0:regression-cloud-unique-id-fe-1" curl "127.0.0.1:5008/MetaService/http/get_instance?token=greedisgood9999&instance_id=regression_instance0" ``` * [improvement](memory) simplify memory config related to tcmalloc and add gc (apache#1191) * [improvement](memory) simplify memory config related to tcmalloc There are several configs related to tcmalloc, users do know how to config them. Actually users just want two modes, performance or compact, in performance mode, users want doris run query and load quickly while in compact mode, users want doris run with less memory usage. If we want to config tcmalloc individually, we can use env variables which are supported by tcmalloc. * [improvement](tcmalloc) add moderate mode and avoid oom with a lot of cache (apache#14374) ReleaseToSystem aggressively when there are little free memory. * [feature](selectdb-cloud) Pick "[fix](hashjoin) fix coredump of hash join in ubsan build apache#13479" (apache#1190) commit b5cd167 Author: TengJianPing <18241664+jacktengg@users.noreply.github.com> Date: Thu Oct 20 10:16:19 2022 +0800 [fix](hashjoin) fix coredump of hash join in ubsan build (apache#13479) * [feature](selectdb-cloud) Support close FileWriter without forcing sync data to storage medium (apache#1134) * Trace accumulated time * Support close FileWriter without forcing sync data to storage medium * Avoid trace overhead when disable trace * [feature](selectdb-cloud) Pick "[BugFix](function) fix reverse function dynamic buffer overflow due to illegal character apache#13671" (apache#1146) * pick [opt](exec) Replace get_utf8_byte_length function by array (apache#13664) * pick [BugFix](function) fix reverse function dynamic buffer overflow due to illegal character apache#13671 Co-authored-by: HappenLee <happenlee@hotmail.com> * [feature](selectdb-cloud) Pick "[fix](fe) Inconsistent behavior for string comparison in FE and BE (apache#13604)" (apache#1150) Co-authored-by: xueweizhang <zxw520blue1@163.com> * [feature](selectdb-cloud) Copy into support delete_on condition (apache#1148) * [feature](selectdb-cloud) Pick "[fix](agg)fix group by constant value bug (apache#13827)" (apache#1152) * [fix](agg)fix group by constant value bug * keep only one const grouping exprs if no agg exprs Co-authored-by: starocean999 <40539150+starocean999@users.noreply.github.com> * [feature](selectdb-cloud) Pick "[fix](join)the build and probe expr should be calculated before converting input block to nullable (apache#13436)" (apache#1155) * [fix](join)the build and probe expr should be calculated before converting input block to nullable * remove_nullable can be called on const column Co-authored-by: starocean999 <40539150+starocean999@users.noreply.github.com> * [feature](selectdb-cloud) Pick "[Bug](predicate) fix core dump on bool type runtime filter (apache#13417)" (apache#1156) fix core dump on bool type runtime filter Co-authored-by: Pxl <pxl290@qq.com> * [feature](selectdb-cloud) Pick "[Fix](agg) fix bitmap agg core dump when phmap pointer assert alignment (apache#13381)" (apache#1157) Co-authored-by: zhangstar333 <87313068+zhangstar333@users.noreply.github.com> * [feature](selectdb-cloud) Pick "[Bug](function) fix core dump on case when have 1000 condition apache#13315" (apache#1158) Co-authored-by: Pxl <pxl290@qq.com> * [feature](selectdb-cloud) Pick "[fix](sort)the sort expr nullable info is wrong in some case (apache#12003)" * [feature](selectdb-cloud) Pick "[Improvement](decimal) print decimal according to the real precision and scale (apache#13437)" * [feature](selectdb-cloud) Pick "[bugfix](VecDateTimeValue) eat the value of microsecond in function from_date_format_str (apache#13446)" * [bugfix](VecDateTimeValue) eat the value of microsecond in function from_date_format_str * add sql based regression test Co-authored-by: xiaojunjie <xiaojunjie@baidu.com> * [feature](selectdb-cloud) Allow ShowProcesslistStmt for normal user (apache#1153) * [feature](selectdb-cloud) tcmalloc gc does not work in somecases (apache#1202) * [feature](selectdb-cloud) show data stmt supports db level stats and add metrics for table data size (apache#1145) * [feature](selectdb-cloud) Fix bug in calculating number of available threads for base compaction (apache#1203) * [feature](selectdb-cloud) Fix unexpected remaining cluster ids on observer when dropping cluster (apache#1194) We don't have the RPC `dropCluster` on, all clusters are built with tags in the backends info.. In the previous, FE master drop a cluster by counting clusters retrieved from meta-service, observers update map `clusteIdToBackend` and `clusterNameToId` by replaying backend node operations, which leads to inconsistency of FE master and FE observer. We can treat empty clusters as dropped clusters to keep consistency. Check <https://selectdb.feishu.cn/wiki/wikcnqI6HfD5mw8kHoGD5DqDxOe> for more info. * [feature](selectdb-cloud) Bump version to 2.0.13 * [opt](tcmalloc) Optimize policy of tcmalloc gc (apache#1214) Release memory when memory pressure is above pressure limit and keep at lease 2% memory as tcmalloc cache. * [feature](selectdb-cloud) Fix some bugs of cloud cluster (apache#1213) 1. fix executing load in multi clusters 2. fix use@ cluster on fe observer 3. fix forward without cloud cluster, we set cloud cluster when use cluster on observer * [fix](tcmalloc) Do not release cache aggressively when rss is low (apache#1216) * [fix](tcmalloc) Fix negative to_free_bytes due to physical_limit (apache#1217) * [feature](selectdb-cloud) Fix old cluster information left in Context (apache#1220) * [feature](selectdb-cloud) Add multi cluster regression case (apache#1226) * [feature](selectdb-cloud) Fix too many obs client log (apache#1227) * [fix](memory) Fix memory leak by calling boost::stacktrace (apache#14269) (incomplete pick) (apache#1210) boost::stacktrace::stacktrace() has memory leak, so use glog internal func to print stacktrace. The reason for the memory leak of boost::stacktrace is that a state is saved in the thread local of each thread but not actively released. The test found that each thread leaked about 100M after calling boost::stacktrace. refer to: boostorg/stacktrace#118 boostorg/stacktrace#111 Co-authored-by: Xinyi Zou <zouxinyi02@gmail.com> * [feature](selectdb-cloud) Check md5sum of libfdb.xz (apache#1163) * [feature](selectdb-cloud) Add multi cluster regression case (apache#1231) * add multi cluster regression case * refine code of multi cluster regression test * [fix](memtracker) Fix segment_meta_mem_tracker pick error (stacktrace) (apache#1237) * [feature](selectdb-cloud) Fix and improve compaction trace (apache#1233) * [feature](selectdb-cloud) Support cloud cluster in select hints (apache#984) e.g. ``` SELECT /*+ SET_VAR(cloud_cluster = ${cluster_name}) */ * from table ``` * [feature](selectdb-cloud) Fix load parquet coredump (apache#1238) * [feature](selectdb-cloud) Improve FE cluster metrics for monitoring (apache#1232) * [feature](selectdb-cloud) Add multi cluster async copy into regression case (apache#1242) * add multi cluster regression case * refine code of multi cluster regression test * Add multi cluster async copy into regression case * [feature](selectdb-cloud) Add error url regression case (apache#1246) * [feature](selectdb-cloud) Upgrade mariadb client version (apache#1240) This change _may_ fix "Failed to execute sql: java.lang.ClassCastException: java.util.LinkedHashMap$Entry cannot be cast to java.util.HashMap$TreeNode" * [feature](selectdb-cloud) Fix replay copy job and fail msg (apache#1239) * [feature](selectdb-cloud) Fix improper number of input rowsets of cumulative compaction (apache#1235) Remove the logic that returns input rowsets directly if the total size is larger than the promotion size in cumulative compaction policy. * Pick "[Feature](runtime-filter) add runtime filter breaking change adapt apache#13246" (apache#1221) This commit fix tpcds q85. * [feature](selectdb-cloud) Update http api doc (apache#1230) * [feature](selectdb-cloud) Optimize count/max/min query by caching the index info when write (apache#1222) * [feature](selectdb-cloud) Fix the codedump about fragment_executor double prepare (apache#1249) * [feature](selectdb-cloud) Clean copy jobs by num (apache#1219) * [feature](selectdb-cloud) Fix is_same_v failed bug in begin_rpc (apache#1250) * [feature](selectdb-cloud) Change delete logic of fdbbackup (apache#1248) * [feature](selectdb-cloud) Fix misuse of aws transfer which may delete tmp file prematurely (apache#1159) * Fix misusage of aws transfer manager * Share TransferManager in a S3FileSystem * Fix uploading incorrect data when opening file failed * Add ut for uploading to s3 * [feature](selectdb-cloud) Bump version to 2.0.14 (apache#1255) * [feature](selectdb-cloud) Improve copy into with delete on for json/parquet/orc (apache#1257) * [feature](selectdb-cloud) Implement tablet balance at partition level (apache#1247) * [feature](selectdb-cloud) Add pad_segment http action to manually overwrite an unrecoverable segment with an empty segment (apache#1254) * [feature](selectdb-cloud) Fix be ut (apache#1262) * [feature](selectdb-cloud) Modify regression case to adapt cloud mode (apache#1264) * [feature][selectdb-cloud] Fix unknown table caused by partition level balance when replay (apache#1265) * [feature][selectdb-cloud] Adjust the log level of table creation (apache#1260) * [feature](selectdb-cloud) Add config of the number of warn log files (apache#1245) * [feature](selectdb-cloud) Fix test_multiply case incorrect variable (apache#1266) * [feature](selectdb-cloud) Check connection timeout when create stage (apache#1253) * [feature][selectdb-cloud] Add auth check for undetermined cluster (apache#1258) * [feature](selectdb-cloud) Meta-service support conf rate limit (apache#1205) * [feature](selectdb-cloud) Check the config for file cache when launch to increase robustness (apache#1269) * [feature][selectdb-cloud] Add MaxBuildRowsetTime, MaxBuildRowsetTime, UploadSpeed in tablet sink profile (apache#1252) * [feature](selectdb-cloud) Fix three regression case for cloud (apache#1271) * [feature](selectdb-cloud) Reduce log of get_tablet_stats (apache#1274) * [feature](selectdb-cloud) Deprecate max_upload_speed and min_upload_speed in PTabletWriterAddBlockResult * [feature](selectdb-cloud) Add more conf for BRPC to get rid of "overcrowed" DECLARE_uint64(max_body_size); DECLARE_int64(socket_max_unwritten_bytes); * Pick "[Chore](regression) Fix wrong result for decimal (apache#13644)" commit e007343 Author: Gabriel <gabrielleebuaa@gmail.com> Date: Wed Oct 26 09:24:46 2022 +0800 [Chore](regression) Fix wrong result for decimal (apache#13644) * [feature](selectdb-cloud) Fix transfer handle doesn't init (apache#1276) * [feature](selectdb-cloud) Fix test_segment_iterator_delete case (apache#1275) * [feature][selectdb-cloud] Update copy upload doc (apache#1273) * [Fix](inverted index) pick clucene error processing from dev (apache#1287) * [bug][inverted]fix be core when throw CLuceneError (apache#1261) * [bug][inverted]fix be core when throw CLuceneError * catch clucene error and add warning logs * optimize code * [Fix](inverted index) return error if inverted index writer init failed (apache#1267) * [Fix](inverted index) return error if inverted index writer init failed * [Fix](segment_writer) need to return error status when create segment writer Co-authored-by: airborne12 <airborne12@gmail.com> Co-authored-by: luennng <luennng@gmail.com> Co-authored-by: airborne12 <airborne12@gmail.com> * [enhancement] support convert TYPE_FLOAT in function convert_type_to_primitive (apache#1290) * [feature][selectdb-cloud] Fix meta service range get instance when launch (apache#1293) Co-authored-by: Lightman <31928846+Lchangliang@users.noreply.github.com> Co-authored-by: meiyi <myimeiyi@gmail.com> Co-authored-by: Xiaocc <598887962@qq.com> Co-authored-by: Lei Zhang <27994433+SWJTU-ZhangLei@users.noreply.github.com> Co-authored-by: Xin Liao <liaoxinbit@126.com> Co-authored-by: Luwei <814383175@qq.com> Co-authored-by: plat1ko <platonekosama@gmail.com> Co-authored-by: deardeng <565620795@qq.com> Co-authored-by: Kidd <107781942+k-i-d-d@users.noreply.github.com> Co-authored-by: Xinyi Zou <zouxinyi02@gmail.com> Co-authored-by: zhannngchen <48427519+zhannngchen@users.noreply.github.com> Co-authored-by: camby <104178625@qq.com> Co-authored-by: cambyzju <zhuxiaoli01@baidu.com> Co-authored-by: Yongqiang YANG <98214048+dataroaring@users.noreply.github.com> Co-authored-by: starocean999 <40539150+starocean999@users.noreply.github.com> Co-authored-by: Gabriel <gabrielleebuaa@gmail.com> Co-authored-by: AlexYue <yj976240184@qq.com> Co-authored-by: xueweizhang <zxw520blue1@163.com> Co-authored-by: Pxl <pxl290@qq.com> Co-authored-by: zhangstar333 <87313068+zhangstar333@users.noreply.github.com> Co-authored-by: xiaojunjie <971308896@qq.com> Co-authored-by: xiaojunjie <xiaojunjie@baidu.com> Co-authored-by: airborne12 <airborne08@gmail.com> Co-authored-by: luennng <luennng@gmail.com> Co-authored-by: airborne12 <airborne12@gmail.com> Co-authored-by: YueW <45946325+Tanya-W@users.noreply.github.com>
Load data from Parquet file by libarrow.a and libparquet.a