Skip to content

ScyllaDB作为后端存储,分页查询有截断,触发Unexpected fetched page size错误 #1340

@philipy1219

Description

@philipy1219

Expected behavior 期望表现

分页查询获得完整结果

Actual behavior 实际表现

使用hugegraph-tools 1.4.0对以ScyllaDB为后端存储的数据进行backup操作,触发"Unexpected fetched page size"错误。

查看源码得知该错误位置
https://github.com/hugegraph/hugegraph/blob/ed610a0b889cc9a0539c5a43ea3be04ce3e6e940/hugegraph-cassandra/src/main/java/com/baidu/hugegraph/backend/store/cassandra/CassandraEntryIterator.java#L43-L59

猜测是由于ScyllaDB的分页大小限制为1MB导致,这是一个硬限制,无法调整。参考以下网页

https://www.scylladb.com/2017/11/17/7-rules-planning-queries-maximum-performance/

https://stackoverflow.com/questions/56697213/when-does-driver-datastax-driver-paging-yields-fewer-pages-than-requested

使用Cassandra为后端存储未发生该错误

参考DataStax 的Java API,我们做出以下修改:

https://github.com/hugegraph/hugegraph/blob/ed610a0b889cc9a0539c5a43ea3be04ce3e6e940/hugegraph-cassandra/src/main/java/com/baidu/hugegraph/backend/store/cassandra/CassandraEntryIterator.java#L67-L94

将上面的fetch函数修改为:

protected final boolean fetch() { 
     assert this.current == null; 
     if (this.next != null) { 
         this.current = this.next; 
         this.next = null; 
     } 
     while (this.remaining > 0 && this.rows.hasNext()) { 
         if (this.query.paging()) { 
                if (!this.results.isFullyFetched())
		    this.results.fetchMoreResults();
             this.remaining--; 
         } 
         Row row = this.rows.next(); 
         BackendEntry merged = this.merger.apply(this.current, row); 
         if (this.current == null) { 
             // The first time to read 
             this.current = merged; 
         } else if (merged == this.current) { 
             // The next entry belongs to the current entry 
             assert merged != null; 
         } else { 
             // New entry 
             assert this.next == null; 
             this.next = merged; 
             break; 
         } 
     } 
     return this.current != null; 
 } 

强制执行isFullyFetched检测是否还有未获取的page,并执行fetchMoreResults函数。

经过修改之后,可以备份99.98%的数据,目前仍有千分之二的数据无法导出;且多次执行backup命令进行测试,没有导出的数据id不完全相同。

分页查询会影响到包括scan查询、索引重建等多个任务,麻烦hugegraph团队定位一下问题,感谢。

Status of loaded data 数据状态

Vertex/Edge summary 数据量

  • loaded vertices amount: 亿级
  • loaded edges amount: 十亿级

Specifications of environment 环境信息

  • hugegraph version: 0.10.4
  • operating system: centos 7.4, 16 CPUs, 128G RAM
  • hugegraph backend: scylladb 4.0.4 cluster with 3 nodes, 1 x 1TB SSD disk each node

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions