refactor(loader): support concurrent readers, short-id & Graphsrc#683
refactor(loader): support concurrent readers, short-id & Graphsrc#683imbajin merged 46 commits intoapache:masterfrom
Conversation
added method InputReader.multiReaders() and adapted for all SOURCE
多文件输入这部分还没确认完成,初步进展
相应配置 & 细节更改:
1. FileSource 新增了 dir_filter 和 extra_date_formats 参数,并修改了构造函数;并增加了 ORC/Parquet 文件表头不区分大小写的支持FileSource.headerCaseSensitive以及单文件应用的splitCount,提升了文件加载的灵活性和兼容性。
2. InputSource加入headerCaseSensitive()默认区分大小写
多文件输入功能
FileReader.java
init() 只负责调用 progress(context, struct),不再扫描文件。
文件扫描和 reader 分裂逻辑移到了 split() 方法:
调用 scanReadables() 获取所有文件
排序
创建多个 FileReader 子实例,每个对应一个文件
InputProgress.java
新版特点
- 进度管理基于 文件名 -> InputItemProgress 的 Map
- 可以同时跟踪多个文件的加载状态(已加载 / 正在加载)
- 支持 多线程并发 和更精细的控制(比如只确认某个文件的 offset,或者只标记某个文件 loaded)
相关接口重构
旧版
- loadingItem():返回单个 loadingItem
- addLoadingItem(InputItemProgress):替换当前 loadingItem,旧的丢到 loadingItems
- loadingOffset():返回当前 loadingItem.offset()
- markLoaded(boolean markAll):
新版
- loadingItem(String name):按文件名查找对应的 loadingItem
- addLoadingItem(String name, InputItemProgress):按文件名新增
- 取消了 loadingOffset(),因为已经支持多文件了,offset 必须按文件取
- markLoaded(Readable readable, boolean markAll):
- 如果传入 readable → 把对应文件从 loadingItems 移到 loadedItems
- 否则(readable=null 且 markAll=true)→ 把全部 loadingItems 移过去
InputProgressDeser.java
旧版
Set<InputItemProgress> loadedItems;
InputItemProgress loadingItem;
用 Set 存储已完成的 items,用单对象存储正在加载的 item。
新版
Map<String, InputItemProgress> loadedItems;
Map<String, InputItemProgress> loadingItems;
改成 Map(key 是字符串,比如文件名/ID),既能保持唯一性又能快速索引,还支持多个并发 "loading items"。
并且使用了:
Collections.synchronizedMap(InsertionOrderUtil.newMap());
来保证线程安全 + 保留插入顺
…ogress && adjust some tests
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #683 +/- ##
=============================================
- Coverage 62.49% 51.72% -10.78%
- Complexity 1903 2059 +156
=============================================
Files 262 335 +73
Lines 9541 12520 +2979
Branches 886 1159 +273
=============================================
+ Hits 5963 6476 +513
- Misses 3190 5573 +2383
- Partials 388 471 +83 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/HugeGraphLoader.java
Show resolved
Hide resolved
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/HugeGraphLoader.java
Outdated
Show resolved
Hide resolved
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/HugeGraphLoader.java
Show resolved
Hide resolved
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/HugeGraphLoader.java
Show resolved
Hide resolved
| .map(item -> item.reader) | ||
| .collect(Collectors.toSet()); | ||
| for (InputReader r : readers) { | ||
| if (!usedReaders.contains(r)) { |
There was a problem hiding this comment.
在 prepareTaskItems() 的 finally 块中:
} finally {
Set<InputReader> usedReaders = tasks.stream()
.map(item -> item.reader)
.collect(Collectors.toSet());
for (InputReader r : readers) {
if (!usedReaders.contains(r)) {
try {
r.close();
} catch (Exception ex) {
LOG.warn("Failed to close reader: {}", ex.getMessage());
}
}
}
}问题:
- 只关闭了 "未使用" 的 reader,但使用中的 reader 在哪里关闭?
- 如果
reader.init()失败,reader 可能处于半初始化状态,仍需清理 - 异常被吞掉(只记录warn),可能隐藏重要的资源释放失败
建议:
- 明确 reader 的生命周期管理责任
- 使用 try-with-resources 或确保在任务完成后统一清理
- 考虑是否需要一个 Reader 注册表来跟踪所有创建的 reader
There was a problem hiding this comment.
使用中的reader在asyncLoadStruct中清理
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/HugeGraphLoader.java
Show resolved
Hide resolved
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/HugeGraphLoader.java
Show resolved
Hide resolved
| .collect(Collectors.toList()); | ||
|
|
||
| if (!CollectionUtils.isEmpty(selectedVertexLabels)) { | ||
| vertexLabels = |
There was a problem hiding this comment.
在 createGraphSourceLabels() 中:
Set<String> existedPKs =
targetClient.schema().getPropertyKeys().stream()
.map(pk -> pk.name()).collect(Collectors.toSet());
for (String pkName : label.properties()) {
PropertyKey pk = sourceClient.schema().getPropertyKey(pkName);
if (!existedPKs.contains(pk.name())) {
targetClient.schema().addPropertyKey(pk);
}
}问题:
- 只检查了 PropertyKey 的名称,没有验证数据类型是否一致
- 如果目标图中已存在同名但类型不同的 PropertyKey,会导致数据不兼容
- 没有处理 PropertyKey 的其他属性(如 Cardinality)
建议:
- 比对 PropertyKey 的完整定义,包括 dataType、cardinality 等
- 如果存在不兼容的 schema,给出明确的错误提示
- 考虑添加强制覆盖选项
There was a problem hiding this comment.
It may not be necessary
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/HugeGraphLoader.java
Show resolved
Hide resolved
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/HugeGraphLoader.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 69 out of 70 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (2)
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/builder/EdgeBuilder.java:1
- Corrected duplicate word 'the' in error message.
/*
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/HugeGraphLoader.java:1
- Corrected spelling of 'avaliable' to 'available' in error message (located in HugeClientHolder.java).
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java
Outdated
Show resolved
Hide resolved
hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java
Outdated
Show resolved
Hide resolved
hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java
Outdated
Show resolved
Hide resolved
hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java
Outdated
Show resolved
Hide resolved
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/reader/jdbc/Fetcher.java
Outdated
Show resolved
Hide resolved
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/filter/util/ShortIdConfig.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 69 out of 70 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (2)
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/builder/ElementBuilder.java:1
- Direct use of internal implementation class BuilderImpl instead of the interface. This creates tight coupling to internal APIs that may change.
/*
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/builder/EdgeBuilder.java:1
- Remove duplicate 'the' in error message.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/LoadTest.java
Show resolved
Hide resolved
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/util/HugeClientHolder.java
Outdated
Show resolved
Hide resolved
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/util/DataTypeUtil.java
Outdated
Show resolved
Hide resolved
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/source/graph/GraphSource.java
Outdated
Show resolved
Hide resolved
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/reader/file/FileLineFetcher.java
Outdated
Show resolved
Hide resolved
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/builder/EdgeBuilder.java
Outdated
Show resolved
Hide resolved
代码审查意见感谢提交这个重要的PR!这是一个重大的重构,引入了并发加载、Graph-to-Graph迁移等重要功能。以下是我的审查意见:
|
Thespica
left a comment
There was a problem hiding this comment.
LGTM, Thank you @sadwitdastreetz !
Purpose of the PR
This PR is a part of updating HugegraphLoader to 2.0 and most importantly, it is NOT ready yet. It introduces a major refactor and enhancement to the HugeGraph Loader, aiming to improve parallelism, stability, and compatibility during data loading.
It includes:
These changes address issues with performance bottlenecks, Kerberos token expiration, Oracle missing rows, and lack of schema compatibility when importing from another graph.
Main Changes
Loader
Refactored HugeGraphLoader with concurrent loading, Graph source support, and improved error handling.
Major Changes
Concurrency
Graph Source Support
Schema Management
Error Handling
API Changes
Breaking Changes
Source Layer
dir_filter,extra_date_formats,headerCaseSensitive, andsplitCountfor flexible directory/file filtering and single-file parallel reading.FileFilter + DirFilter, supports recursive directory traversal.LoadException.Reader Layer
init()and moved scan logic intosplit()→ multiple sub-readers per file.DirFilter.RowFetcherwith streamingJDBCFetcherto avoid Oracle data loss and improve performance.Progress Layer
loadingItemtoMap<String, InputItemProgress>for multi-file concurrent tracking.markLoaded(Readable, boolean)API for fine-grained progress confirmation.Filter Layer
ShortIdParser,ShortIdConfig.SchemaManagerProxyandVertexLabelProxyusing reflection, injecting short-id handling transparently into HugeClient.Options
LoadOptionswith new cluster, graph, and loading optimization flags (--scatter-sources,--short-id,--restore, etc.).dumpParams()to log all runtime parameters.Others
GlobalExecutorManagerfor thread pool management.FileLoadTestto adapt toInputProgressrefactor.sequenceDiagram autonumber actor CLI as 用户(CLI) participant Loader as HugeGraphLoader participant Options as LoadOptions participant Ctx as LoadContext participant Exec as GlobalExecutorManager participant Reader as InputReader(s) participant Parse as ElementParseGroup participant ShortId as ShortIdParser participant Client as HugeClient CLI->>Loader: new(args) / load() Loader->>Options: 解析并设置并行/shortId 等 Loader->>Ctx: init(Options) -> 创建 indirectClient, filterGroup Loader->>Exec: getExecutor(parallel) Loader->>Reader: create reader / split() (若 multiReaders) par 并行处理每个 InputTaskItem Reader-->>Loader: emit Line 或 GraphElement Loader->>Parse: filter(element) Parse-->>Loader: true / false alt 通过过滤 Loader->>ShortId: 可能转换短ID ShortId-->>Loader: 更新 element.id Loader->>Client: 写入(提交/flush 由配置或每行触发) else 过滤掉 Loader-->>Loader: 跳过 end Loader->>Ctx: 标记进度 (loaded_items/loading_items map) end Loader->>Exec: shutdown() Loader-->>CLI: 返回完成/异常sequenceDiagram autonumber participant GSrc as GraphSource participant GReader as GraphReader participant Fetch as GraphFetcher participant Client as HugeClient participant Loader as HugeGraphLoader GSrc->>GReader: new(GraphSource) GReader->>Client: createHugeClient() loop 批量拉取 GReader->>Fetch: queryBatch(offset, size) Fetch->>Client: 执行查询 Client-->>Fetch: 返回元素批次 Fetch-->>GReader: elements GReader-->>Loader: 生成 Line(交由 NopBuilder/过滤链处理) endDoes this PR potentially affect the following parts?
LoadOptionsextended)Documentation Status
Doc - TODO(need to update loader usage doc)