Unity device mapping algorithm#1459
Conversation
lockshaw
left a comment
There was a problem hiding this comment.
Can you add compiler-tests to CI and proj?
Also, overall this code could use a bit of simplifying and readability-improving. For a lot of the cases where you're unwrapping types consider just defining equivalent functions for those types, which would help simplify things. Overall, more small functions with better names would probably help over the current large functions which can be a bit hard to actually read through
Reviewed 7 of 15 files at r1, 8 of 8 files at r2, all commit messages.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @wmdi)
lib/compiler/include/compiler/graph_optimize_result.struct.toml line 7 at r2 (raw file):
includes = [ "compiler/machine_mapping.h" ]
Suggestion:
includes = [
"compiler/machine_mapping.dtg.h",
"pcg/parallel_computation_graph/parallel_computation_graph.dtg.h",
]lib/compiler/include/compiler/graph_optimize_state.h line 13 at r2 (raw file):
float runtime; bool operator==(GraphOptimizeState const &other) const;
GraphOptimizeState::operator== should probably have tests
lib/compiler/include/compiler/unity_algorithm.h line 13 at r2 (raw file):
namespace FlexFlow { struct OptimizerConfig {
Move over to dtgen so it's printable, equality-checkable, hashable, etc.?
lib/compiler/src/graph_optimize_state.cc line 6 at r2 (raw file):
bool GraphOptimizeState::operator==(GraphOptimizeState const &other) const { auto layers1 = topological_ordering(graph_optimize_result.pcg);
I'm not sure topological_ordering is the best thing to rely on here. This seems like a graph-level comparison function, so it might be better to figure out what the semantics are and then it can get added to graph so you only have to use it here?
lib/compiler/src/graph_optimize_state.cc line 30 at r2 (raw file):
return !(*this == other); }
Where is operator<? I think it was declared?
lib/compiler/src/machine_mapping.cc line 159 at r2 (raw file):
cached_subgraph_costs(cached_subgraph_costs) {} ParallelComputationGraph pcg;
Why move the PCG from OptimalCostFunctor to MachineMappingSearcher?
lib/compiler/src/machine_mapping.cc line 166 at r2 (raw file):
OptimalCostCache &cached_subgraph_costs; struct OptimalCostFunctor {
Why do we have this separate OptimalCostFunctor?
lib/compiler/src/machine_mapping.cc line 219 at r2 (raw file):
std::unordered_set<DataflowOutput> split_outputs; for (auto const &[value, _] :
Might be simpler with transform? Also the assert isn't really necessary, get will already throw if you perform an invalid access
lib/compiler/src/machine_mapping.cc line 287 at r2 (raw file):
optimal_result, OptimalCostResult::parallel_combine( std::visit(OptimalCostFunctor(
Why not just pull out a separate function that does std::visit(OptimalCostFunctor(...)) rather than repeat it each time?
lib/compiler/src/machine_mapping.cc line 320 at r2 (raw file):
enumerate_machine_views({node}, resource)) { MachineMapping mv_map{{{node, node_machine_views.at(node)}}}; std::unordered_map<OpenDataflowValue, MachineView> machine_views =
Might be cleaner using merge_maps and generate_map?
lib/compiler/src/machine_mapping.cc line 334 at r2 (raw file):
std::vector<std::unordered_map<Node, MachineView>> enumerate_machine_views(std::unordered_set<Node> const &nodes,
Not entirely clear what these do? Can you rename them to something a bit more descriptive?
lib/compiler/test/src/test_optimal_cost.cc line 9 at r2 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("optimal_cost_0") {
Suggestion:
TEST_CASE("optimal_cost does not crash on minimal inputs") {lib/substitutions/include/substitutions/sub_parallel_computation_graph.h line 25 at r2 (raw file):
SubParallelComputationGraph sub_pcg_from_partial_pcg(ParallelComputationGraph const &,
Is this just get_subgraph for PCGs?
lib/substitutions/src/substitutions/sub_parallel_computation_graph.cc line 61 at r2 (raw file):
std::unordered_set<Node> const &nodes) { auto as_open = view_as_labelled_open_dataflow_graph(pcg.raw_graph); OpenDataflowSubgraphResult subgraph_result = get_subgraph(as_open, nodes);
Probably better to just add get_subgraph for LabelledOpenDataflowGraph and then use it here
wmdi
left a comment
There was a problem hiding this comment.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @lockshaw)
lib/compiler/include/compiler/graph_optimize_result.struct.toml line 7 at r2 (raw file):
includes = [ "compiler/machine_mapping.h" ]
Done.
lib/compiler/include/compiler/graph_optimize_state.h line 13 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
GraphOptimizeState::operator==should probably have tests
Done.
lib/compiler/include/compiler/unity_algorithm.h line 13 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Move over to
dtgenso it's printable, equality-checkable, hashable, etc.?
Done.
lib/compiler/src/graph_optimize_state.cc line 6 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I'm not sure
topological_orderingis the best thing to rely on here. This seems like a graph-level comparison function, so it might be better to figure out what the semantics are and then it can get added tographso you only have to use it here?
I think I require homomorphism assuming inputs are ordered.
lib/compiler/src/machine_mapping.cc line 159 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why move the PCG from
OptimalCostFunctortoMachineMappingSearcher?
This is to avoid transferring variables that are not changed during DP, i.e., they are 'constants' to the DP, instead of states.
lib/compiler/src/machine_mapping.cc line 166 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why do we have this separate
OptimalCostFunctor?
To separate DP states with DP constants (environments).
lib/compiler/src/machine_mapping.cc line 219 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Might be simpler with
transform? Also theassertisn't really necessary,getwill already throw if you perform an invalid access
I do not find methods to get the set of left/right values of a bidict
lib/compiler/src/machine_mapping.cc line 287 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not just pull out a separate function that does
std::visit(OptimalCostFunctor(...))rather than repeat it each time?
Done.
lib/compiler/src/machine_mapping.cc line 320 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Might be cleaner using
merge_mapsandgenerate_map?
We do not have the utils required for bidict
lib/compiler/src/machine_mapping.cc line 334 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Not entirely clear what these do? Can you rename them to something a bit more descriptive?
Done.
lib/compiler/test/src/test_optimal_cost.cc line 9 at r2 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("optimal_cost_0") {
Done.
lib/substitutions/include/substitutions/sub_parallel_computation_graph.h line 25 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Is this just
get_subgraphforPCGs?
Yes
lockshaw
left a comment
There was a problem hiding this comment.
Reviewed 7 of 9 files at r3, all commit messages.
Reviewable status: 15 of 17 files reviewed, 9 unresolved discussions (waiting on @wmdi)
lib/compiler/include/compiler/optimal_cost_state.struct.toml line 20 at r3 (raw file):
"utils/hash/unordered_map.h", "utils/fmt/unordered_map.h", ]
Suggestion:
includes = [
"utils/graph/serial_parallel/serial_parallel_decomposition.dtg.h",
"pcg/machine_specification.dtg.h",
"pcg/machine_view.dtg.h",
"utils/graph/open_dataflow_graph/open_dataflow_edge.dtg.h",
"utils/graph/open_dataflow_graph/open_dataflow_value.dtg.h",
]
src_includes = [
"utils/hash/unordered_map.h",
"utils/fmt/unordered_map.h",
]lib/compiler/src/graph_optimize_state.cc line 6 at r2 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
I think I require homomorphism assuming inputs are ordered.
Using a hack like this for now is fine, but eventually it would be good to move over to a more principled algorithm implemented in utils/graph that doesn't make correctness tradeoffs and more explicitly uses the properties of DataflowGraph that you're relying on for efficiency. Until that's implemented let's add a comment in the code that clarifies that this is a hack and links to an issue with the plan for moving this over to something more correct
lib/compiler/src/machine_mapping.cc line 159 at r2 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
This is to avoid transferring variables that are not changed during DP, i.e., they are 'constants' to the DP, instead of states.
As discussed in the Wednesday meeting, it should be possible to remove OptimalCostFunctor, especially as we're now on C++17 and can use overload
There seem to be two potential designs (either (1) making optimal_cost a method on MachineMappingSearcher, or (2) making MachineMappingSearcher just a wrapper for the global state (probably renamed to something like DPContext) and passing that around between top-level functions as a mutable reference, i.e., essentially doing the passing of this manually)
lib/compiler/src/machine_mapping.cc line 219 at r2 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
I do not find methods to get the set of left/right values of a
bidict
See keys in utils/containers/keys.h
lib/compiler/src/machine_mapping.cc line 320 at r2 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
We do not have the utils required for
bidict
Not exactly sure the details of simplifying this, but making it more clear what this is doing would be nice (maybe using functions from utils/containers or pulling out some pieces into separate named lambdas)
lib/compiler/test/src/test_graph_optimize_state.cc line 8 at r3 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("graph_optimize_state:equality") {
Suggestion:
TEST_CASE("GraphOptimizeState::operator==") {
lockshaw
left a comment
There was a problem hiding this comment.
Reviewed 2 of 9 files at r3.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @wmdi)
lib/compiler/src/graph_optimize_state.cc line 57 at r3 (raw file):
namespace std { size_t hash<::FlexFlow::GraphOptimizeState>::operator()(
See https://reviewable.io/reviews/flexflow/FlexFlow/1459#-O3kIAym5Yc4ECMISkBW. Eventually it might be good to use a proper graph hash like https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.graph_hashing.weisfeiler_lehman_graph_hash.html#networkx.algorithms.graph_hashing.weisfeiler_lehman_graph_hash
lib/compiler/test/src/test_graph_optimize_state.cc line 44 at r3 (raw file):
"dense1"); ParallelComputationGraph pcg = builder.pcg;
Suggestion:
ParallelComputationGraph pcg1 = [&] {
ParallelComputationGraphBuilder builder;
ParallelTensorShape input_shape =
ParallelTensorShape{ParallelTensorDims{
FFOrdered<ShardParallelDim>{
ShardParallelDim{32, 2},
ShardParallelDim{16, 1},
},
ReplicaParallelDimSet{
SumDegree{1},
DiscardCopyDegree{1},
},
},
DataType::FLOAT};
parallel_tensor_guid_t input0 =
builder.create_input_tensor(input_shape);
parallel_tensor_guid_t dense0 = builder.dense(input0,
8);
parallel_tensor_guid_t dense1 = builder.dense(dense0,
4);
return builder.pcg
}();lib/compiler/test/src/test_graph_optimize_state.cc line 67 at r3 (raw file):
"dense0"); ParallelComputationGraph pcg_ = builder.pcg;
Suggestion:
ParallelComputationGraph pcg2 = [&] {
ParallelComputationGraphBuilder builder;
parallel_tensor_guid_t input0 =
builder.create_input_tensor(input_shape);
parallel_tensor_guid_t dense0 = builder.dense(input0,
8);
return builder.pcg;
}();lib/compiler/test/src/test_graph_optimize_state.cc line 69 at r3 (raw file):
ParallelComputationGraph pcg_ = builder.pcg; CHECK(GraphOptimizeState(GraphOptimizeResult(pcg, empty_machine_mapping), 0) !=
This doesn't really sufficiently test GraphOptimizeState::operator==--from this I'm not sure (1) what role the machine mappings and the cost play in equality, and (2) whether any graphs compare equal at all, or whether all of them just return not equal. A few more test/sub-cases are needed
lib/compiler/test/src/test_graph_optimize_state.cc line 70 at r3 (raw file):
CHECK(GraphOptimizeState(GraphOptimizeResult(pcg, empty_machine_mapping), 0) != GraphOptimizeState(GraphOptimizeResult(pcg_, empty_machine_mapping), 0));
Suggestion:
GraphOptimizeState state1 = GraphOptimizeState{
GraphOptimizeResult{pcg, empty_machine_mapping},
0,
};
GraphOptimzieState state2 = GraphOptimizeState{
GraphOptimizeResult{pcg, empty_machine_mapping},
0,
};
CHECK(state1 != state2);lib/substitutions/include/substitutions/sub_parallel_computation_graph.h line 25 at r2 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
Yes
In that case let's rename it to get_pcg_subgraph
wmdi
left a comment
There was a problem hiding this comment.
Reviewable status: 4 of 41 files reviewed, 13 unresolved discussions (waiting on @lockshaw)
lib/compiler/src/graph_optimize_state.cc line 6 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Using a hack like this for now is fine, but eventually it would be good to move over to a more principled algorithm implemented in
utils/graphthat doesn't make correctness tradeoffs and more explicitly uses the properties ofDataflowGraphthat you're relying on for efficiency. Until that's implemented let's add a comment in the code that clarifies that this is a hack and links to an issue with the plan for moving this over to something more correct
Done.
lib/compiler/src/machine_mapping.cc line 159 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
As discussed in the Wednesday meeting, it should be possible to remove
OptimalCostFunctor, especially as we're now on C++17 and can useoverloadThere seem to be two potential designs (either (1) making
optimal_costa method onMachineMappingSearcher, or (2) makingMachineMappingSearcherjust a wrapper for the global state (probably renamed to something likeDPContext) and passing that around between top-level functions as a mutable reference, i.e., essentially doing the passing ofthismanually)
Done.
lib/compiler/src/machine_mapping.cc line 219 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
See
keysinutils/containers/keys.h
Done.
lib/compiler/src/machine_mapping.cc line 320 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Not exactly sure the details of simplifying this, but making it more clear what this is doing would be nice (maybe using functions from
utils/containersor pulling out some pieces into separate named lambdas)
Done.
lib/compiler/test/src/test_graph_optimize_state.cc line 8 at r3 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("graph_optimize_state:equality") {
Done.
lib/compiler/test/src/test_graph_optimize_state.cc line 44 at r3 (raw file):
"dense1"); ParallelComputationGraph pcg = builder.pcg;
Done.
lib/compiler/test/src/test_graph_optimize_state.cc line 67 at r3 (raw file):
"dense0"); ParallelComputationGraph pcg_ = builder.pcg;
Done.
lib/substitutions/include/substitutions/sub_parallel_computation_graph.h line 25 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
In that case let's rename it to
get_pcg_subgraph
Done.
lib/compiler/include/compiler/optimal_cost_state.struct.toml line 20 at r3 (raw file):
"utils/hash/unordered_map.h", "utils/fmt/unordered_map.h", ]
Done.
wmdi
left a comment
There was a problem hiding this comment.
Reviewable status: 4 of 41 files reviewed, 13 unresolved discussions (waiting on @lockshaw)
lib/compiler/src/graph_optimize_state.cc line 57 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
See https://reviewable.io/reviews/flexflow/FlexFlow/1459#-O3kIAym5Yc4ECMISkBW. Eventually it might be good to use a proper graph hash like https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.graph_hashing.weisfeiler_lehman_graph_hash.html#networkx.algorithms.graph_hashing.weisfeiler_lehman_graph_hash
Done.
lib/compiler/test/src/test_graph_optimize_state.cc line 69 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This doesn't really sufficiently test
GraphOptimizeState::operator==--from this I'm not sure (1) what role the machine mappings and the cost play in equality, and (2) whether any graphs compare equal at all, or whether all of them just return not equal. A few more test/sub-cases are needed
This will be fixed in another PR.
lib/compiler/test/src/test_graph_optimize_state.cc line 70 at r3 (raw file):
CHECK(GraphOptimizeState(GraphOptimizeResult(pcg, empty_machine_mapping), 0) != GraphOptimizeState(GraphOptimizeResult(pcg_, empty_machine_mapping), 0));
This will be fixed in another PR.
lib/substitutions/src/substitutions/sub_parallel_computation_graph.cc line 61 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Probably better to just add
get_subgraphforLabelledOpenDataflowGraphand then use it here
Done.
lockshaw
left a comment
There was a problem hiding this comment.
Reviewed 27 of 37 files at r4, 6 of 7 files at r5, 1 of 2 files at r6, all commit messages.
Reviewable status: 38 of 41 files reviewed, 31 unresolved discussions (waiting on @wmdi)
lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h line 6 at r6 (raw file):
#include "machine_mapping.h" #include "machine_mapping_cache.h" #include "machine_mapping_context.dtg.h"
Suggestion:
#include "compiler/machine_mapping/machine_mapping.h"
#include "compiler/machine_mapping/machine_mapping_cache.h"
#include "compiler/machine_mapping/machine_mapping_context.dtg.h"lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h line 16 at r6 (raw file):
ParallelComputationGraph const &pcg, std::function<std::unordered_set<MachineView>( ParallelLayerAttrs const &, MachineSpecification const &)> const
This isn't going to allow access to the input and output tensor shapes of the operator, which are probably necessary?
Code quote:
ParallelLayerAttrs const &, Mlib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h line 23 at r6 (raw file):
MachineMappingResult get_optimal_machine_mapping_internal(MachineMappingContext &context,
What is this overload for? Why does it not take a SerialParallelDecomposition?
lib/compiler/include/compiler/machine_mapping/machine_mapping.h line 4 at r6 (raw file):
#define _FLEXFLOW_COMPILER_MACHINE_MAPPING_H #include "machine_mapping.dtg.h"
Suggestion:
#include "compiler/machine_mapping/machine_mapping.dtg.h"lib/compiler/include/compiler/machine_mapping/machine_mapping.h line 8 at r6 (raw file):
namespace FlexFlow { MachineMapping combine(MachineMapping const &, MachineMapping const &);
Suggestion:
MachineMapping combine_disjoint_mappings(MachineMapping const &, MachineMapping const &);lib/compiler/include/compiler/machine_mapping/machine_mapping_cache.h line 10 at r6 (raw file):
namespace FlexFlow { class MachineMappingCache {
Does it make sense to have a separate class/object for this, or this not just the std::unordered_map API?
lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml line 7 at r5 (raw file):
includes = [ "machine_mapping.dtg.h",
Suggestion:
"compiler/machine_mapping/machine_mapping.dtg.h",lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml line 29 at r6 (raw file):
[[fields]] name = "cached_subgraph_results" type = "::FlexFlow::MachineMappingCache"
What about moving MachineMappingCache out of this class so MachineMappingContext can be passed as const & and MachineMappingCache alone gets passed as mutable &?
lib/compiler/include/compiler/machine_mapping/machine_mapping_result.h line 14 at r6 (raw file):
MachineMappingResult get_infinity_machine_mapping_result(); void minimize_runtime(MachineMappingResult &m1, MachineMappingResult const &m2);
Why this vs just calling min or something?
Code quote:
void minimize_runtime(MachineMappingResult &m1, MachineMappingResult const &m2);lib/compiler/include/compiler/machine_mapping/machine_mapping_result.struct.toml line 9 at r5 (raw file):
includes = [ "machine_mapping.dtg.h",
Suggestion:
"compiler/machine_mapping/machine_mapping.dtg.h",lib/compiler/src/unity_algorithm.cc line 39 at r6 (raw file):
MachineMappingCache cached_subgraph_costs; DeduplicatedPriorityQueue<GraphOptimizeState> candidates;
Would it be cleaner here to, rather than having a custom GraphOptimizeState type whose hash and operator== ignore the runtime cost, instead have this be a DeduplicatedPriorityQueue<SomeWrapperAroundAPCG, CustomComparator> where the CustomComparator is something like below, where MyCustomType is analagous to SomeWrapperAroundAPCG, and I suppose instead of a std::unordered_map<MyCustomType, float> you'd have a MachineMappingCache
Code snippet:
#include <map>
#include <queue>
#include <unordered_map>
#include <stdlib.h>
#include <iostream>
int guid_ctr = 0;
struct MyCustomType {
MyCustomType()
: guid(guid_ctr++) {}
bool operator==(MyCustomType const &other) const {
return this->guid == other.guid;
}
bool operator!=(MyCustomType const &other) const {
return this->guid != other.guid;
}
int guid;
};
namespace std {
template <>
struct hash<MyCustomType> {
size_t operator()(MyCustomType const &x) const {
return x.guid;
}
};
}
float get_cost(MyCustomType const &) {
return rand();
}
int main() {
std::unordered_map<MyCustomType, float> costs;
auto cached_get_cost = [&](MyCustomType const &x) -> float {
if (costs.find(x) == costs.end()) {
costs.insert({x, get_cost(x)});
}
return costs.at(x);
};
auto my_custom_comparator = [&](MyCustomType const &lhs, MyCustomType const &rhs) -> bool {
return cached_get_cost(lhs) < cached_get_cost(rhs);
};
std::priority_queue<MyCustomType, std::vector<MyCustomType>, decltype(my_custom_comparator)> q{my_custom_comparator};
for (int i = 0; i < 10; i++) {
q.push(MyCustomType{});
}
while (!q.empty()) {
MyCustomType curr = q.top();
q.pop();
std::cout << cached_get_cost(curr) << std::endl;
}
}lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 81 at r6 (raw file):
return enumeration; }
Move to a separate allowed_machine_mappings file, this file is already getting rather complicated/large
Code quote:
std::vector<std::unordered_map<Node, MachineView>>
allowed_machine_mappings(MachineMappingContext const &context,
std::unordered_set<Node> const &nodes,
MachineSpecification const &resource) {
if (nodes.empty()) {
return {{}};
}
Node node = get_first(nodes);
std::vector<std::unordered_map<Node, MachineView>> partial_enumeration =
allowed_machine_mappings(context, set_minus(nodes, {node}), resource);
std::unordered_set<MachineView> allowed_machine_views_for_node =
context.allowed_machine_views(context.pcg.raw_graph.at(node), resource);
std::vector<std::unordered_map<Node, MachineView>> enumeration;
for (MachineView const &mv : allowed_machine_views_for_node) {
for (std::unordered_map<Node, MachineView> const &partial :
partial_enumeration) {
enumeration.push_back(merge_maps(
partial, std::unordered_map<Node, MachineView>{{node, mv}}));
}
}
return enumeration;
}
std::vector<std::unordered_map<DataflowOutput, MachineView>>
allowed_machine_mappings(MachineMappingContext const &context,
std::unordered_set<DataflowOutput> const &values,
MachineSpecification const &resource) {
std::unordered_set<Node> nodes;
for (DataflowOutput const &v : values) {
nodes.insert(v.node);
}
std::vector<std::unordered_map<Node, MachineView>> node_enumeration =
allowed_machine_mappings(context, nodes, resource);
std::vector<std::unordered_map<DataflowOutput, MachineView>> enumeration;
for (std::unordered_map<Node, MachineView> _node_enumeration :
node_enumeration) {
std::unordered_map<DataflowOutput, MachineView> _emumeration;
for (DataflowOutput const &v : values) {
_emumeration.emplace(v, _node_enumeration.at(v.node));
}
enumeration.push_back(_emumeration);
}
return enumeration;
}lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 187 at r6 (raw file):
GraphSplit graph_split = get_graph_split(decompn1, decompn2); OpenDataflowSubgraphResult subgraph_res1 = get_subgraph(
Is there really a point to the subgraphs at this point, or can we just rely on the SP decomposition for this information? I guess it's that the graph cost is cached, and if you cached it for the SP decomposition then you wouldn't be agnostic to different locations of the same graph?
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 192 at r6 (raw file):
sub_pcg_from_full_pcg(context.pcg).raw_graph, graph_split.second); std::unordered_set<DataflowOutput> split_outputs = transform(
FYI you can use parallel_tensor_guid_t for this once #1471 is merged
Code quote:
DataflowOutputlib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 199 at r6 (raw file):
&split_machine_views : allowed_machine_mappings(context, split_outputs, resource)) { std::unordered_map<OpenDataflowValue, MachineView> fixed_machine_views1 =
FYI you can use open_parallel_tensor_guid_t for this once #1471 is merged
Code quote:
OpenDataflowValuelib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 206 at r6 (raw file):
get_open_dataflow_values(subgraph_res2.graph)); for (auto const &[split_value, split_input] :
Suggestion:
for (auto const &[full_graph_value, subgraph_input] :lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 283 at r6 (raw file):
assert(contains( context.allowed_machine_views(context.pcg.raw_graph.at(node), resource), fixed_machine_views.at(any_output)));
Simplify
Code quote:
assert(contains(
context.allowed_machine_views(context.pcg.raw_graph.at(node), resource),
fixed_machine_views.at(any_output)));lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 302 at r6 (raw file):
return OpenDataflowValue(o); }), [&](OpenDataflowValue const &o) { return mv; });
Simplify, probably separate into two expressions
Code quote:
std::unordered_map<OpenDataflowValue, MachineView> output_mv_map =
generate_map(transform(get_outputs(context.pcg.raw_graph, node),
[](DataflowOutput const &o) {
return OpenDataflowValue(o);
}),
[&](OpenDataflowValue const &o) { return mv; });lib/compiler/src/compiler/machine_mapping/machine_mapping_cache.cc line 10 at r6 (raw file):
if (contains_key(cache, state)) { MachineMappingResult result = cache.at(state); return std::make_optional(result);
Suggestion:
return result;lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 100 at r6 (raw file):
}(); ParallelComputationGraph pcg_non_sp = [&] {
These should be in the SUBCASEs as they're not shared between them
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 146 at r6 (raw file):
CHECK(bool(result.runtime > 0)); // TODO(@Mengdi Wu): fill it with actual cost // CHECK(result.runtime == xx);
Rather than running it, getting a cost, and then just hardcoding that into the file, I'd focus on generating testcases (such as restricting the allowed machine views to only a few, controlling the cost function, etc.) where you can actually set analytically derived values
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 171 at r6 (raw file):
} SUBCASE("PCG is not sp-izable due to multiple inputs") {
You'd want separate testcases for this too that actually check the graph manipulation, not just get_optimal_machine_mapping-level tests
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 0 at r6 (raw file):
Rename machine_mapping.cc
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 18 at r6 (raw file):
MachineMapping result = combine(machine_mapping_0, machine_mapping_1); CHECK(result == combined); }
Suggestion:
TEST_CASE("combine(MachineMapping, MachineMapping)") {
MachineView machine_view_0 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(1));
MachineView machine_view_1 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(2));
MachineMapping machine_mapping_0 = MachineMapping{{
{Node(0), machine_view_0},
}};
MachineMapping machine_mapping_1 = MachineMapping{{
{Node(1), machine_view_1},
}};
MachineMapping correct = MachineMapping{{
{Node(0), machine_view_0},
{Node(1), machine_view_1},
}};
MachineMapping result = combine(machine_mapping_0, machine_mapping_1);
CHECK(result == correct);
}lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 20 at r6 (raw file):
} TEST_CASE("nodes_are_disjoint") {
Suggestion:
TEST_CASE("nodes_are_disjoint(MachineMapping, MachineMapping)") {lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 28 at r6 (raw file):
{{Node(0), machine_view_0}, {Node(1), machine_view_1}}); CHECK(nodes_are_disjoint(machine_mapping_0, machine_mapping_1)); CHECK_FALSE(nodes_are_disjoint(machine_mapping_0, combined));
Suggestion:
MachineMapping machine_mapping_0({{Node(0), machine_view_0}});
SUBCASE("nodes are disjoint") {
MachineMapping machine_mapping_1({{Node(1), machine_view_1}});
bool correct = true;
bool result = nodes_are_disjoint(machine_mapping_0, machine_mapping_1);
CHECK(result == correct);
}
SUBCASE("nodes are not disjoint") {
MachineMapping machine_mapping_1(
{{Node(0), machine_view_0}, {Node(1), machine_view_1}});
bool correct = false;
bool result = nodes_are_disjoint(machine_mapping_0, machine_mapping_1);
CHECK(result == correct);
}lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc line 11 at r6 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("machine_mapping_cache") {
Suggestion:
TEST_CASE("MachineMappingCache") {lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_result.cc line 77 at r6 (raw file):
CHECK(s1.runtime < inf.runtime); CHECK(s2.runtime < inf.runtime); }
Delete
Code quote:
TEST_CASE("get_infinity_machine_mapping_result") {
MachineView machine_view_0 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(1));
MachineView machine_view_1 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(2));
MachineMapping machine_mapping_empty(
std::unordered_map<Node, MachineView>{});
MachineMapping machine_mapping_0({{Node(0), machine_view_0}});
MachineMapping machine_mapping_1({{Node(1), machine_view_1}});
MachineMapping combined(
{{Node(0), machine_view_0}, {Node(1), machine_view_1}});
MachineMappingResult s0(0, machine_mapping_empty);
MachineMappingResult s1(1, machine_mapping_0);
MachineMappingResult s2(2, machine_mapping_1);
MachineMappingResult inf = get_infinity_machine_mapping_result();
CHECK(s0.runtime < inf.runtime);
CHECK(s1.runtime < inf.runtime);
CHECK(s2.runtime < inf.runtime);
}lib/substitutions/include/substitutions/sub_parallel_computation_graph.h line 25 at r6 (raw file):
SubParallelComputationGraph get_pcg_subgraph(ParallelComputationGraph const &, std::unordered_set<Node> const &);
Suggestion:
std::unordered_set<parallel_layer_guid_t> const &);lib/compiler/include/compiler/machine_mapping/machine_mapping.struct.toml line 17 at r6 (raw file):
"utils/hash/unordered_map.h", "utils/fmt/unordered_map.h", ]
Suggestion:
includes = [
"utils/graph/node/node.dtg.h",
"pcg/machine_view.dtg.h",
]
src_includes = [
"utils/hash/unordered_map.h",
"utils/fmt/unordered_map.h",
]lib/compiler/include/compiler/machine_mapping/machine_mapping.struct.toml line 21 at r6 (raw file):
[[fields]] name = "machine_views" type = "std::unordered_map<::FlexFlow::Node, ::FlexFlow::MachineView>"
Suggestion:
type = "std::unordered_map<::FlexFlow::parallel_layer_guid_t, ::FlexFlow::MachineView>"
wmdi
left a comment
There was a problem hiding this comment.
Reviewable status: 38 of 41 files reviewed, 31 unresolved discussions (waiting on @lockshaw)
lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h line 6 at r6 (raw file):
#include "machine_mapping.h" #include "machine_mapping_cache.h" #include "machine_mapping_context.dtg.h"
Done.
lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h line 16 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This isn't going to allow access to the input and output tensor shapes of the operator, which are probably necessary?
I see. Will fix it when merging the machine view PR.
lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h line 23 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What is this overload for? Why does it not take a
SerialParallelDecomposition?
This is to further split the logics into multiple pieces. get_optimal_machine_mapping wraps global variables as MachineMappingContext and calls this function, and this function performs sp decomposition (and solves the un-sp-izable issue in a later version) and calls other overloads.
lib/compiler/include/compiler/machine_mapping/machine_mapping.h line 4 at r6 (raw file):
#define _FLEXFLOW_COMPILER_MACHINE_MAPPING_H #include "machine_mapping.dtg.h"
Done.
lib/compiler/include/compiler/machine_mapping/machine_mapping.h line 8 at r6 (raw file):
namespace FlexFlow { MachineMapping combine(MachineMapping const &, MachineMapping const &);
Done.
lib/compiler/include/compiler/machine_mapping/machine_mapping_cache.h line 10 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Does it make sense to have a separate class/object for this, or this not just the
std::unordered_mapAPI?
I think it makes sense to wrap up the save and load logic as they handle some existence cases.
lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml line 7 at r5 (raw file):
includes = [ "machine_mapping.dtg.h",
Done.
lib/compiler/include/compiler/machine_mapping/machine_mapping_result.h line 14 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why this vs just calling
minor something?
This function only compares MachineMappingResults according to the runtime. It does not make sense to have an operator< for MachineMappingResult that only compares runtime, which is required by min.
lib/compiler/include/compiler/machine_mapping/machine_mapping_result.struct.toml line 9 at r5 (raw file):
includes = [ "machine_mapping.dtg.h",
Done.
lib/compiler/src/compiler/machine_mapping/machine_mapping_cache.cc line 10 at r6 (raw file):
if (contains_key(cache, state)) { MachineMappingResult result = cache.at(state); return std::make_optional(result);
Done.
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 18 at r6 (raw file):
MachineMapping result = combine(machine_mapping_0, machine_mapping_1); CHECK(result == combined); }
Done.
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 20 at r6 (raw file):
} TEST_CASE("nodes_are_disjoint") {
Done.
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 28 at r6 (raw file):
{{Node(0), machine_view_0}, {Node(1), machine_view_1}}); CHECK(nodes_are_disjoint(machine_mapping_0, machine_mapping_1)); CHECK_FALSE(nodes_are_disjoint(machine_mapping_0, combined));
Done.
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Rename
machine_mapping.cc
What does this mean?
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc line 11 at r6 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("machine_mapping_cache") {
Done.
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_result.cc line 77 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/compiler/include/compiler/machine_mapping/machine_mapping.struct.toml line 17 at r6 (raw file):
"utils/hash/unordered_map.h", "utils/fmt/unordered_map.h", ]
Done.
wmdi
left a comment
There was a problem hiding this comment.
Reviewable status: 38 of 41 files reviewed, 31 unresolved discussions (waiting on @lockshaw)
lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml line 29 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What about moving
MachineMappingCacheout of this class soMachineMappingContextcan be passed asconst &andMachineMappingCachealone gets passed as mutable&?
I think MachineMappingCache should be part of the context, and it is fine to have a mutable context.
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 81 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Move to a separate
allowed_machine_mappingsfile, this file is already getting rather complicated/large
Done.
wmdi
left a comment
There was a problem hiding this comment.
Reviewable status: 38 of 41 files reviewed, 31 unresolved discussions (waiting on @lockshaw)
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 206 at r6 (raw file):
get_open_dataflow_values(subgraph_res2.graph)); for (auto const &[split_value, split_input] :
Done.
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 283 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Simplify
Done.
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 302 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Simplify, probably separate into two expressions
Done.
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 100 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
These should be in the
SUBCASEs as they're not shared between them
Done.
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 146 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Rather than running it, getting a cost, and then just hardcoding that into the file, I'd focus on generating testcases (such as restricting the allowed machine views to only a few, controlling the cost function, etc.) where you can actually set analytically derived values
In terms of actual cost, I mean the cost model we have for a pcg. Even for analytical results, we need a precise cost model.
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 171 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
You'd want separate testcases for this too that actually check the graph manipulation, not just
get_optimal_machine_mapping-level tests
Done.
lib/substitutions/include/substitutions/sub_parallel_computation_graph.h line 25 at r6 (raw file):
SubParallelComputationGraph get_pcg_subgraph(ParallelComputationGraph const &, std::unordered_set<Node> const &);
Why?
lockshaw
left a comment
There was a problem hiding this comment.
Reviewed 1 of 37 files at r4, 7 of 17 files at r7, all commit messages.
Reviewable status: 33 of 43 files reviewed, 24 unresolved discussions (waiting on @wmdi)
lib/compiler/include/compiler/machine_mapping/allowed_machine_mappings.h line 12 at r7 (raw file):
allowed_machine_mappings(MachineMappingContext const &context, std::unordered_set<Node> const &nodes, MachineSpecification const &resource);
Suggestion:
std::vector<std::unordered_map<parallel_layer_guid_t, MachineView>>
allowed_machine_mappings(MachineMappingContext const &context,
std::unordered_set<parallel_layer_guid_t> const &nodes,
MachineSpecification const &resource);lib/compiler/include/compiler/machine_mapping/allowed_machine_mappings.h line 16 at r7 (raw file):
std::vector<std::unordered_map<DataflowOutput, MachineView>> allowed_machine_mappings(MachineMappingContext const &context, std::unordered_set<DataflowOutput> const &values,
Suggestion:
std::unordered_set<parallel_tensor_guid_t> const &values,lib/compiler/test/src/test_graph_optimize_state.cc line 70 at r3 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
This will be fixed in another PR.
Isn't this a trivial fix?
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 146 at r6 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
In terms of actual cost, I mean the cost model we have for a pcg. Even for analytical results, we need a precise cost model.
In a way yes, but you control the cost model: you can add a custom implementation of CostEstimator that returns whatever numbers you think would be useful for testing. We absolutely do want to (at some point) test the whole compiler+costestimator+machineviews+everything stack, but even if all of that was present we would also want to have tests that just that compiler is correct in isolation, which is what most of these tests should be--provide an arbitrary tiny set of allowed machine views (say, two of them), create a tiny PCG (with, say, 2 operators in parallel), and write a CostEstimator subclass that gives you back some chosen constants (1 for each of the possible node-machineview assignments). Then solve for what the resulting value of get_optimal_machine_mappings should be, run the function, and check that the returned value is the same as what you solved analytically.
At this point there shouldn't be anything stopping you from testing pretty much the whole Unity DP in isolation, and once everything is merged we'll add a couple end-to-end tests just to make sure there weren't any unexpected issues at the boundaries between components
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 12 at r7 (raw file):
auto allowed_machine_views1 = [&](ParallelLayerAttrs const &, MachineSpecification const &) { // TODO(@Mengdi Wu): Replace it with actual allowed machine views when
Ideally don't do this--these tests are supposed to confirm that get_optimal_machine_mapping is correct, not that get_allowed_machine_views is correct. If you call get_allowed_machine_views in this test and the test fails, you don't know whether it is because get_optimal_machine_mapping is wrong or if it s because get_allowed_machine_views is wrong
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line at r6 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
What does this mean?
Rename from test_machine_mapping.cc to machine_mapping.cc--the current convention is to not have the test_ prefix, as they're already under the test directory
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 25 at r7 (raw file):
combine_disjoint_mappings(machine_mapping_0, machine_mapping_1); CHECK(result == correct); }
Add a SUBCASE to test the failure case
Suggestion:
TEST_CASE("combine_disjoint_mappings(MachineMapping, MachineMappping)") {
MachineView machine_view_0 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(1));
MachineView machine_view_1 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(2));
SUBCASE("input machine mappings are disjoint") {
MachineMapping machine_mapping_0 = MachineMapping({
{Node(0), machine_view_0},
});
MachineMapping machine_mapping_1 = MachineMapping({
{Node(1), machine_view_1},
});
MachineMapping correct = MachineMapping({
{Node(0), machine_view_0},
{Node(1), machine_view_1},
});
MachineMapping result =
combine_disjoint_mappings(machine_mapping_0, machine_mapping_1);
CHECK(result == correct);
}
SUBCASE("input machine mappings are not disjoint") {
...
}
}lib/substitutions/include/substitutions/sub_parallel_computation_graph.h line 25 at r6 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
Why?
parallel_layer_guid_t is a wrapper for Node that we use for representing Nodes in a ParallelComputationGraph. It primarily improves readability and helps prevent accidentally passing the wrong node to things
lockshaw
left a comment
There was a problem hiding this comment.
Reviewed 10 of 17 files at r7.
Reviewable status: all files reviewed, 28 unresolved discussions (waiting on @wmdi)
lib/compiler/include/compiler/machine_mapping/allowed_machine_mappings.h line 10 at r7 (raw file):
std::vector<std::unordered_map<Node, MachineView>> allowed_machine_mappings(MachineMappingContext const &context,
The type signatures of these methods are a bit odd? At least it's not 100% clear for me what they're doing based on the names and input types, especially since there are two overloads? Can you either clarify the names a bit and/or add doxygen docstrings giving a brief description of what they do?
lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h line 23 at r6 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
This is to further split the logics into multiple pieces.
get_optimal_machine_mappingwraps global variables asMachineMappingContextand calls this function, and this function performs sp decomposition (and solves the un-sp-izable issue in a later version) and calls other overloads.
I think it's probably better to think of the SP decomposition as coming from outside of the optimal machine mapping algorithm, as with Pietro's stuff it at least seems more like creating an SP-ization of a given PCG is a separate step than just part of optimal machine mapping algorithm, but this isn't a bit deal and can always get refactored later. That said, it seems a bit odd to have this function, which conceptually seems to do entirely different things from the other get_optimal_machine_mapping_internal overloads, have the same name as them, so if you do prefer to use this structure it should at least be renamed to something clearer--ideally I should be able to make a pretty accurate educated guess of what a function does from its name and arguments, and I don't think that's the case here.
lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml line 29 at r6 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
I think
MachineMappingCacheshould be part of the context, and it is fine to have a mutable context.
Is there a significant downside to moving it out? It would make the code a bit more readable as it would make it more obvious what gets mutated and what doesn't
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 9 at r7 (raw file):
namespace FlexFlow { std::vector<std::unordered_map<Node, MachineView>>
Might be cleaner to define a machine_mapping.struct.toml wrapper type around std::unordered_map<parallel_layer_guid_t, MachineView> so this turns into the more readable std::vector<MachineMapping>? Also why is this a std::vector and not a std::unordered_set/std::set? Should there be duplicates or does order matter?
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 14 at r7 (raw file):
MachineSpecification const &resource) { if (nodes.empty()) { return {{}};
Why not an empty vector as a return value? Does an empty machine mapping really have a meaning?
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 29 at r7 (raw file):
} } return enumeration;
Suggestion:
Node curr_node = get_first(nodes);
std::unordered_set<Node> other_nodes = set_minus(nodes, {node});
std::vector<std::unordered_map<Node, MachineView>> other_node_mappings_from_recursion =
allowed_machine_mappings(context, other_nodes, resource);
std::unordered_set<MachineView> allowed_machine_views_for_curr_node =
context.allowed_machine_views(context.pcg.raw_graph.at(node), resource);
std::vector<std::unordered_map<Node, MachineView>> result;
for (MachineView const &for_curr_node : allowed_machine_views_for_curr_node) {
for (std::unordered_map<Node, MachineView> const &for_other_nodes :
other_node_mappings_from_recursion) {
MachineMapping merged = merge_disjoint_machine_mappings(
MachineView{{
{curr_node, for_cudd_node},
}},
for_other_nodes,
});
enumeration.push_back(merged);
}
}
return result;lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 39 at r7 (raw file):
for (DataflowOutput const &v : values) { nodes.insert(v.node); }
Suggestion:
std::unordered_set<Node> nodes transform(values, [](DataflowOutput const &v) { return v.node; });lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 47 at r7 (raw file):
for (std::unordered_map<Node, MachineView> _node_enumeration : node_enumeration) { std::unordered_map<DataflowOutput, MachineView> _emumeration;
Don't use leading underscore, if you need to distinguish it from enumeration it's probably better to come up with a clearer name as enumeration is a bit unclear (what is being enumerated? does this have to do with enumerate from utils/containers/enumerate.h?)
Code quote:
_emumeration;lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 257 at r7 (raw file):
std::unordered_map<OpenDataflowValue, MachineView> output_mv_map = generate_map(outputs_of_node, [&](OpenDataflowValue const &o) { return mv; });
Suggestion:
[&](OpenDataflowValue const &) { return mv; });lib/compiler/test/src/test_graph_optimize_state.cc line 69 at r3 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
This will be fixed in another PR.
In that case can you create an issue for that and link it here?
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 171 at r6 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
Done.
I just see a comment added? Are we planning to add the multiple-input handling in this PR or in a separate one?
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc line 12 at r7 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("MachineMappingCache") { ParallelComputationGraph pcg = [&] {
Why create a PCG here? It looks like the tests don't actually need it--they just need an SPDecomposition object?
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc line 58 at r7 (raw file):
MachineSpecification machine_spec(1, 1, 1, 1, 1); MachineMappingState state0(subgraph0, machine_spec, {});
Prefer curly-brace initialization for consistency with the rest of the FF codebase
Suggestion:
MachineMappingState state0 = MachineMappingState{
subgraph0, machine_spec, {}};lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc line 72 at r7 (raw file):
cache.save(state0, result0); CHECK(cache.load(state0).value() == result0);
Break into SUBCASEs with titles that explain what you're testing--right now it's just a sea of checks which makes this test rather difficult to understand
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc line 81 at r7 (raw file):
CHECK(cache.load(state2) == std::nullopt); cache.save(state2, result2);
Include failure cases--what happens if I save a key twice?
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_result.cc line 21 at r7 (raw file):
MachineMappingResult s2(2, machine_mapping_1); MachineMappingResult result0 = sequential_combine(s0, s1);
Each of these checks/function evaluations should have a SUBCASE with a title explaining what it's testing/why it's there. Right now it's just a bunch of checks, which makes it difficult to understand quickly what high-level properties you're trying to verify with each. If you can't come up with a concise property that's being verified, the test might need some restructuring/redesigning
(and same with the parallel_combine testcase below, and the minimize_runtime testcase below that)
For an example of what I mean, see https://github.com/flexflow/FlexFlow/blob/add-candle-uno-pcg/lib/op-attrs/test/src/op-attrs/ops/concat.cc#L175-L312--each of the failure cases and success cases has a separate SUBCASE with a title indicating what is being checked, or for a simpler example, https://github.com/flexflow/FlexFlow/blob/add-candle-uno-pcg/lib/utils/test/src/utils/containers/try_merge_nondisjoint_unordered_maps.cc#L9-L75
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_result.cc line 23 at r7 (raw file):
MachineMappingResult result0 = sequential_combine(s0, s1); CHECK(result0.runtime == 1); CHECK(result0.machine_mapping == machine_mapping_0);
Suggestion:
MachineMappingResult result0 = sequential_combine(s0, s1);
MachineMappingResult correct = MachineMappingResult{machine_mapping_0, 1};
wmdi
left a comment
There was a problem hiding this comment.
Reviewable status: all files reviewed, 28 unresolved discussions (waiting on @lockshaw)
lib/compiler/include/compiler/machine_mapping/allowed_machine_mappings.h line 12 at r7 (raw file):
allowed_machine_mappings(MachineMappingContext const &context, std::unordered_set<Node> const &nodes, MachineSpecification const &resource);
Done.
lib/compiler/include/compiler/machine_mapping/allowed_machine_mappings.h line 16 at r7 (raw file):
std::vector<std::unordered_map<DataflowOutput, MachineView>> allowed_machine_mappings(MachineMappingContext const &context, std::unordered_set<DataflowOutput> const &values,
Done.
lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h line 23 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I think it's probably better to think of the SP decomposition as coming from outside of the optimal machine mapping algorithm, as with Pietro's stuff it at least seems more like creating an SP-ization of a given PCG is a separate step than just part of optimal machine mapping algorithm, but this isn't a bit deal and can always get refactored later. That said, it seems a bit odd to have this function, which conceptually seems to do entirely different things from the other
get_optimal_machine_mapping_internaloverloads, have the same name as them, so if you do prefer to use this structure it should at least be renamed to something clearer--ideally I should be able to make a pretty accurate educated guess of what a function does from its name and arguments, and I don't think that's the case here.
I see. But I don't think it is a problem to have this signature. First, it is only used internally, so it has little effect on other parts. And it actually make the internal implementation more convenient. Second, it actually has the similar functionality as other functions that have the same name but just different format of the inputs. We should name functions according to their functionality instead of the detailed implementation.
lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml line 29 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Is there a significant downside to moving it out? It would make the code a bit more readable as it would make it more obvious what gets mutated and what doesn't
I feel like we should put global variables in MachineMappingContext as possible and only explicitly pass the variables that belong to DP states. I think it is more clear this way.
lib/compiler/src/unity_algorithm.cc line 39 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Would it be cleaner here to, rather than having a custom
GraphOptimizeStatetype whose hash andoperator==ignore the runtime cost, instead have this be aDeduplicatedPriorityQueue<SomeWrapperAroundAPCG, CustomComparator>where theCustomComparatoris something like below, whereMyCustomTypeis analagous toSomeWrapperAroundAPCG, and I suppose instead of astd::unordered_map<MyCustomType, float>you'd have aMachineMappingCache
Sounds a good idea and we will leave all the MCMC search part to another PR.
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 9 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Might be cleaner to define a
machine_mapping.struct.tomlwrapper type aroundstd::unordered_map<parallel_layer_guid_t, MachineView>so this turns into the more readablestd::vector<MachineMapping>? Also why is this astd::vectorand not astd::unordered_set/std::set? Should there be duplicates or does order matter?
It is a vector because no deduplication is required here.
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 14 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not an empty vector as a return value? Does an empty machine mapping really have a meaning?
It means the only machine mapping for an empty set of input nodes.
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 29 at r7 (raw file):
} } return enumeration;
Done.
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 39 at r7 (raw file):
for (DataflowOutput const &v : values) { nodes.insert(v.node); }
I don't think we should use transform since it is not a one-to-one mapping but a many-to-one mapping. We should not expect transform to handle many-to-one mapping not matter whether it actually does.
lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 47 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Don't use leading underscore, if you need to distinguish it from
enumerationit's probably better to come up with a clearer name asenumerationis a bit unclear (what is being enumerated? does this have to do withenumeratefromutils/containers/enumerate.h?)
Done.
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 187 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Is there really a point to the subgraphs at this point, or can we just rely on the SP decomposition for this information? I guess it's that the graph cost is cached, and if you cached it for the SP decomposition then you wouldn't be agnostic to different locations of the same graph?
Done.
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 257 at r7 (raw file):
std::unordered_map<OpenDataflowValue, MachineView> output_mv_map = generate_map(outputs_of_node, [&](OpenDataflowValue const &o) { return mv; });
Done.
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 146 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
In a way yes, but you control the cost model: you can add a custom implementation of
CostEstimatorthat returns whatever numbers you think would be useful for testing. We absolutely do want to (at some point) test the whole compiler+costestimator+machineviews+everything stack, but even if all of that was present we would also want to have tests that just that compiler is correct in isolation, which is what most of these tests should be--provide an arbitrary tiny set of allowed machine views (say, two of them), create a tiny PCG (with, say, 2 operators in parallel), and write aCostEstimatorsubclass that gives you back some chosen constants (1 for each of the possible node-machineview assignments). Then solve for what the resulting value ofget_optimal_machine_mappingsshould be, run the function, and check that the returned value is the same as what you solved analytically.At this point there shouldn't be anything stopping you from testing pretty much the whole Unity DP in isolation, and once everything is merged we'll add a couple end-to-end tests just to make sure there weren't any unexpected issues at the boundaries between components
Done.
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 171 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I just see a comment added? Are we planning to add the multiple-input handling in this PR or in a separate one?
In a separate one.
lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 12 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Ideally don't do this--these tests are supposed to confirm that
get_optimal_machine_mappingis correct, not thatget_allowed_machine_viewsis correct. If you callget_allowed_machine_viewsin this test and the test fails, you don't know whether it is becauseget_optimal_machine_mappingis wrong or if it s becauseget_allowed_machine_viewsis wrong
Done.
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Rename from
test_machine_mapping.cctomachine_mapping.cc--the current convention is to not have thetest_prefix, as they're already under thetestdirectory
Done.
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 25 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Add a
SUBCASEto test the failure case
Done.
lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc line 12 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why create a PCG here? It looks like the tests don't actually need it--they just need an SPDecomposition object?
Yes, this PCG is only used to create the SPD
lib/substitutions/include/substitutions/sub_parallel_computation_graph.h line 25 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
parallel_layer_guid_tis a wrapper forNodethat we use for representingNodes in aParallelComputationGraph. It primarily improves readability and helps prevent accidentally passing the wrong node to things
Done.
lib/compiler/include/compiler/machine_mapping/machine_mapping.struct.toml line 21 at r6 (raw file):
[[fields]] name = "machine_views" type = "std::unordered_map<::FlexFlow::Node, ::FlexFlow::MachineView>"
Done.
Marsella8
left a comment
There was a problem hiding this comment.
Reviewed 1 of 3 files at r9, 106 of 150 files at r10, 117 of 145 files at r11, 51 of 60 files at r12, 23 of 23 files at r13, all commit messages.
Reviewable status: all files reviewed, 52 unresolved discussions (waiting on @wmdi)
lib/compiler/include/compiler/machine_mapping/include_unconstrained.struct.toml line 2 at r11 (raw file):
namespace = "FlexFlow" name = "IncludeUnconstrained"
maybe enum instead? (just an idea)
lib/compiler/include/compiler/machine_mapping/machine_mapping_result.h line 9 at r12 (raw file):
namespace FlexFlow { [[nodiscard]] MachineMappingResult infeasible_machine_mapping_result();
any strong reason for [[nodiscard]]?
lib/compiler/include/compiler/machine_mapping/machine_mapping_problem_tree/mm_problem_tree_parallel_split_label.struct.toml line 2 at r11 (raw file):
namespace = "FlexFlow" name = "MMProblemTreeParallelSplitLabel"
why is it empty?
lib/compiler/src/compiler/machine_mapping/get_machine_resource_splits.cc line 27 at r12 (raw file):
result.insert(std::make_pair(sub_resource2, sub_resource1)); }
note that we are only partitioning across GPUs here.
Currently, I consider how MachineSpecification fuses GPU and CPU nodes to be a bit awkard (e.g. it doesn't even have a separate intra and inter node bandwitdh per parameters), I think it should instead have as attributes num_nodes, num_devices_per_node, device_type, inter_node_bandwidth, and intra_node_bandwidth, since this way we can ignore the CPU stuff a lot more and only use it when needed.
Also with the i*=2 in this part of the code feels a bit arbitrary?
lib/compiler/src/compiler/machine_mapping/get_tensor_set_movement_across_split.cc line 4 at r13 (raw file):
#include "compiler/machine_mapping/abstracted_tensor_set_movement/abstracted_tensor_set_movement.h" #include "compiler/machine_mapping/abstracted_tensor_set_movement/get_abstracted_tensor_set_movement_across_split.h" #include "compiler/machine_mapping/transitive_reduced_pcg.h"
some of this imports seem superfluous
lib/compiler/src/compiler/machine_mapping/machine_mapping_result.cc line 26 at r13 (raw file):
for (MachineMappingResult const &candidate : candidates) { result = minimize_runtime(result, candidate);
optionally foldl, but it's honestly pretty clear as it is
lib/compiler/src/compiler/machine_mapping/machine_mapping_result.cc line 51 at r13 (raw file):
ParallelLayerGuidObliviousMachineMapping mapping = [&] { if (parallel_split_transformation.has_value()
I think this should be if (parallel_split_transformation.has_value()) then evaluate whether it should be LthenR or RthenL, here if the split_transformation has no value it calls it with LthenR
lib/compiler/src/compiler/machine_mapping/machine_mapping_result.cc line 97 at r13 (raw file):
MachineMappingResult minimize_runtime(MachineMappingResult const &maybe_m1, MachineMappingResult const &maybe_m2) { FeasibleMachineMappingResult m1 = ({
This pattern to get the feasible results seems to appear fairly often here, maybe pacakge it into a separate function?
lib/compiler/src/compiler/machine_mapping/machine_mapping_result.cc line 111 at r13 (raw file):
}); if (m2.runtime < m1.runtime) {
ternary optionally
lib/compiler/test/src/compiler/machine_mapping/machine_mapping_problem_tree/get_machine_mapping_problem_tree.cc line 63 at r13 (raw file):
UnmappedOpCostEstimateKey input_key = make_input_key(input_shape); PCGBinarySPDecomposition sp_decomposition = \
remove
Code quote:
\lib/local-execution/src/ops/element_binary.h line 1 at r13 (raw file):
#ifndef _FLEXFLOW_ELEMENT_BINARY_H
change for all src/ops/ to also include the path? not a big deal
Code quote:
_FLEXFLOW_ELEMENT_BINARY_Hlib/local-execution/src/ops/layer_norm.h line 1 at r13 (raw file):
#ifndef _FLEXFLOW_RUNTIME_SRC_OPS_LAYER_NORM_H
LOCAL_EXECUTION
Code quote:
RUNTIMElib/utils/include/utils/full_binary_tree/find_paths_to_leaf.h line 16 at r13 (raw file):
template <typename ParentLabel, typename LeafLabel> std::unordered_set<BinaryTreePath> find_paths_to_leaf(FullBinaryTree<ParentLabel, LeafLabel> const &tree,
super clean implementation
lib/utils/include/utils/graph/digraph/algorithms/get_edges_from_subgraph_to_subgraph.h line 8 at r13 (raw file):
std::unordered_set<DirectedEdge> get_edges_from_subgraph_to_subgraph(DiGraphView const &, std::unordered_set<Node> const &,
add variable names for clarity
lib/utils/include/utils/graph/series_parallel/binary_sp_decomposition_tree/generic_binary_sp_decomposition_tree/make.h line 7 at r13 (raw file):
namespace FlexFlow {
maybe a couple of using SomeShorterName = GenericBinarySPDecompositionTree<SeriesLabel, ParallelLabel, LeafLabel> and same for the others?
I know it's not really standard for the codebase, but I think in this specific case it could help readability quite a bit.
lib/utils/include/utils/graph/series_parallel/binary_sp_decomposition_tree/leaf_only_binary_sp_decomposition_tree/leaf_only_binary_parallel_split_label.struct.toml line 2 at r13 (raw file):
namespace = "FlexFlow" name = "LeafOnlyBinaryParallelSplitLabel"
why empty?
lib/utils/include/utils/graph/series_parallel/binary_sp_decomposition_tree/leaf_only_binary_sp_decomposition_tree/leaf_only_binary_series_split_label.struct.toml line 11 at r13 (raw file):
"rapidcheck", ]
why empty?
lib/utils/src/utils/graph/dataflow_graph/algorithms/transitive_reduced_dataflow_graph/transitive_reduced_dataflow_graph.cc line 7 at r13 (raw file):
TransitiveReducedDataflowGraphView get_dataflow_graph_transitive_reduction(DataflowGraphView const &g) { DiGraphView as_digraph = g;
use an explicit to_digraph to cast it (just so this won't break in case we decide to modify the coercion rules)
lib/utils/test/src/utils/containers/unordered_map_from_pairs.cc line 42 at r13 (raw file):
{1, "b"}, };
I think we should throw an error in this case
lib/compiler/src/compiler/machine_mapping/get_allowed_machine_views_list.cc line 44 at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Might be more readable as something like this? We'd need to add a
transform_tuplesfunction or something like it that takes in a container ofstd::tuple<A, B, ...>and a lambda matchingstd::function<T(A const &, B const &, ...)>and combines the destructuring andtransformoperations, but I'm sure I or @Marsella8 would be happy to add this tocontainersas I can think of quite a few places where it would be useful. Not a high priority and the current code is fine as is, just something I was thinking of
yeah seems like a good function to add, let me know if it is needed
lib/compiler/src/compiler/machine_mapping/machine_mapping_constraints.cc line 75 at r13 (raw file):
} else { if (current_machine_view.value() != machine_view) { throw mk_runtime_error(fmt::format("with_additional_layer_machine_views received machine view assignment for layer {} "
with_additional_constraints
Code quote:
with_additional_layer_machine_views
wmdi
left a comment
There was a problem hiding this comment.
Reviewed 2 of 15 files at r1, 1 of 9 files at r3, 9 of 37 files at r4, 1 of 17 files at r7, 4 of 21 files at r8, 1 of 3 files at r9, 103 of 150 files at r10, 84 of 145 files at r11, 40 of 60 files at r12, 18 of 23 files at r13, 83 of 83 files at r14, all commit messages.
Reviewable status: all files reviewed, 52 unresolved discussions (waiting on @lockshaw)
lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml line 29 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I agree up to the point of grouping mutable and non-mutable global variables into the same type as then the mutation properties become unclear (how do I know that your function isn't also mutating the
pcg?), especially since only one of the members ever mutable. This is one case where I have a strong preference that we splitMachineMappingCacheout so that the function signatures remain clear about what is safe from mutation
Done.
Description of changes:
This PR recovers the device mapping algorithm (the DP-based algorithm) used in Unity.
Related Issues:
Linked Issues:
Issues closed by this PR:
This change is