Skip to content

Conversation

@0x53A
Copy link
Contributor

@0x53A 0x53A commented Sep 29, 2017

Program Code: https://gist.github.com/0x53A/9da43cb0f7c42f1b629b888fe7a68224
(for the inline version I inlined let iter action (source : seq<'T>) =)

This code was compiled into a release console exe.
TargetFW: net461
TargetRuntime: I have Win10CU with net47

Size:

AnyCpu (32bit) x64
inline 45kB 45kB
not inline 55kB 55kB

Speed:

"Benchmark" Program:

open System.Diagnostics
open System.IO

let dir = @"C:\Users\lr\Source\Repos\IterInlineTest\IterInlineTest\bin\Release"

let files = [
    "IterInlineTest-AnyCpu-Inline.exe"
    "IterInlineTest-AnyCpu-NotInline.exe"
    "IterInlineTest-x64-inline.exe"
    "IterInlineTest-x64-NotInline.exe"
]

for i in 1..10 do
    for f in files do
        printfn "%i - %s" i f
        let fullPath = Path.Combine(dir, f)
        let proc = Process.Start(ProcessStartInfo(fullPath, UseShellExecute=false))
        proc.WaitForExit()
1 - IterInlineTest-AnyCpu-Inline.exe
Time: 8.0717219 seconds
1 - IterInlineTest-AnyCpu-NotInline.exe
Time: 8.103574 seconds
1 - IterInlineTest-x64-inline.exe
Time: 6.3089797 seconds
1 - IterInlineTest-x64-NotInline.exe
Time: 6.4022095 seconds
2 - IterInlineTest-AnyCpu-Inline.exe
Time: 8.1084784 seconds
2 - IterInlineTest-AnyCpu-NotInline.exe
Time: 8.1381134 seconds
2 - IterInlineTest-x64-inline.exe
Time: 6.2733213 seconds
2 - IterInlineTest-x64-NotInline.exe
Time: 6.2912988 seconds
3 - IterInlineTest-AnyCpu-Inline.exe
Time: 8.201093 seconds
3 - IterInlineTest-AnyCpu-NotInline.exe
Time: 8.1094063 seconds
3 - IterInlineTest-x64-inline.exe
Time: 6.2986958 seconds
3 - IterInlineTest-x64-NotInline.exe
Time: 6.3382166 seconds
4 - IterInlineTest-AnyCpu-Inline.exe
Time: 8.1207324 seconds
4 - IterInlineTest-AnyCpu-NotInline.exe
Time: 8.1065901 seconds
4 - IterInlineTest-x64-inline.exe
Time: 6.3112234 seconds
4 - IterInlineTest-x64-NotInline.exe
Time: 6.3317693 seconds
5 - IterInlineTest-AnyCpu-Inline.exe
Time: 8.0858645 seconds
5 - IterInlineTest-AnyCpu-NotInline.exe
Time: 8.1014179 seconds
5 - IterInlineTest-x64-inline.exe
Time: 6.4862915 seconds
5 - IterInlineTest-x64-NotInline.exe
Time: 6.3200616 seconds
6 - IterInlineTest-AnyCpu-Inline.exe
Time: 8.0870462 seconds
6 - IterInlineTest-AnyCpu-NotInline.exe
Time: 8.1001479 seconds
6 - IterInlineTest-x64-inline.exe
Time: 6.290876 seconds
6 - IterInlineTest-x64-NotInline.exe
Time: 6.2454101 seconds
7 - IterInlineTest-AnyCpu-Inline.exe
Time: 8.1010845 seconds
7 - IterInlineTest-AnyCpu-NotInline.exe
Time: 8.1180521 seconds
7 - IterInlineTest-x64-inline.exe
Time: 6.3788174 seconds
7 - IterInlineTest-x64-NotInline.exe
Time: 6.3946524 seconds
8 - IterInlineTest-AnyCpu-Inline.exe
Time: 8.0704105 seconds
8 - IterInlineTest-AnyCpu-NotInline.exe
Time: 9.3288509 seconds
8 - IterInlineTest-x64-inline.exe
Time: 6.3131007 seconds
8 - IterInlineTest-x64-NotInline.exe
Time: 6.3239749 seconds
9 - IterInlineTest-AnyCpu-Inline.exe
Time: 8.0876243 seconds
9 - IterInlineTest-AnyCpu-NotInline.exe
Time: 8.1352227 seconds
9 - IterInlineTest-x64-inline.exe
Time: 6.4821192 seconds
9 - IterInlineTest-x64-NotInline.exe
Time: 6.3198395 seconds
10 - IterInlineTest-AnyCpu-Inline.exe
Time: 8.0812481 seconds
10 - IterInlineTest-AnyCpu-NotInline.exe
Time: 8.1302496 seconds
10 - IterInlineTest-x64-inline.exe
Time: 6.3678513 seconds
10 - IterInlineTest-x64-NotInline.exe
Time: 6.3046622 seconds

As you can see, the x64 is always faster than the 32bit. The results have a lot of jitter, but the inline version is most of the time faster.


I also created a "real" benchmark using BenchmarkDotNet: https://gist.github.com/0x53A/320abe9890af709c510f02abdabd410a

BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i7-6700K CPU 4.00GHz (Skylake), ProcessorCount=8
Frequency=3914059 Hz, Resolution=255.4893 ns, Timer=TSC
  [Host]     : .NET Framework 4.7 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.2110.0
  DefaultJob : .NET Framework 4.7 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.2110.0

Method Mean Error StdDev
Inlined 264.2 us 0.9388 us 0.8323 us
NonInlined 271.3 us 3.3525 us 3.1359 us
BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i7-6700K CPU 4.00GHz (Skylake), ProcessorCount=8
Frequency=3914059 Hz, Resolution=255.4893 ns, Timer=TSC
  [Host]     : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2110.0
  DefaultJob : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2110.0

Method Mean Error StdDev
Inlined 204.2 us 0.4885 us 0.4331 us
NonInlined 207.5 us 1.5505 us 1.3745 us

Debuggability:

This one was a bit disappointing:

[<EntryPoint>]
let main argv = 

    let thisIsAnOutsideVar = "hello"
    
    let s = Seq.init 2000 (id >> int64)
    let mutable counter = 0L

    s |> MySeq.iter (fun i ->
        counter <- counter + i)

    s |> MySeq.iterInline (fun i ->
        counter <- counter + i)

        
    printfn "%s" thisIsAnOutsideVar
    printfn "%A" argv

    0 // return an integer exit code

Debug:

Well, i and the closure are visible, as expected:
image

This is a bit disappointing, the call to iter was inlined, but the lambda was not erased, so I am still inside the Invoke and the outside variables are not visible:
image

Release:

image

I couldn't even set a breakpoint into the callback for iterInline...


Conclusion:

Inlining improves performance, but does not improve debugability. I would still prefer to explicitly erase *.iter to a for loop, because that one also erases the lambda into the outer scope.


Now my question is: Which functions should all be inlined?

I strongly assume that any benchmarks for List.iter and Array.iter would give similar results. What about iteri? What about Option.[map/iter/forall]?

It is my strong guess that all functions with a small body that accept a callback would benefit a lot from this. For small functions that don't accept a callback, it is probably not so clear-cut.

@0x53A
Copy link
Contributor Author

0x53A commented Sep 29, 2017

For completeness sake, this compares #3662 against the latest FSharp.Compiler.Tools nuget fsc:

https://gist.github.com/0x53A/bbc37d5d3c642d1a9d4a459f2598fd27

My pr improves the performance by 10%.

I didn't implement the erasure for Seq.iter, only for List.iter, so it can't really be compared to the benchmarks in the first post.

@0x53A 0x53A changed the title mark Seq.iter inline [Discussion] mark Seq.iter inline Sep 29, 2017
@saul
Copy link
Contributor

saul commented Sep 30, 2017

With regards to the debuggability, you can move up the call stack to see the other locals.

@forki
Copy link
Contributor

forki commented Sep 30, 2017 via email

@zpodlovics
Copy link

What about functions with lot's of code in the function body and / or non performance critical code? Some hotspot cases you'll need the inline version, other non hotspot cases you'll need the non inlined version. There are no universal solution. How about providing multiple modules with different inlining level? Specialization will be easy with local module alias or aliases (if you want mix inline/noinline as you need), and you can start the specialization step by step basis for every hotspot.

Something like this:

module FooOperations = begin
  // generic code 
  [<MethodImpl(MethodImplOptions.AggressiveInlining)>]
  let inline bar x = x + 1
end

module FooNoInlining = begin
  [<MethodImpl(MethodImplOptions.NoInlining)>]
  let bar x = FooOperations.bar x
end

module Foo = begin
  let bar x = FooOperations.bar x
end

module FooInlining = begin
  let inline bar x = FooOperations.bar x
end

module FooAggressiveInlining = begin
  [<MethodImpl(MethodImplOptions.AggressiveInlining)>]
  let bar x = FooOperations.bar x
end

Example usage1:

module F = Foo
module FI = FooAggressiveInlining

let testF() =
  F.bar 1

let testFI() =
  FI.bar 1

Please note: the AggressiveInlining will change the JIT inlining behaviour - the code will inlined even if exceed the inlining size limit in the JIT.

@dsyme
Copy link
Contributor

dsyme commented Oct 2, 2017

This is a bit disappointing, the call to iter was inlined, but the lambda was not erased...

Yes, for Debug code I'd image that's the case.

I would still prefer to explicitly erase *.iter to a for loop, because that one also erases the lambda into the outer scope.

Again I'd prefer a set of orthogonal decision/optimizations/choices that would work for all code, including user-defined code, rather than just one function in the library.

So let's take a look if inline can also achieve improved debugging. The end result of the inlined code is a TAST that contains something like ... let f = (fun ....) in ... .... f x .... where the let f = ... is binding for the argument of the inlined function

Now nrmally we don't do lambda-propagation of f in Debug code )in Debug code the aim is to avoid "mucking" with the code as much as possibl). But perhaps in some (very limited) circumstances we should do lambda-propagation to improve debuggability of inlined code. It's hard to tell immediately what the general criteria would be for that, but perhaps either:

  1. f is a value resulting from a parameter of an inlined function, or just
  2. "f is compiler generated

when encountering f in f x. The point where we make this decision is here: https://github.com/Microsoft/visualfsharp/blob/master/src/fsharp/Optimizer.fs#L2498. Perhaps this could be modified to check if we're at the application of a compiler generated f value (You'd have to pass f0 in here https://github.com/Microsoft/visualfsharp/blob/master/src/fsharp/Optimizer.fs#L2580 and check if it's a compiler generated value)

It is my strong guess that all functions with a small body that accept a callback would benefit a lot from this. For small functions that don't accept a callback, it is probably not so clear-cut.

Yes.

@0x53A
Copy link
Contributor Author

0x53A commented Oct 2, 2017

So, three tasks:

  • mark all suitable functions (small body + callback) as inline
  • erase lambdas more eager in debug mode
  • make sure debug information flows even through the erasure.

The last one is probably the most important - in my example in Release mode, the lambda was inlined, but I couldn't even set a breakpoint.

I will take another stab at this, but as always, it may be a while ;)

Thanks!


Ceterum autem censeo Carthaginem esse delendam.

I still think small targeted semantic optimizations like the seq.map fusion would make sense in the absence of staging.

@dsyme
Copy link
Contributor

dsyme commented Oct 3, 2017

make sure debug information flows even through the erasure.

Hmmm.. I think (not sure) this should "just happen". Debug information gets erased from the implementation of the iteration, but not the lambda. So I think we should just get sequence points in the lambda as expected. But I'm still not sure what debug experience that will give on stepping, and it might depend where the lambda is used in the body of the implementation

@dsyme
Copy link
Contributor

dsyme commented May 30, 2018

@dotnet-bot test this please

@KevinRansom
Copy link
Contributor

@0x53A, @dsyme, What do you want to do with this PR?

It's marked as discussion, but nothing much has been said this year.

Can it be closed?

Thanks

Kevin

@0x53A
Copy link
Contributor Author

0x53A commented Sep 12, 2018

I think the result of the discussion was that yes, marking these functions as inline would be a positive change.

It's just that someone has to do it, and I haven't yet, and probably won't the next few weeks / months.

I'd close this - if someone other wants to implement it, then great, otherwise I may reopen later.

@0x53A 0x53A closed this Sep 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants