Skip to content

Conversation

@dennwc
Copy link
Contributor

@dennwc dennwc commented Dec 2, 2025

This PR refactors and optimises the SIP message parsers.

Short list of changes:

  • Parser will now track the number of bytes consumed, removing the need for hacks in streaming mode.
  • Steaming parser now properly skips CRLF between messages (RFC 3261 - 7.5).
  • Content-Length is now required for streaming mode (RFC 3261 - 7.5).
  • Parser postpones the []byte -> string conversions for headers, improving performance by 4% and reducing allocations by 10%.
  • Streaming parser reuses more code from the regular parser.
  • Parser exposes new API for parsing the message headers without the body.
  • Streaming parser exposes API for getting individual messages and how many bytes were read.
  • Streaming parser can now discard data to recover the connection state after malformed messages.

Benchmarking results

4% faster parsing, up to 10% less allocations.

                               │   old.txt   │               new.txt                │
                               │   sec/op    │    sec/op     vs base                │
ParserStream/NoChunks-16         2.500µ ± 3%   2.377µ ±  4%   -4.94% (p=0.004 n=10)
ParserStream/SingleRoutine-16    2.612µ ± 2%   2.518µ ±  1%   -3.62% (p=0.000 n=10)
ParserStream/Paralel-16          13.18µ ± 6%   13.33µ ±  6%        ~ (p=0.912 n=10)
Parser/SingleRoutine-16          2.843µ ± 4%   2.785µ ±  1%   -2.04% (p=0.000 n=10)
Parser/Paralel-16                14.21µ ± 5%   14.10µ ±  5%        ~ (p=0.239 n=10)
ParserNoData/New-16              2.120µ ± 1%   2.014µ ±  3%   -5.00% (p=0.000 n=10)

                               │    old.txt     │                new.txt                 │
                               │      B/op      │     B/op      vs base                  │
ParserStream/NoChunks-16         2.790Ki ± 0%     2.656Ki ± 0%   -4.80% (p=0.000 n=10)
ParserStream/SingleRoutine-16    2.845Ki ± 0%     2.656Ki ± 0%   -6.63% (p=0.000 n=10)
ParserStream/Paralel-16          3.021Ki ± 0%     2.661Ki ± 0%  -11.90% (p=0.000 n=10)
Parser/SingleRoutine-16          3.117Ki ± 0%     2.984Ki ± 0%   -4.26% (p=0.000 n=10)
Parser/Paralel-16                3.123Ki ± 0%     2.989Ki ± 0%   -4.28% (p=0.000 n=10)
ParserNoData/New-16              2.520Ki ± 0%     2.430Ki ± 0%   -3.57% (p=0.000 n=10)

                               │   old.txt    │              new.txt                │
                               │  allocs/op   │ allocs/op   vs base                 │
ParserStream/NoChunks-16         39.00 ± 0%     37.00 ± 0%  -5.13% (p=0.000 n=10)
ParserStream/SingleRoutine-16    41.00 ± 0%     37.00 ± 0%  -9.76% (p=0.000 n=10)
ParserStream/Paralel-16          41.00 ± 0%     37.00 ± 0%  -9.76% (p=0.000 n=10)
Parser/SingleRoutine-16          44.00 ± 0%     43.00 ± 0%  -2.27% (p=0.000 n=10)
Parser/Paralel-16                44.00 ± 0%     43.00 ± 0%  -2.27% (p=0.000 n=10)
ParserNoData/New-16              32.00 ± 0%     30.00 ± 0%  -6.25% (p=0.000 n=10)

Changes to Parser

Original parsing method is kept unchanged:

func (p *Parser) ParseSIP(data []byte) (Message, error)

Instead, a new method is introduced:

func (p *Parser) Parse(data []byte, stream bool) (Message, int, error)

This method now returns number of bytes used to parse the message. It always stops at the beginning of the line that caused a failure, which helps reuse the same underlying code for streaming mode.

The stream flag adjust the behaviour of the parser slightly, for example, in streaming mode the Content-Length is required and CRLF at the beginning of the message is silently skipped (RFC 3261 - 7.5).

This new method now always returns io.UnexpectedEOF if message was not read completely and more data is required. The old ParseSIP is not affected and still returns ParseEOF error.

Additionally, another new method is added for parsing the message without the body:

func (p *Parser) ParseHeaders(data []byte, stream bool) (Message, int, error)

This allows using the parser more efficiently in proxies that only act on the headers. It will not require allocating the body separately, instead, the proxy may write the headers followed by the body from the original buffer.

Changes to ParserStream

Existing methods for ParserStream are kept unchanged:

func (p *ParserStream) ParseSIPStream(data []byte) ([]Message, error)
func (p *ParserStream) ParseSIPStreamEach(data []byte, cb func(msg Message)) error

New method was added for reading messages separately:

func (p *ParserStream) Write(data []byte) (int, error)
func (p *ParserStream) ParseNext() (Message, int, error)

After using Write to append data to the internal buffer, ParseNext can be called multiple times to get SIP messages. As opposed to ParseSIPStream and ParseSIPStreamEach, the caller can decide when to stop reading the messages.

Under the hood, all ParserStream methods now reuse the methods for parsing headers and start line from the Parser, reducing code duplication and differences in behaviour.

Similar to the new Parse method, ParseNext returns the number of bytes that were used to read the message, or io.UnexpectedEOF in case the message was parsed only partially. Old methods like ParseSIPStreamEach still return ErrParseSipPartial instead.

Additionally, there are a few new methods to control the stream state:

func (p *ParserStream) Buffer() *bytes.Buffer
func (p *ParserStream) Discard(n int)
func (p *ParserStream) Reset()
func (p *ParserStream) Close()

Buffer allows the caller to examine the underlying parser state after an error. ParseNext will return the offset into this buffer at the beginning of the failed line.

Discard allows the caller to drop N first bytes and reset the parser to recover the remaining stream. This API assumes that the caller has some heuristic to decide how and when to recover.

Reset completely resets the internal state and the buffer, allowing reuse of the ParserStream.

Close is similar, but calling it is now required to reuse the underlying buffer for other parsers (using the Pool). Previously, the parser was always discarding the buffer after each message. Now the caller can control it by calling Reset or Close.

@emiago
Copy link
Owner

emiago commented Dec 2, 2025

Like a bomb :), just we are almost at 1.0.0 and I am now not sure where to go with this.
I like the performance improvements and all and would like to merge but...

So do I understand you need some exposure here of Parser because you are running some manual parsing?
I would like to see is there need to expose this right now all, and give chance for merging prio 1.0.0

I have to look more, but here some first looks.
This I think could be better as 2 seperate functions. Reason is that caller has to know this upfront, and
error handling will anyway be determined by flag, but it can be underhood left the rest.

func (p *Parser) Parse(data []byte, stream bool) (Message, int, error)

Also btw reason for ParseSIP and not Parse was historifcal as we had interface usage :)


// headerParserTo generates ToHeader
func headerParserTo(headerName string, headerText string) (header Header, err error) {
func headerParserTo(headerName []byte, headerText string) (header Header, err error) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could understand this comes from line buffer, but value is mostly not used, so not sure do we need this change anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally it would be great to switch values to byte slices - that should give even better performance. But that would be a very large change. So I only did header names for now.

Answering your question, tbh I'm not sure why it helps here specifically. As you mentioned, the string here is unused and is mostly generated from static strings in the lower case header names switch. So it should not matter if we do bytes here. But it does affect performance in the end.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably removing is better, but we can deal with this after.


colonIdx := strings.Index(headerText, ":")
// ParseHeader parses a SIP header from the line and appends it to out.
func (headersParser HeadersParser) ParseHeader(out []Header, line []byte) ([]Header, error) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was avoiding to expose this, but I guess there is use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite happy with the current API having the ParseHeader which avoids the body, so we could unexport it back. The reason I still exposed it is just in case we'll need something more exotic for parsing, while still keeping compatibility with sipgo data structures. We'll need to call this directly then. Maybe the answer is similar to the one for NextLine - move it to another package like parser. Would that work?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Parser is just natural fit for sip package, like json or any other.

// It returns io.ErrUnexpectedEOF is there's no CRLF (\r\n) in the data.
// If there's a CR (\r) which is not followed by LF (\n), a ErrParseLineNoCRLF is returned.
// As a special case, it returns io.EOF if data is empty.
func NextLine(data []byte) ([]byte, int, error) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not have this package exposed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, would it be more acceptable if I introduce a new parser package that will expose this method without polluting the sip package?

Copy link
Owner

@emiago emiago Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @dennwc Can you pls share, why this functions will benefit to have on API level. I am not sure how you use this, but user should be able to parse message without knowing/using some extra traversing. Exposing this I can not see any sense unless you are reimplementing parts of Parser?

No parser is just generic name, it is not Go way. sip is where things need to land. Still I do not see how will change things. Also to mention, I regret on other stuff not putting here, but was hard to change.

Anyway do not worry about naming, I can deal with this, just I want now to keep API on lower exposure and bring up things up more slowly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I was considering to use this functionality to build a message segmenter, that is something that will only mark message boundaries, but won't do the full parsing. But for now it's unused, so we could unexport in back and consider this change separately.

@dennwc
Copy link
Contributor Author

dennwc commented Dec 3, 2025

First, thank you for a quick reply!

So do I understand you need some exposure here of Parser because you are running some manual parsing?

That's correct. We now have a separate SIP proxy that needs the parser, but not the other layers. And we've hit a few issues with it. This is why we'd appreciate a bit more flexibility for the parser, which this PR attempts.

I would like to see is there need to expose this right now all, and give chance for merging prio 1.0.0

No rush at all! We are totally fine with running this code from our own branch for now, until you are ready to upstream these changes in one form or another (1.1.0 maybe?).

This I think could be better as 2 seperate functions.

You mean having two separate parsers? PR still keeps ParseSIP method intact, so it doe use two different function for old vs new signatures and error semantics. Under the hood it's the same code.

@emiago
Copy link
Owner

emiago commented Dec 16, 2025

@dennwc if you can pls rebase, but if you find some werid conflicts let me know. I had some maybe renaming of stuff. I want to consider merging this soon.

@dennwc
Copy link
Contributor Author

dennwc commented Dec 18, 2025

@emiago rebase complete!

type parserState int

var (
ParseMaxMessageLength = 65535
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder was this removed by your rebase. Will need to check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still there, just in a different file.

@dennwc
Copy link
Contributor Author

dennwc commented Dec 23, 2025

Hey @emiago! Is anything else blocking this? I do not want to rush it, just let me know if there's anything else remaining here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants