-
Notifications
You must be signed in to change notification settings - Fork 986
DRILL-7534: Convert HTTPD Format Plugin to EVF #2112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@nielsbasjes |
|
Seems like https://github.com/apache/drill/blob/gh-pages/_docs/connect-a-data-source/plugins/111-httpd-format-plugin.md can use an update also (atleast the maxErrors addition). Just noticed that that page mentions the useragent parsing as your "exernal" udf but that was included in drill itself last year. |
contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdLogFormatConfig.java
Outdated
Show resolved
Hide resolved
contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdLogFormatConfig.java
Show resolved
Hide resolved
contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdParser.java
Outdated
Show resolved
Hide resolved
|
I think the documentation should have an example of the wildcard support. |
|
I was thinking about the concept of the extensions to match the files. I see that this is the way it is configured and used in the To me it would make a lot more sense that the What do you think @cgivre ? |
|
@nielsbasjes I have a question for you regarding the file extension. Right now, Drill uses the file extension to determine which format plugin to use for parsing the file(s). One other option that Drill has is the "weblogs": {
"location": "<path to logs>",
"writable": false,
"defaultInputFormat": "httpd"
}With that said, I do like the idea of allowing users to define a pattern for filenames that would be associated with a particular file type. I think that might be out of scope for this PR however. |
@dzamo, could you take a look? |
|
Hi @nielsbasjes
Thanks! drill/contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdLogRecord.java Lines 120 to 140 in a53402e
Here's where the results are being mapped to the setters: drill/contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdLogRecord.java Lines 321 to 326 in a53402e
|
|
Regarding 1: There is a LogParser Dissector (UserAgentDissector) that uses Yauaa: https://yauaa.basjes.nl/UDF-LogParser.html . Regarding 2: In the LogParser there is already code to further parse the timestamp format into usable parts. For this situation I think the best way to obtain the timestamp in a usable generic form is by retrieving (for example) the |
|
As an experiment I added this to your code: Now I ran into something strange. I see this At this moment I think this is a bug in the Yauaa Dissector. |
|
@nielsbasjes
I'm going to finish the date fields first, then I will do additional experiments on the UA dissector. |
|
@cgivre
|
|
I just now released version 5.6 of the logparser that fixes the "double values" problem. |
|
Hi @nielsbasjes |
|
Cool! Yes I agree having the useragent part as a separate addition later on. I wanted to try to make a junit test with different config settings (like having multiple logformats). |
@nielsbasjes drill/contrib/format-syslog/src/test/java/org/apache/drill/exec/store/syslog/TestSyslogFormat.java Lines 45 to 60 in 31d6086
This snippet demonstrates how to set up different configs for testing. What I did in that case was define different extensions |
|
I did some testing and found something worth discussing regarding the wildcards. Note about all of these points; I'm fine with just putting a bit of documentation in place that describes these as known limitations. When I do a "select *" from a table backed by this format and I print the result set I get for "wildcard" scenarios like the query parameters and the cookies options like these: The first thing I noticed is that the actual values in the data are reflected in the header. I assume this is just the way the RowSet::print() works. Do note that if you have a large variety of query parameters in your dataset this may become a big list. What I find is that these wildcards do not work as I expected when comparing what the underlying parser does. Assuming the URI When I ask for This "explicit" way of asking for a values is there because now the system does not need to url decode the "unwanted" fields (i.e. there is a bit of performance impact if there are a lot of unwanted fields (query parameters / cookies) in the line at hand. Note that the underlying parser does support this; the example for Apache Pig makes this the most clear: Now the response cookies are special because they have limited support for a wildcard in the middle: These are intended so you can ask for something like Here I found that these seem to always return a null also. |
|
Hmmm, the schema of a map seems "funny" in the output. I do and I get as output: |
That is the intended behavior. What should happen is that Drill will create a map of the parsed cookies and uri query. If you don't think this is the most effective way of doing this, I'm definitely open to refactoring it. Just as an FYI, I only chose to do it this way because that's how it was done in the original Drill/HTTPD integration. It might be better to flatten these maps and produce actual columns with the values.
That is correct.
The way Drill works is that it creates a vector for every column it finds. So if you have a URL with params Now, if the next record has
What I think you're getting at here is it might be advantageous to flatten the wildcard fields rather than putting them in a Drill map and in so doing, create many null columns. Is that correct? If so, my thought here is that the best way to go about that would be to add a config option called The advantage that I see in doing this is easier queries. For instance if you wanted to find particular values from a query string, you could do something like: SELECT <fields>
FROM ...
WHERE request_firstline_uri_query_aap = 1234Would that work for you? |
|
I'm cool with what makes it work for Drill. |
|
@nielsbasjes |
|
Sounds great! |
@nielsbasjes |
|
I had a look at the tests that check the wildcard flattening and this looks very good to me. +1 |
Docs updated. |
DRILL-7534: Convert HTTPD Format Plugin to EVF
Description
This PR updates the HTTPD format plugin to use the Enhanced Vector Framework (EVF). In theory there are few changes a user might notice.
maxErrorshas been added which will allow a user to tune how fault tolerant they want Drill to be when reading log files._rawand_matched. They are described in the docs below.contribfolder.flattenWildcardsoption which allows the user to flatten nested fields.This PR also refactors the code and includes some optimizations which should, in theory, result in faster queries.
In addition, this PR updates the associated User Agent parsing functions with the latest version of the underlying libraries.
Documentation
Web Server Log Format Plugin (HTTPD)
This plugin enables Drill to read and query httpd (Apache Web Server) and nginx logs natively. This plugin uses the work by Niels Basjes which is available here: https://github.com/nielsbasjes/logparser.
Configuration
There are three fields which you will need to configure in order for Drill to read web server logs which are:
logFormat: The log format string is the format string found in your web server configuration.timestampFormat: The format of time stamps in your log files.extensions: The file extension of your web server logs.maxErrors: Sets the plugin error tolerence. When set to any value less than0, Drill will ignore all errors.flattenWildcards: Flattens nested fieldsImplicit Columns
Data queried by this plugin will return two implicit columns:
_raw: This returns the raw, unparsed log line_matched: Returnstrueorfalsedepending on whether the line matched the config string.Thus, if you wanted to see which lines in your log file were not matching the config, you could use the following query:
Testing
Added additional unit tests for this plugin. Ran all unit tests for the
parse_user_agent()UDF as well.