Skip to content

Conversation

@cgivre
Copy link
Contributor

@cgivre cgivre commented Nov 11, 2020

DRILL-7534: Convert HTTPD Format Plugin to EVF

Description

This PR updates the HTTPD format plugin to use the Enhanced Vector Framework (EVF). In theory there are few changes a user might notice.

  1. A new configuration option maxErrors has been added which will allow a user to tune how fault tolerant they want Drill to be when reading log files.
  2. Two new implicit fields have been added, _raw and _matched. They are described in the docs below.
  3. The plugin now includes a limit pushdown which significantly improves query times for queries with limits.
  4. The plugin code is now in the contrib folder.
  5. Added a flattenWildcards option which allows the user to flatten nested fields.

This PR also refactors the code and includes some optimizations which should, in theory, result in faster queries.

In addition, this PR updates the associated User Agent parsing functions with the latest version of the underlying libraries.

Documentation

Web Server Log Format Plugin (HTTPD)

This plugin enables Drill to read and query httpd (Apache Web Server) and nginx logs natively. This plugin uses the work by Niels Basjes which is available here: https://github.com/nielsbasjes/logparser.

Configuration

There are three fields which you will need to configure in order for Drill to read web server logs which are:

  • logFormat: The log format string is the format string found in your web server configuration.
  • timestampFormat: The format of time stamps in your log files.
  • extensions: The file extension of your web server logs.
  • maxErrors: Sets the plugin error tolerence. When set to any value less than 0, Drill will ignore all errors.
  • flattenWildcards: Flattens nested fields
"httpd" : {
  "type" : "httpd",
  "logFormat" : "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-agent}i\"",
  "timestampFormat" : "dd/MMM/yyyy:HH:mm:ss ZZ",
  "maxErrors": 0,
  "flattenWildcards": false
}

Implicit Columns

Data queried by this plugin will return two implicit columns:

  • _raw: This returns the raw, unparsed log line
  • _matched: Returns true or false depending on whether the line matched the config string.

Thus, if you wanted to see which lines in your log file were not matching the config, you could use the following query:

SELECT _raw
FROM <data>
WHERE _matched = false

Testing

Added additional unit tests for this plugin. Ran all unit tests for the parse_user_agent() UDF as well.

@cgivre cgivre self-assigned this Nov 11, 2020
@cgivre cgivre added the refactoring PR related to code refactoring label Nov 11, 2020
@cgivre
Copy link
Contributor Author

cgivre commented Nov 11, 2020

@nielsbasjes
Would you be willing to take a look at this PR?
Thanks!

@nielsbasjes
Copy link
Contributor

Seems like https://github.com/apache/drill/blob/gh-pages/_docs/connect-a-data-source/plugins/111-httpd-format-plugin.md can use an update also (atleast the maxErrors addition). Just noticed that that page mentions the useragent parsing as your "exernal" udf but that was included in drill itself last year.

@nielsbasjes
Copy link
Contributor

I think the documentation should have an example of the wildcard support.
Something like showing a "select *" that returns wildcard values and then showing how someone can directly query a specific url parameter or cookie value would be really helpful.

@nielsbasjes
Copy link
Contributor

I was thinking about the concept of the extensions to match the files.
When I look at my own access log files they are named something like access_log or access_log-20201108 and as such the idea of the extension does not work for me.

I see that this is the way it is configured and used in the EasyFormatPlugin EasyFormatConfig .
Looking deeper into the code I see that this list of extensions is used to create a BasicFormatMatcher which then internally stores them as a list of Patterns ... and it also has a constructor that allows you to just specify these patterns.

To me it would make a lot more sense that the extensions is replaced by fileNamePatterns.
I'm totally fine with having the same default as it is now: *\.httpd

What do you think @cgivre ?

@cgivre
Copy link
Contributor Author

cgivre commented Nov 15, 2020

@nielsbasjes
Thanks for the quick review! I have a few more things to tweak before it's ready for the next round of review, but I've noticed a very significant improvement in query performance on my machine with the log files with the refactoring. Did you see any difference?

I have a question for you regarding the file extension. Right now, Drill uses the file extension to determine which format plugin to use for parsing the file(s). One other option that Drill has is the defaultInputFormat which is an option for a given workspace. I'd imagine that in a real world situation, web server logs would be contained as they are generated in a directory or series of directories. What you could do in that case, is define a workspace and set the defaultInputFormat to httpd and that would tell Drill to use the HTTPD plugin even when there are no file extensions specified.

       "weblogs": {
         "location": "<path to logs>",
         "writable": false,
         "defaultInputFormat": "httpd"
       }

With that said, I do like the idea of allowing users to define a pattern for filenames that would be associated with a particular file type. I think that might be out of scope for this PR however.

@cgivre
Copy link
Contributor Author

cgivre commented Nov 15, 2020

Seems like https://github.com/apache/drill/blob/gh-pages/_docs/connect-a-data-source/plugins/111-httpd-format-plugin.md can use an update also (atleast the maxErrors addition). Just noticed that that page mentions the useragent parsing as your "exernal" udf but that was included in drill itself last year.

@dzamo, could you take a look?

@cgivre
Copy link
Contributor Author

cgivre commented Nov 15, 2020

Hi @nielsbasjes
I think we're ready for the next round of review. I do have a few questions for you:

  1. Is there a dissector for user agent strings? I'd like to include that if possible so that the user doesn't have to call UDFs to get user agent info. However, this could cause performance degradation, so if this is possible, I might make a config option to disable it by default.

  2. I made some modifications to use the default timestamp parser instead of asking this from the user. (The user can still specify one, but if the field is null then it doesn't get passed.). In doing so, I ran into another issue. If you look at the setter below, I still need to get an actual timestamp so that I can send the timestamp to the Drill ScalarWriter. Unfortunately, the parser is sending a String and hence, I still need a format string. Is there a way to either:
    a. Get the format string from the parser? Or
    b. Get the parser to return a timestamp of some sort so that I don't have to do the conversion?

Thanks!

@SuppressWarnings("unused")
public void setTimestamp(String field, String value) {
if (value != null) {
//Convert the date string into a long
long ts = 0;
try {
Date d = this.dateFormatter.parse(value);
ts = d.getTime();
} catch (Exception e) {
//If the date formatter does not successfully create a date, the timestamp will fall back to zero
//Do not throw exception
}
final ScalarWriter tw = times.get(field);
if (tw != null) {
logger.trace("Parsed field: {}, as time: {}", field, value);
tw.setTimestamp(new Instant(ts));
} else {
logger.warn("No 'timestamp' writer found for field: {}", field);
}
}
}

Here's where the results are being mapped to the setters:

if (parserFieldName.startsWith("TIME.STAMP:")) {
parser.addParseTarget(this.getClass().getMethod("setTimestamp", String.class, String.class), parserFieldName);
times.put(parserFieldName, rowWriter.scalar(drillFieldName));
} else {
parser.addParseTarget(this.getClass().getMethod("set", String.class, String.class), parserFieldName);
strings.put(parserFieldName, rowWriter.scalar(drillFieldName));

@nielsbasjes
Copy link
Contributor

Regarding 1: There is a LogParser Dissector (UserAgentDissector) that uses Yauaa: https://yauaa.basjes.nl/UDF-LogParser.html .
You can of course enable/disable this with a setting. I think you can also simply enable it by default as the parsing is automatically optimized out if you do not ask for the parse result fields. So if you only select certain fields then the runtime performance difference should be 0.

Regarding 2: In the LogParser there is already code to further parse the timestamp format into usable parts. For this situation I think the best way to obtain the timestamp in a usable generic form is by retrieving (for example) the TIME.EPOCH:request.receive.time.epoch as this is a long that is the number of milliseconds since 1970-01-01 which would simplify the code to having a setter public void setTimestamp(long value) which can then simply do something like Instant.ofEpochMilli(something). This way you are leveraging the parsing capabilities I've already built into the system.

@nielsbasjes
Copy link
Contributor

As an experiment I added this to your code:

diff --git a/contrib/format-httpd/pom.xml b/contrib/format-httpd/pom.xml
index 10a9e35b4..02ae984ac 100644
--- a/contrib/format-httpd/pom.xml
+++ b/contrib/format-httpd/pom.xml
@@ -51,6 +51,12 @@
       </exclusions>
     </dependency>
 
+    <dependency>
+      <groupId>nl.basjes.parse.useragent</groupId>
+      <artifactId>yauaa-logparser</artifactId>
+      <version>${yauaa.version}</version>
+    </dependency>
+
     <!-- Test dependencies -->
     <dependency>
       <groupId>org.apache.drill.exec</groupId>
diff --git a/contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdParser.java b/contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdParser.java
index 326a074d1..8a0f23063 100644
--- a/contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdParser.java
+++ b/contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdParser.java
@@ -17,6 +17,7 @@
  */
 package org.apache.drill.exec.store.httpd;
 
+import nl.basjes.parse.useragent.dissector.UserAgentDissector;
 import org.apache.drill.common.expression.SchemaPath;
 import org.apache.drill.common.types.TypeProtos;
 import org.apache.drill.common.types.TypeProtos.MinorType;
@@ -67,6 +68,7 @@ public class HttpdParser {
     } else {
       this.parser = new HttpdLoglineParser<>(HttpdLogRecord.class, logFormat, timestampFormat);
     }
+    this.parser.addDissector(new UserAgentDissector());
     this.requestedColumns = scan.getColumns();
 
     if (timestampFormat != null && !timestampFormat.trim().isEmpty()) {
@@ -119,6 +121,7 @@ public class HttpdParser {
      * because this will be the slowest parsing path possible for the specified format.
      */
     Parser<Object> dummy = new HttpdLoglineParser<>(Object.class, logFormat);
+    dummy.addDissector(new UserAgentDissector());
     dummy.addParseTarget(String.class.getMethod("indexOf", String.class), allParserPaths);
 
     for (final Map.Entry<String, String> entry : requestedPaths.entrySet()) {

Now I ran into something strange.
When I run this test code:

    String sql = "SELECT `request_user-agent`, `request_user-agent_device__name`, `request_user-agent_agent__name__version__major` FROM cp.`httpd/hackers-access-small.httpd` LIMIT 1";
    RowSet results = client.queryBuilder().sql(sql).rowSet();
    results.print();

I see this

#: `request_user-agent` VARCHAR, `request_user-agent_device__name` VARCHAR, `request_user-agent_agent__name__version__major` VARCHAR
0: "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0", "DesktopDesktop", "Firefox 35Firefox 35"

At this moment I think this is a bug in the Yauaa Dissector.
I'm digging into this.

@cgivre
Copy link
Contributor Author

cgivre commented Nov 16, 2020

@nielsbasjes
Thanks for answering the questions. I now have a few more...

  1. The is the field request_referer_time formatted exactly as it comes out of the log file or is this something that your parser reformats?

  2. Does your parser consistently return the date fields in the format yyyy-mm-dd, or are these also at the mercy of the underlying format? Are time fields also guaranteed to be returned in the format hh:mm:ss?

  3. I also experimented with adding the UA dissector to the parser. It did have a significant speed cost however. I think the issue is around the dummy parser that is created to build the schema. For some perspective, most of my unit tests were completing in under 1 second without the UA dissector and around 3 seconds with it.

I'm going to finish the date fields first, then I will do additional experiments on the UA dissector.
Thanks!

@nielsbasjes
Copy link
Contributor

@cgivre
NOTE: Regarding the double output like "Firefox 35Firefox 35". It is a bug in my parser that causes the setter to be called twice under certain conditions (a reset is done incomplete if you change the settings later). I'm working on a patch and a new version should be available this week.

  1. The time field like request_receive_time (there is no such thing as request_referer_time) will be exactly as you find it in the log file as a "raw string".
  2. The original "raw string" string is always parsed if you ask for a "dissection" of it ( https://github.com/nielsbasjes/logparser/blob/master/httpdlog/httpdlog-parser/src/main/java/nl/basjes/parse/httpdlog/dissectors/TimeStampDissector.java#L418 ) and then from the resulting instance the output for these fields is created in the fully predictable ISO format ( https://github.com/nielsbasjes/logparser/blob/master/httpdlog/httpdlog-parser/src/main/java/nl/basjes/parse/httpdlog/dissectors/TimeStampDissector.java#L491 )
  3. Yes, there is overhead in figuring out which fields CAN be extracted so the extra 2 seconds during the setup makes sense. Yet in the runtime (i.e. hammering through the actual data) this Dissector only has impact on the processing time if you are actually asking for a field that comes from it.

@nielsbasjes
Copy link
Contributor

I just now released version 5.6 of the logparser that fixes the "double values" problem.
It should arrive on maven central soon.
Just make sure you skip the one packaged in Yauaa.

    <dependency>
      <groupId>nl.basjes.parse.useragent</groupId>
      <artifactId>yauaa-logparser</artifactId>
      <version>${yauaa.version}</version>
      <exclusions>
        <exclusion>
          <groupId>nl.basjes.parse.httpdlog</groupId>
          <artifactId>httpdlog-parser</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

@cgivre
Copy link
Contributor Author

cgivre commented Nov 17, 2020

Hi @nielsbasjes
I fixed all the data type issues and I think this is ready for another round of review. I'm going to hold off on adding the UA parser directly to this for now as I'd like to get this committed. I will say that I'm pleasantly surprised by the speed of this parser now. With the sample files I've queried, the results are noticeably faster and in some cases near instantaneous!

@nielsbasjes
Copy link
Contributor

Cool! Yes I agree having the useragent part as a separate addition later on.

I wanted to try to make a junit test with different config settings (like having multiple logformats).
How do I do that for a junit test?

@cgivre
Copy link
Contributor Author

cgivre commented Nov 18, 2020

Cool! Yes I agree having the useragent part as a separate addition later on.

I wanted to try to make a junit test with different config settings (like having multiple logformats).
How do I do that for a junit test?

@nielsbasjes
I updated the parser library to use version 5.6. Thanks for the bug fixes. As for your original question, take a look here:

@BeforeClass
public static void setup() throws Exception {
ClusterTest.startCluster(ClusterFixture.builder(dirTestWatcher).maxParallelization(1));
defineSyslogPlugin();
}
private static void defineSyslogPlugin() throws ExecutionSetupException {
Map<String, FormatPluginConfig> formats = new HashMap<>();
formats.put("sample", new SyslogFormatConfig(
Collections.singletonList("syslog"), null, null));
formats.put("flat", new SyslogFormatConfig(
Collections.singletonList("syslog1"), null, true));
// Define a temporary plugin for the "cp" storage plugin.
cluster.defineFormats("cp", formats);

This snippet demonstrates how to set up different configs for testing. What I did in that case was define different extensions syslog and syslog1 for different configs. In real life this would be done with different workspaces, however, this works for testing.

@nielsbasjes
Copy link
Contributor

I did some testing and found something worth discussing regarding the wildcards.

Note about all of these points; I'm fine with just putting a bit of documentation in place that describes these as known limitations.

When I do a "select *" from a table backed by this format and I print the result set I get for "wildcard" scenarios like the query parameters and the cookies options like these:

`response_cookies_$` STRUCT<`apache` VARCHAR>,
`request_firstline_uri_query_$` STRUCT<`aap` VARCHAR, `res` VARCHAR>,

The first thing I noticed is that the actual values in the data are reflected in the header. I assume this is just the way the RowSet::print() works. Do note that if you have a large variety of query parameters in your dataset this may become a big list.

What I find is that these wildcards do not work as I expected when comparing what the underlying parser does.

Assuming the URI /icons/powered_by_rh.png?aap=noot&res=1024x768

When I ask for request_firstline_uri_query_$ I see in the output something that looks like what I expect {"noot", "1024x768"}
However when I directly try to query a deeper entry like request_firstline_uri_query_aap I consistently see a null value.

This "explicit" way of asking for a values is there because now the system does not need to url decode the "unwanted" fields (i.e. there is a bit of performance impact if there are a lot of unwanted fields (query parameters / cookies) in the line at hand.

Note that the underlying parser does support this; the example for Apache Pig makes this the most clear:
https://github.com/nielsbasjes/logparser/blob/master/examples/apache-pig/src/main/pig/demo.pig#L34

Now the response cookies are special because they have limited support for a wildcard in the middle:

`response_cookies_$_comment` VARCHAR,
`response_cookies_$_domain` VARCHAR,
`response_cookies_$_expires` TIMESTAMP,
`response_cookies_$_path` VARCHAR,
`response_cookies_$_value` VARCHAR,

See https://github.com/nielsbasjes/logparser/blob/master/httpdlog/httpdlog-parser/src/test/java/nl/basjes/parse/httpdlog/ApacheHttpdLogParserTest.java#L161

These are intended so you can ask for something likeSTRING:response.cookies.jsessionid.path

Here I found that these seem to always return a null also.

@nielsbasjes
Copy link
Contributor

Hmmm, the schema of a map seems "funny" in the output.

I do `SELECT `request_firstline_uri_query_$` FROM ... on a file that has

127.0.0.1 - - [19/Nov/2020:10:14:00 +0100] "GET /paramtest?one=1 HTTP/1.1" 200 1213
127.0.0.1 - - [19/Nov/2020:10:14:00 +0100] "GET /paramtest?two=2 HTTP/1.1" 200 1213
127.0.0.1 - - [19/Nov/2020:10:14:00 +0100] "GET /paramtest?three=3 HTTP/1.1" 200 1213
127.0.0.1 - - [19/Nov/2020:10:14:00 +0100] "GET /paramtest?four=4 HTTP/1.1" 200 1213
127.0.0.1 - - [19/Nov/2020:10:14:00 +0100] "GET /paramtest?five=5 HTTP/1.1" 200 1213
127.0.0.1 - - [19/Nov/2020:10:14:00 +0100] "GET /paramtest?six=6 HTTP/1.1" 200 1213
127.0.0.1 - - [19/Nov/2020:10:14:00 +0100] "GET /paramtest?seven=7 HTTP/1.1" 200 1213
127.0.0.1 - - [19/Nov/2020:10:14:00 +0100] "GET /paramtest?eight=8 HTTP/1.1" 200 1213
127.0.0.1 - - [19/Nov/2020:10:14:00 +0100] "GET /paramtest?nine=9 HTTP/1.1" 200 1213
127.0.0.1 - - [19/Nov/2020:10:14:00 +0100] "GET /paramtest?ten=10 HTTP/1.1" 200 1213
    RowSet results = client.queryBuilder().sql(sql).rowSet();
    results.print();

and I get as output:

#: `request_firstline_uri_query_$` STRUCT<`one` VARCHAR, `two` VARCHAR, `three` VARCHAR, `four` VARCHAR, `five` VARCHAR, `six` VARCHAR, `seven` VARCHAR, `eight` VARCHAR, `nine` VARCHAR, `ten` VARCHAR>
0: {"1", null, null, null, null, null, null, null, null, null}
1: {null, "2", null, null, null, null, null, null, null, null}
2: {null, null, "3", null, null, null, null, null, null, null}
3: {null, null, null, "4", null, null, null, null, null, null}
4: {null, null, null, null, "5", null, null, null, null, null}
5: {null, null, null, null, null, "6", null, null, null, null}
6: {null, null, null, null, null, null, "7", null, null, null}
7: {null, null, null, null, null, null, null, "8", null, null}
8: {null, null, null, null, null, null, null, null, "9", null}
9: {null, null, null, null, null, null, null, null, null, "10"}

@cgivre
Copy link
Contributor Author

cgivre commented Nov 20, 2020

I did some testing and found something worth discussing regarding the wildcards.

Note about all of these points; I'm fine with just putting a bit of documentation in place that describes these as known limitations.

When I do a "select *" from a table backed by this format and I print the result set I get for "wildcard" scenarios like the query parameters and the cookies options like these:

`response_cookies_$` STRUCT<`apache` VARCHAR>,
`request_firstline_uri_query_$` STRUCT<`aap` VARCHAR, `res` VARCHAR>,

That is the intended behavior. What should happen is that Drill will create a map of the parsed cookies and uri query. If you don't think this is the most effective way of doing this, I'm definitely open to refactoring it.

Just as an FYI, I only chose to do it this way because that's how it was done in the original Drill/HTTPD integration. It might be better to flatten these maps and produce actual columns with the values.

The first thing I noticed is that the actual values in the data are reflected in the header. I assume this is just the way the RowSet::print() works. Do note that if you have a large variety of query parameters in your dataset this may become a big list.

That is correct.

What I find is that these wildcards do not work as I expected when comparing what the underlying parser does.

Assuming the URI /icons/powered_by_rh.png?aap=noot&res=1024x768

When I ask for request_firstline_uri_query_$ I see in the output something that looks like what I expect {"noot", "1024x768"}
However when I directly try to query a deeper entry like request_firstline_uri_query_aap I consistently see a null value.

This "explicit" way of asking for a values is there because now the system does not need to url decode the "unwanted" fields (i.e. there is a bit of performance impact if there are a lot of unwanted fields (query parameters / cookies) in the line at hand.

The way Drill works is that it creates a vector for every column it finds. So if you have a URL with params field1 and field2, you'll get vectors (regardless of whether they are in a map or not) of field1 and field2.

Now, if the next record has field2 and field3 the result will be that the field1 will be null for row2 but fields2 and 3 will be populated.

Note that the underlying parser does support this; the example for Apache Pig makes this the most clear:
https://github.com/nielsbasjes/logparser/blob/master/examples/apache-pig/src/main/pig/demo.pig#L34

Now the response cookies are special because they have limited support for a wildcard in the middle:

`response_cookies_$_comment` VARCHAR,
`response_cookies_$_domain` VARCHAR,
`response_cookies_$_expires` TIMESTAMP,
`response_cookies_$_path` VARCHAR,
`response_cookies_$_value` VARCHAR,

See https://github.com/nielsbasjes/logparser/blob/master/httpdlog/httpdlog-parser/src/test/java/nl/basjes/parse/httpdlog/ApacheHttpdLogParserTest.java#L161

These are intended so you can ask for something likeSTRING:response.cookies.jsessionid.path

Here I found that these seem to always return a null also.

What I think you're getting at here is it might be advantageous to flatten the wildcard fields rather than putting them in a Drill map and in so doing, create many null columns. Is that correct? If so, my thought here is that the best way to go about that would be to add a config option called flattenWildcardFields and if the user selects that, you would get a column for every value in the wildcard fields rather than a map.

The advantage that I see in doing this is easier queries. For instance if you wanted to find particular values from a query string, you could do something like:

SELECT <fields>
FROM ...
WHERE request_firstline_uri_query_aap = 1234

Would that work for you?

@nielsbasjes
Copy link
Contributor

I'm cool with what makes it work for Drill.
Do note that if you have a LOT of query params and you only need a single won then putting them all into a map will have a impact on performance as all parameters are extracted, urldecoded and put into a map. Yet when you only want a single one the system is optimized to only do those steps for the requested field.
In general I expect the performance impact of this to be tiny.

@cgivre
Copy link
Contributor Author

cgivre commented Dec 6, 2020

@nielsbasjes
I added an option to the config which allows the user to flatten the wildcard columns. I think this should address your comments. Thanks for your reviews!

@nielsbasjes
Copy link
Contributor

Sounds great!

@cgivre
Copy link
Contributor Author

cgivre commented Dec 6, 2020

Sounds great!

@nielsbasjes
If you approve the changes, can you mark the PR with a +1 or approve the review? Thank you very much for both your libraries and taking the time to review this PR!

@nielsbasjes
Copy link
Contributor

I had a look at the tests that check the wildcard flattening and this looks very good to me.

+1

@cgivre cgivre merged commit e87edad into apache:master Dec 6, 2020
@cgivre cgivre deleted the tmp_http branch December 6, 2020 20:50
@jnturton
Copy link
Contributor

jnturton commented Dec 9, 2020

dzamo

Docs updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactoring PR related to code refactoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants