Skip to content

Some fixes and tests for spaces/non-ASCII chars in datasource names#6761

Merged
fjy merged 7 commits intoapache:masterfrom
jon-wei:dsname-kafka
Jan 15, 2019
Merged

Some fixes and tests for spaces/non-ASCII chars in datasource names#6761
fjy merged 7 commits intoapache:masterfrom
jon-wei:dsname-kafka

Conversation

@jon-wei
Copy link
Copy Markdown
Contributor

@jon-wei jon-wei commented Dec 20, 2018

This PR fixes URL encoding issues found in some internal API accesses when datasource names contain spaces or non-ASCII characters, and changes the indexing integration tests to use such datasource names.

Not fixed in this PR:

  • The "indexer" tab in the coordinator console has issues with such datasource names, I am not familiar with that code and haven't looked into it much
  • S3 and HDFS deep storage have some URL encoding issues as well, I am working on another patch that will address those and add some tests for those situations

There may be other issues beyond what I mentioned above.

@jon-wei jon-wei added the Bug label Dec 20, 2018
@jon-wei jon-wei force-pushed the dsname-kafka branch 2 times, most recently from 543bf1b to 578f86b Compare December 20, 2018 00:51

try {
// application/x-www-form-urlencoded encodes spaces as "+", but we use this to encode non-form
// data as well, so replace "+" with "%20".
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add this comment to javadoc for this method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to method javadoc

Assert.assertEquals(s1, "aaa%20bbb");

String s2 = StringUtils.urlEncode("fff+ggg");
Assert.assertEquals(s2, "fff%2Bggg");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also do a decode and verify that the original string is read back ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added decode checks

"namespace",
"page",
"regionIsoCode",
"regionName",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove these dimensions ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full story behind this change is:

  • This file is used only by NestedQueryPushDownTest
  • This test wasn't previously being run by Travis (see the change in ci/travis_script_integration_part2.sh)
  • When I started running the test locally and in travis, I was getting test timeout errors because ingesting this task's data took too long, so I shrunk the input data by removing columns
  • The tests run group by (channel, user) queries, so I kept the "page" dimension in there to preserve some level of query time aggregation

@Override
public String getExtraDatasourceNameSuffix()
{
return extraDatasourceNameSuffix;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a use case to make the suffix configurable ?
If not we can just have the suffix in the test datasource name itself, i believe it would simplify the changes. and each test can choose its own datasource name format.

Copy link
Copy Markdown
Contributor Author

@jon-wei jon-wei Jan 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that the non-ASCII characters could make the tests fail for people who do not have the proper locales set up on their system. I had to tweak locale settings on the ubuntu containers used in the integration tests (IT setup uses a shared folder too so the host machine needs to be configured as well), and people may not have any need for such characters in their own development/use cases, so I felt it would be nice to have a way to disable the use of such characters in the tests.

Another reason for making it a "global" thing like this is because I wanted to easily test support across all the use cases being tested in the IT suite (e.g., if I wanted to test support for some other characters across the board, I can just edit this property instead of changing each test individually)

@jon-wei
Copy link
Copy Markdown
Contributor Author

jon-wei commented Jan 9, 2019

@nishantmonu51 Merged with master and fixed conflicts, can you take another look?

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants