Skip to content
Mayank Shrivastava edited this page Jun 29, 2016 · 19 revisions

Step-by-step guide

The Pinot Quickstart console provides a shell interface (pinot-admin) to perform the individual steps to be able to upload data and perform queries on Pinot2_0. Broadly speaking, the steps involve starting the Pinot2_0 processes, generating and loading data into Pinot2_0, and querying the data. The details of individual steps, along with examples are listed below. The steps are listed in order in which they must be performed.

The pinot-admin console only supports localhost currently. Help on using the pinot-admin.sh script is listed towards the end of this article.

Note: To overwrite the default JVM arguments used by pinot-admin script, set the environment variable JAVA_OPTS to appropriate value.

Starting the processes:

  • Start Zookeeper**:** The StartZookeeper command below will start the Zookeeper process on localhost at port 2181. Use the -zkPort (and other options) to start the process on a different port (and of other customizations). It is not mandatory to use the pinot-admin script to launch Zookeeper, it can be launched via other available options (eg zkServer.sh in Zookeeper installation). Regardless of the way of launching Zookeeper, it is important to note the port number at which it is being launched, as it is needed in the steps below.
pinot-admin.sh StartZookeeper
  • Start Pinot-Controller: The StartController command below will start the Pinot-Controller on port 9000 on localhost, which will connect to the Zookeeper server at the provided address.
pinot-admin.sh StartController \
 -controllerPort 9000 \
 -zkAddress "localhost:2181" \
 -dataDir "/tmp/PinotController"
  • Start Pinot-Broker: The StartBroker command below will start the Pinot-Broker process on localhost at default port (8099), connect to the provided Zookeeper sever and cluster.
pinot-admin.sh StartBroker -brokerPort 8099 -zkAddress "localhost:2181"
  • Start Pinot-Server: The StartServer command below will start the Pinot-Server process on localhost at specified port (8098), and will connect to the specified Zookeeper process. The specified data/segment directories are used for temporary storage by the Pinot-Server.
pinot-admin.sh StartServer \
    -serverPort 8098 \
    -dataDir /tmp/data \
    -segmentDir /tmp/segment \
    -zkAddress "localhost:2181"

Generating data and creating Pinot-segments:

  • Generate Data: The GenerateData command takes as input the number of records/files to generate, and the Pinot2_0 schema in JSON format. It then generates random data in AVRO format as per the specified options. Use the -schemaAnnotationFile (annotations in JSON) option to customize cardinality for dimensions and range for metric & time columns. Examples of schema file and annotation file are provided at the end of this article. If output directory exists, the command will error out. Provide a new directory name that does not exist, or use the -overwrite option to overwrite the existing data.
pinot-admin.sh GenerateData \
    -numRecords 1000 \
    -numFiles 1 \
    -schemaFile /tmp/example.sch \
    -outDir PinotData
  • Create Segments: The CreateSegment command below takes the AVRO data generated by the GenerateData command, along with the schema file, table name and generates Pinot2_0 segments. If the output directory already exists, the command will error out. Provide a new directory name that does not exist, or use the -overwrite option to overwrite the existing directory.
pinot-admin.sh CreateSegment \
    -schemaFile /tmp/schema.sch \
    -dataDir /tmp/PinotData \
    -tableName myTable \
    -segmentName mySegment \
    -outDir /tmp/PinotSeg
  • Star Tree Segments: Pinot now offers special a indexing scheme for fast aggregations called as Star Tree. The data is represented in a tree structure where level in the tree splits that data on a particular dimension. Each level also contains a node called star node that contains all columns other than columns the data is already split on so far (including the current level). All metric columns in the star node are aggregated using remaining dimension columns as the keys. To generate segments in this format, use the CreateSegment command with the following options.
    -enableStarTreeIndex true \
    -starTreeIndexSpecFile <specFile>
  • The index spec file (optional) can be used to configure various parameters that are used in star tree segment generation:
    • maxLeafRecords( default 100k): A node at any level would not be split further if it has less than maxLeafRecords records.
    • dimensionsSplitOrder: The order in which dimension columns are picked to split the data. Default order is computed by sorting the dimensions in decreasing order of their cardinality
    • skipStarNodeCreationForDimensions: When data is split on these dimensions, star node creation is skipped. Default is empty.
    • skipMaterializationForDimensions: When data is materialized, the dimensions are removed. By default, all high cardinality (>10000) dimensions are removed while materializing.
    • skipMaterializationCardinalityThreshold: Another way to set the threshold for high cardinality which is 10000 by default. All dimensions with cardinality greater than skipMaterializationCardinalityThreshold will be skipped from materialization.Default value is 10000

A sample spec file is provided below.

{
  "maxLeafRecords" : 10000,
  "dimensionsSplitOrder" : [ "dim1", "dim2", "dim3", "dim4", "dim5" ],
  "skipStarNodeCreationForDimensions" : [ "dim5", "dim4" ],
  "skipMaterializationForDimensions" : [ "dim6" ],
  "skipMaterializationCardinalityThreshold" : 10000
}

Add Table

  • Add Table: The AddTable command below creates a table of specified name at the specified controller. Use the -filePath option to specify the table configuration file in Json format. A sample table configuration file is provided towards the end of this article. The controller port must be the same as where StartController command started the Pinot-Controller process. Specify the '-exec' option to execute the command. If not specified, it just prints the command to be executed without actually executing it.
pinot-admin.sh AddTable -filePath ./data/table.json -controllerPort 9000 -exec

Upload and query:

  • Upload Segment: The UploadSegment command below specifies the Pinot-Controller to upload data (pinot-segments), which in turn uploads the data into Pinot-Server. The controller port must be the same where StartController command started the Pinot-Controller process. The segment directory must be the same where the CreateSegment command created the data segments.
pinot-admin.sh UploadSegment -controllerPort 9000 -segmentDir /tmp/PinotSeg
  • Query Data: After performing the steps above. The data is now loaded into Pinot2_0 and is ready to be queried. The following example queries the number of records where the age column has value > 20:
pinot-admin.sh PostQuery -brokerPort 8099 -query "select count(*) from 'myTable' where age > 20"
  • Stop process: The StopProcess command can be used to stop one or more processes started in step 1a-d. In the following example, the controller, server and broker are all being stopped at once.
pinot-admin.sh StopProcess -controller -server -broker
  • Delete Cluster: The DeleteCluster command can be used to delete a previously created Helix cluster. The command errors out if it is unable to connect to the provided zkAddress, or if the provided cluster does not exist.
pinot-admin.sh DeleteCluster -clusterName pinotCluster -zkAddress localhost:2181
  • Show Cluster Info: The ShowClusterInfo command can be used to print the cluster info in Zookeeper. The command errors out if it is unable to connect to the provided zkAddress, or if the provided cluster does not exist. You can (optionally) specify the table names or tag names to print the info for.
pinot-admin.sh ShowClusterInfo -clusterName pinotCluster -zkAddress localhost:2181

Using the pinot-admin.sh script: The pinot-admin.sh script is built along with Pinot2_0 code, and is located at the following location.

pinot-tools/target/pinot-tools-pkg/bin/pinot-admin.sh
  • Use the -help option to list the available sub-commands:
pinot-admin.sh -help
Usage: pinot-admin.sh <subCommand>
Valid subCommands are:
    GenerateData
    StartController
    PostQuery
    ....
  • For <sub-command -help> to list available options of individual commands:
pinot-admin.sh StartController -help
Usage: StartController
    -clusterName    <string>       : Name of the cluster. (required=true)
    -tableName      <string>       : Name of the table. (required=false)
    -controllerPort <int>          : Port number to start the controller at. (required=true)
    -dataDir        <string>       : Path to directory containging data. (required=false)
    -zkAddress      <http>         : Http address of Zookeeper. (required=true)
    -help                          : Print this message. (required=false)

Schema format examples Example Schema file (JSON format):

{
  "dimensionFieldSpecs" : [
    {
      "delimiter" : null,
      "dataType" : "INT",
      "defaultNullValue" : -2147483648,
      "singleValueField" : true,
      "name" : "Year"
     },
     {
      "delimiter" : null,
      "dataType" : "INT",
      "defaultNullValue" : -2147483648,
      "singleValueField" : true,
      "name" : "Quarter"
     },
     {
      "delimiter" : null,
      "dataType" : "INT",
      "defaultNullValue" : -2147483648,
      "singleValueField" : true,
      "name" : "Month"
     },
     {
      "delimiter" : null,
      "dataType" : "INT",
      "defaultNullValue" : -2147483648,
      "singleValueField" : true,
      "name" : "DayofMonth"
     },
     {
      "delimiter" : null,
      "dataType" : "INT",
      "defaultNullValue" : -2147483648,
      "singleValueField" : true,
      "name" : "DayOfWeek"
     },
     {
      "delimiter" : null,
      "dataType" : "STRING",
      "defaultNullValue" : "null",
      "singleValueField" : true,
      "name" : "FlightDate"
     },
     {
      "delimiter" : null,
      "dataType" : "STRING",
      "defaultNullValue" : "null",
      "singleValueField" : true,
      "name" : "Carrier"
     },
     {
      "delimiter" : null,
      "dataType" : "STRING",
      "defaultNullValue" : "null",
      "singleValueField" : true,
      "name" : "Origin"
     },
     {
      "delimiter" : null,
      "dataType" : "STRING",
      "defaultNullValue" : "null",
      "singleValueField" : true,
      "name" : "OriginCityName"
     },
     {
      "delimiter" : null,
      "dataType" : "STRING",
      "defaultNullValue" : "null",
      "singleValueField" : true,
      "name" : "OriginState"
     },
     {
      "delimiter" : null,
      "dataType" : "STRING",
      "defaultNullValue" : "null",
      "singleValueField" : true,
      "name" : "Dest"
     },
     {
      "delimiter" : null,
      "dataType" : "STRING",
      "defaultNullValue" : "null",
      "singleValueField" : true,
      "name" : "DestCityName"
     },
     {
      "delimiter" : null,
      "dataType" : "STRING",
      "defaultNullValue" : "null",
      "singleValueField" : true,
      "name" : "DestState"
     },
     {
      "delimiter" : null,
      "dataType" : "INT",
      "defaultNullValue" : -2147483648,
      "singleValueField" : true,
      "name" : "CRSDepTime"
     },
     {
      "delimiter" : null,
      "dataType" : "INT",
      "defaultNullValue" : -2147483648,
      "singleValueField" : true,
      "name" : "DepTime"
     },
     {
      "delimiter" : null,
      "dataType" : "INT",
      "defaultNullValue" : -2147483648,
      "singleValueField" : true,
      "name" : "DepDelay"
     },
     {
      "delimiter" : null,
      "dataType" : "INT",
      "defaultNullValue" : -2147483648,
      "singleValueField" : true,
      "name" : "TaxiOut"
     },
     {
      "delimiter" : null,
      "dataType" : "INT",
      "defaultNullValue" : -2147483648,
      "singleValueField" : true,
      "name" : "TaxiIn"
     },
     {
      "delimiter" : null,
      "dataType" : "INT",
      "defaultNullValue" : -2147483648,
      "singleValueField" : true,
      "name" : "CRSArrTime"
     },
     {
      "delimiter" : null,
      "dataType" : "INT",
      "defaultNullValue" : -2147483648,
      "singleValueField" : true,
      "name" : "ArrTime"
     },
     {
      "delimiter" : null,
      "dataType" : "INT",
      "defaultNullValue" : -2147483648,
      "singleValueField" : true,
      "name" : "ArrDelay"
     },
     {
      "delimiter" : null,
      "dataType" : "INT",
      "defaultNullValue" : -2147483648,
      "singleValueField" : true,
      "name" : "Cancelled"
    }
  ],
  "schemaName" : "OnTime"
}
  • Example Schema Annotation file:
[
  {
    "column" : "name",
    "range" : false,
    "cardinality" : 100
  },
  {
    "column" : "days",
    "range" : true,
    "rangeStart" : 100,
    "rangeEnd"   : 200
  },
  {
    "column" : "percent",
    "range" : true,
    "rangeStart" : 0.1,
    "rangeEnd"   : 99.99
  },
  {
    "column" : "age",
    "range" : false,
    "cardinality" : 100
  }
]
Sample Table configuration Json File:
{
    "tableName":"myTable",
    "segmentsConfig" : {
        "retentionTimeUnit":"DAYS",
        "retentionTimeValue":"700",
        "segmentPushFrequency":"daily",
        "segmentPushType":"APPEND",
        "replication" : "3",
        "schemaName" : "tableSchemaName",
        "timeColumnName" : "timeColumnName",
        "timeType" : "timeType",
        "segmentAssignmentStrategy" : "BalanceNumSegmentAssignmentStrategy"
    },
    "tableIndexConfig" : {
        "invertedIndexColumns" : ["column1","column2"],
        "loadMode"  : "HEAP",
        "lazyLoad"  : "false"
    },
    "tenants" : {
        "broker":"brokerOne",
        "server":"serverOne"
    },
    "tableType":"OFFLINE",
    "metadata": {
        "customConfigs" : {
            "d2Name":"Test"
        }
    }
}

Home

Pinot Documentation

Pinot Administration

Contributor

Design Docs

  • Multitenancy
  • Architecture
  • Query Execution
  • [Pinot Core Concepts and Terminology] (Pinot-Core-Concepts-and-Terminology)
  • [Low level kafka Consumers] (Low-level-kafka-consumers)
  • [Expressions & UDF Support] (Expressions-&-UDF-support)

Clone this wiki locally