Skip to content

Conversation

@kevinrr888
Copy link
Member

This is part of #4687
Changes:

  • Added new admin check command
  • admin check list prints the checks that can be run along with their descriptions and dependencies
  • admin check run runs all the checks or a specified list of checks (explicitly provide each check, provide a regex of the checks to run, or just run all if no further args are provided). The dependencies are run first and if a checks dependency fails, the check is skipped
  • Current impl of admin check run does not do any actual work besides printing that the check is running (to ensure correct run order) and returning an OK status
  • New IT AdminCheckIT
    • This IT includes a verion of the Checks where no work is actually done (right now, this is equivalent to the actual implementation). This is done so correct run order, correct checks run, etc. can be
      verified without actually running the checks which may take a long time. More tests can be added later to test the actual check functionality when that is implemented.
  • No existing functionality was changed

Questions:

  1. Are new classes and packages in suitable locations?
  2. Is the way I went about implementing the dummy checks in AdminCheckIT okay? Mostly concerned if changing the real static CHECK_RUNNERS field is okay (changes it during the test but asserts that it is reset after the test completes).

(Part of apache#4687)
- Added new `admin check` command
- `admin check list` prints the checks that can be run along with their descriptions and dependencies
- `admin check run` runs all the checks or a specified list of checks (explicitly provide each check, provide a regex of the checks to run, or just run all if no further args are provided). The dependencies are run first and if a checks dependency fails, the check is skipped
- Current impl of `admin check run` does not do any actual work besides printing that the check is running (to ensure correct run order) and returning an `OK` status
- New IT AdminCheckIT
	- This IT includes a verion of the Checks where no work is actually done (right now, this is equivalent to the actual implementation). This is done so correct run order, correct checks run, etc. can be
	verified without actually running the checks which may take a long time. More tests can be added later to test the actual check functionality when that is implemented.
- No existing functionality was changed
@kevinrr888
Copy link
Member Author

Some example outputs and a sketch of the current dependency tree as support:

Dependency tree sketch:

        SYSTEM_CONFIG
              |
              V
        ROOT_METADATA
              |
              V
          ROOT_TABLE
         /          \
   METADATA_TABLE  SYSTEM_FILES
         |
         V
     USER_FILES

Sample outputs:

$ accumulo admin check list

Check Name           | Description                                                                      | Depends on          
------------------------------------------------------------------------------------------------------------------------
SYSTEM_CONFIG        | Validate the system config stored in ZooKeeper                                   |                     
ROOT_METADATA        | Checks integrity of the root tablet metadata stored in ZooKeeper                 | SYSTEM_CONFIG       
ROOT_TABLE           | Scans all the tablet metadata stored in the root table and checks integrity      | ROOT_METADATA       
METADATA_TABLE       | Scans all the tablet metadata stored in the metadata table and checks integrity  | ROOT_TABLE          
SYSTEM_FILES         | Checks that files in system tablet metadata exist in DFS                         | ROOT_TABLE          
USER_FILES           | Checks that files in user tablet metadata exist in DFS                           | METADATA_TABLE      
------------------------------------------------------------------------------------------------------------------------
$ accumulo admin check run

Running check SYSTEM_CONFIG
Check SYSTEM_CONFIG completed with status OK
Running check ROOT_METADATA
Check ROOT_METADATA completed with status OK
Running check ROOT_TABLE
Check ROOT_TABLE completed with status OK
Running check METADATA_TABLE
Check METADATA_TABLE completed with status OK
Running check SYSTEM_FILES
Check SYSTEM_FILES completed with status OK
Running check USER_FILES
Check USER_FILES completed with status OK

Check Name           | Status              
--------------------------------------------------
SYSTEM_CONFIG        | OK                  
ROOT_METADATA        | OK                  
ROOT_TABLE           | OK                  
METADATA_TABLE       | OK                  
SYSTEM_FILES         | OK                  
USER_FILES           | OK                  
--------------------------------------------------
$ accumulo admin check run -p ".*table"

Running check SYSTEM_CONFIG
Check SYSTEM_CONFIG completed with status OK
Running check ROOT_METADATA
Check ROOT_METADATA completed with status OK
Running check ROOT_TABLE
Check ROOT_TABLE completed with status OK
Running check METADATA_TABLE
Check METADATA_TABLE completed with status OK

Check Name           | Status              
--------------------------------------------------
SYSTEM_CONFIG        | OK                  
ROOT_METADATA        | OK                  
ROOT_TABLE           | OK                  
METADATA_TABLE       | OK                  
--------------------------------------------------

In this example, I made ROOT_TABLE fail when it was run and attempted to run all the checks

$ accumulo admin check run

Running check SYSTEM_CONFIG
Check SYSTEM_CONFIG completed with status OK
Running check ROOT_METADATA
Check ROOT_METADATA completed with status OK
Running check ROOT_TABLE
Check ROOT_TABLE completed with status FAILED

Check Name           | Status              
--------------------------------------------------
SYSTEM_CONFIG        | OK                  
ROOT_METADATA        | OK                  
ROOT_TABLE           | FAILED              
METADATA_TABLE       | SKIPPED_DEPENDENCY_FAILED
SYSTEM_FILES         | SKIPPED_DEPENDENCY_FAILED
USER_FILES           | SKIPPED_DEPENDENCY_FAILED
--------------------------------------------------

@kevinrr888 kevinrr888 added this to the 3.1.0 milestone Aug 14, 2024
@kevinrr888 kevinrr888 self-assigned this Aug 14, 2024
@kevinrr888
Copy link
Member Author

This was created as a replacement for #4806 which targeted elasticity. This PR targets main

@kevinrr888
Copy link
Member Author

Relevant review comment from old PR:
#4806 (comment)

@kevinrr888
Copy link
Member Author

@keith-turner question:
Which of these would be the correct way to handle something like this:

$ accumulo admin check run SYSTEM_CONFIG USER_FILES
when SYSTEM_CONFIG fails

Check Name           | Status              
--------------------------------------------------
SYSTEM_CONFIG        | FAILED                  
ROOT_METADATA        | FILTERED_OUT        
ROOT_TABLE           | FILTERED_OUT        
METADATA_TABLE       | FILTERED_OUT        
SYSTEM_FILES         | FILTERED_OUT        
USER_FILES           | OK                  
--------------------------------------------------
Check Name           | Status              
--------------------------------------------------
SYSTEM_CONFIG        | FAILED                  
ROOT_METADATA        | FILTERED_OUT        
ROOT_TABLE           | FILTERED_OUT        
METADATA_TABLE       | FILTERED_OUT        
SYSTEM_FILES         | FILTERED_OUT        
USER_FILES           | SKIPPED_DEPENDENCY_FAILED                  
--------------------------------------------------
Check Name           | Status              
--------------------------------------------------
SYSTEM_CONFIG        | FAILED                  
ROOT_METADATA        | SKIPPED_DEPENDENCY_FAILED         
ROOT_TABLE           | SKIPPED_DEPENDENCY_FAILED         
METADATA_TABLE       | SKIPPED_DEPENDENCY_FAILED         
SYSTEM_FILES         | SKIPPED_DEPENDENCY_FAILED         
USER_FILES           | SKIPPED_DEPENDENCY_FAILED                  
--------------------------------------------------

@keith-turner
Copy link
Contributor

@kevinrr888 they all seems like good ways to handle the situation. In the examples you posted it seems to come down to

  1. Do we only look at the status of direct dependencies or do we analyze transitive dependencies?
  2. When a check is filtered out and its dependency failed, what should it status be?

Personally I would go with the option that is the most straightforward to implement an maintain. Option 1 or 3 seem like they are the most straightforward to implement because they do not do transitive analysis. Seems like option 3 chooses SKIPPED_DEPENDENCY_FAILED when its also filtered out. I lean towards option 3.

@dlmarion dlmarion changed the base branch from main to 3.1 August 26, 2024 11:59
- Added FILTERED_OUT status for checks that are not specified to run
- When checks are specified, no longer run their dependencies first
- Switched from static maps for the fields (CHECK_DESCRIPTION, CHECK_DEPENDENCIES, CHECK_RUNNERS) to instead be declared in the enum constructor.
	- This also meant refactoring AdminCheckIT
@kevinrr888
Copy link
Member Author

Addressed review comments:

  • Added FILTERED_OUT status for checks that are not specified to run
  • When checks are specified, no longer run their dependencies first
  • Switched from static maps for the fields (CHECK_DESCRIPTION, CHECK_DEPENDENCIES, CHECK_RUNNERS) to instead be declared in the enum constructor.
    • This also meant refactoring AdminCheckIT

AdminCheckIT does not have many tests yet to avoid too much refactoring if further structural changes are needed. Will add more after implementation is approved.
Re #4807 (comment): I went with option 3 in b9f6d5d

Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This foundation looks really good. I think we can merge this as long as we have blocker issues about creating some of the actual checks.

@kevinrr888
Copy link
Member Author

Once this is merged, I can create the follow on issue for adding the functionality for these checks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create an admin command that performs a comprehensive check for problems

3 participants