Skip to content

Conversation

@kszucs
Copy link
Member

@kszucs kszucs commented Jul 13, 2020

Using the existing numpydoc checker in archery. I've enabled a single rule to check undocumented arguments (we have 189 currently).

Perhaps we should fix these in this PR then enable it globally to keep the docstrings in good shape?

@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 25, 2020
@kszucs kszucs changed the title [Python][CI] Create a build for validating python docstrings ARROW-13208: [Python][CI] Create a build for validating python docstrings Jun 29, 2021
@github-actions
Copy link

@kszucs
Copy link
Member Author

kszucs commented Jun 29, 2021

We have many undocumented arguments in the pyarrow API. It would be nice if we could introduce the numpydoc checks, but at some point we need to enable at least a single rule causing the build to fail.

cc @pitrou @jorisvandenbossche @amol

@apache apache deleted a comment from github-actions bot Jun 29, 2021
@pitrou
Copy link
Member

pitrou commented Jun 29, 2021

@kszucs Why not. Is there a way to selectively disable the check on some functions? (a Python comment perhaps?)

@pitrou
Copy link
Member

pitrou commented Jun 29, 2021

@kszucs Can you paste the current error log somewhere?

@kszucs
Copy link
Member Author

kszucs commented Jun 30, 2021

Truncated log with a single rule (PR01) to check enabled:

...

pyarrow.lib.is_boolean_value
-> pyarrow.lib.is_boolean_value(obj)
PR01: Parameters {'obj'} not documented

pyarrow.types.is_boolean
PR01: Parameters {'t'} not documented

pyarrow.types.is_binary
PR01: Parameters {'t'} not documented

pyarrow.gandiva.module.make_projector
-> pyarrow.gandiva.make_projector(Schema schema, children, MemoryPool pool, unicode selection_mode=u'NONE')
PR01: Parameters {'schema', 'selection_mode', 'pool', 'children'} not documented

pyarrow.gandiva.module.make_filter
-> pyarrow.gandiva.make_filter(Schema schema, Condition condition)
PR01: Parameters {'schema', 'condition'} not documented

pyarrow.lib.frombytes
-> pyarrow.lib.frombytes(o, *, safe=False)
PR01: Parameters {'safe', 'o'} not documented

pyarrow.parquet.write_table
PR01: Parameters {'use_compliant_nested_type', 'use_byte_stream_split', 'where', '**kwargs', 'row_group_size', 'compression_level'} not documented

pyarrow.parquet.write_metadata
PR01: Parameters {'where', 'metadata_collector'} not documented

pyarrow.parquet.read_table
PR01: Parameters {'columns', 'source'} not documented

pyarrow.parquet.read_pandas
PR01: Parameters {'**kwargs', 'source', 'columns'} not documented

pyarrow.parquet.PartitionSet
PR01: Parameters {'keys', 'name'} not documented

pyarrow.parquet.PartitionSet.get_index
PR01: Parameters {'key'} not documented

pyarrow.parquet.ParquetWriter
PR01: Parameters {'writer_engine_version', 'use_compliant_nested_type', 'use_byte_stream_split', 'compression_level'} not documented

pyarrow.parquet.ParquetFile
PR01: Parameters {'read_dictionary'} not documented

pyarrow.parquet.ParquetFile.read_row_groups
PR01: Parameters {'row_groups', 'columns'} not documented

pyarrow.parquet.ParquetFile.read_row_group
PR01: Parameters {'i', 'columns'} not documented

pyarrow.parquet.ParquetFile.read
PR01: Parameters {'columns'} not documented

pyarrow.parquet.ParquetFile.iter_batches
PR01: Parameters {'batch_size', 'row_groups', 'columns'} not documented

pyarrow.parquet.ParquetDatasetPiece
PR01: Parameters {'file_options'} not documented

pyarrow.parquet.ParquetDatasetPiece.read
PR01: Parameters {'use_pandas_metadata'} not documented

pyarrow.parquet.ParquetDataset
PR01: Parameters {'metadata_nthreads'} not documented

pyarrow.parquet.ParquetDataset.read_pandas
PR01: Parameters {'**kwargs'} not documented

pyarrow._csv.write_csv
-> pyarrow._csv.write_csv(data, output_file, write_options=None, MemoryPool memory_pool=None)
PR01: Parameters {'output_file', 'write_options', 'memory_pool', 'data'} not documented

pyarrow._csv.read_csv
-> pyarrow._csv.read_csv(input_file, read_options=None, parse_options=None, convert_options=None, MemoryPool memory_pool=None)
PR01: Parameters {'memory_pool', 'read_options', 'convert_options', 'input_file', 'parse_options'} not documented

pyarrow._csv.open_csv
-> pyarrow._csv.open_csv(input_file, read_options=None, parse_options=None, convert_options=None, MemoryPool memory_pool=None)
PR01: Parameters {'memory_pool', 'read_options', 'convert_options', 'input_file', 'parse_options'} not documented

pyarrow._csv.ReadOptions
-> pyarrow._csv.ReadOptions(use_threads=None, *, block_size=None, skip_rows=None, column_names=None, autogenerate_column_names=None, encoding=u'utf8', skip_rows_after_names=None)
PR01: Parameters {'encoding', 'column_names', 'skip_rows', 'autogenerate_column_names', 'skip_rows_after_names'} not documented

pyarrow._csv.ParseOptions
-> pyarrow._csv.ParseOptions(delimiter=None, *, quote_char=None, double_quote=None, escape_char=None, newlines_in_values=None, ignore_empty_lines=None)
PR01: Parameters {'double_quote', 'delimiter', 'newlines_in_values', 'quote_char', 'escape_char', 'ignore_empty_lines'} not documented

pyarrow._csv.ConvertOptions
-> pyarrow._csv.ConvertOptions(check_utf8=None, *, column_types=None, null_values=None, true_values=None, false_values=None, strings_can_be_null=None, quoted_strings_can_be_null=None, include_columns=None, include_missing_columns=None, auto_dict_encode=None, auto_dict_max_cardinality=None, timestamp_parsers=None)
PR01: Parameters {'true_values', 'column_types', 'quoted_strings_can_be_null', 'strings_can_be_null', 'false_values', 'include_columns', 'auto_dict_max_cardinality', 'include_missing_columns', 'null_values', 'auto_dict_encode', 'timestamp_parsers'} not documented

pyarrow._dataset.TaggedRecordBatch
-> pyarrow._dataset.A combination of a record batch and the fragment it came from.
PR01: Parameters {'record_batch', 'fragment'} not documented

pyarrow._dataset.RowGroupInfo
-> pyarrow._dataset.A wrapper class for RowGroup information
PR01: Parameters {'schema', 'id', 'metadata'} not documented

pyarrow._flight.connect
-> pyarrow._flight.connect(location, **kwargs)
PR01: Parameters {'**kwargs', 'location'} not documented

pyarrow._flight.Ticket
-> pyarrow._flight.Ticket(ticket)
PR01: Parameters {'ticket'} not documented

pyarrow._flight.SchemaResult
-> pyarrow._flight.SchemaResult(Schema schema)
PR01: Parameters {'schema'} not documented

pyarrow._flight.Result
-> pyarrow._flight.Result(buf)
PR01: Parameters {'buf'} not documented

pyarrow._flight.RecordBatchStream
-> pyarrow._flight.RecordBatchStream(data_source, options=None)
PR01: Parameters {'data_source', 'options'} not documented

pyarrow._flight.Location
-> pyarrow._flight.Location(uri)
PR01: Parameters {'uri'} not documented

pyarrow._flight.for_grpc_unix
-> pyarrow._flight.Location.for_grpc_unix(path)
PR01: Parameters {'path'} not documented

pyarrow._flight.for_grpc_tls
-> pyarrow._flight.Location.for_grpc_tls(host, port)
PR01: Parameters {'host', 'port'} not documented

pyarrow._flight.for_grpc_tcp
-> pyarrow._flight.Location.for_grpc_tcp(host, port)
PR01: Parameters {'host', 'port'} not documented

pyarrow._flight.GeneratorStream
-> pyarrow._flight.GeneratorStream(schema, generator, options=None)
PR01: Parameters {'schema', 'generator', 'options'} not documented

pyarrow._flight.FlightWriteSizeExceededError
-> pyarrow._flight.A write operation exceeded the client-configured limit.
PR01: Parameters {'message', 'actual', 'limit'} not documented

pyarrow._flight.FlightServerBase
-> pyarrow._flight.FlightServerBase(location=None, auth_handler=None, tls_certificates=None, verify_client=None, root_certificates=None, middleware=None)
PR01: Parameters {'auth_handler', 'verify_client', 'root_certificates', 'middleware', 'tls_certificates'} not documented

pyarrow._flight.FlightMethod
-> pyarrow._flight.The implemented methods in Flight.
PR01: Parameters {'qualname', 'type', 'names', 'start', 'value', 'module'} not documented

pyarrow._flight.FlightInfo
-> pyarrow._flight.FlightInfo(Schema schema, FlightDescriptor descriptor, endpoints, total_records, total_bytes)
PR01: Parameters {'schema', 'endpoints', 'total_records', 'descriptor', 'total_bytes'} not documented

pyarrow._flight.FlightError
-> pyarrow._flight.FlightError(message=u'', extra_info=b'')
PR01: Parameters {'message', 'extra_info'} not documented

pyarrow._flight.FlightEndpoint
-> pyarrow._flight.FlightEndpoint(ticket, locations)
PR01: Parameters {'ticket', 'locations'} not documented

pyarrow._flight.for_path
-> pyarrow._flight.FlightDescriptor.for_path(*path)
PR01: Parameters {'*path'} not documented

pyarrow._flight.for_command
-> pyarrow._flight.FlightDescriptor.for_command(command)
PR01: Parameters {'command'} not documented

pyarrow._flight.FlightClient
-> pyarrow._flight.FlightClient(location, tls_root_certs=None, *, cert_chain=None, private_key=None, override_hostname=None, middleware=None, write_size_limit_bytes=None, disable_server_verification=None, generic_options=None)
PR01: Parameters {'generic_options', 'override_hostname', 'cert_chain', 'tls_root_certs', 'middleware', 'disable_server_verification', 'private_key', 'write_size_limit_bytes'} not documented

pyarrow._flight.FlightCallOptions
-> pyarrow._flight.FlightCallOptions(timeout=None, write_options=None, headers=None)
PR01: Parameters {'write_options', 'timeout', 'headers'} not documented

pyarrow._flight.DescriptorType
PR01: Parameters {'qualname', 'type', 'names', 'start', 'value', 'module'} not documented

pyarrow._flight.CertKeyPair
-> pyarrow._flight.A TLS certificate and key for use in Flight.
PR01: Parameters {'cert', 'key'} not documented

pyarrow._flight.CallInfo
-> pyarrow._flight.Information about a particular RPC for Flight middleware.
PR01: Parameters {'method'} not documented

pyarrow._flight.BasicAuth
-> pyarrow._flight.BasicAuth(username=None, ***
PR01: Parameters {'password', 'username'} not documented

pyarrow._flight.deserialize
-> pyarrow._flight.BasicAuth.deserialize(string)
PR01: Parameters {'string'} not documented

pyarrow._flight.ActionType
-> pyarrow._flight.A type of action that is executable on a Flight service.
PR01: Parameters {'type', 'description'} not documented

pyarrow._flight.Action
-> pyarrow._flight.Action(action_type, buf)
PR01: Parameters {'buf', 'action_type'} not documented

pyarrow.lib.schema
-> pyarrow.lib.schema(fields, metadata=None)
PR01: Parameters {'fields'} not documented

pyarrow.feather.read_feather
PR01: Parameters {'use_threads'} not documented

pyarrow.lib.concat_tables
-> pyarrow.lib.concat_tables(tables, bool promote=False, MemoryPool memory_pool=None)
PR01: Parameters {'promote'} not documented

pyarrow.lib.is_available
-> pyarrow.lib.Codec.is_available(unicode compression)
PR01: Parameters {'compression'} not documented

pyarrow._json.read_json
-> pyarrow._json.read_json(input_file, read_options=None, parse_options=None, MemoryPool memory_pool=None)
PR01: Parameters {'memory_pool', 'read_options', 'parse_options', 'input_file'} not documented

pyarrow._json.ParseOptions
-> pyarrow._json.ParseOptions(explicit_schema=None, newlines_in_values=None, unexpected_field_behavior=None)
PR01: Parameters {'unexpected_field_behavior', 'explicit_schema', 'newlines_in_values'} not documented

pyarrow.orc.ORCWriter.write
PR01: Parameters {'table'} not documented

Total number of docstring violations: 186

@kszucs
Copy link
Member Author

kszucs commented Jun 30, 2021

@kszucs Why not. Is there a way to selectively disable the check on some functions? (a Python comment perhaps?)

It retrieves data from the python symbols rather than parsing the source files. Otherwise we can maintain a list of symbols (modules for example) to ignore as well as a list to traverse at the first place.

@jorisvandenbossche
Copy link
Member

Note that several of the "errors" in the output come from not using a space before the colon (param: type instead of param : type), and are not actually undocumented

Otherwise we can maintain a list of symbols (modules for example) to ignore as well as a list to traverse at the first place.

I think your archery command allows to specify which modules to test? So we can start with selectively only testing the ones that pass.
(that still doesn't allow skipping a specific function, though, in case there would be a good reason you want to skip some function)

@amol-
Copy link
Member

amol- commented Jul 14, 2021

A good first step might be to ensure no new ones can be added. In other contexts I used to wc -l the output of flake8 before and after the PR (thus on master and on the PR) to ensure that the PR didn't introduce any new error.

A similar strategy might be applied here, while we fix the existing ones, we can put in place a check that prevents new ones from occurring by simply counting the number of existing errors (or if you want to go further you can compare the list of functions with errors, but In my experience just comparing the count does 90% of the job)

@pitrou
Copy link
Member

pitrou commented Jul 21, 2021

@kszucs Can you rebase and fix conflicts?

'pyarrow.dataset',
'pyarrow.feather',
'pyarrow.flight',
# 'pyarrow.flight',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporarily disabled, created a follow-up: https://issues.apache.org/jira/browse/ARROW-14995

return Projector.create(result, pool)


cpdef make_filter(Schema schema, Condition condition):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are only used from the unittest, so we should not expose them to the public API.

Created a follow-up https://issues.apache.org/jira/browse/ARROW-14996

@kszucs
Copy link
Member Author

kszucs commented Dec 6, 2021

cc @amol-

@kszucs kszucs requested a review from pitrou December 7, 2021 10:38
@kszucs
Copy link
Member Author

kszucs commented Dec 7, 2021

Created a follow up to iteratively fix and enable more numpydoc checks https://issues.apache.org/jira/browse/ARROW-15006

'setuptools_scm'],
'crossbow-upload': ['github3.py', jinja_req, 'ruamel.yaml',
'setuptools_scm'],
'numpydoc': ['numpydoc==1.1.0']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for pinning to this exact version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This it the first version which provides the class and function we use from numpydoc which seem a bit like internal-ish at the moment.

Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@kszucs kszucs closed this in 2bffb82 Dec 8, 2021
@ursabot
Copy link

ursabot commented Dec 8, 2021

Benchmark runs are scheduled for baseline = b31dd51 and contender = 2bffb82. 2bffb82 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.74% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Gandiva Component: Python needs-rebase A PR that needs to be rebased by the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants