-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
Pandas version checks
- I have checked that the issue still exists on the latest versions of the docs on
mainhere
Location of the documentation
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Documentation problem
I have noted the following Issues of clarity in the docstring description for read_csv for the header parameter:
- The behavior associated with
header=Noneis not explicitly defined. - Default behavior description is in terms of
header=0andheader=None, neither of which have been clearly explained yet. - The relationship between file line numbers (which are conventionally numbered from 1) and row numbers/indices (which are indexed from 0) is not described explicity (only alluded to implicitly through examples of
header=0meaning the first line) - The description "if column names are passed explicitly" is vague as it doesn't explicitly mention how (i.e. via
namesparameter). - The detailed descriptions to not align with the order given in the inital list of accepted values
Suggested fix for documentation
Original docstring:
header : int, list of int, None, default 'infer'
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical toheader=0and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical toheader=None. Explicitly passheader=0to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines ifskip_blank_lines=True, soheader=0denotes the first line of data rather than the first line of the file. Issues of clarity noted in the docstring description:
Proposed change to address the issues:
header : int, list of int, None, default 'infer'
Index or indices corresponding to line number(s) in the CSV file that will be read as DataFrame column labels. Index 0 corresponds to the first line in the file (or the first non-blank, non-commented line ifskip_blank_lines=True). The following arguments are valid:
- Single
int: denotes the line index at which column labels will be read.- List of
int: denotes line indices at which column labels will be read as a multi-index. Note: intervening rows not specified in the list will be skipped (e.g., forheader=[0,1,3], the line at index 2 will be skipped).None: indicates that none of the lines in the file will be interpreted as headers and columns will instead be labelled by column index (or by values passed to thenamesparameter when provided). This is typically for files with no header. If the file has a header which the user intends to override with thenamesparameter, header should be assigned 0 instead of None.'infer'(default): behaves asheader=0if nonameswere passed, otherwise asheader=None.
Note: this is more in line with how the read_excel function is described which would enhance consistency between the two similar functions as well. I would also like to propose making further edits to other parameter descriptions in the function but wanted to gauge support for this first by keying in on a specific example.