44
55Comparison with SAS
66********************
7+
78For potential users coming from `SAS <https://en.wikipedia.org/wiki/SAS_(software) >`__
89this page is meant to demonstrate how different SAS operations would be
910performed in pandas.
1011
1112.. include :: includes/introduction.rst
1213
13- .. note ::
14-
15- Throughout this tutorial, the pandas ``DataFrame `` will be displayed by calling
16- ``df.head() ``, which displays the first N (default 5) rows of the ``DataFrame ``.
17- This is often used in interactive work (e.g. `Jupyter notebook
18- <https://jupyter.org/> `_ or terminal) - the equivalent in SAS would be:
19-
20- .. code-block :: sas
21-
22- proc print data= df(obs = 5 );
23- run;
2414
2515Data structures
2616---------------
@@ -120,7 +110,7 @@ The pandas method is :func:`read_csv`, which works similarly.
120110 " pandas/master/pandas/tests/io/data/csv/tips.csv"
121111 )
122112 tips = pd.read_csv(url)
123- tips.head()
113+ tips
124114
125115
126116 Like ``PROC IMPORT ``, ``read_csv `` can take a number of parameters to specify
@@ -138,6 +128,19 @@ In addition to text/csv, pandas supports a variety of other data formats
138128such as Excel, HDF5, and SQL databases. These are all read via a ``pd.read_* ``
139129function. See the :ref: `IO documentation<io> ` for more details.
140130
131+ Limiting output
132+ ~~~~~~~~~~~~~~~
133+
134+ .. include :: includes/limit.rst
135+
136+ The equivalent in SAS would be:
137+
138+ .. code-block :: sas
139+
140+ proc print data= df(obs = 5 );
141+ run;
142+
143+
141144 Exporting data
142145~~~~~~~~~~~~~~
143146
@@ -173,20 +176,8 @@ be used on new or existing columns.
173176 new_bill = total_bill / 2 ;
174177 run;
175178
176- pandas provides similar vectorized operations by
177- specifying the individual ``Series `` in the ``DataFrame ``.
178- New columns can be assigned in the same way.
179-
180- .. ipython :: python
181-
182- tips[" total_bill" ] = tips[" total_bill" ] - 2
183- tips[" new_bill" ] = tips[" total_bill" ] / 2.0
184- tips.head()
185-
186- .. ipython :: python
187- :suppress:
179+ .. include :: includes/column_operations.rst
188180
189- tips = tips.drop(" new_bill" , axis = 1 )
190181
191182Filtering
192183~~~~~~~~~
@@ -278,18 +269,7 @@ drop, and rename columns.
278269 rename total_bill= total_bill_2;
279270 run;
280271
281- The same operations are expressed in pandas below.
282-
283- .. ipython :: python
284-
285- # keep
286- tips[[" sex" , " total_bill" , " tip" ]].head()
287-
288- # drop
289- tips.drop(" sex" , axis = 1 ).head()
290-
291- # rename
292- tips.rename(columns = {" total_bill" : " total_bill_2" }).head()
272+ .. include :: includes/column_selection.rst
293273
294274
295275Sorting by values
@@ -308,8 +288,8 @@ Sorting in SAS is accomplished via ``PROC SORT``
308288String processing
309289-----------------
310290
311- Length
312- ~~~~~~
291+ Finding length of string
292+ ~~~~~~~~~~~~~~~~~~~~~~~~
313293
314294SAS determines the length of a character string with the
315295`LENGTHN <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002284668.htm >`__
@@ -327,8 +307,8 @@ functions. ``LENGTHN`` excludes trailing blanks and ``LENGTHC`` includes trailin
327307 .. include :: includes/length.rst
328308
329309
330- Find
331- ~~~~
310+ Finding position of substring
311+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
332312
333313SAS determines the position of a character in a string with the
334314`FINDW <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002978282.htm >`__ function.
@@ -342,19 +322,11 @@ you supply as the second argument.
342322 put(FINDW(sex,' ale' ));
343323 run;
344324
345- Python determines the position of a character in a string with the
346- ``find `` function. ``find `` searches for the first position of the
347- substring. If the substring is found, the function returns its
348- position. Keep in mind that Python indexes are zero-based and
349- the function will return -1 if it fails to find the substring.
350-
351- .. ipython :: python
352-
353- tips[" sex" ].str.find(" ale" ).head()
325+ .. include :: includes/find_substring.rst
354326
355327
356- Substring
357- ~~~~~~~~~
328+ Extracting substring by position
329+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
358330
359331SAS extracts a substring from a string based on its position with the
360332`SUBSTR <https://www2.sas.com/proceedings/sugi25/25/cc/25p088.pdf >`__ function.
@@ -366,17 +338,11 @@ SAS extracts a substring from a string based on its position with the
366338 put(substr(sex,1 ,1 ));
367339 run;
368340
369- With pandas you can use ``[] `` notation to extract a substring
370- from a string by position locations. Keep in mind that Python
371- indexes are zero-based.
372-
373- .. ipython :: python
374-
375- tips[" sex" ].str[0 :1 ].head()
341+ .. include :: includes/extract_substring.rst
376342
377343
378- Scan
379- ~~~~
344+ Extracting nth word
345+ ~~~~~~~~~~~~~~~~~~~
380346
381347The SAS `SCAN <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000214639.htm >`__
382348function returns the nth word from a string. The first argument is the string you want to parse and the
@@ -394,20 +360,11 @@ second argument specifies which word you want to extract.
394360 ;;;
395361 run;
396362
397- Python extracts a substring from a string based on its text
398- by using regular expressions. There are much more powerful
399- approaches, but this just shows a simple approach.
363+ .. include :: includes/nth_word.rst
400364
401- .. ipython :: python
402365
403- firstlast = pd.DataFrame({" String" : [" John Smith" , " Jane Cook" ]})
404- firstlast[" First_Name" ] = firstlast[" String" ].str.split(" " , expand = True )[0 ]
405- firstlast[" Last_Name" ] = firstlast[" String" ].str.rsplit(" " , expand = True )[0 ]
406- firstlast
407-
408-
409- Upcase, lowcase, and propcase
410- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
366+ Changing case
367+ ~~~~~~~~~~~~~
411368
412369The SAS `UPCASE <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245965.htm >`__
413370`LOWCASE <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245912.htm >`__ and
@@ -427,27 +384,13 @@ functions change the case of the argument.
427384 ;;;
428385 run;
429386
430- The equivalent Python functions are `` upper ``, `` lower ``, and `` title ``.
387+ .. include :: includes/case.rst
431388
432- .. ipython :: python
433-
434- firstlast = pd.DataFrame({" String" : [" John Smith" , " Jane Cook" ]})
435- firstlast[" string_up" ] = firstlast[" String" ].str.upper()
436- firstlast[" string_low" ] = firstlast[" String" ].str.lower()
437- firstlast[" string_prop" ] = firstlast[" String" ].str.title()
438- firstlast
439389
440390Merging
441391-------
442392
443- The following tables will be used in the merge examples
444-
445- .. ipython :: python
446-
447- df1 = pd.DataFrame({" key" : [" A" , " B" , " C" , " D" ], " value" : np.random.randn(4 )})
448- df1
449- df2 = pd.DataFrame({" key" : [" B" , " D" , " D" , " E" ], " value" : np.random.randn(4 )})
450- df2
393+ .. include :: includes/merge_setup.rst
451394
452395In SAS, data must be explicitly sorted before merging. Different
453396types of joins are accomplished using the ``in= `` dummy
@@ -473,39 +416,15 @@ input frames.
473416 if a or b then output outer_join;
474417 run;
475418
476- pandas DataFrames have a :meth: `~DataFrame.merge ` method, which provides
477- similar functionality. Note that the data does not have
478- to be sorted ahead of time, and different join
479- types are accomplished via the ``how `` keyword.
480-
481- .. ipython :: python
482-
483- inner_join = df1.merge(df2, on = [" key" ], how = " inner" )
484- inner_join
485-
486- left_join = df1.merge(df2, on = [" key" ], how = " left" )
487- left_join
488-
489- right_join = df1.merge(df2, on = [" key" ], how = " right" )
490- right_join
491-
492- outer_join = df1.merge(df2, on = [" key" ], how = " outer" )
493- outer_join
419+ .. include :: includes/merge.rst
494420
495421
496422Missing data
497423------------
498424
499- Like SAS, pandas has a representation for missing data - which is the
500- special float value ``NaN `` (not a number). Many of the semantics
501- are the same, for example missing data propagates through numeric
502- operations, and is ignored by default for aggregations.
425+ Both pandas and SAS have a representation for missing data.
503426
504- .. ipython :: python
505-
506- outer_join
507- outer_join[" value_x" ] + outer_join[" value_y" ]
508- outer_join[" value_x" ].sum()
427+ .. include :: includes/missing_intro.rst
509428
510429One difference is that missing data cannot be compared to its sentinel value.
511430For example, in SAS you could do this to filter missing values.
@@ -522,25 +441,7 @@ For example, in SAS you could do this to filter missing values.
522441 if value_x ^= .;
523442 run;
524443
525- Which doesn't work in pandas. Instead, the ``pd.isna `` or ``pd.notna `` functions
526- should be used for comparisons.
527-
528- .. ipython :: python
529-
530- outer_join[pd.isna(outer_join[" value_x" ])]
531- outer_join[pd.notna(outer_join[" value_x" ])]
532-
533- pandas also provides a variety of methods to work with missing data - some of
534- which would be challenging to express in SAS. For example, there are methods to
535- drop all rows with any missing values, replacing missing values with a specified
536- value, like the mean, or forward filling from previous rows. See the
537- :ref: `missing data documentation<missing_data> ` for more.
538-
539- .. ipython :: python
540-
541- outer_join.dropna()
542- outer_join.fillna(method = " ffill" )
543- outer_join[" value_x" ].fillna(outer_join[" value_x" ].mean())
444+ .. include :: includes/missing.rst
544445
545446
546447GroupBy
@@ -549,7 +450,7 @@ GroupBy
549450Aggregation
550451~~~~~~~~~~~
551452
552- SAS's PROC SUMMARY can be used to group by one or
453+ SAS's `` PROC SUMMARY `` can be used to group by one or
553454more key variables and compute aggregations on
554455numeric columns.
555456
@@ -561,14 +462,7 @@ numeric columns.
561462 output out= tips_summed sum = ;
562463 run;
563464
564- pandas provides a flexible ``groupby `` mechanism that
565- allows similar aggregations. See the :ref: `groupby documentation<groupby> `
566- for more details and examples.
567-
568- .. ipython :: python
569-
570- tips_summed = tips.groupby([" sex" , " smoker" ])[[" total_bill" , " tip" ]].sum()
571- tips_summed.head()
465+ .. include :: includes/groupby.rst
572466
573467
574468Transformation
@@ -597,16 +491,7 @@ example, to subtract the mean for each observation by smoker group.
597491 if a and b;
598492 run;
599493
600-
601- pandas ``groupby `` provides a ``transform `` mechanism that allows
602- these type of operations to be succinctly expressed in one
603- operation.
604-
605- .. ipython :: python
606-
607- gb = tips.groupby(" smoker" )[" total_bill" ]
608- tips[" adj_total_bill" ] = tips[" total_bill" ] - gb.transform(" mean" )
609- tips.head()
494+ .. include :: includes/transform.rst
610495
611496
612497By group processing
0 commit comments