pysum
takes a pandas dataframe (and a few others arguments to
customize the output) and creates a markdown, html, or xlsx report with
summary of each of variables in the dataframe.
The program iterates through each of the columns in the dataframe and based on the datatype, creates summary statistics for each, and prints them out to a table.
The function takes the following arguments:
dataframe
: pandas dataframe. No Default. The passed dataframe must also have an attributename
that carries thename
of the dataframe. See examples for clarification.round_digits
: Integer. Digits to which the numbers reported should be rounded. Default is 2.var_numbers
: Boolean. Whether or not to add a column indicating the column number. Default istrue
.missing_col
: Boolean. Adds a column that reports proportion missing. Default in true.max_distinct_values
: Numeric. The maximum number of values to display frequencies for. If variable has more distinct values than this number, the remaining frequencies will be reported as a whole, along with the number of additional distinct values. Defaults to 10.max_string_width
: Integer. Limits the number of characters to display in the frequency tables. Default is 25.output_type
: String. The file format of the output file.xlsx, html, markdown
. Default ishtml
.output_file
: String. The path and filename to which the script should output the results. Default issummary.html
in the local directoryappend
: Boolean. If there is an existing file, should we append the results or should we overwrite the file. Default istrue
. When append istrue
, the results are appended. When it isfalse
, the file is overwritten.
The html
output also depends on custom.css in the
local folder.
The output is a xlsx, html, or markdown file. For numeric columns, it reports mean, standard deviation, minimum, maximum, median, IQR, Number of distinct values, Percentage that are valid, and Percentage missing, by default.
Definitions of Things in Output
- Valid = entries with non-missing values
- mean (sd) = mean (standard deviation).
- min = minimum
- med = median
- max = maximum
- IQR = Interquartile range
- CV = Coefficient of variation
For character vectors, it reports as many as max_distinct_values
,
reports the number of other values, and their percentage. It also
reports percentage of observations that are valid and that are missing
by default.
Limitations: Dates by default are parsed as characters. Dates are best handled as numeric. But given the variety of formats in which dates appear, no standard support is offered for now.
Install the requirements:
pip install -r requirements.txt
You also need pandoc
to be installed on your machine.
import pandas import pysum # Load dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = pandas.read_csv(url, names=names) # Pass name of the dataset; required dataset.name = 'iris' pysum.summarizeDF(dataset) pysum.summarizeDF(dataset, output_type = "xlsx", append = False) pysum.summarizeDF(dataset, output_type = "markdown", append = False)
Markdown Output, HTML Output and XLSX Output
The package is based on https://github.com/dcomtois/summarytools