Pandas Correlation Between Categorical and Continuous Columns

Relationships between continuous and categorical variables

Data concepts

Categorical variable

A categorical variable can take on a finite set of values. The simplest form of categorical variable is an indicator variable that has only two values. The two values are typically 0 and 1, although other values are used at times. Other categorical variables take on multiple values. These values are often expressed using descriptive character strings. For example, a categorical variable for rank of a professor might use assistant professor, associate professor, and professor as its values. The values of a categorical variable are sometime referred to as levels.

Observations within a category may be more similar to other observations within the same category and have larger differences with observations in different categories. These relationships are sometime referred to as within group and between groups variation. For example, we would expect the salaries of the assistant professor group to be fairly similar, and to generally be different from the salaries in the professor group. These are the kind of relations that can be explored with graphs.

Exploring - Box plots

A box plot is a graph of the distribution of a continuous variable. The graph is based on the quartiles of the variables. The quartiles divide a set of ordered values into four groups with the same number of observations. The smallest values are in the first quartile and the largest values in the fourth quartiles.

The plot uses a box to show the values that are larger than the first quartile and smaller than the fourth quartile. These are the values that are closest to the center (median) of the values. The values within the first and fourth quartiles are shown as a line. These lines are referred to as whiskers. These are the values that are farthest from the center of the values.

One useful way to explore the relationship between a continuous and a categorical variable is with a set of side by side box plots, one for each of the categories. Similarities and differences between the category levels can be seen in the length and position of the boxes and whiskers.

Examples - R

These examples use the auto.csv data set.

We begin by using similar code as in the prior section to load the tidyverse and import the csv file.

                                      library(tidyverse)

A categorical variable is needed for these examples. The col_types parameter of read_csv() is used to create a factor variable, what R calls a categorical variable. Factor variables in R will be covered in a future chapter. For now you do not need to know any more than we now can use the origin variable as a categorical variable.

                  auto_path <-                                        file.path("..",                    "datasets",                    "auto.csv") auto <-                                        read_csv(auto_path,                    col_types =                    cols(origin =                    col_factor(NULL)))

                  Warning: Missing column names filled in: 'X1' [1]

                                      glimpse(auto)

                  Observations: 392 Variables: 10 $ X1           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15... $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 1... $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6... $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390,... $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190,... $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4... $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.... $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7... $ origin       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1... $ name         <chr> "chevrolet chevelle malibu", "buick skylark 320",...

Exploring - Box plots

This example uses origin as the horizontal variable for a boxplot. This results in the creation of a separate boxplot for each level of the origin variable. All the observation with a value of 1 are used in the leftmost boxplot. Similarly the observations for levels 2 and 3 of origin are used in separate boxplots.

                                          ggplot(data=auto,                      mapping =                      aes(x =                      origin,                      y =                      mpg))                      +                                                                                                              geom_boxplot()                      +                                                                  theme_bw()

The above box plot shows that the distribution of mpg values is different within the three levels of origin. The automobiles at level 1 have a lower median value than the other two levels. The lowest mpg for level 3 is about the median of level 1.

Examples - Python

These examples use the auto.csv data set.

We begin by using similar code as in the prior section to load the packages and import the csv file.

                                      from                    pathlib                    import                    Path                    import                    pandas                    as                    pd                    import                    plotnine                    as                    p9

A categorical variable is needed for these examples. The dtype parameter of read_csv() is used to create a category variable, what pandas calls a categorical variable. category variables will be covered in a future chapter. For now you do not need to know any more than we now can use the origin variable as a categorical variable.

                  auto_path                    =                    Path('..')                    /                    'datasets'                    /                    'Auto.csv'                    auto                    =                    pd.read_csv(auto_path, dtype={'origin':                    'category'})                    print(auto.dtypes)

                  Unnamed: 0         int64 mpg              float64 cylinders          int64 displacement     float64 horsepower         int64 weight             int64 acceleration     float64 year               int64 origin          category name              object dtype: object

Exploring - Box plots

This example uses origin as the horizontal variable for a boxplot. This results in the creation of a separate boxplot for each level of the origin variable. All the observation with a value of 1 are used in the left most boxplot. Similarly the observations for levels 2 and 3 of origin are used in separate boxplots.

                                          print(     p9.ggplot(auto, p9.aes(x=                      'origin', y=                      'mpg'))                      +                      p9.geom_boxplot()                      +                      p9.theme_bw())

                    <ggplot: (-9223371893262612655)>

Exercises

These exercises use the Mroz.csv data set that was imported in the prior section.

Create a boxplot for lwg for women who attended college and women who did not.
Create a boxplot for lwg for men who attended college and men who did not.

princecounwent.blogspot.com

Source: https://ssc.wisc.edu/sscc/pubs/DWE/book/3-3-sect-ggplot-categorical.html

Pandas Correlation Between Categorical and Continuous Columns

Relationships between continuous and categorical variables

Data concepts

Categorical variable

Exploring - Box plots

Examples - R

Exploring - Box plots

Examples - Python

Exploring - Box plots

Exercises

0 Response to "Pandas Correlation Between Categorical and Continuous Columns"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel