This article focuses on comparing pandas with R, the statistical package on which much of pandas' functionality is modeled. It is intended as a guide for R users who wish to use pandas, and for users who wish to replicate functionality that they have seen in the R code in pandas. It focuses on some key features available to R users and shows how to achieve similar functionality in pandas by using some illustrative examples. This article assumes that you have the R statistical package installed. If not, it can be downloaded and installed from here: http://www.r-project.org/.
By the end of the article, data analysis users should have a good grasp of the data analysis capabilities of R as compared to pandas, enabling them to transition to or use pandas, should they need to. The various topics addressed in this article include the following:
- R data types and their pandas equivalents
- Slicing and selection
- Arithmetic operations on datatype columns
- Aggregation and GroupBy
- Matching
- Split-apply-combine
- Melting and reshaping
- Factors and categorical data
This article focuses on comparing pandas with R, the statistical package on which much of pandas' functionality is modeled. It is intended as a guide for R users who wish to use pandas, and for users who wish to replicate functionality that they have seen in the R code in pandas. It focuses on some key features available to R users and shows how to achieve similar functionality in pandas by using some illustrative examples. This article assumes that you have the R statistical package installed. If not, it can be downloaded and installed from here: http://www.r-project.org/.
By the end of the article, data analysis users should have a good grasp of the data analysis capabilities of R as compared to pandas, enabling them to transition to or use pandas, should they need to. The various topics addressed in this article include the following:
- R data types and their pandas equivalents
- Slicing and selection
- Arithmetic operations on datatype columns
- Aggregation and GroupBy
- Matching
- Split-apply-combine
- Melting and reshaping
- Factors and categorical data
(For more resources related to this topic, see here.)
R data types
R has five primitive or atomic types:
- Character
- Numeric
- Integer
- Complex
- Logical/Boolean
It also has the following, more complex, container types:
- Vector: This is similar to numpy.array. It can only contain objects of the same type.
- List: It is a heterogeneous container. Its equivalent in pandas would be a series.
- DataFrame: It is a heterogeneous 2D container, equivalent to a pandas DataFrame
- Matrix:- It is a homogeneous 2D version of a vector. It is similar to a numpy.matrix.
For this article, we will focus on list and DataFrame, which have pandas equivalents as series and DataFrame.
For more information on R data types, refer to the following document at: http://www.statmethods.net/input/datatypes.html.
For Numpy data types, refer to the following document at: http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html and http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html.
R lists
R lists can be created explicitly as a list declaration as shown here:
>h_lst<- list(23,'donkey',5.6,1+4i,TRUE)
>h_lst
[[1]]
[1] 23
[[2]]
[1] "donkey"
[[3]]
[1] 5.6
[[4]]
[1] 1+4i
[[5]]
[1] TRUE
>typeof(h_lst)
[1] "list"
Here is its series equivalent in pandas with the creation of a list and the creation of a series from it:
In [8]: h_list=[23, 'donkey', 5.6,1+4j, True]
In [9]: import pandas as pd
h_ser=pd.Series(h_list)
In [10]: h_ser
Out[10]: 0 23
1 donkey
2 5.6
3 (1+4j)
4 True
dtype: object
Array indexing starts from 0 in pandas as opposed to R, where it starts at 1. Following is an example of this:
In [11]: type(h_ser)
Out[11]: pandas.core.series.Series
R DataFrames
We can construct an R DataFrame as follows by calling the data.frame() constructor and then display it as follows:
>stocks_table<- data.frame(Symbol=c('GOOG','AMZN','FB','AAPL',
'TWTR','NFLX','LINKD'),
Price=c(518.7,307.82,74.9,109.7,37.1,
334.48,219.9),
MarketCap=c(352.8,142.29,216.98,643.55,23.54,20.15,27.31))
>stocks_table
Symbol PriceMarketCap
1 GOOG 518.70 352.80
2 AMZN 307.82 142.29
3 FB 74.90 216.98
4 AAPL 109.70 643.55
5 TWTR 37.10 23.54
6 NFLX 334.48 20.15
7 LINKD 219.90 27.31
Here, we construct a pandas DataFrame and display it:
In [29]: stocks_df=pd.DataFrame({'Symbol':['GOOG','AMZN','FB','AAPL',
'TWTR','NFLX','LNKD'],
'Price':[518.7,307.82,74.9,109.7,37.1,
334.48,219.9],
'MarketCap($B)' : [352.8,142.29,216.98,643.55,
23.54,20.15,27.31]
})
stocks_df=stocks_df.reindex_axis(sorted(stocks_df.columns,reverse=True),axis=1)
stocks_df
Out[29]:
Symbol PriceMarketCap($B)
0 GOOG 518.70 352.80
1 AMZN 307.82 142.29
2 FB 74.90 216.98
3 AAPL 109.70 643.55
4 TWTR 37.10 23.54
5 NFLX 334.48 20.15
6 LNKD219.90 27.31
Slicing and selection
In R, we slice objects in the following three ways:
- [: This always returns an object of the same type as the original and can be used to select more than one element.
- [[: This is used to extract elements of list or DataFrame; and can only be used to extract a single element,: the type of the returned element will not necessarily be a list or DataFrame.
- $: This is used to extract elements of a list or DataFrame by name and is similar to [[.
Here are some slicing examples in R and their equivalents in Pandas:
R-matrix and Numpy array compared
Let's see matrix creation and selection in R:
>r_mat<- matrix(2:13,4,3)
>r_mat
[,1] [,2] [,3]
[1,] 2 6 10
[2,] 3 7 11
[3,] 4 8 12
[4,] 5 9 13
To select first row, we write:
>r_mat[1,]
[1] 2 6 10
To select second column, we use the following command:
>r_mat[,2]
[1] 6 7 8 9
Let's now see Numpy array creation and selection:
In [60]: a=np.array(range(2,6))
b=np.array(range(6,10))
c=np.array(range(10,14))
In [66]: np_ar=np.column_stack([a,b,c])
np_ar
Out[66]: array([[ 2, 6, 10],
[ 3, 7, 11],
[ 4, 8, 12],
[ 5, 9, 13]])
To select first row, write the following command:
In [79]: np_ar[0,]
Out[79]: array([ 2, 6, 10])
Indexing is different in R and pandas/Numpy.
In R, indexing starts at 1, while in pandas/Numpy, it starts at 0. Hence, we have to subtract 1 from all indexes when making the translation from R to pandas/Numpy.
To select second column, write the following command:
In [81]: np_ar[:,1]
Out[81]: array([6, 7, 8, 9])
Another option is to transpose the array first and then select the column, as follows:
In [80]: np_ar.T[1,]
Out[80]: array([6, 7, 8, 9])
R lists and pandas series compared
Here is an example of list creation and selection in R:
>cal_lst<- list(weekdays=1:8, mth='jan')
>cal_lst
$weekdays
[1] 1 2 3 4 5 6 7 8
$mth
[1] "jan"
>cal_lst[1]
$weekdays
[1] 1 2 3 4 5 6 7 8
>cal_lst[[1]]
[1] 1 2 3 4 5 6 7 8
>cal_lst[2]
$mth
[1] "jan"
Series creation and selection in pandas is done as follows:
In [92]: cal_df= pd.Series({'weekdays':range(1,8), 'mth':'jan'})
In [93]: cal_df
Out[93]: mthjan
weekdays [1, 2, 3, 4, 5, 6, 7]
dtype: object
In [97]: cal_df[0]
Out[97]: 'jan'
In [95]: cal_df[1]
Out[95]: [1, 2, 3, 4, 5, 6, 7]
In [96]: cal_df[[1]]
Out[96]: weekdays [1, 2, 3, 4, 5, 6, 7]
dtype: object
Here, we see a difference between an R-list and a pandas series from the perspective of the [] and [[]] operators. We can see the difference by considering the second item, which is a character string.
In the case of R, the [] operator produces a container type, that is, a list containing the string, while the [[]] produces an atomic type: in this case, a character as follows:
>typeof(cal_lst[2])
[1] "list"
>typeof(cal_lst[[2]])
[1] "character"
In the case of pandas, the opposite is true: [] produces the atomic type, while [[]] results in a complex type, that is, a series as follows:
In [99]: type(cal_df[0])
Out[99]: str
In [101]: type(cal_df[[0]])
Out[101]: pandas.core.series.Series
In both R and pandas, the column name can be specified in order to obtain an element.
Specifying column name in R
In R, this can be done with the column name preceded by the $ operator as follows:
>cal_lst$mth
[1] "jan"
> cal_lst$'mth'
[1] "jan"
Specifying column name in pandas
In pandas, we subset elements in the usual way with the column name in square brackets:
In [111]: cal_df['mth']
Out[111]: 'jan'
One area where R and pandas differ is in the subsetting of nested elements. For example, to obtain day 4 from weekdays, we have to use the [[]] operator in R:
>cal_lst[[1]][[4]]
[1] 4
>cal_lst[[c(1,4)]]
[1] 4
However, in the case of pandas, we can just use a double []:
In [132]: cal_df[1][3]
Out[132]: 4
R DataFrames versus pandas DataFrames
Selecting data in R DataFrames and pandas DataFrames follows a similar script. The following section explains on how we perform multi-column selects from both.
Multi-column selection in R
In R, we specify the multiple columns to select by stating them in a vector within square brackets:
>stocks_table[c('Symbol','Price')]
Symbol Price
1 GOOG 518.70
2 AMZN 307.82
3 FB 74.90
4 AAPL 109.70
5 TWTR 37.10
6 NFLX 334.48
7 LINKD 219.90
>stocks_table[,c('Symbol','Price')]
Symbol Price
1 GOOG 518.70
2 AMZN 307.82
3 FB 74.90
4 AAPL 109.70
5 TWTR 37.10
6 NFLX 334.48
7 LINKD 219.90
Multi-column selection in pandas
In pandas, we subset elements in the usual way with the column names in square brackets:
In [140]: stocks_df[['Symbol','Price']]
Out[140]:Symbol Price
0 GOOG 518.70
1 AMZN 307.82
2 FB 74.90
3 AAPL 109.70
4 TWTR 37.10
5 NFLX 334.48
6 LNKD 219.90
In [145]: stocks_df.loc[:,['Symbol','Price']]
Out[145]: Symbol Price
0 GOOG 518.70
1 AMZN 307.82
2 FB 74.90
3 AAPL 109.70
4 TWTR 37.10
5 NFLX 334.48
6 LNKD 219.90
Arithmetic operations on columns
In R and pandas, we can apply arithmetic operations in data columns in a similar manner. Hence, we can perform arithmetic operations such as addition or subtraction on elements in corresponding positions in two or more DataFrames.
Here, we construct a DataFrame in R with columns labeled x and y, and subtract column y from column x:
>norm_df<- data.frame(x=rnorm(7,0,1), y=rnorm(7,0,1))
>norm_df$x - norm_df$y
[1] -1.3870730 2.4681458 -4.6991395 0.2978311 -0.8492245 1.5851009 -1.4620324
The with operator in R also has the same effect as arithmetic operations:
>with(norm_df,x-y)
[1] -1.3870730 2.4681458 -4.6991395 0.2978311 -0.8492245 1.5851009 -1.4620324
In pandas, the same arithmetic operations on columns can be done and the equivalent operator is eval:
In [10]: import pandas as pd
import numpy as np
df = pd.DataFrame({'x': np.random.normal(0,1,size=7), 'y': np.random.normal(0,1,size=7)})
In [11]: df.x-df.y
Out[11]: 0 -0.107313
1 0.617513
2 -1.517827
3 0.565804
4 -1.630534
5 0.101900
6 0.775186
dtype: float64
In [12]: df.eval('x-y')
Out[12]: 0 -0.107313
1 0.617513
2 -1.517827
3 0.565804
4 -1.630534
5 0.101900
6 0.775186
dtype: float64
Aggregation and GroupBy
Sometimes, we may wish to split data into subsets and apply a function such as the mean, max, or min to each subset. In R, we can do this via the aggregate or tapply functions.
Here, we will use the example of a dataset of statistics on the top five strikers of the four clubs that made it to the semi-final of the European Champions League Football tournament in 2014. We will use it to illustrate aggregation in R and its equivalent GroupBy functionality in pandas.
Aggregation in R
In R aggregation is done using the following command:
> goal_stats=read.csv('champ_league_stats_semifinalists.csv')
>goal_stats
Club Player Goals GamesPlayed
1 Atletico Madrid Diego Costa 8 9
2 Atletico Madrid ArdaTuran 4 9
3 Atletico Madrid RaúlGarcía 4 12
4 Atletico Madrid AdriánLópez 2 9
5 Atletico Madrid Diego Godín 2 10
6 Real Madrid Cristiano Ronaldo 17 11
7 Real Madrid Gareth Bale 6 12
8 Real Madrid Karim Benzema 5 11
9 Real Madrid Isco 3 12
10 Real Madrid Ángel Di María 3 11
11 Bayern Munich Thomas Müller 5 12
12 Bayern Munich ArjenRobben 4 10
13 Bayern Munich Mario Götze 3 11
14 Bayern Munich Bastian Schweinsteiger 3 8
15 Bayern Munich Mario Mandžukić 3 10
16 Chelsea Fernando Torres 4 9
17 Chelsea Demba Ba 3 6
18 Chelsea Samuel Eto'o 3 9
19 Chelsea Eden Hazard 2 9
20 Chelsea Ramires 2 10
We can now compute the goals per game ratio for each striker, to measure their deadliness in front of a goal:
>goal_stats$GoalsPerGame<- goal_stats$Goals/goal_stats$GamesPlayed
>goal_stats
Club Player Goals GamesPlayedGoalsPerGame
1 Atletico Madrid Diego Costa 8 9 0.8888889
2 Atletico Madrid ArdaTuran 4 9 0.4444444
3 Atletico Madrid RaúlGarcía 4 12 0.3333333
4 Atletico Madrid AdriánLópez 2 9 0.2222222
5 Atletico Madrid Diego Godín 2 10 0.2000000
6 Real Madrid Cristiano Ronaldo 17 11 1.5454545
7 Real Madrid Gareth Bale 6 12 0.5000000
8 Real Madrid Karim Benzema 5 11 0.4545455
9 Real Madrid Isco 3 12 0.2500000
10 Real Madrid Ángel Di María 3 11 0.2727273
11 Bayern Munich Thomas Müller 5 12 0.4166667
12 Bayern Munich ArjenRobben 4 10 0.4000000
13 Bayern Munich MarioGötze 3 11 0.2727273
14 Bayern Munich Bastian Schweinsteiger 3 8 0.3750000
15 Bayern Munich MarioMandžukić 3 10 0.3000000
16 Chelsea Fernando Torres 4 9 0.4444444
17 Chelsea Demba Ba 3 6 0.5000000
18 Chelsea Samuel Eto'o 3 9 0.3333333
19 Chelsea Eden Hazard 2 9 0.2222222
20 Chelsea Ramires 2 10 0.2000000
Let's suppose that we wanted to know the highest goals per game ratio for each team. We would calculate this as follows:
>aggregate(x=goal_stats[,c('GoalsPerGame')], by=list(goal_stats$Club),FUN=max)
Group.1 x
1 Atletico Madrid 0.8888889
2 Bayern Munich 0.4166667
3 Chelsea 0.5000000
4 Real Madrid 1.5454545
The tapply function is used to apply a function to a subset of an array or vector that is defined by one or more columns. The tapply function can also be used as follows:
>tapply(goal_stats$GoalsPerGame,goal_stats$Club,max)
Atletico Madrid Bayern Munich Chelsea Real Madrid
0.8888889 0.4166667 0.5000000 1.5454545
The pandas' GroupBy operator
In pandas, we can achieve the same result by using the GroupBy function:
In [6]: import pandas as pd
importnumpy as np
In [7]: goal_stats_df=pd.read_csv('champ_league_stats_semifinalists.csv')
In [27]: goal_stats_df['GoalsPerGame']= goal_stats_df['Goals']/goal_stats_df['GamesPlayed']
In [27]: goal_stats_df['GoalsPerGame']= goal_stats_df['Goals']/goal_stats_df['GamesPlayed']
In [28]: goal_stats_df
Out[28]: Club Player Goals GamesPlayedGoalsPerGame
0 Atletico Madrid Diego Costa 8 9 0.888889
1 Atletico Madrid ArdaTuran 4 9 0.444444
2 Atletico Madrid RaúlGarcía 4 12 0.333333
3 Atletico Madrid AdriánLópez 2 9 0.222222
4 Atletico Madrid Diego Godín 2 10 0.200000
5 Real Madrid Cristiano Ronaldo 17 11 1.545455
6 Real Madrid Gareth Bale 6 12 0.500000
7 Real Madrid Karim Benzema 5 11 0.454545
8 Real Madrid Isco 3 12 0.250000
9 Real Madrid Ángel Di María 3 11 0.272727
10 Bayern Munich Thomas Müller 5 12 0.416667
11 Bayern Munich ArjenRobben 4 10 0.400000
12 Bayern Munich Mario Götze 3 11 0.272727
13 Bayern Munich BastianSchweinsteiger 3 8 0.375000
14 Bayern Munich MarioMandžukić 3 10 0.300000
15 Chelsea Fernando Torres 4 9 0.444444
16 Chelsea Demba Ba 3 6 0.500000
17 Chelsea Samuel Eto'o 3 9 0.333333
18 Chelsea Eden Hazard 2 9 0.222222
19 Chelsea Ramires 2 10 0.200000
In [30]: grouped = goal_stats_df.groupby('Club')
In [17]: grouped['GoalsPerGame'].aggregate(np.max)
Out[17]: Club
Atletico Madrid 0.888889
Bayern Munich 0.416667
Chelsea 0.500000
Real Madrid 1.545455
Name: GoalsPerGame, dtype: float64
In [22]: grouped['GoalsPerGame'].apply(np.max)
Out[22]: Club
Atletico Madrid 0.888889
Bayern Munich 0.416667
Chelsea 0.500000
Real Madrid 1.545455
Name: GoalsPerGame, dtype: float64
Comparing matching operators in R and pandas
Here, we will demonstrate the equivalence of matching operators between R (%in%) and pandas (isin()). In both cases, a logical vector or series (pandas) is produced, which indicates the position at which a match was found.
R %in% operator
Here, we will demonstrate the use of the %in% operator in R:
>stock_symbols=stocks_table$Symbol
>stock_symbols
[1] GOOG AMZN FB AAPL TWTR NFLX LINKD
Levels: AAPL AMZN FB GOOG LINKD NFLX TWTR
>stock_symbols %in% c('GOOG','NFLX')
[1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE
The pandas isin() function
Here is an example of using the pandas isin() function:
In [11]: stock_symbols=stocks_df.Symbol
stock_symbols
Out[11]: 0 GOOG
1 AMZN
2 FB
3 AAPL
4 TWTR
5 NFLX
6 LNKD
Name: Symbol, dtype: object
In [10]: stock_symbols.isin(['GOOG','NFLX'])
Out[10]: 0 True
1 False
2 False
3 False
4 False
5 True
6 False
Name: Symbol, dtype: bool
Logical subsetting
In R as well as in pandas, there is more than one way to perform logical subsetting. Suppose that we wished to display all players with the average goals per game ratio of greater than or equal to 0.5; that is, they average at least one goal every two games.
Logical subsetting in R
Here's how we can do this in R:
- Via a logical slice:
>goal_stats[goal_stats$GoalsPerGame>=0.5,] Club Player Goals GamesPlayedGoalsPerGame 1 Atletico Madrid Diego Costa 8 9 0.8888889 6 Real Madrid Cristiano Ronaldo 17 11 1.5454545 7 Real Madrid Gareth Bale 6 12 0.5000000 17 Chelsea Demba Ba 3 6 0.5000000
- Via the subset() function:
>subset(goal_stats,GoalsPerGame>=0.5) Club Player Goals GamesPlayedGoalsPerGame 1 Atletico Madrid Diego Costa 8 9 0.8888889 6 Real Madrid Cristiano Ronaldo 17 11 1.5454545 7 Real Madrid Gareth Bale 6 12 0.5000000 17 Chelsea Demba Ba 3 6 0.5000000
Logical subsetting in pandas
In pandas, we also do something similar:
- Logical slicing:
In [33]: goal_stats_df[goal_stats_df['GoalsPerGame']>=0.5] Out[33]: Club Player Goals GamesPlayedGoalsPerGame 0 Atletico Madrid Diego Costa 8 9 0.888889 5 Real Madrid Cristiano Ronaldo 17 11 1.545455 6 Real Madrid Gareth Bale 6 12 0.500000 16 Chelsea Demba Ba 3 6 0.500000 - DataFrame.query() operator:
In [36]: goal_stats_df.query('GoalsPerGame>= 0.5') Out[36]: Club Player Goals GamesPlayedGoalsPerGame 0 Atletico Madrid Diego Costa 8 9 0.888889 5 Real Madrid Cristiano Ronaldo 17 11 1.545455 6 Real Madrid Gareth Bale 6 12 0.500000
16 Chelsea Demba Ba 3 6 0.500000
Split-apply-combine
R has a library called plyr for a split-apply-combine data analysis. The plyr library has a function called ddply, which can be used to apply a function to a subset of a DataFrame, and then, combine the results into another DataFrame.
For more information on ddply, you can refer to the following: http://www.inside-r.org/packages/cran/plyr/docs/ddply
To illustrate, let us consider a subset of a recently created dataset in R, which contains data on flights departing NYC in 2013: http://cran.r-project.org/web/packages/nycflights13/index.html.
Implementation in R
Here, we will install the package in R and instantiate the library:
>install.packages('nycflights13')
...
>library('nycflights13')
>dim(flights)
[1] 336776 16
>head(flights,3)
year month day dep_timedep_delayarr_timearr_delay carrier tailnum flight
1 2013 1 1 517 2 830 11 UA N14228 1545
2 2013 1 1 533 4 850 20 UA N24211 1714
3 2013 1 1 542 2 923 33 AA N619AA 1141
origindestair_time distance hour minute
1 EWR IAH 227 1400 5 17
2 LGA IAH 227 1416 5 33
3 JFK MIA 160 1089 5 42
> flights.data=na.omit(flights[,c('year','month','dep_delay','arr_delay','distance')])
>flights.sample<- flights.data[sample(1:nrow(flights.data),100,replace=FALSE),]
>head(flights.sample,5)
year month dep_delayarr_delay distance
155501 2013 3 2 5 184
2410 2013 1 0 4 762
64158 2013 11 -7 -27 509
221447 2013 5 -5 -12 184
281887 2013 8 -1 -10 937
The ddply function enables us to summarize the departure delays (mean, standard deviation) by year and month:
>ddply(flights.sample,.(year,month),summarize, mean_dep_delay=round(mean(dep_delay),2), s_dep_delay=round(sd(dep_delay),2))
year month mean_dep_delaysd_dep_delay
1 2013 1 -0.20 2.28
2 2013 2 23.85 61.63
3 2013 3 10.00 34.72
4 2013 4 0.88 12.56
5 2013 5 8.56 32.42
6 2013 6 58.14 145.78
7 2013 7 25.29 58.88
8 2013 8 25.86 59.38
9 2013 9 -0.38 10.25
10 2013 10 9.31 15.27
11 2013 11 -1.09 7.73
12 2013 12 0.00 8.58
Let us save the flights.sample dataset to a CSV file so that we can use the data to show us how to do the same thing in pandas:
>write.csv(flights.sample,file='nycflights13_sample.csv', quote=FALSE,row.names=FALSE)
Implementation in pandas
In order to do the same thing in pandas, we read the CSV file saved in the preceding section:
In [40]: flights_sample=pd.read_csv('nycflights13_sample.csv')
In [41]: flights_sample.head()
Out[41]: year month dep_delayarr_delay distance
0 2013 3 2 5 184
1 2013 1 0 4 762
2 2013 11 -7 -27 509
3 2013 5 -5 -12 184
4 2013 8 -1 -10 937
We achieve the same effect as ddply by making use of the GroupBy() operator:
In [44]: pd.set_option('precision',3)
In [45]: grouped = flights_sample_df.groupby(['year','month'])
In [48]: grouped['dep_delay'].agg([np.mean, np.std])
Out[48]: mean std
year month
2013 1 -0.20 2.28
2 23.85 61.63
3 10.00 34.72
4 0.88 12.56
5 8.56 32.42
6 58.14 145.78
7 25.29 58.88
8 25.86 59.38
9 -0.38 10.25
10 9.31 15.27
11 -1.09 7.73
12 0.00 8.58
Reshaping using Melt
The melt function converts data into a wide format to a single column consisting of unique ID-variable combinations.
The R melt() function
Here, we demonstrate the use of the melt() function in R. It produces long-format data in which the rows are unique variable-value combinations:
>sample4=head(flights.sample,4)[c('year','month','dep_delay','arr_delay')]
> sample4
year month dep_delay arr_delay
155501 2013 3 2 5
2410 2013 1 0 4
64158 2013 11 -7 -27
221447 2013 5 -5 -12
>melt(sample4,id=c('year','month'))
year month variable value
1 2013 3 dep_delay 2
2 2013 1 dep_delay 0
3 2013 11 dep_delay -7
4 2013 5 dep_delay -5
5 2013 3 arr_delay 5
6 2013 1 arr_delay 4
7 2013 11 arr_delay -27
8 2013 5 arr_delay -12
>
For more information, you can refer to the following: http://www.statmethods.net/management/reshape.html.
The pandas melt() function
In pandas, the melt function is similar:
In [55]: sample_4_df=flights_sample_df[['year','month','dep_delay', \
'arr_delay']].head(4)
In [56]: sample_4_df
Out[56]: year month dep_delay arr_delay
0 2013 3 2 5
1 2013 1 0 4
2 2013 11 -7 -27
3 2013 5 -5 -12
In [59]: pd.melt(sample_4_df,id_vars=['year','month'])
Out[59]: year month variable value
0 2013 3 dep_delay 2
1 2013 1 dep_delay 0
2 2013 11 dep_delay -7
3 2013 5 dep_delay -5
4 2013 3 arr_delay 5
5 2013 1 arr_delay 4
6 2013 11 arr_delay -27
7 2013 5 arr_delay -12
The reference for this information is from: http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-melt.
Factors/categorical data
R refers to categorical variables as factors, and the cut() function enables us to break a continuous numerical variable into ranges, and treat the ranges as factors or categorical variables, or to classify a categorical variable into a larger bin.
An R example using cut()
Here is an example in R:
clinical.trial<- data.frame(patient = 1:1000,
age = rnorm(1000, mean = 50, sd = 5),
year.enroll = sample(paste("19", 80:99, sep = ""),
1000, replace = TRUE))
>clinical.trial<- data.frame(patient = 1:1000,
+ age = rnorm(1000, mean = 50, sd = 5),
+ year.enroll = sample(paste("19", 80:99, sep = ""),
+ 1000, replace = TRUE))
>summary(clinical.trial)
patient age year.enroll
Min. : 1.0 Min. :31.14 1995 : 61
1st Qu.: 250.8 1st Qu.:46.77 1989 : 60
Median : 500.5 Median :50.14 1985 : 57
Mean : 500.5 Mean :50.14 1988 : 57
3rd Qu.: 750.2 3rd Qu.:53.50 1990 : 56
Max. :1000.0 Max. :70.15 1991 : 55
(Other):654
>ctcut<- cut(clinical.trial$age, breaks = 5)> table(ctcut)
ctcut
(31.1,38.9] (38.9,46.7] (46.7,54.6] (54.6,62.4] (62.4,70.2]
15 232 558 186 9
The reference for the preceding data can be found at: http://www.r-bloggers.com/r-function-of-the-day-cut/.
The pandas solution
Here is the equivalent of the earlier explained cut() function in pandas (only applies to Version 0.15+):
In [79]: pd.set_option('precision',4)
clinical_trial=pd.DataFrame({'patient':range(1,1001),
'age' : np.random.normal(50,5,size=1000),
'year_enroll': [str(x) for x in np.random.choice(range(1980,2000),size=1000,replace=True)]})
In [80]: clinical_trial.describe()
Out[80]: age patient
count 1000.000 1000.000
mean 50.089 500.500
std 4.909 288.819
min 29.944 1.000
25% 46.572 250.750
50% 50.314 500.500
75% 53.320 750.250
max 63.458 1000.000
In [81]: clinical_trial.describe(include=['O'])
Out[81]: year_enroll
count 1000
unique 20
top 1992
freq 62
In [82]: clinical_trial.year_enroll.value_counts()[:6]
Out[82]: 1992 62
1985 61
1986 59
1994 59
1983 58
1991 58
dtype: int64
In [83]: ctcut=pd.cut(clinical_trial['age'], 5)
In [84]: ctcut.head()
Out[84]: 0 (43.349, 50.052]
1 (50.052, 56.755]
2 (50.052, 56.755]
3 (43.349, 50.052]
4 (50.052, 56.755]
Name: age, dtype: category
Categories (5, object): [(29.91, 36.646] < (36.646, 43.349] < (43.349, 50.052] < (50.052, 56.755] < (56.755, 63.458]]
In [85]: ctcut.value_counts().sort_index()
Out[85]: (29.91, 36.646] 3
(36.646, 43.349] 82
(43.349, 50.052] 396
(50.052, 56.755] 434
(56.755, 63.458] 85
dtype: int64
Summary
In this article, we have attempted to compare key features in R with their pandas equivalents in order to achieve the following objectives:
- To assist R users who may wish to replicate the same functionality in pandas
- To assist any users who upon reading some R code may wish to rewrite the code in Pandas
The reference documentation for this chapter can be found at http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html.
For more information on Pandas you can also refer to https://www.packtpub.com/big-data-and-business-intelligence/mastering-pandas-finance.
Resources for Article:
Further resources on this subject:
- Getting Started with RStudio [article]
- Deep learning in R [article]
- Object-Oriented Programming [article]