this time for everyone to sort out a large collection of operations, a total of 20 functions, short and concise, once let you love enough.pandas

1. ExcelWriter

Many times there are Chinese in it, if you directly output to csv, Chinese will display garbled code. Excel is different, yes a class that can make the data frame directly output to the excel file and can specify the name.dataframeExcelWriterpandasdataframesheets

df1 = pd.DataFrame([["AAA", "BBB"]], columns=["Spam", "Egg"])
df2 = pd.DataFrame([["ABC", "XYZ"]], columns=["Foo", "Bar"])
with ExcelWriter("path_to_file.xlsx") as writer:
    df1.to_excel(writer, sheet_name="Sheet1")
    df2.to_excel(writer, sheet_name="Sheet2")

if you have a time variable, you can also specify the format of the time when you output it. in addition, it can also be output to an existing excel file by settings, which is very flexible.date_formatmode

with ExcelWriter("path_to_file.xlsx", mode="a", engine="openpyxl") as writer:
    df.to_excel(writer, sheet_name="Sheet3")

2. pipe

pipepipeline functions can fit multiple custom functions into the same operation, making the whole code more concise and compact.

for example, when we do data cleaning, the code is often very messy, with deduplication, de-outliers, code conversion, and so on. if used, it will be like this.pipe

diamonds = sns.load_dataset("diamonds")

df_preped = (diamonds.pipe(drop_duplicates).
                      pipe(remove_outliers, ['price', 'carat', 'depth']).
                      pipe(encode_categoricals, ['cut', 'color', 'clarity'])

two words, clean!

3. factorize

factorizethis function is similar to the same function.sklearnLabelEncoder

# Mind the [0] at the end
diamonds["cut_enc"] = pd.factorize(diamonds["cut"])[0]

>>> diamonds["cut_enc"].sample(5)

52103    2
39813    0
31843    0
10675    0
6634     0
Name: cut_enc, dtype: int64

the difference is that a binary tuple is returned: an encoded column and a list of unique categorical values.factorize

codes, unique = pd.factorize(diamonds["cut"], sort=True)

>>> codes[:10]
array([0, 1, 3, 1, 3, 2, 2, 2, 4, 2], dtype=int64)

>>> unique
['Ideal', 'Premium', 'Very Good', 'Good', 'Fair']

4. explode

explodethe explosion function can explode array-like values such as lists into multiple lines.

data = pd.Series([1, 6, 7, [46, 56, 49], 45, [15, 10, 12]]).to_frame("dirty")

data.explode("dirty", ignore_index=True)

5. squeeze

many times, we use filtering to return a value, but return a value. in fact, as long as you use it, it can be perfectly solved. like what:.locseries.squeeze()

# 没使用squeeze
subset = diamonds.loc[diamonds.index < 1, ["price"]]
# 使用squeeze

as you can see, the result is already in the format, not the one it is.int64series

6. between

dataframethere are many screening methods, common, etc., but in fact, there is a very concise method, specifically to filter the range of values, that is, the use is very simple.locisinbetween

      .between(3500, 3700, inclusive="neither")].sample(5)

7. T

this is a simple property of all that implements the transpose function. it works well when displayed.dataframedescribe


8. pandas styler

pandasIt’s also possible to set visual conditional formatting for a table like excel, and it only takes a single line of code (it may require a throwaway of front-end HTML and CSS basics).

>>> diabetes.describe().T.drop("count", axis=1)\

of course, there are many kinds of conditional formatting.

9. Pandas options

pandasthere are a lot of macro setting options available, which are divided into the following 5 categories.

['compute', 'display', 'io', 'mode', 'plotting']

in general, the use will be a little more, such as the maximum and minimum number of display lines, drawing methods, display accuracy and so on.display

pd.options.display.max_columns = None
pd.options.display.precision = 5

10. convert_dtypes

as is often used, it is known that for often the variable type will be changed directly, resulting in subsequent inability to operate properly. this situation can be used to perform a batch conversion, which automatically infers the original type of data and implements the conversion.pandaspandasobjectconvert_dtypes

sample = pd.read_csv(
    usecols=["StationId", "CO", "O3", "AQI_Bucket"],

>>> sample.dtypes

StationId      object
CO            float64
O3            float64
AQI_Bucket     object
dtype: object

>>> sample.convert_dtypes().dtypes

StationId      string
CO            float64
O3            float64
AQI_Bucket     string
dtype: object

11. select_dtypes

when it is necessary to filter variable types, you can directly use, pass and filter and exclude the types of variables.selec _dtypesincludeexclude

# 选择数值型的变量
# 排除数值型的变量

12. mask

maskcell values can be quickly replaced under custom conditions, and are often seen in the source code of many third-party libraries. for example, below we want to make the cell other than age 50-60 empty, just need to write the custom conditions.conohter

ages = pd.Series([55, 52, 50, 66, 57, 59, 49, 60]).to_frame("ages")

ages.mask(cond=~ages["ages"].between(50, 60), other=np.nan)

13. min, max of the column axis

although everyone knows the function of sum, it should be rare to apply on the column. this function can actually be used like this:minmax

index = ["Diamonds", "Titanic", "Iris", "Heart Disease", "Loan Default"]
libraries = ["XGBoost", "CatBoost", "LightGBM", "Sklearn GB"]

df = pd.DataFrame(
    {lib: np.random.uniform(90, 100, 5) for lib in libraries}, index=index

>>> df
>>> df.max(axis=1)

Diamonds         99.52684
Titanic          99.63650
Iris             99.10989
Heart Disease    99.31627
Loan Default     97.96728
dtype: float64

14. nlargest、nsmallest


diamonds.nlargest(5, "price")

15. idmax、idxmin

when we use or using a column axis, the maximum/minimum value is returned. but i don’t need a specific value now, i need the position of this maximum value. because many times it is necessary to lock the position and then operate on the entire row, such as single to bring up or delete, so this requirement is still very common.maxminpandas

use and can be solved.idxmaxidxmin

>>> diamonds.price.idxmax()

>>> diamonds.carat.idxmin()

16. value_counts

when it comes to data exploration, it is a function that is used very often, it does not count null values by default, but null values are often also very concerned about us. if you want to count null values, you can set the parameter to .value_countsdropnaFalse

ames_housing = pd.read_csv("data/train.csv")

>>> ames_housing["FireplaceQu"].value_counts(dropna=False, normalize=True)

NaN    0.47260
Gd     0.26027
TA     0.21438
Fa     0.02260
Ex     0.01644
Po     0.01370
Name: FireplaceQu, dtype: float64

17. clip

outlier detection is a common practice in data analysis. functions make it easy to find outliers outside the range of variables and replace them.clip

>>> age.clip(50, 60)

18. at_time、between_time

these two functions are super useful when the time granularity is relatively fine. because they can perform more granular operations, such as filtering a certain point in time, or a certain range of time, etc., they can be refined to hours and minutes.

>>> data.at_time("15:00")
from datetime import datetime

>>> data.between_time("09:45", "12:00")

19. hasnans

pandasprovides a quick way to check if a given series contains a null value.hasnans

series = pd.Series([2, 4, 6, "sadf", np.nan])

>>> series.hasnans

this method is only suitable for structures.series

20. GroupBy.nth

this feature is only available for objects. specifically, after grouping, return the nth row for each group:GroupBynth

>>> diamonds.groupby("cut").nth(5)



[2], once again, give you Amway under “In-depth And Simple Pandas: Using Python for Data Processing and Analysis”, a classic of pandas learning, superb! The knowledge related to pandas is particularly complete and detailed, and the last chapter also has a large number of practical projects and cases, and it is recommended to start a book. More information and table of contents can be found at the links below.