this time for everyone to sort out a large collection of operations, a total of 20 functions, short and concise, once let you love enough.pandas
1. ExcelWriter
Many times there are Chinese in it, if you directly output to csv, Chinese will display garbled code. Excel is different, yes a class that can make the data frame directly output to the excel file and can specify the name.dataframe
ExcelWriter
pandas
dataframe
sheets
df1 = pd.DataFrame([["AAA", "BBB"]], columns=["Spam", "Egg"])
df2 = pd.DataFrame([["ABC", "XYZ"]], columns=["Foo", "Bar"])
with ExcelWriter("path_to_file.xlsx") as writer:
df1.to_excel(writer, sheet_name="Sheet1")
df2.to_excel(writer, sheet_name="Sheet2")
if you have a time variable, you can also specify the format of the time when you output it. in addition, it can also be output to an existing excel file by settings, which is very flexible.date_format
mode
with ExcelWriter("path_to_file.xlsx", mode="a", engine="openpyxl") as writer:
df.to_excel(writer, sheet_name="Sheet3")
2. pipe
pipe
pipeline functions can fit multiple custom functions into the same operation, making the whole code more concise and compact.
for example, when we do data cleaning, the code is often very messy, with deduplication, de-outliers, code conversion, and so on. if used, it will be like this.pipe
diamonds = sns.load_dataset("diamonds")
df_preped = (diamonds.pipe(drop_duplicates).
pipe(remove_outliers, ['price', 'carat', 'depth']).
pipe(encode_categoricals, ['cut', 'color', 'clarity'])
)
two words, clean!
3. factorize
factorize
this function is similar to the same function.sklearn
LabelEncoder
# Mind the [0] at the end
diamonds["cut_enc"] = pd.factorize(diamonds["cut"])[0]
>>> diamonds["cut_enc"].sample(5)
52103 2
39813 0
31843 0
10675 0
6634 0
Name: cut_enc, dtype: int64
the difference is that a binary tuple is returned: an encoded column and a list of unique categorical values.factorize
codes, unique = pd.factorize(diamonds["cut"], sort=True)
>>> codes[:10]
array([0, 1, 3, 1, 3, 2, 2, 2, 4, 2], dtype=int64)
>>> unique
['Ideal', 'Premium', 'Very Good', 'Good', 'Fair']
4. explode
explode
the explosion function can explode array-like values such as lists into multiple lines.
data = pd.Series([1, 6, 7, [46, 56, 49], 45, [15, 10, 12]]).to_frame("dirty")
data.explode("dirty", ignore_index=True)
5. squeeze
many times, we use filtering to return a value, but return a value. in fact, as long as you use it, it can be perfectly solved. like what:.loc
series
.squeeze()
# 没使用squeeze
subset = diamonds.loc[diamonds.index < 1, ["price"]]
# 使用squeeze
subset.squeeze("columns")

as you can see, the result is already in the format, not the one it is.int64
series
6. between
dataframe
there are many screening methods, common, etc., but in fact, there is a very concise method, specifically to filter the range of values, that is, the use is very simple.loc
isin
between
diamonds[diamonds["price"]\
.between(3500, 3700, inclusive="neither")].sample(5)
7. T
this is a simple property of all that implements the transpose function. it works well when displayed.dataframe
describe
boston.describe().T.head(10)
8. pandas styler
pandas
It’s also possible to set visual conditional formatting for a table like excel, and it only takes a single line of code (it may require a throwaway of front-end HTML and CSS basics).
>>> diabetes.describe().T.drop("count", axis=1)\
.style.highlight_max(color="darkred")

of course, there are many kinds of conditional formatting.
9. Pandas options
pandas
there are a lot of macro setting options available, which are divided into the following 5 categories.
dir(pd.options)
['compute', 'display', 'io', 'mode', 'plotting']
in general, the use will be a little more, such as the maximum and minimum number of display lines, drawing methods, display accuracy and so on.display
pd.options.display.max_columns = None
pd.options.display.precision = 5
10. convert_dtypes
as is often used, it is known that for often the variable type will be changed directly, resulting in subsequent inability to operate properly. this situation can be used to perform a batch conversion, which automatically infers the original type of data and implements the conversion.pandas
pandas
object
convert_dtypes
sample = pd.read_csv(
"data/station_day.csv",
usecols=["StationId", "CO", "O3", "AQI_Bucket"],
)
>>> sample.dtypes
StationId object
CO float64
O3 float64
AQI_Bucket object
dtype: object
>>> sample.convert_dtypes().dtypes
StationId string
CO float64
O3 float64
AQI_Bucket string
dtype: object
11. select_dtypes
when it is necessary to filter variable types, you can directly use, pass and filter and exclude the types of variables.selec _dtypes
include
exclude
# 选择数值型的变量
diamonds.select_dtypes(include=np.number).head()
# 排除数值型的变量
diamonds.select_dtypes(exclude=np.number).head()
12. mask
mask
cell values can be quickly replaced under custom conditions, and are often seen in the source code of many third-party libraries. for example, below we want to make the cell other than age 50-60 empty, just need to write the custom conditions.con
ohter
ages = pd.Series([55, 52, 50, 66, 57, 59, 49, 60]).to_frame("ages")
ages.mask(cond=~ages["ages"].between(50, 60), other=np.nan)
13. min, max of the column axis
although everyone knows the function of sum, it should be rare to apply on the column. this function can actually be used like this:min
max
index = ["Diamonds", "Titanic", "Iris", "Heart Disease", "Loan Default"]
libraries = ["XGBoost", "CatBoost", "LightGBM", "Sklearn GB"]
df = pd.DataFrame(
{lib: np.random.uniform(90, 100, 5) for lib in libraries}, index=index
)
>>> df

>>> df.max(axis=1)
Diamonds 99.52684
Titanic 99.63650
Iris 99.10989
Heart Disease 99.31627
Loan Default 97.96728
dtype: float64
14. nlargest、nsmallest
SOMETIMES WE WANT NOT ONLY THE MIN/MAX VALUE OF THE COLUMN, BUT ALSO THE FIRST N OR VALUES OF THE VARIABLE. THAT’S WHEN PEACE COMES IN HANDY.~(top N)
nlargest
nsmallest
diamonds.nlargest(5, "price")

15. idmax、idxmin
when we use or using a column axis, the maximum/minimum value is returned. but i don’t need a specific value now, i need the position of this maximum value. because many times it is necessary to lock the position and then operate on the entire row, such as single to bring up or delete, so this requirement is still very common.max
min
pandas
use and can be solved.idxmax
idxmin
>>> diamonds.price.idxmax()
27749
>>> diamonds.carat.idxmin()
14
16. value_counts
when it comes to data exploration, it is a function that is used very often, it does not count null values by default, but null values are often also very concerned about us. if you want to count null values, you can set the parameter to .value_counts
dropna
False
ames_housing = pd.read_csv("data/train.csv")
>>> ames_housing["FireplaceQu"].value_counts(dropna=False, normalize=True)
NaN 0.47260
Gd 0.26027
TA 0.21438
Fa 0.02260
Ex 0.01644
Po 0.01370
Name: FireplaceQu, dtype: float64
17. clip
outlier detection is a common practice in data analysis. functions make it easy to find outliers outside the range of variables and replace them.clip
>>> age.clip(50, 60)
18. at_time、between_time
these two functions are super useful when the time granularity is relatively fine. because they can perform more granular operations, such as filtering a certain point in time, or a certain range of time, etc., they can be refined to hours and minutes.
>>> data.at_time("15:00")
from datetime import datetime
>>> data.between_time("09:45", "12:00")

19. hasnans
pandas
provides a quick way to check if a given series contains a null value.hasnans
series = pd.Series([2, 4, 6, "sadf", np.nan])
>>> series.hasnans
True
this method is only suitable for structures.series
20. GroupBy.nth
this feature is only available for objects. specifically, after grouping, return the nth row for each group:GroupBy
nth
>>> diamonds.groupby("cut").nth(5)
reference:
[1] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.ExcelWriter.html
[2] https://towardsdatascience.com/25-pandas-functions-you-didnt-know-existed-p-guarantee-0-8-1a05dcaad5d0Finally, once again, give you Amway under “In-depth And Simple Pandas: Using Python for Data Processing and Analysis”, a classic of pandas learning, superb! The knowledge related to pandas is particularly complete and detailed, and the last chapter also has a large number of practical projects and cases, and it is recommended to start a book. More information and table of contents can be found at the links below.