What Is Feature Engineering and Why It Matters in Machine Learning

Feature engineering is the process of transforming raw data into features that make machine learning models more effective. It is often said that better features beat better algorithms — a simple model with great features outperforms a complex model with raw data. This guide explains the key techniques with code examples.

80%

of ML project time spent on data and feature work

Better features

usually beat better algorithms

Domain knowledge

the key input to great feature engineering

AutoML

automates feature engineering — partially

1

What is a Feature?

Quick fact

A feature is any measurable property or characteristic of the data used as input to a machine learning model. Raw data (customer birth date) becomes a feature through engineering (age, days since last birthday, birth month). Good features encode the domain knowledge that helps the model make correct predictions.

In machine learning, the distinction between raw data and engineered features is fundamental. Raw data is what you collect — timestamps, text strings, category codes. Features are what you give the model — numerical representations that encode meaning. A date string "2020-03-15" means nothing to a linear regression model. But "account_age_days=1200" and "signup_month=3" give the model something it can learn from. The transformation between the two is feature engineering.

2

Core Feature Engineering Techniques

pythonFeature engineering with pandas
import pandas as pd
import numpy as np

# Sample dataset
df = pd.DataFrame({
    'signup_date': pd.to_datetime(['2022-01-15', '2023-06-20', '2021-03-05']),
    'last_login': pd.to_datetime(['2024-01-01', '2024-01-15', '2024-01-10']),
    'revenue': [150, 2500, 450],
    'age': [25, 45, 32],
    'city': ['Boston', 'New York', 'Boston'],
    'n_purchases': [3, 25, 7],
})

# 1. Date/time features
df['account_age_days'] = (pd.Timestamp.now() - df['signup_date']).dt.days
df['days_since_login'] = (pd.Timestamp.now() - df['last_login']).dt.days
df['signup_month'] = df['signup_date'].dt.month
df['signup_day_of_week'] = df['signup_date'].dt.dayofweek

# 2. Ratio/interaction features
df['revenue_per_purchase'] = df['revenue'] / df['n_purchases'].clip(1)
df['purchase_frequency'] = df['n_purchases'] / df['account_age_days'].clip(1)

# 3. Log transform (reduces skew on revenue)
df['log_revenue'] = np.log1p(df['revenue'])  # log1p handles zeros

# 4. Binning (age groups)
df['age_group'] = pd.cut(df['age'],
    bins=[0, 25, 35, 50, 100],
    labels=['young', 'adult', 'mid', 'senior']
)

# 5. Target encoding (encode categorical by target mean)
city_revenue_mean = df.groupby('city')['revenue'].mean()
df['city_avg_revenue'] = df['city'].map(city_revenue_mean)

# 6. Boolean flags
df['is_high_value'] = (df['revenue'] > 1000).astype(int)
df['is_active'] = (df['days_since_login'] < 30).astype(int)

print(df[['revenue', 'log_revenue', 'revenue_per_purchase', 'age_group', 'is_high_value']].head())
3

Key Feature Engineering Categories

Numerical transformations

Log/sqrt for skewed distributions, Min-Max normalization, Z-score standardization, polynomial features (x squared, x cubed). Linear models especially benefit from normalized features. Skewed distributions like revenue often perform better log-transformed.

Categorical encoding

One-hot encoding for low-cardinality categoricals. Target encoding for high-cardinality (cities, zip codes). Ordinal encoding for ordered categories (low/medium/high). The wrong encoding can destroy model performance.

Time-based features

Extract: day of week, month, hour, quarter. Compute: time since event, recency (days since last purchase), frequency (purchases per day), tenure (account age). Time features capture behavioral patterns and seasonality.

Interaction features

Combine features: revenue/sessions (revenue per visit), purchases times recency, age times income. Domain knowledge drives which interactions are meaningful. Tree models discover interactions automatically; linear models need them explicit.

4

Feature Selection — Remove Noise

Creating features is only half the work. Many engineered features will be redundant, correlated with each other, or simply not useful. Adding too many weak features hurts model performance through the curse of dimensionality and increased overfitting risk. Feature selection identifies which features to keep.

pythonFeature selection methods
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

X = df.drop('target', axis=1)
y = df['target']

# 1. Correlation analysis (remove highly correlated features)
corr_matrix = X.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]
X_reduced = X.drop(columns=to_drop)
print(f"Dropped {len(to_drop)} highly correlated features: {to_drop}")

# 2. Feature importance from Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
importance = pd.Series(rf.feature_importances_, index=X.columns)
top_features = importance.nlargest(20).index.tolist()
print("Top 20 features by importance:", top_features)

# 3. Mutual information (captures non-linear relationships)
mi_scores = mutual_info_classif(X, y)
mi_df = pd.DataFrame({'feature': X.columns, 'mi_score': mi_scores})
mi_df = mi_df.sort_values('mi_score', ascending=False)
print(mi_df.head(10))

# 4. Variance threshold (remove near-constant features)
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)  # remove features with <1% variance
X_var = selector.fit_transform(X)
5

RFM Feature Engineering — A Real-World Example

RFM (Recency, Frequency, Monetary) is one of the most powerful feature engineering frameworks for customer data. It transforms raw transaction history into three high-signal features that predict churn, lifetime value, and purchase probability remarkably well — often outperforming complex deep learning approaches on transactional data.

pythonRFM feature engineering for customer data
import pandas as pd
from datetime import datetime

# Sample transaction data
transactions = pd.DataFrame({
    'customer_id': [1, 1, 1, 2, 2, 3],
    'order_date': pd.to_datetime(['2024-01-01', '2024-02-15', '2024-03-10',
                                   '2023-12-01', '2024-01-20', '2024-03-01']),
    'order_value': [50, 120, 80, 200, 150, 90],
})

snapshot_date = pd.Timestamp('2024-04-01')

# Build RFM features per customer
rfm = transactions.groupby('customer_id').agg(
    recency=('order_date', lambda x: (snapshot_date - x.max()).days),
    frequency=('order_date', 'count'),
    monetary=('order_value', 'sum'),
).reset_index()

# Normalize to 1-5 scores
rfm['r_score'] = pd.qcut(rfm['recency'], q=5, labels=[5,4,3,2,1])  # lower recency = better
rfm['f_score'] = pd.qcut(rfm['frequency'].rank(method='first'), q=5, labels=[1,2,3,4,5])
rfm['m_score'] = pd.qcut(rfm['monetary'], q=5, labels=[1,2,3,4,5])

# Composite RFM score
rfm['rfm_score'] = rfm['r_score'].astype(int) + rfm['f_score'].astype(int) + rfm['m_score'].astype(int)
rfm['segment'] = pd.cut(rfm['rfm_score'],
    bins=[0, 5, 8, 11, 15],
    labels=['at_risk', 'needs_attention', 'loyal', 'champion']
)

print(rfm)
6

Handling Missing Values as Features

Missing values are not just a preprocessing problem — the fact that a value is missing is often itself a highly predictive signal. Imputing the value without capturing the missingness pattern throws away information.

pythonMissing value features
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'income': [50000, None, 120000, None, 80000],
    'credit_score': [720, 680, None, 590, 750],
    'loan_amount': [10000, 25000, 50000, 5000, 15000],
})

# 1. Create missingness indicator features BEFORE imputing
df['income_missing'] = df['income'].isna().astype(int)
df['credit_score_missing'] = df['credit_score'].isna().astype(int)

# 2. Impute with median (preserves distribution better than mean for skewed data)
df['income'] = df['income'].fillna(df['income'].median())
df['credit_score'] = df['credit_score'].fillna(df['credit_score'].median())

# Now model sees BOTH the imputed value AND whether it was originally missing
# The missingness pattern itself may be predictive
# (e.g., missing income might indicate self-employment or refusal to disclose)

Domain knowledge beats algorithmic feature search

The best features come from understanding your domain. For churn prediction: days since last login is more predictive than random polynomial combinations. Talk to domain experts, look at what analysts track manually, and translate that knowledge into features before trying automated approaches.
7

Feature Engineering Pipeline — Production Setup

In production, feature engineering must be reproducible, versioned, and applied consistently to both training and inference data. Using sklearn Pipelines prevents the most common production bug in ML: different preprocessing at training vs. serving time.

pythonsklearn Pipeline for reproducible feature engineering
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

# Define preprocessing for each column type
numeric_features = ['age', 'income', 'account_age_days']
categorical_features = ['city', 'plan_type']

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
])

# Combine into a ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features),
])

# Full ML pipeline: preprocessing + model
model_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100)),
])

# Fit on training data
model_pipeline.fit(X_train, y_train)

# Inference: same transformations applied automatically
predictions = model_pipeline.predict(X_test)

# Save entire pipeline — preprocessing included
import joblib
joblib.dump(model_pipeline, 'model_pipeline.pkl')

Frequently Asked Questions