
Python Machine Learning Basics: A Practical, Hands-On Guide for Intermediate Developers
Dive into Python machine learning with a practical, step-by-step guide that covers core concepts, real code examples, and production considerations. Learn data handling with pandas, model building with scikit-learn, serving via a Python REST API, and validating workflows with pytest.
Introduction
Machine learning (ML) can feel like a black box: math-heavy, jargon-filled, and intimidating. But at its core, ML is about teaching computers to find patterns in data. This guide demystifies the essentials for intermediate Python developers and gives you practical, runnable examples you can reuse in real projects.
You'll learn:
- Key ML concepts and workflow steps.
- How to prepare data with pandas.
- How to build and evaluate a model using scikit-learn.
- How to serve a model with a Python REST API (FastAPI).
- How to write simple tests with pytest and persist models for production.
Prerequisites and Setup
Install required packages:
pip install pandas numpy scikit-learn matplotlib joblib fastapi uvicorn pytest
Why these libraries?
- pandas: data cleaning and analysis.
- numpy: numerical operations.
- scikit-learn: model building and evaluation.
- joblib: model persistence.
- FastAPI + uvicorn: lightweight REST API for serving models.
- pytest: automated testing.
- Python: https://docs.python.org/3/
- pandas: https://pandas.pydata.org/docs/
- scikit-learn: https://scikit-learn.org/stable/
- FastAPI: https://fastapi.tiangolo.com/
- pytest: https://docs.pytest.org/
Core Concepts — The ML Workflow
Think of ML as a pipeline with these stages:
- Problem definition — classification? regression? clustering?
- Data collection & exploration — load data, check distributions.
- Data preprocessing / feature engineering — handle missing values, encode categories, scale features.
- Model selection — choose algorithms and baseline models.
- Training & validation — split data, cross-validation.
- Evaluation — metrics, confusion matrix, ROC, etc.
- Deployment — serve model (REST API, batch jobs).
- Monitoring & maintenance — drift detection, re-training.
- Data quality: missing or biased data.
- Overfitting: model performs well on training, poorly on new data.
- Imbalanced classes: rare events dominating metrics.
- Reproducibility: ensure fixed seeds and save preprocessing steps.
Step-by-Step Example: Binary Classification
We'll build a small classification pipeline using a synthetic dataset. This example is minimal but demonstrates real-world patterns: pandas usage, scikit-learn pipeline, model persistence, REST API, and pytest tests.
1) Data generation and exploration with pandas
# data_prep.py
import pandas as pd
from sklearn.datasets import make_classification
Generate synthetic dataset
X, y = make_classification(
n_samples=1000,
n_features=10,
n_informative=4,
n_redundant=2,
n_classes=2,
weights=[0.7, 0.3],
random_state=42
)
Convert to pandas DataFrame for exploration
df = pd.DataFrame(X, columns=[f'feat_{i}' for i in range(X.shape[1])])
df['target'] = y
Quick exploration
print(df.head())
print(df['target'].value_counts(normalize=True))
Explanation, line by line:
import pandas as pd
: import pandas for DataFrame operations.from sklearn.datasets import make_classification
: helper to synthesize classification data.make_classification(...)
: creates 1000 samples, 10 features (4 informative).weights
creates slight class imbalance.random_state
ensures reproducibility.pd.DataFrame(...)
: pandas DataFrame facilitates exploration and preprocessing.df['target'] = y
: attach labels.print(df.head())
: shows top rows (input).print(df['target'].value_counts(normalize=True))
: shows class distribution (output), helpful to detect imbalance.
- Small sample sizes (n_samples < features) may cause overfitting.
- If features contain NaNs, subsequent scikit-learn models will error — we must impute or drop them.
2) Train/test split and pipeline with scikit-learn
We'll build a pipeline that scales features and fits a logistic regression classifier.
# model_train.py
import joblib
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report
import pandas as pd
from data_prep import df # re-use DataFrame from previous step
Split
X = df.drop(columns=['target']).values
y = df['target'].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
Define pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(solver='liblinear', random_state=42))
])
Cross-validation for baseline
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')
print(f'CV ROC AUC scores: {cv_scores}, mean: {cv_scores.mean():.3f}')
Train
pipeline.fit(X_train, y_train)
Evaluate
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]
print("Accuracy:", accuracy_score(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_proba))
print("Classification report:\n", classification_report(y_test, y_pred))
Persist model
joblib.dump(pipeline, 'pipeline.joblib')
Line-by-line explanation:
joblib
: used to save/load models efficiently.train_test_split(..., stratify=y, random_state=42)
: ensures class proportions are preserved, reproducibility through seed.Pipeline([...])
: chains preprocessing and estimator; ensures consistent transformations on training and inference.StandardScaler()
: subtracts mean and scales to unit variance — important for models like logistic regression.LogisticRegression(solver='liblinear')
: a simple baseline classifier.liblinear
is good for small datasets.cross_val_score(..., scoring='roc_auc')
: uses 5-fold cross-validation to estimate generalization (ROC AUC is robust for imbalanced data).pipeline.fit(...)
: fits scaler and classifier on training data.predict_proba
: returns class probabilities, used for ROC AUC and thresholding decisions.joblib.dump(...)
: saves the entire pipeline including scaler and model. This preserves preprocessing.
- Input: numpy arrays X_train/X_test and labels.
- Output: saved
pipeline.joblib
file, printed metrics.
- If a categorical variable existed, you'd need OneHotEncoder or OrdinalEncoder in the pipeline.
- For very imbalanced classes, consider oversampling (SMOTE) or class_weight in the classifier.
3) Serving the model with a Python REST API (FastAPI)
Now expose a simple /predict endpoint that accepts JSON features and returns a prediction and probability.
# api_server.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, conlist
import joblib
import numpy as np
Load model at startup
model = joblib.load('pipeline.joblib')
app = FastAPI(title="ML Model API")
Define input schema: list of 10 floats
class PredictRequest(BaseModel):
features: conlist(float, min_items=10, max_items=10)
class PredictResponse(BaseModel):
prediction: int
probability: float
@app.post("/predict", response_model=PredictResponse)
def predict(req: PredictRequest):
try:
X = np.array([req.features]) # shape (1, n_features)
proba = model.predict_proba(X)[0, 1]
pred = int(model.predict(X)[0])
return PredictResponse(prediction=pred, probability=float(proba))
except Exception as e:
raise HTTPException(status_code=400, detail=str(e))
Line-by-line:
BaseModel
andconlist
from pydantic validate payload shape and types automatically.model = joblib.load('pipeline.joblib')
: loads the pipeline with scaler + classifier.- Endpoint accepts JSON like:
{"features": [0.1, -1.2, ...]}
. - We wrap prediction in try/except to return informative HTTP 400 errors for malformed input.
uvicorn api_server:app --reload --port 8000
Edge cases and safety:
- Protect model access in production (authentication, rate limiting).
- Validate numeric ranges where appropriate.
- Consider batching requests for throughput.
4) Testing with pytest
Create unit tests for small components: data shapes and API contract.
# test_model.py
import joblib
import numpy as np
from fastapi.testclient import TestClient
from api_server import app
client = TestClient(app)
model = joblib.load('pipeline.joblib')
def test_model_predict_shape():
# Create a valid input
X = np.zeros((1, 10))
proba = model.predict_proba(X)[0, 1]
assert 0.0 <= proba <= 1.0
def test_api_predict_endpoint():
payload = {"features": [0.0]*10}
response = client.post("/predict", json=payload)
assert response.status_code == 200
data = response.json()
assert "prediction" in data and "probability" in data
assert isinstance(data["prediction"], int)
assert 0.0 <= float(data["probability"]) <= 1.0
Notes:
fastapi.testclient.TestClient
lets you test endpoints without running a server.- pytest will discover and run the tests. Run
pytest -q
to execute. - Tests check shape, types, and probability bounds.
Best Practices and Production Considerations
- Use Pipelines: Keep preprocessing and model in a single pipeline to avoid mismatch between training and serving.
- Reproducibility: Set
random_state
for deterministic behavior during development. - Versioning: Track model versions and data schema; store metadata (training score, date, hyperparameters).
- Validation: Use pydantic or marshmallow to validate incoming API payloads.
- Monitoring: Log predictions, inputs, and ground truth (when available) to detect drift.
- Security: Sanitize inputs, use HTTPS, add authentication for APIs.
- Performance: Profile inference latency, use vectorized operations, and consider model quantization or lightweight libraries for edge deployment.
- Parallelism: For high throughput, use worker processes (Gunicorn/Uvicorn with workers) and batching strategies.
Common Pitfalls and How to Avoid Them
- Overfitting: Use cross-validation, regularization, and simpler models as baselines.
- Leaky features: Ensure your features do not contain future information (e.g., a timestamp-based target leakage).
- Data drift: Periodically compare training and incoming data distributions (e.g., feature histograms).
- Wrong evaluation metric: Select metrics aligned with business goals (precision, recall, F1, ROC AUC).
- Ignoring class imbalance: Use stratified splits, class weights, or resampling techniques.
Advanced Tips
- For large datasets, use incremental learning (e.g.,
partial_fit
with SGDClassifier) or out-of-core libraries like Dask or scikit-learn's incremental tools. - For hyperparameter tuning, use
GridSearchCV
orRandomizedSearchCV
; for larger searches, consider Optuna or scikit-optimize. - Use
ColumnTransformer
for mixed feature types (numeric vs categorical) inside pipelines. - For model explainability, try SHAP or LIME to explain predictions to stakeholders.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
Suppose first 3 columns are categorical
preprocessor = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), [0,1,2]),
('num', StandardScaler(), slice(3, None))
])
Performance Considerations
- Vectorize operations with numpy/pandas; avoid Python loops on rows.
- Cache preprocessing steps if they are expensive.
- For inference scaling, use model serving platforms (KFServing, BentoML) or containers with autoscaling.
- Measure latency and throughput under realistic load; use A/B tests for new models.
Visualizing Results (quick tip)
While optional, visual diagnostics help:
- Confusion matrix heatmap to see types of errors.
- ROC curves to choose classification thresholds.
- Feature importance (coefficients for logistic regression or feature_importances_ for tree models).
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
Conclusion
You've now seen a full, practical pipeline:
- Data creation and exploration with pandas.
- Model construction, validation, and persistence with scikit-learn and joblib.
- Serving predictions via a lightweight Python REST API (FastAPI).
- Basic automated testing using pytest.
Call to action: Try running the code, modify the synthetic dataset to mirror a real problem, and deploy the API locally. Share your model and tests with teammates — reproducibility is as important as accuracy.
Further Reading and Resources
- scikit-learn tutorials: https://scikit-learn.org/stable/tutorial/index.html
- pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html
- FastAPI documentation: https://fastapi.tiangolo.com/
- pytest guide: https://docs.pytest.org/en/stable/getting-started.html
- Official Python docs: https://docs.python.org/3/