denis-ai.bsky.social
@denis-ai.bsky.social
All I want is to deploy in the cloud☁️
Once containerized there are many options😃
See four liners at github.com/DataTalksClu...
github.com
November 3, 2025 at 9:10 PM
What about getting UV in docker container? 🤔
Just copy binary like this:
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ 😎
November 3, 2025 at 9:00 PM
UV package manager
is really-really fast. Get it with pip install uv👌
November 3, 2025 at 8:58 PM
Summary of MLZC Week 5
- loading and saving models
- serve model with FastAPI
- manage environments with uv
- package prediction service in Docker container
- deploy to cloud
November 3, 2025 at 8:56 PM
Summary of MLZC Week 4
ROC curve: Evaluates the performance at all thresholds
K-fold Cross Validation: train and validate a model on a number of parts (folds) and loop that through all folds combos.
October 23, 2025 at 8:36 PM
Summary of MLZC week 4
Metric: single number to describe model performance
Accuracy: fraction of correct answers
Precision&Recall are less misleading when classes are imbalanced
October 23, 2025 at 8:31 PM
Dummy binary classifier
Can be very accurate when classes are imbalanced 👌
October 23, 2025 at 8:27 PM
Accuracy is not enough!
Get precision and recall too.
Bonus: calculate F1 score just in case!
October 23, 2025 at 8:25 PM
Confusion table helps to avoid confusion ...
when predicting a cstmr churn consider:
No churn (Negative)
Cstmr didn't churn - True Negative (TN)
Cstmr churned - False Negative (FN)
Churn (Positive)
Cstmr churned - True Positive (TP)
Cstmr didn't churn - False Positive (FP)
October 23, 2025 at 8:23 PM
sklearn.metrics module
Score functions, performance metrics, pairwise metrics and distance computations.
See scikit-learn.org/stable/modul...
3.4. Metrics and scoring: quantifying the quality of predictions
Which scoring function should I use?: Before we take a closer look into the details of the many scores and evaluation metrics, we want to give some guidance, inspired by statistical decision theory...
scikit-learn.org
October 23, 2025 at 8:17 PM
ROC AUC is useful metric
Evaluate model with Area Under the ROC curve (AUC). AUC equals 0.5 for a random baseline and 1.0 for an ideal model.
October 23, 2025 at 8:11 PM
When data preprocessing looks right

preprocessor = ColumnTransformer([
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(drop='first'), categorical_features)
])
October 13, 2025 at 10:58 PM
Get weights of trained model with:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)
...
model.intercept_[0]
model.coef_[0]
October 13, 2025 at 10:55 PM
Logistic regression is supervised ML algorithm
Unlike linear regression that predicts continuous values it predicts the probs that an input belongs to a specific class. Used for binary classification, with output, e.g., Yes/No, 0/1. Sigmoid function converts inputs into a prob value between 0 and 1.
October 13, 2025 at 10:49 PM
Make Features Important Again!
Risk ratios, mutual information and pearson's correlations are your little helpers🧑‍🎓
October 13, 2025 at 10:38 PM
One-hot encoding is easy
with DictVectorizer from sklearn.feature_extraction 👌
October 13, 2025 at 10:34 PM
Before model training do that ☝️
- Categorical Variables: convert with one-hot encoding
- Numerical Variables: scaled or normalize
- Missing Values: fix with chosen strategy
- Data Split: 60% for training, 20% for validation and rest for testing
October 13, 2025 at 10:31 PM
Learnings from Week 3 of ML-Zoomcamp
- Not all features are equal - feature analysis with risk ratios, mutual information and correlation matrices
- One-hot encoding for categorical features
- Classify with logistic regression
- Probabilities for categories and interpretation of weights
October 13, 2025 at 10:27 PM
Data preparation basics - II

Shuffle data using pandas only 👍 The built-in sample() function is ready to help!
pandas.pydata.org/docs/referen...

df_shuffled = df.sample(frac=1, random_state=seed)
df_shuffled.reset_index(drop=True, inplace=True)
pandas.DataFrame.sample — pandas 2.3.3 documentation
pandas.pydata.org
October 8, 2025 at 8:47 AM
Data preparation basics
Shuffle your data before splitting it into train, validation, test subsets:
idx = np.arange( len(df) )
np.random.seed(42) <-- seed value
np.random.shuffle(idx)

df_train = df.iloc[idx[:train_size]]
...
October 7, 2025 at 9:57 PM
Learnings from Week 2 of ML-Zoomcamp
- Start with EDA
- Look at target variable distribution
- Make train, val, test datasets
- Matrix manipulations for linear regression
- Code linear regression in few lines of code
- RMSE
- Feature engineering helps
- Regularization is needed too
October 7, 2025 at 9:47 PM
Reduce the skewness in data with np.log1p() 😎
Easy!☝️
October 7, 2025 at 9:39 PM
Little hacks💡to EDA your data - III👌
Have a grasp of values and their uniqueness for every column with:
for col in df.columns:
print(col)
print(df[col].unique()[:10])
print(df[col].nunique())
October 7, 2025 at 9:30 PM
Little hacks💡to EDA your data - II👌
Get number of missing values for every column with df.isnull().sum()
October 7, 2025 at 9:25 PM
Little hacks💡to EDA your data - I👌
After loading dataframe, have a quick look at it with df.T
October 7, 2025 at 9:19 PM