Once containerized there are many options😃
See four liners at github.com/DataTalksClu...
Once containerized there are many options😃
See four liners at github.com/DataTalksClu...
Just copy binary like this:
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ 😎
Just copy binary like this:
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ 😎
is really-really fast. Get it with pip install uv👌
is really-really fast. Get it with pip install uv👌
- loading and saving models
- serve model with FastAPI
- manage environments with uv
- package prediction service in Docker container
- deploy to cloud
- loading and saving models
- serve model with FastAPI
- manage environments with uv
- package prediction service in Docker container
- deploy to cloud
ROC curve: Evaluates the performance at all thresholds
K-fold Cross Validation: train and validate a model on a number of parts (folds) and loop that through all folds combos.
ROC curve: Evaluates the performance at all thresholds
K-fold Cross Validation: train and validate a model on a number of parts (folds) and loop that through all folds combos.
Metric: single number to describe model performance
Accuracy: fraction of correct answers
Precision&Recall are less misleading when classes are imbalanced
Metric: single number to describe model performance
Accuracy: fraction of correct answers
Precision&Recall are less misleading when classes are imbalanced
Can be very accurate when classes are imbalanced 👌
Can be very accurate when classes are imbalanced 👌
Get precision and recall too.
Bonus: calculate F1 score just in case!
Get precision and recall too.
Bonus: calculate F1 score just in case!
when predicting a cstmr churn consider:
No churn (Negative)
Cstmr didn't churn - True Negative (TN)
Cstmr churned - False Negative (FN)
Churn (Positive)
Cstmr churned - True Positive (TP)
Cstmr didn't churn - False Positive (FP)
when predicting a cstmr churn consider:
No churn (Negative)
Cstmr didn't churn - True Negative (TN)
Cstmr churned - False Negative (FN)
Churn (Positive)
Cstmr churned - True Positive (TP)
Cstmr didn't churn - False Positive (FP)
Score functions, performance metrics, pairwise metrics and distance computations.
See scikit-learn.org/stable/modul...
Score functions, performance metrics, pairwise metrics and distance computations.
See scikit-learn.org/stable/modul...
Evaluate model with Area Under the ROC curve (AUC). AUC equals 0.5 for a random baseline and 1.0 for an ideal model.
Evaluate model with Area Under the ROC curve (AUC). AUC equals 0.5 for a random baseline and 1.0 for an ideal model.
preprocessor = ColumnTransformer([
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(drop='first'), categorical_features)
])
preprocessor = ColumnTransformer([
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(drop='first'), categorical_features)
])
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)
...
model.intercept_[0]
model.coef_[0]
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)
...
model.intercept_[0]
model.coef_[0]
Unlike linear regression that predicts continuous values it predicts the probs that an input belongs to a specific class. Used for binary classification, with output, e.g., Yes/No, 0/1. Sigmoid function converts inputs into a prob value between 0 and 1.
Unlike linear regression that predicts continuous values it predicts the probs that an input belongs to a specific class. Used for binary classification, with output, e.g., Yes/No, 0/1. Sigmoid function converts inputs into a prob value between 0 and 1.
Risk ratios, mutual information and pearson's correlations are your little helpers🧑🎓
Risk ratios, mutual information and pearson's correlations are your little helpers🧑🎓
with DictVectorizer from sklearn.feature_extraction 👌
with DictVectorizer from sklearn.feature_extraction 👌
- Categorical Variables: convert with one-hot encoding
- Numerical Variables: scaled or normalize
- Missing Values: fix with chosen strategy
- Data Split: 60% for training, 20% for validation and rest for testing
- Categorical Variables: convert with one-hot encoding
- Numerical Variables: scaled or normalize
- Missing Values: fix with chosen strategy
- Data Split: 60% for training, 20% for validation and rest for testing
- Not all features are equal - feature analysis with risk ratios, mutual information and correlation matrices
- One-hot encoding for categorical features
- Classify with logistic regression
- Probabilities for categories and interpretation of weights
- Not all features are equal - feature analysis with risk ratios, mutual information and correlation matrices
- One-hot encoding for categorical features
- Classify with logistic regression
- Probabilities for categories and interpretation of weights
Shuffle data using pandas only 👍 The built-in sample() function is ready to help!
pandas.pydata.org/docs/referen...
df_shuffled = df.sample(frac=1, random_state=seed)
df_shuffled.reset_index(drop=True, inplace=True)
Shuffle data using pandas only 👍 The built-in sample() function is ready to help!
pandas.pydata.org/docs/referen...
df_shuffled = df.sample(frac=1, random_state=seed)
df_shuffled.reset_index(drop=True, inplace=True)
Shuffle your data before splitting it into train, validation, test subsets:
idx = np.arange( len(df) )
np.random.seed(42) <-- seed value
np.random.shuffle(idx)
df_train = df.iloc[idx[:train_size]]
...
Shuffle your data before splitting it into train, validation, test subsets:
idx = np.arange( len(df) )
np.random.seed(42) <-- seed value
np.random.shuffle(idx)
df_train = df.iloc[idx[:train_size]]
...
- Start with EDA
- Look at target variable distribution
- Make train, val, test datasets
- Matrix manipulations for linear regression
- Code linear regression in few lines of code
- RMSE
- Feature engineering helps
- Regularization is needed too
- Start with EDA
- Look at target variable distribution
- Make train, val, test datasets
- Matrix manipulations for linear regression
- Code linear regression in few lines of code
- RMSE
- Feature engineering helps
- Regularization is needed too
Easy!☝️
Easy!☝️
Have a grasp of values and their uniqueness for every column with:
for col in df.columns:
print(col)
print(df[col].unique()[:10])
print(df[col].nunique())
Have a grasp of values and their uniqueness for every column with:
for col in df.columns:
print(col)
print(df[col].unique()[:10])
print(df[col].nunique())
Get number of missing values for every column with df.isnull().sum()
Get number of missing values for every column with df.isnull().sum()
After loading dataframe, have a quick look at it with df.T
After loading dataframe, have a quick look at it with df.T