Custom Python Package
Our code repository has a custom python package.
The custom package, imaginatively named
my_custom_package can be found in the git repository in
src/my_custom_package and has a very minimalist
setup.py file in the
This is installed in our pipelines from our
requirements.txt file but can also be installed from the command line as follows:
Here we only have a single python package for our project but some projects grow to be quite complex with multiple python packages that all have functionality required for model training.
Our code repository has a number of utility files inside the
src/my_custom_package/utils/ directory that contain functionality for interacting with Azure services (
blob_storage_interface.py), for storing constants (
const.py) and for transforming data (
As we’ve discussed above, any pre-processing we do to our training data, we’ll want to do to the data we’ll be using for predictions in our deployed service.
Our function for transforming data will be imported in both our training and the script we’ll be using to call our model. Depending on the nature of the data transformation, we might have included the data transformation in our scoring script itself.
So what data pre-processing do we need to do? If we take a correlation plot for the input data, we see that there are some highly correlated values.
You can detect high-multi-collinearity by inspecting the eigen values of correlation matrix. A very low eigen value shows that the data are collinear, and the corresponding eigen vector shows which variables are collinear.
If there is no collinearity in the data, you would expect that none of the eigen values are close to zero.
If we take a look at the eigen values, there are 2 eigen vectors that are clearly close to zero.
Upon inspecting the eigen vectors, as well as our correlation plot above, we see that the source of the collinearity is columns
"I". So let’s remove these.
transform_data.py is shown below.
return X_data.drop(['D', 'I'], axis=1)
Upon removing the columns, our eigen values now show no values close to zero.