Predicting whether a given company is under financial distress or not based on time-based data for different companies.
The financial stability of a company is dependent on various factors. Predicting financial distress is necessary to take appropriate steps to manage the situation and get the company back on track.
In this article, we predict whether a given set of companies are under financial distress based on ~80 time-based factors.
Implementation of the idea on cAInvas — here!
The dataset
The dataset is a CSV file with financial distress prediction for a set of companies.
Along with companies and time periods, there are 83 factors denoted by x1 to x83 that define the financial and non-financial characteristics of the companies. Out of these, x80 is a categorical feature.
The ‘Financial Distress’ column is a continuous variable that can be converted into a two-value column — healthy (0) if value > -0.5, else distressed (1).
Let us understand the dataset we are working with. Each company has 1 or more rows corresponding to various time periods. Looking into how many —
There are 422 companies in the dataset. A few have less than 5 time periods too!
Preprocessing
One hot encoding the input variables
x80 is a categorical variable that has to be one hot encoded as the attribute values do not define a range dependency.
The drop_first parameter is set to True. This means that if there are n categories in the column, n-1 columns are returned instead of n. i.e., each value is returned as an n-1 value array. The first category is defined by an array with all 0s while the remaining n-1 category variables are arrays with 1 in the (i-1)th index of the array.
Creating a time-based data frame
Since this is a time-based dataset, the features are appended to include values from previous timesteps of the same company group.
A time window of 5 is defined, i.e., attribute values from 5 timesteps are combined to create one row of the final dataset. If a company has fewer timestamps than the defined time window, they are discarded.
One hot encoding the target variables
We are approaching this as a classification problem and so the categorical target feature is converted into a binary-valued feature using the condition defined previously — healthy (0) if value > -0.5, else distressed (1).
This is not a balanced dataset.
Balancing the dataset
With 5 timestep values in a one-row sample, we are going to resample and train using this dataset without a time series split.
It is an unbalanced dataset. In order to balance the dataset, there are two options,
- upsampling — resample the values to increase their count in the dataset.
- downsampling — pick n samples from each class label where n = number of samples in class with least count (here, 83), i.e., reducing the count of certain class values in the dataset.
Here, we will be upsampling.
Defining the input and output columns for use later —
There are 448 input columns and 1 output column.
Train-validation-test split
Using an 80–10–10 ratio to split the data frame into train- validation- test sets. The train_test_split function of the sklearn.model_selection module is used for this. These are then divided into X and y (input and output) for further processing.
Scaling the values
The range of attribute values in the dataset is not the same all across. This may result in certain attributes being weighted higher than others. The range of values across all attributes are scaled to [0, 1].
The MinMaxScaler function of the sklearn.preprocessing module is used to implement this concept. The instance is first fit on the training data and used to transform the train, validation, and test data.
The model
The model is a simple one with 4 Dense layers, 3 of which have ReLU activation functions and the last one has a Sigmoid activation function that outputs a value in the range [0, 1].
As it is a binary classification problem, the model is compiled using the binary cross-entropy loss function. The Adam optimizer is used and the accuracy of the model is tracked over epochs.
The EarlyStopping callback function of the keras.callbacks module monitors the validation loss and stops the training if it doesn’t decrease for 5 epochs continuously. The restore_best_weights parameter ensures that the model with the least validation loss is restored to the model variable.
The model was trained with a learning rate of 0.01 and followed by a learning rate of 0.001.
The model achieved an accuracy of ~93% on the test set.
Plotting a confusion matrix to understand the results better —
A larger dataset with more instances of financial distress would help in achieving a better test set accuracy. Feel free to play around with the time window to see the variation in results.
The metrics
Prediction
Let’s perform predictions on random test data samples —
deepC
deepC library, compiler, and inference framework are designed to enable and perform deep learning neural networks by focussing on features of small form-factor devices like micro-controllers, eFPGAs, CPUs, and other embedded devices like raspberry-pi, odroid, Arduino, SparkFun Edge, RISC-V, mobile phones, x86 and arm laptops among others.
Compiling the model using deepC —
Head over to the cAInvas platform (link to notebook given earlier) and check out the predictions by the .exe file!
Credits: Ayisha D
Also Read: Windmill Fault Prediction App