Months ago, I was contacted by a training fitness startup. They wanted to know if applying analytics techniques and methods could help them answer some of their business questions and enhance trainees' performance. I told them that, indeed, it was a doable task. However, they pointed out that the startup had limited resources, a small budget, and NO data. I realized that the task at hand was a huge challenge... Indeed. So, how do I tackle this challenge effectively, get the job done, and deliver actionable insight for this training-fitness startup? It was time to, again, think out of the box.
The answer to these tough questions was to use some of Google Workspace collaborative tools (Docs, Sheets, Slides, Forms), together with the TRIFACTA Data Wrangling Tools and Software and the open-source R language comprehensive library, to design and implement an insightful, low-cost solution, serving the results as fully interactive tables and easy-to-digest visualizations in Google Looker Studio.
To clarify the discussion that follows, I have divided this post into three sections:
Gathering data and identifying key variables: online surveys and Principal Component Analysis - PCA
Graph Analysis and Weighted Association Rules Mining - WARM
Implementation of a Content-Based Recommendation Engine - CBRE
Let's unlock another use case that illustrates how to get the most value out of data and helping to solve a real-world problem.
Gathering data and identifying key variables: online surveys and Principal Component Analysis - PCA
The main issue that I had to care about was the absence of reliable data. The easiest and cheapest solution was to use the Google FORMS free tool to build a few online surveys and ask the trainees to fill them out online. Fortunately, there are several reliable free solutions out there, tutorials, and video walkthroughs with instructions on how to construct a survey using Google Forms that contain types of questions like single-choice short-answer and paragraph questions, single and multiple-choice questions, and multiple-choice checkboxes questions, etc. Tips to improve online surveys are also available online.
The first survey deployed was designed to explore the trainees' experience and satisfaction. Knowledge gained analyzing the results delivered by the Form's visualization capabilities was used directly to quickly address and fix a few important issues (that hadn't been yet detected) regarding trainers' performance, logistics, etc.
The figure below shows the surveys designed, using some of the question types already mentioned, to collect trainees' general data, like age, sex, and email (to be used as a unique identifier), among other relevant information.
The figure above depicts some of the trainees' general survey responses, saved as a Google Sheet; by the way, Google Sheets is another powerful, easy-to-use, cloud-based free tool with all Excel features and far more. Sheets are easy to share and allow easy configuring of the access and roles of multiple users.
As I pointed out in my first post, the data preparation process is the most important step in any Data Science workflow. Indeed, it was extremely important, particularly in this example, because NO data was available to work with; so, to deliver reliable results, it was imperative to apply advanced data preparation techniques to extract every single drop of knowledge and relevant information that could be encapsulated in the survey responses; all in a short time. Yes, wrangling properly the data was critical to addressing the issues and accomplishing the goals.
The process begins with a Principal Component Analysis (PCA) performed to explore the survey's responses and select a reduced set of variables or principal components. Simply put, it was performed to identify the most relevant and informative variables for the oncoming analysis. The figure above illustrates the logic of the data preparation in the TRIFACTA. To the right is the recipe implemented to reshape the data and ready it for PCA. As will be discussed in the following sections, recipes were also implemented to reshape the data for visualization, graph analysis, etc.
The R packages FactoMineR and factoextra were leveraged to carry out the analysis and build some insightful visualizations. The packages' ability to handle quantitative and categorical variables was pivotal to tackling the challenge and quickly obtaining reliable and usable results.
The website, Statistical Tools for High-Throughput Data Analysis (STHDA), contains tutorials and many examples of applications of the R packages already mentioned. The figure above shows one of the available graphic tools to visually deliver the PCA results; the image can be interpreted as follows: variables located far from the Dim1 and Dim2 axis intersection are of the GREATEST VARIABILITY in the data, namely, the most impactful variables. For a detailed explanation, additional visual tools, and examples, please explore the STHDA web page. Considering the suggestions of trainers and the startup's domain experts, a set of 16 (out of 27) variables (highlighted inside the dotted blue curve) was selected.
Armed with the PCA results, the next step was to go back to TRIFACTA and reformat the available data, generate the inputs to apply Graph and Weighted Association Rules Mining techniques and unearth additional actionable knowledge to help the startup's trainers to enhance trainees' performance.
Graph Analysis and Weighted Association Rules Mining - WARM
The figure below shows the results (trainees' identifier has been anonymized) of the data preparation complex process in TRIFACTA to generate the input with a format suitable to the Graph and Association Rules Mining analysis.
The figure below illustrates an example of a Graph built and plotted using arules and igraph R packages' tools regarding the trainees surveyed. Tuning the plot function parameters (size, colors, etc.), it's possible to surface a few interesting features, as well as key relationships between some of them. The figure suggests that there exists a connection between, for example, BEBE_ALCOHOL_FRECUENTE (alcohol consumption) and other factors that harm the trainees' performance, like LESION_MUSCULAR_ARTICULAR_SI (muscular lesions) and (sleep deprivation) HORAS_DUERME_NOCHE_5-6, etc.
Following these visual findings and unearthing more possible variables' associations, a Weighted Association Rules Mining analysis was performed, applying the workflow discussed in my first post. So, to unlock the current use case, the apriori and hits algorithms and the same analogy process were applied. Each trainee is considered a "customer", and the factors mentioned above, like BEBE_ALCOHOL_FRECUENTE (alcohol consumption), HORAS_DUERME_NOCHE_5-6 (sleep deprivation), etc., are the items "bought" by the customers. The task is to uncover the most relevant associations (rules) between the items.
The figures above show Graph and Parallel Coordinate plots, respectively. They were obtained following a procedure similar to the one used to unlock the use case presented in the first post. The metric Lift was also used here to rank the rules or item associations (in the Parallel plot, the thicker the red line, the higher the Lift value). After carefully exploring both figures, it's not hard to conclude that the incidence of muscular and articular lesions could be closely associated with respiratory issues, frequent alcohol consumption, and other trainees' unhealthy habits.
To facilitate the interpretation and add explainability, the rules are now tabulated (ranked by the metric Lift) in the figure above (the top 10 associations have been highlighted). This unearthed actionable insight can be used by the trainers to address and mitigate the negative factors and associations identified, helping directly to, at the same time, enhance the health and the performance of the trainees.
Implementation of a Content-Based Recommendation Engine - CBRE
Finally, the startup was interested in exploring innovative ways to optimize and personalize its training programs. They wondered if it was possible to extract additional knowledge from the available survey response data. Here was when the idea of designing and implementing a Content-Based Recommendation Engine (CBRE) came in handy. You guessed right... I had to go back to TRIFACTA.
Indeed to tackle this challenge, it was necessary to implement the most complex data preparation recipes so far. An example is illustrated in the figure above. Again, the trainees' identifier has been anonymized.
First, using one-hot encoding to transform categorical variables into numeric 0/1 codes, the input required to evaluate a Similarity Matrix (using the Pearson Correlation Coefficient) to compare the trainees was generated. The Similarity Matrix for this example is a (square) matrix where each row (column) corresponds to a trainee; the diagonal is filled with values equal to 1 (each trainee is identical to itself), and the off-diagonal elements are values between -1 (very different) 1 (very similar). Calculations were performed once again in the R language framework.
Second, the Similarity Matrix was UNPIVOT and blended with the trainees' general data and other relevant information. This refined-blended file is the core of the recommender system. The final step in implementing the Content-Based Recommendation Engine consisted of facilitating the trainers' access and utilization of the latest results. So, They were presented as a fully interactive table with controls and as an easy-to-digest and informative visualization. The figure below depicts the built solution.
From the figure above (trainees' identifiers are anonymized), taking a trainer or an advanced trainee as a reference and selecting her (his) identifier in the dropdown REFERENCE filter, the engine shows up in the table at the bottom, a list (sorted descending) of the most similar trainees. Results are also displayed at the top right in a compelling visualization. With the SLIDER, it is possible to adjust the similarity index values. Additional relevant data and information were included in the table (age, sex, etc.), too. The user can export the filtered data, or save it as a Google Sheet.
The results delivered by the CBRE can be used immediately by the trainers, for example, to prescribe to the most similar trainees of different levels customized workout routines, nutrition supplements, etc., processes that have been tested and refined in the (more advanced) reference group.
Summary
In this post, an End-to-End Data Analytics Solution was designed, built, and deployed. It comprises the construction of improved online surveys using the Google Forms tool; the implementation of complex data preparation recipes in TRIFACTA to transform and reshape the surveys' responses; PCA, Graph, and Weighted Association Rules Mining Analysis carried out by applying algorithms/methods available in the R language framework; and the design and implementation of a Content-Based Recommendation Engine in Google Looker Studio.
This low-cost analytics workflow allowed me to generate the relevant data, reshape and refine it, visually discover and extract actionable insight that can be used immediately to address the real-life issues of the training-fitness startup, as well as help its trainers and domain experts to deliver recommendations and customized workout routines oriented to enhance the trainees' health and performance. They wrap up the project results and recommendations in the following compelling and insightful video:
In future posts, I’ll continue presenting and discussing more real-life relevant use cases. Please, stay tuned and don’t miss them out. Leave your comments below, and kindly share and contact me if you require additional information.
Comentários