Fracking: using graph and association rules mining techniques to optimize wells design

Angel Aponte
Mar 15, 2023
9 min read

Updated: Apr 7, 2023

Hydraulic fracturing, also called fracking, is a WELL-STIMULATION technique involving the fracturing of bedrock formations by a pressurized liquid. The process involves the high-pressure injection of "fracking fluid" (primarily water, containing sand or other proppants suspended in water and thickening agents) into a wellbore to create cracks in the deep-rock formations through which natural gas, petroleum, and brine will flow more freely. When the hydraulic pressure is removed from the well, small grains of hydraulic fracturing proppants (sand or aluminum oxide) hold the fractures open.

Hydraulic fracturing is used to increment the rate at which petroleum or natural gas can be recovered from subterranean natural reservoirs. Reservoirs (as depicted in the figure below) are typically porous sandstones, limestones, or dolomite rocks but also include "unconventional reservoirs" such as shale rock and tight sands. Hydraulic fracturing enables the extraction of natural gas and oil from rock formations or unconventional reservoirs deep below the earth's surface (generally 2,000–6,000 m (5,000–20,000 ft)). At such depth, there is low permeability or reservoir pressure to allow natural gas and oil to flow from the rock into the wellbore at a high economic return. Thus, creating conductive fractures in the rock is instrumental in extraction from naturally impermeable SHALE reservoirs.

Conventional and unconventional reservoirs

Since the early 2000s, advances in drilling and type of completion technology ("plug-and-perf", "perf", "sliding sleeve", etc.) have made horizontal wellbores, like the ones illustrated in the figure above, much more economical. This is particularly useful in shale formations (Vaca Muerta in Argentina, for example) that don't have sufficient permeability to produce economically with a vertical well. The type of wellbore completion is used to determine how much a formation is fractured and at what locations along the well horizontal section.

The variables related to the typical (unconventional) reservoir fracking process can be summarized:

The pressure (in psi) used in the injection process
The power (in hp) consumed by operating pumps and other pieces of equipment
The volume (in m3) of water injected
The quantity (in tons) of sand (or other proppants)
Type of completion ("plug-and-perf", "perf" and "sliding sleeve", etc.)
Well's horizontal section length (in m)
The number of fractures

In general, fracking is a complex and expensive process. Therefore any methodology or procedure that contributes to reducing costs, optimizing well designs, and enhancing the whole process, would be very welcome by operators.

So, when months ago while presenting to an Oil and Gas company an analytic solution regarding data from conventional reservoirs, someone asked me if it was possible to build something similar and apply it to publicly available fracking data, I immediately started wondering what fracking-retail has in common (data tons of it!) and how to modify and adapt one of the End-to-End Analytics Solutions of my previous posts, to unlock this new and challenging real-world use case.

To clarify the exposition, I've divided the current post into three sections:

Data preparation tool Principal Components Analysis (PCA) and visual exploration
Weighted Association Rules Mining (WARM): optimizing wells' design parameters
Implementing a Content-Based Recommendation Engine: fine-tune wells' design parameters

Let's move on, hoping this Analytics Solution could help answer some business questions in the fracking realm and be a springboard for future deeper analyses.

Data preparation tool Principal Components Analysis (PCA) and visual exploration

In previous posts, I've strongly emphasized the pivotal role of data preparation and the TRIFACTA wrangling software. This is available in two versions, on-premise and in the cloud; the last is part of the Google Cloud Platform ecosystem. To tackle the data challenges of the current real-world use case, TRIFACTA Cloud Dataprep was unleashed; it allows, as illustrated in the figure below, to take full advantage of the scalability, connectivity, and data pipelines' tool capabilities.

The data is a publicly available dataset from CAP-IV Secretaria de Energia. It comprised the information of the well's name, company, area, and reservoir name; along with the well's horizontal-section lengths, the number of fractures, type of completion, pressure, etc.; from more than 1,850 wells of unconventional reservoirs in the Neuquen and Rio Negro provinces in Argentina. Additional data regarding the well's producing life (in days) or vu, and accumulated liquids' productions (oil, gas, water, etc.), was also available and uploaded together with fracking data in a bucked previously created in Google Cloud Storage (GCS).

The figure above illustrates part of the recipe of the data preparation process in TRIFACTA, carried out to generate the input requires to perform a Principal Component Analysis (PCA); to identify the most relevant and informative variables. Some of the PCA results can be visually explored in IMAGE#1 and IMAGE#2. The first image illustrates both quantitative and categorical variables in a general Variables-FAMD plot, and the second, a Quantitative (Numeric) Variables-FAMD plot; it's pretty clear from the last image which numeric variables, depicted as the longest and thickest red arrows, are of the GREATEST VARIABILITY in the data. In summary, the 15 most informative (out of 20) variables were selected for future analysis.

The fresh refined data also saved in GCS, was connected directly from Looker Studio, where fully interactive and easy-to-digest visualizations, as shown in the figure above, were built; allowing to perform a more in-depth visual exploration of the most relevant variables identified by the PCA process.

The visual exploration of the 15 most informative variables also surfaced possible interesting relationships between single variables and certain combinations of them. Following this promising lead, the next step, to be presented in the following section, was to perform a full Weighted Association Rules Mining (WARM) analysis, applying Market Basket Analysis techniques using, again, the apriori and hits algorithms. To learn more about apriori and hits algorithms, please read the related discussion in my third post. Uncovering possible relationships between the most informative variables and, in particular, the well's PRODUCING LIFE (vu) and the ACCUMULATED GAS PRODUCTION (agp), would be of paramount interest.

Weighted Association Rules Mining (WARM): optimizing wells' design parameters

Indeed, to perform the WARM analysis and accomplish the extraction of actionable relationships between the PCA's most relevant variables, using the apriori and hits algorithms, it was imperative to implement an advanced data preparation recipe in TRIFACTA Cloud Dataprep to reshape the data and generate the required input with the right format.

First, the data from the 15 most informative variables were blended with other relevant information, particularly the already mentioned agp and vu data. Next, it was required to convert all numeric variables (integer and decimal) into categorical variables, as illustrated in the figure above. So, for example, the (decimal) variable corresponding to the well's horizontal-section length (hsl), with values distributed between 0m and 2,800m, was transformed into a categorical six-valued variable, i.e., 0m<=hsl<100m, 100m<=hsl<1000m,..., hsl>=2500m. A similar procedure was applied to the variables pressure, the volume of injected water, number of fractures, agp, vu, etc.

Trifacta Cloud dataprep: WARM input data

Finally, it was required to Unpivot the blended-transformed data (as shown in the figure above) to obtain a dataset that can be ingested directly by the Market Basket Analysis algorithms and successfully carry out the WARM process. With the data ready, it was time to unleash apriori and hits algorithms power and to get the most value out of the freshly prepared data.

Before engaging with the WARM, it could be interesting to explore visually the dataset obtained so far. Using arules and igraph R packages' visualization tools, a graph, shown in IMAGE#3, was built; tuning up the graph-plot function parameters (size, colors, etc.), it was possible to visually identify a few interesting features, as well as, uncover some relationships, for example, between 150KMm3<=agp<250KMm3 (values of the accumulated gas production between 150K Mm3 and 250K Mm3) and 25Km3<=ainy<50Km3 (volume of injected water between 25K m3 and 50K m3), lrh>=2500m (well's horizontal-section length greater than or equal 2,500 m), etc.

With the last as a guide, the WARM analysis starts by evaluating the weights corresponding to each variable association using the hits algorithm. The weights calculated were plugged into the apriori algorithm, and the association rules (variable associations) were induced or generated. As mentioned in previous posts, the rules obtained in the WARM process contain antecedent (lhs) and consequence (rhs) sets of variables or factors. So, by tuning up apriori's parameter properly, it is possible to induce rules, for example:

with a target or specific element in the CONSEQUENCE rhs-set
or select a group of particular elements or factors in the PRECEDENCE lhs-set

When the rhs (consequence) was, for example, set to {150KMm3<=agp<250KMm3} (target value of the accumulated gas production agp between 150K Mm3 and 250K Mm3, indeed values very attractive from an economic return point of view), the TOP-30 induced rules obtained, sorted decreasing by the metric Lift, are shown in the figure below. The green rectangle highlights the five combinations with the highest Lift. It's important to notice that the elements appearing in the lhs sets are non-random combinations whatsoever. Remember, the greater the Lift value is than 1.00, the smaller the likelihood that the induced rule occurs by chance. So, the induced rules or associations generated are the strongest, and the recommended values (appearing in the lhs) of pressure, the number of fractures, the volume of water injected, completion type, etc., could be used directly by the domain expert or engineer to improve the design of new unconventional wells.

Now, the rhs (consequence) was set, for example, to {600d<=vu<900d} (target value of the well's producing life vu between 600 and 900 days, again very attractive from an economic point of view). The figure below depicts the TOP-30 induced rules (sorting decreasing by Lift). As before, the green rectangle highlights the four combinations with the highest Lift. These rules are the strongest, and the recommended values (appearing in the lhs) of the number of fractures, horizontal-section length, pressure, etc., could be used directly by the engineer to improve the design of new unconventional wells.

Finally, if a group of specific elements is included in the PRECEDENCE lhs-set, for example, {2000m<=lrh<2500m, Tapon-disparo, 50Km3<=ainy<75Km3, 10.5Kpsi<=pre<11Kpsi, 16Khp<=w<18Khp, 30<=NUMFRA<40} (value of horizontal-section length between 2,000 m and 2,500 m, completion type plug-and-perf, the volume of water injected between 50K m3 and 75K m3, the pressure between 10.5K psi and 11K psi, power between 16K hp and 18K hp, and the number of fractures between 30and 40); the TOP-30 induced rules, sorted descending by the metric Lift, are shown in the figure below.

Interestingly, as depicted particularly from the rules highlighted by the green rectangles, also in this scenario, results are consistent with CONSEQUENCE values of agp and vu, attractive from an economic-return point of view. As before, the actionable insight unearthed by the WARM analysis could be used immediately by the domain expert/engineer to recommend actions oriented to improve the design of new (unconventional) wells.

Implementing a Content-Based Recommendation Engine: fine-tune wells' design parameters

Also, it would be interesting to explore additional ways to prescribe actions and optimize the proposed designs of new fracking wells, using the data and information already available for well-known unconventional wells (even possibly located in a different geographic location).

In this scenario, consider a group of new locations or proposed unconventional wells (let's call them the "REFERENCE WELLs") whose design parameters have been estimated or generated synthetically, for example. Could they be directly compared with well-known wells (let's call them the "SIMILAR WELLs"), allowing the domain expert/engineer to fine-tune in advance the new well design parameters? Here Is when comes in handy to implement a Content-Based Recommendation Engine (CBRE).

First, from the original refined data of the 15 most informative variables and the REFERENCE WELLs synthetic data (both uploaded and living in GCS), one-hot encoding could be used to transform categorical variables into numeric 0/1 codes, to generate in TRIFACTA the input required to evaluate a Similarity Matrix (using for example the Pearson Correlation Coefficient). The Similarity Matrix is a square matrix where each row (column) corresponds to a well; each matrix cell contains a value of the Similarity Index; the diagonal is filled with values equal to 1 (each well is identical to itself), and the off-diagonal elements, are values distributed between -1 (very different) and 1 (very similar), the (-1, 1) open interval. The Similarity Matrix Evaluation is performed using the R language framework and is saved into GCS.

Second, back in TRIFACTA Cloud Dataprep, the Similarity Matrix is UNPIVOT and blended with additional relevant information. The resulting refined dataset, containing REFERENCE and SIMILAR WELL data, will be the core of the recommender system. The latest results are also saved in GCS, ready to be connected with any BI tool.

The final step is the implementation of the Content-Based Recommendation Engine (CBRE) to facilitate the utilization and knowledge extraction of the latest blended-transformed dataset to the end-users (engineer or domain expert); the engine for this example has been served in Google Looker Studio as fully interactive and easy-to-digest visualizations, for the best user experience. The figure below depicts the results.

Content-Based Recommendation Engine for fracking well design

Referring to the figure above, picking up a REFERENCE WELL, and typing or selecting its name in the dropdown REFERENCE filter, the engine shows up, in the table at the bottom, a list sorted descending by the Similarity Index, of well-known SIMILAR WELLs and their related data and parameters; their names are also displayed, to the top right, in a compelling visualization. The Similarity Index upper and lower bounds can be adjusted using the SLIDER. If required, the user can apply additional filters and export the filtered table in a convenient format (EXCEL or CSV) or save it as a Google Sheet.

The recommendations delivered by the CBRE can be used immediately by the engineer to customize the estimated new well parameters or to prescribe and directly implement in the new well's design values that are taken from the SIMILAR WELL's parameters that have been already tested and deployed successfully; among other practical applications.

Summary

In this post, a scalable cloud-based End-to-End Data Analytics solution was presented. It comprised:

construction of a data repository in Google Cloud Storage (GCS)
implementation of complex and advanced data preparation recipes in TRIFACTA Cloud Dataprep to transform and reshape a publicly available hydraulic fracking dataset
PCA, Graph Analysis and Weighted Association Rules Mining Analysis (WARM) are carried out by applying algorithms and methods available in the R language framework
and implementation of a Content-Based Recommendation Engine served in Google Looker Studio

The results of the analyzes presented in this post can be used directly by engineers or domain experts, for example, to prescribe and implement new unconventional well(s) values of the SIMILAR WELL's parameters that have been extensively tested, and deployed successfully, among other practical applications.

Hoping the presented example of Analytics Solution helps answer some business questions in the Fracking realm and that it would be in the short-term a springboard to broader and deeper analyses. For further information and details, please contact me.

In future posts, I’ll continue unlocking more real-life use cases. Please, stay tuned and don’t miss them out. And kindly, leave your comments below and share.

ANGELAPONTE

Fracking: using graph and association rules mining techniques to optimize wells design

Data preparation tool Principal Components Analysis (PCA) and visual exploration

Weighted Association Rules Mining (WARM): optimizing wells' design parameters

Implementing a Content-Based Recommendation Engine: fine-tune wells' design parameters

Summary

Recent Posts

Yorumlar

Get in Touch