Content-based recommender system: data imputation similarity index & fuzzy ranking an O&G use case

Angel Aponte
Apr 7, 2023
4 min read

Updated: Jun 21, 2023

Continuing the discussion on RECOMMENDATION ENGINES and their applications, in what follows, I consider another example of a content-based recommendation system and how FUZZY-LOGIC can improve the recommendations delivered. The dataset is from a MATURE OIL FIELD. You're wondering, what do retail and a mature oil field have in common? The answer: data, tons of it...

Before addressing recommendations and fuzzy logic tasks, the missing values should be fixed. Indeed, if the proportion of missing values in the data is lower than 10%, it is safe to eliminate instances (rows) containing them. If not, missing values IMPUTATION is required. In this example, the proportion of missing values was roughly 25%, so it was imperative to implement in advance a reliable imputation scheme. For example, replace missing values with column Median or Mean value; or fill in the unknown values using a more sophisticated regression-based procedure. Let's call the former the MFV technique and the latest ECV procedure. Calculations were performed with the R programming language.

The figure above depicts to the right a snapshot of the dataset; each column represents a RELEVANT FEATURE for the oncoming analysis (the far-left column is the well's identifier). AOP_m3 stands for "Accumulated Oil Production in m3", AWP_m3 is the "Accumulated Water Production in m3, and PF_INDEX is an index (or probability) that runs between (1) the present or (0) absent of oil. To make the matter more challenging, missing values are in adjacent columns (see columns #2 y #3, from RIGHT to LEFT). Implementation of MFV is straightforward. In the case of the ECV technique, a double imputation shame should be applied: first, values in column #1 are used for the imputation of values of column #2; second, values of column #2 (including those already fixed in the first round) are used to finally fill-in missing values of column #3. The image to the right in the figure illustrates the R-snippets deployed.

In the figure below, the results of MFV and ECV for AOP_m3 and AWP_m3, respectively, are compared with the original distribution (BLUE). ECV technique is the best approach for replacing the missing values of column #2 (AWP_m3) and column #3 (AOP_m3). Tackled the imputation challenge, the recommendation system issues can now be addressed.

Data imputation schemes and original data

The RECOMMENDATION System will be implemented in three STEPS: Step1 evaluation of the Similarity Matrix and the SIMILARITY INDEX. Step 2 SEARCH: Construction of the Similarity Database and NEW FEATURES evaluation (scores). And Step 3 RESULT RANKING: Refine Ranking Results, FUZZY-RULES, and FUZZY-RANKING. The figure below illustrates the 3 steps described.

The figure below shows schematically the implementation of Step 1. The SIMILARITY INDEX is used to compare wells and select the top-5 similarities of, for example, well #13 (w-170). So, actions carried out on w-170 could also be PRESCRIBED (RECOMMENDATIONS) to Target WELLS w-181, w-173,...; even if they are NOT in the same geographic area (well coordinates may or may not be included in the Similarity Matrix evaluation). For example, if the Target WELLS doesn't have enough information to perform the analysis, the gap could be filled in with data of SIMILAR wells of an analog or neighbor field, and vice-versa.

To refine the ranking and deliver recommendations, in Step 2 and Step 3, Fuzzy-Logic techniques are unleashed. Fuzzy logic is based on the Fuzzy Sets theory. In the STANDARD Set theory, the boundaries between the sets are well-defined (crips). In fuzzy sets, boundaries are not well defined and may overlap. Another concept in fuzzy logic is fuzzy rules. They provide a mechanism for dealing with fuzzy consequents and antecedents. This approach allows capturing SUBTLE PATTERNS that strict numeric comparisons cannot detect. Now, consider the ranking problem while serving recommendations. The ranking is crisp, based on the strict sorting of one or more similarity scores. In what follows the previous Step 1 sorting system will be replaced in Step 2 and Step 3 with a fuzzy-rule-based system.

The figure below depicts the general process: additional features (scores) are evaluated, and numeric values are transformed into LINGUISTIC VARIABLES (values are now words in a natural language), which are inputs to the FUZZY-RULES and are also used to build the MEMBERSHIP FUNCTIONS. Together these comprise the FUZZIFICATION or Fuzzy Logic Ranking stage. Then, MEMBERSHIP FUNCTIONS are used to transform back linguistic variables into numeric values in the DEFUZZIFICATION stage; Refined Ranked results are then delivered.

Fuzzification and defuzzification and deliver refined recommendations

At the button of the figure above, Step 1 Target WELLS are compared with the results of the Fuzzy Logic Ranking. It can be noted that well w-181 remains at the top of both lists. However, well w-1002 now pop-ups replacing w-173 that no longer appear. The well w-205 moves up with well w-105; w-171 is replaced, too. As before, the improved Fuzzy Logic Ranking recommendations could also be used to fill up data gaps with data of SIMILAR wells of analog and neighbor fields, and vice-versa.

This is another example of CROSS-DOMAIN ANALYTICS: techniques and methods typically used in the Retail Business and other knowledge domains are adapted to address issues of the Oil and Gas Industry.

Please leave your comments below, and kindly share and contact me if you require additional information about this or other posts. I'll be glad to answer all your questions.

ANGELAPONTE

Content-based recommender system: data imputation similarity index & fuzzy ranking an O&G use case

Recent Posts

Comments

Get in Touch