Making Edible Oils Safe: Authentication and Quality Assurance of Edible Oils
COM1 Level 3
MR1, COM1-03-19
closeAbstract:
Edible oils and oil blends provide important micronutrients and macronutrients essential for the human diet. The quality of these edible oils can be compromised either by an intentional replacement of high-quality oils with low quality oils to increase the profit margins or by an unintentional mistake in the production pipeline. In order to maintain the health of consumers and the integrity of the industry, it is crucial to authenticate and ensure the quality of edible oils and oil blends.
Even though NIR spectroscopy is often used for the analysis of edible oils, it is highly prone to unwanted sources of variation called batch effects. Batch effects render a model built on one batch of spectra to perform sub-optimally when tested on another batch of spectra. It is necessary to eliminate these batch effects for reasonable analysis. The batch effects are generally corrected by building a standardisation (i.e. batch-effect correction) transfer model (such as piecewise direct standardisation (PDS), calibration transfer based on canonical correlation analysis (CTCCA) and principal component based piecewise direct standardisation (PCPDS)) between some well-chosen spectra from the different batches. However, these methods struggle to correct the batch effects as batch effects are source dependent, complex and often heterogeneous with respect to the phenotype of the spectra. Moreover, shifts in spectra due to batch effects are confounded by shifts resulting from differences in phenotype, especially when the phenotype is quantitative.
Classification of spectra is a very common chemometric problem. A good classification model should not just classify spectra into predefined classes but also detect spectra as belonging to none of the predefined classes. A good classification model should also classify spectra from a different batch well. Existing models often take a two-step approach of first correcting the spectra for batch effects and then classifying them. In this thesis, a class-specific correction and classification (CSCAC) model is proposed to not only correct for batch effects but also perform multi-class classification and novel-class detection. Three variants of CSCAC, each with a different transfer model, were built: CSCAC (PDS), CSCAC (CTCCA) and CSCAC (PCPDS). Both the standardisation and classification performance of CSCAC models was illustrated on a 14-edible oil dataset obtained across 5 different batches. The multi-class classification accuracy and the novel-class detection accuracy of CSCAC (PDS) is better than PCLDA (principal component linear discriminant analysis) combined with PDS in its stand-alone form, indicating that the classification performance improves when transfer models are explicitly used in a class-specific manner in the CSCAC framework. Principal component analysis of spectra corrected by CSCAC models indicate that the spectra cluster well according to their class irrespective of their batch. Within-class sum of squares of spectra corrected by CSCAC models are lower than those corrected by other models. Between-class sum of squares of spectra corrected by CSCAC models are higher than those corrected by other models. These evaluation metrics indicate that CSCAC models perform better standardisation than the existing batch correction methods.
The CSCAC model can also be used as a quality check model to determine whether an edible oil blend has deviated from its quality specifications or not. This application of CSCAC was illustrated on a peanut oil-maize oil blends dataset obtained across 26 different batches. CSCAC (PDS) performed better than CSCAC (CTCCA) and CSCAC (PCPDS) in this dataset. An overall sensitivity of 0.96 and specificity of 0.996 was obtained for CSCAC (PDS). Deviation tolerance, the \% of deviation that cannot be consistently detected by the model as failing the quality check, of the CSCAC (PDS) model was contained within 1.5\% even in the worst case. The specificity and the sensitivity of the CSCAC (PDS) model was also determined to be better than existing models.
Even though NIR spectroscopy is faster and more efficient than traditional chromatographic techniques, obtaining lots of spectra in the laboratory is still very laborious and time consuming. The training requirement for an analysis of edible oil and oil blends is quite high. Correction of batch effects also requires additional transfer spectra, and this further increases spectral requirement. In order to address these, a mathematical model for NIR spectra generation (MMGEN) is proposed to artificially generate spectra of oil blends given some known oil blends. Further, MMGEN (shift), a shifted version of the MMGEN algorithm, is proposed to generate spectra that accounts for batch effects. Close to median score (CMS), the distance between the generated spectra and the median of the target reference pool, is used to evaluate the goodness of the generation model. MMGEN and MMGEN (shift) models were used to generate spectra for peanut oil-maize oil blends dataset \#2. For blends of brand 1 peanut oil and brand 5 maize oil, around 25\% spectra generated by MMGEN and around 43\% of spectra generated by MMGEN (shift) has a CMS of less than 3. This indicates that MMGEN (shift) generates spectra closer to the real target spectra in this dataset. The use of MMGEN to generate spectra for 3 component oil blends in peanut oil-roasted soyabean oil-roasted sunflower oil dataset is also illustrated. Spectra generated by MMGEN when using known spectra from the same batch as the target spectra was better than when using known spectra from different batch.
Inferring both qualitative and quantitative constituents of blended oils simultaneously is termed as multi-output analysis, and it is not very straightforward. Often, existing methods simply build an independent model for each output, but this is disadvantageous as the relationship between the outputs would be ignored. A genetic algorithm based qualitative and quantitative analyser, QQAnalyser, is proposed to determine the brands of the component oils and the percentage of the component oils in the peanut oil-maize oil blends dataset. The MMGEN and MMGEN (shift) algorithm are used in the QQAnalyser to evaluate the fitness of the individuals in the population of the genetic algorithm. The working of the QQAnalyser is illustrated and its performance is compared with some of the state-of-the-art techniques. QQAnalyser benefited from a more strategic initialisation design where some important individuals were explicitly included while initialising the population. The performance of QQAnalyser improved with increasing population size and when using Canberra distance metrics for the evaluation of the individuals. The performance of QQanalyser improved when using MMGEN (shift) instead of MMGEN for generating the spectra for evaluation