On the shared subspace between different seismic data sources

The lack of labeling consistency in seismic fault annotations, as well as the high cost of acquiring quality labels represent significant issues when developing machine learning models for fault segmentation. To address this problem, a common solution is to resort to alternatives like training models on large amounts of synthetic data, and then finetuning them on a smaller set of real target data. However, this poses a domain adaptation challenge, given that the distributional and geophysical properties of these two data sources are generally different. In general, harnessing the knowledge from alternative label sources in real seismic settings remains an open challenge. 

In this abstract, we analyze the way fault segmentation models transfer knowledge across different data sources under various training and finetuning conditions. To this end, we leverage a natural seismic survey called the Thebe Gas Field, a synthetic seismic dataset used to train the well-known method FaultSeg3D, and a novel crowdsourced seismic dataset called CRACKS built on top of the Netherlands F3 block. CRACKS is further subdivided in three label categories: domain expert, practitioner and novice-level labels. We use all possible pairwise combinations of these labels sources to pretrain and finetune a fault segmentation model, and evaluate the model’s performance on CRACKS (expert label) and synthetic data. 

Our experiments suggest that the evaluated data sources share a common subspace, and provide us with a sense of the structural geometry conformed by the data in this space. Figure 1 showcases the deeplab model being evaluated under all possible pretraining and finetuning settings on the cracks (expert) dataset, we can find for instance that the best performing ones are those that were pretrained on synthetic data and then fine-tuned on the target cracks data. This both suggests the existence of commonalities between the geophysical properties of the seismic faults in these two data sources, and further solidifies the role of synthetic data as a viable source of training data for fault delineation models when real seismic data is limited. Furthermore, the distribution of datasets used for finetuning in Figure 1 (encoded by different colors) establishes a hierarchy of the data sources that are best aligned to aid models in recovering faults from the F3 block, starting by expertly labeled cracks data itself, followed by lower quality labels (novice and practitioner), synthetic and thebe, in that order. This establishes the surprising fact that even though it is also real seismic data, thebe is farther away from cracks than synthetic. 

In Figure 2, we show an analogous experiment by evaluating all possible combinations of pretraining and finetuning on our different data sources, and evaluate the resulting models on the synthetic dataset. The synergic relationship between cracks and synthetic data is further corroborated here, where the models that perform best are originally pretrained on cracks and then finetuned on synthetic data. This highlights the fact that crowdsourced labels hold the potential to improve performance of models on synthetically generated data. 

Print Friendly, PDF & Email