Patent Phrase Matching with DeBERTa

nlp
transformers
deberta
patent-matching
Complete solution for US Patent Phrase to Phrase Matching competition using DeBERTa-v3-small model.
Author

Mohammed Adil Siraju

Published

September 26, 2025

This notebook implements a solution for the US Patent Phrase to Phrase Matching competition using DeBERTa-v3-small model. The goal is to predict similarity scores between patent phrase pairs.

Environment Detection

Check if we’re running in a Kaggle environment to handle different execution contexts.

import os
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

Installing Kaggle API

Install the Kaggle API to download competition datasets programmatically.

# %pip install kaggle
import kaggle

Setup Kaggle Environment

Setting up Kaggle credentials and API access for downloading competition data. We create a credentials file and set appropriate permissions.

Kaggle Credentials Setup

Set up Kaggle API credentials for authentication. Replace with your own credentials from kaggle.json.

# for working with paths in Python, I recommend using `pathlib.Path`
from pathlib import Path

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

Downloading Competition Data

Download and extract the US Patent Phrase Matching competition dataset from Kaggle.

# Download the competition data using Python API
path = Path('data')
if not iskaggle and not path.exists():
    import zipfile
    kaggle.api.competition_download_cli('us-patent-phrase-to-phrase-matching')
    zipfile.ZipFile('us-patent-phrase-to-phrase-matching.zip').extractall(path)

Import and EDA

# %pip install -q datasets
%ls {path}
sample_submission.csv  test.csv  train.csv

Data Loading and Initial EDA

Loading the training data and performing initial exploratory data analysis to understand the structure and content of our dataset.

import pandas as pd
df = pd.read_csv(path/'train.csv')
df
id anchor target context score
0 37d61fd2272659b1 abatement abatement of pollution A47 0.50
1 7b9652b17b68b7a4 abatement act of abating A47 0.75
2 36d72442aefd8232 abatement active catalyst A47 0.25
3 5296b0c19e1ce60e abatement eliminating process A47 0.50
4 54c1e3b9184cb5b6 abatement forest region A47 0.00
... ... ... ... ... ...
36468 8e1386cbefd7f245 wood article wooden article B44 1.00
36469 42d9e032d1cd3242 wood article wooden box B44 0.50
36470 208654ccb9e14fa3 wood article wooden handle B44 0.50
36471 756ec035e694722b wood article wooden material B44 0.75
36472 8d135da0b55b8c88 wood article wooden substrate B44 0.50

36473 rows × 5 columns

df.describe(include='object')
id anchor target context
count 36473 36473 36473 36473
unique 36473 733 29340 106
top 37d61fd2272659b1 component composite coating composition H01
freq 1 152 24 2186

Input Formatting

Creating a structured input format by combining context, target, and anchor texts. This format helps the model understand the relationships between different phrases.

df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor
df.input
0        TEXT1: A47; TEXT2: abatement of pollution; ANC...
1        TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2        TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3        TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4        TEXT1: A47; TEXT2: forest region; ANC1: abatement
                               ...                        
36468    TEXT1: B44; TEXT2: wooden article; ANC1: wood ...
36469    TEXT1: B44; TEXT2: wooden box; ANC1: wood article
36470    TEXT1: B44; TEXT2: wooden handle; ANC1: wood a...
36471    TEXT1: B44; TEXT2: wooden material; ANC1: wood...
36472    TEXT1: B44; TEXT2: wooden substrate; ANC1: woo...
Name: input, Length: 36473, dtype: object
from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)
ds
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

Model Setup and Tokenization

Loading the DeBERTa-v3-small tokenizer and testing it with sample texts to ensure proper tokenization.

model_nm = 'microsoft/deberta-v3-small'
# %pip install transformers tiktoken
# %pip install -U transformers tokenizers SentencePiece
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)
/home/adil/miniconda3/envs/fastai/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py:564: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
  warnings.warn(
tokz.tokenize("G'day folks, I'm Adil Siraju from kerala")
['▁G',
 "'",
 'day',
 '▁folks',
 ',',
 '▁I',
 "'",
 'm',
 '▁Adil',
 '▁Siraj',
 'u',
 '▁from',
 '▁kerala']
tokz.tokenize("A platypus is an ornithorhynchus anatinus.")
['▁A',
 '▁platypus',
 '▁is',
 '▁an',
 '▁or',
 'ni',
 'tho',
 'rhynch',
 'us',
 '▁an',
 'at',
 'inus',
 '.']
def tok_func(x):
    return tokz(x['input'])
tok_ds = ds.map(tok_func, batched=True)
row = tok_ds[0]
row['input'], row['input_ids']
('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2])
tok_ds
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})
tokz.vocab['▁of']
265
tok_ds = tok_ds.rename_columns({'score':'labels'})
tok_ds
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})
dds = tok_ds.train_test_split(0.25, seed=42)
dds
DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})
show_corr(subset, 'MedHouseVal', 'AveRooms')
array([[1.  , 0.68],
       [0.68, 1.  ]])
show_corr(subset, 'HouseAge', 'AveRooms')
0.6760250732906005
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

train

from transformers import Trainer, TrainingArguments

Training Configuration

Setting up training parameters including batch size, learning rate, and other hyperparameters for fine-tuning the DeBERTa model.

bs = 128
epochs = 4

lr = 8e-5

args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

# %pip install transformers[torch]
# %pip install 'accelerate>=0.26.0'
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=corr_d)
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/tmp/ipykernel_14095/3597993663.py:2: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
  trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
/tmp/ipykernel_14095/3597993663.py:2: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
  trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
trainer.train()
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 2, 'bos_token_id': 1}.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 2, 'bos_token_id': 1}.
[ 11/3424 00:13 < 1:28:24, 0.64 it/s, Epoch 0.02/8]
Epoch Training Loss Validation Loss

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 2, 'bos_token_id': 1}.
[ 11/3424 00:13 < 1:28:24, 0.64 it/s, Epoch 0.02/8]
Epoch Training Loss Validation Loss

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[57], line 1
----> 1 trainer.train()

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/transformers/trainer.py:2328, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   2326         hf_hub_utils.enable_progress_bars()
   2327 else:
-> 2328     return inner_training_loop(
   2329         args=args,
   2330         resume_from_checkpoint=resume_from_checkpoint,
   2331         trial=trial,
   2332         ignore_keys_for_eval=ignore_keys_for_eval,
   2333     )

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/transformers/trainer.py:2672, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2665 context = (
   2666     functools.partial(self.accelerator.no_sync, model=model)
   2667     if i != len(batch_samples) - 1
   2668     and self.accelerator.distributed_type != DistributedType.DEEPSPEED
   2669     else contextlib.nullcontext
   2670 )
   2671 with context():
-> 2672     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
   2674 if (
   2675     args.logging_nan_inf_filter
   2676     and not is_torch_xla_available()
   2677     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   2678 ):
   2679     # if loss is nan or inf simply add the average of previous logged losses
   2680     tr_loss = tr_loss + tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/transformers/trainer.py:4009, in Trainer.training_step(self, model, inputs, num_items_in_batch)
   4006     return loss_mb.reduce_mean().detach().to(self.args.device)
   4008 with self.compute_loss_context_manager():
-> 4009     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
   4011 del inputs
   4012 if (
   4013     self.args.torch_empty_cache_steps is not None
   4014     and self.state.global_step % self.args.torch_empty_cache_steps == 0
   4015 ):

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/transformers/trainer.py:4099, in Trainer.compute_loss(self, model, inputs, return_outputs, num_items_in_batch)
   4097         kwargs["num_items_in_batch"] = num_items_in_batch
   4098     inputs = {**inputs, **kwargs}
-> 4099 outputs = model(**inputs)
   4100 # Save past state if it exists
   4101 # TODO: this needs to be fixed and made cleaner later.
   4102 if self.args.past_index >= 0:

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/torch/nn/modules/module.py:1773, in Module._wrapped_call_impl(self, *args, **kwargs)
   1771     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1772 else:
-> 1773     return self._call_impl(*args, **kwargs)

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/torch/nn/modules/module.py:1784, in Module._call_impl(self, *args, **kwargs)
   1779 # If we don't have any hooks, we want to skip the rest of the logic in
   1780 # this function, and just call forward.
   1781 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1782         or _global_backward_pre_hooks or _global_backward_hooks
   1783         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1784     return forward_call(*args, **kwargs)
   1786 result = None
   1787 called_always_called_hooks = set()

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/accelerate/utils/operations.py:818, in convert_outputs_to_fp32.<locals>.forward(*args, **kwargs)
    817 def forward(*args, **kwargs):
--> 818     return model_forward(*args, **kwargs)

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/accelerate/utils/operations.py:806, in ConvertOutputsToFp32.__call__(self, *args, **kwargs)
    805 def __call__(self, *args, **kwargs):
--> 806     return convert_to_fp32(self.model_forward(*args, **kwargs))

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/torch/amp/autocast_mode.py:44, in autocast_decorator.<locals>.decorate_autocast(*args, **kwargs)
     41 @functools.wraps(func)
     42 def decorate_autocast(*args, **kwargs):
     43     with autocast_instance:
---> 44         return func(*args, **kwargs)

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/transformers/models/deberta_v2/modeling_deberta_v2.py:1079, in DebertaV2ForSequenceClassification.forward(self, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
   1071 r"""
   1072 labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
   1073     Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
   1074     config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
   1075     `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
   1076 """
   1077 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
-> 1079 outputs = self.deberta(
   1080     input_ids,
   1081     token_type_ids=token_type_ids,
   1082     attention_mask=attention_mask,
   1083     position_ids=position_ids,
   1084     inputs_embeds=inputs_embeds,
   1085     output_attentions=output_attentions,
   1086     output_hidden_states=output_hidden_states,
   1087     return_dict=return_dict,
   1088 )
   1090 encoder_layer = outputs[0]
   1091 pooled_output = self.pooler(encoder_layer)

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/torch/nn/modules/module.py:1773, in Module._wrapped_call_impl(self, *args, **kwargs)
   1771     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1772 else:
-> 1773     return self._call_impl(*args, **kwargs)

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/torch/nn/modules/module.py:1784, in Module._call_impl(self, *args, **kwargs)
   1779 # If we don't have any hooks, we want to skip the rest of the logic in
   1780 # this function, and just call forward.
   1781 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1782         or _global_backward_pre_hooks or _global_backward_hooks
   1783         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1784     return forward_call(*args, **kwargs)
   1786 result = None
   1787 called_always_called_hooks = set()

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/transformers/models/deberta_v2/modeling_deberta_v2.py:818, in DebertaV2Model.forward(self, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds, output_attentions, output_hidden_states, return_dict)
    815 if not return_dict:
    816     return (sequence_output,) + encoder_outputs[(1 if output_hidden_states else 2) :]
--> 818 return BaseModelOutput(
    819     last_hidden_state=sequence_output,
    820     hidden_states=encoder_outputs.hidden_states if output_hidden_states else None,
    821     attentions=encoder_outputs.attentions,
    822 )

File <string>:6, in __init__(self, last_hidden_state, hidden_states, attentions)

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/transformers/utils/generic.py:392, in ModelOutput.__post_init__(self)
    389 first_field = getattr(self, class_fields[0].name)
    390 other_fields_are_none = all(getattr(self, field.name) is None for field in class_fields[1:])
--> 392 if other_fields_are_none and not is_tensor(first_field):
    393     if isinstance(first_field, dict):
    394         iterator = first_field.items()

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/transformers/utils/generic.py:139, in is_tensor(x)
    134 """
    135 Tests if `x` is a `torch.Tensor`, `tf.Tensor`, `jaxlib.xla_extension.DeviceArray`, `np.ndarray` or `mlx.array`
    136 in the order defined by `infer_framework_from_repr`
    137 """
    138 # This gives us a smart order to test the frameworks with the corresponding tests.
--> 139 framework_to_test_func = _get_frameworks_and_test_func(x)
    140 for test_func in framework_to_test_func.values():
    141     if test_func(x):

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/transformers/utils/generic.py:124, in _get_frameworks_and_test_func(x)
    113 """
    114 Returns an (ordered since we are in Python 3.7+) dictionary framework to test function, which places the framework
    115 we can guess from the repr first, then Numpy, then the others.
    116 """
    117 framework_to_test = {
    118     "pt": is_torch_tensor,
    119     "tf": is_tf_tensor,
   (...)
    122     "mlx": is_mlx_array,
    123 }
--> 124 preferred_framework = infer_framework_from_repr(x)
    125 # We will test this one first, then numpy, then the others.
    126 frameworks = [] if preferred_framework is None else [preferred_framework]

File ~/miniconda3/envs/fastai/lib/python3.10/site-packages/transformers/utils/generic.py:99, in infer_framework_from_repr(x)
     94 def infer_framework_from_repr(x):
     95     """
     96     Tries to guess the framework of an object `x` from its repr (brittle but will help in `is_tensor` to try the
     97     frameworks in a smart order, without the need to import the frameworks).
     98     """
---> 99     representation = str(type(x))
    100     if representation.startswith("<class 'torch."):
    101         return "pt"

KeyboardInterrupt: 

Prediction and Submission

Loading test data, generating predictions, and creating a submission file in the required format.

eval_df = pd.read_csv(path/'test.csv')
eval_df
id anchor target context
0 4112d61851461f60 opc drum inorganic photoconductor drum G02
1 09e418c93a776564 adjust gas flow altering gas flow F23
2 36baf228038e314b lower trunnion lower locating B60
3 1f37ead645e7f0c8 cap component upper portion D06
4 71a5b6ad068d531f neural stimulation artificial neural network H04
5 474c874d0c07bd21 dry corn dry corn starch C12
6 442c114ed5c4e3c9 tunneling capacitor capacitor housing G11
7 b8ae62ea5e1d8bdb angular contact bearing contact therapy radiation B23
8 faaddaf8fcba8a3f produce liquid hydrocarbons produce a treated stream C10
9 ae0262c02566d2ce diesel fuel tank diesel fuel tanks F02
10 a8808e31641e856d chemical activity dielectric characteristics B01
11 16ae4b99d3601e60 transmit to platform direct receiving H04
12 25c555ca3d5a2092 oil tankers oil carriers B63
13 5203a36c501f1b7c generate in layer generate by layer G02
14 b9fdc772bb8fd61c slip segment slip portion B22
15 7aa5908a77a7ec24 el display illumination G02
16 d19ef3979396d47e overflow device oil filler E04
17 fd83613b7843f5e1 beam traveling direction concrete beam H05
18 2a619016908bfa45 el display electroluminescent C23
19 733979d75f59770d equipment unit power detection H02
20 6546846df17f9800 halocarbyl halogen addition reaction C07
21 3ff0e7a35015be69 perfluoroalkyl group hydroxy A63
22 12ca31f018a2e2b9 speed control means control loop G05
23 03ba802ed4029e4d arm design steel plate F16
24 c404f8b378cbb008 hybrid bearing bearing system F04
25 78243984c02a72e4 end pins end days A44
26 de51114bc0faec3e organic starting organic farming B61
27 7e3aff857f056bf9 make of slabs making cake E04
28 26c3c6dc6174b589 seal teeth teeth whitening F01
29 b892011ab2e2cabc carry by platform carry on platform B60
30 8247ff562ca185cc polls pooling device B21
31 c057aecbba832387 upper clamp arm end visual A61
32 9f2279ce667b21dc clocked storage clocked storage device G01
33 b9ea2b06a878df6f coupling factor turns impedance G01
34 79795133c30ef097 different conductivity carrier polarity H03
35 25522ee5411e63e9 hybrid bearing corrosion resistant F16
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_ds = Dataset.from_pandas(eval_df)
eval_ds = eval_ds.map(tok_func, batched=True)
# Now predict using the properly formatted dataset
preds = trainer.predict(eval_ds).predictions.astype(float)
preds = np.clip(preds, 0, 1)
preds
array([[0.48],
       [0.82],
       [0.34],
       [0.35],
       [0.  ],
       [0.43],
       [0.36],
       [0.05],
       [0.09],
       [1.  ],
       [0.17],
       [0.28],
       [0.67],
       [0.7 ],
       [0.79],
       [0.34],
       [0.22],
       [0.03],
       [0.48],
       [0.25],
       [0.34],
       [0.21],
       [0.08],
       [0.16],
       [0.52],
       [0.  ],
       [0.  ],
       [0.03],
       [0.  ],
       [0.68],
       [0.28],
       [0.04],
       [0.71],
       [0.35],
       [0.35],
       [0.15]])
import datasets

submission = datasets.Dataset.from_dict({
    'id': eval_df['id'],
    'score': preds
})

submission.to_csv('submission.csv', index=False)
862