トランスフォーム

Python

音韻コードを使用したエンティティ名のファジーマッチング

PySpark で音韻コードを使用してエンティティ名のファジーマッチングを行うにはどうすればよいですか？

このコードは、PySpark を使用してエンティティ名をクリーンアップし、音韻コードを生成し、Jaro 類似度メトリックを使用してエンティティ名のファジーマッチングを行います。これは、2つのデータセットで類似のエンティティ名をマッチングするのに役立ちます。

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
from pyspark.sql import functions as F
from pyspark.sql import types as T
from transforms.api import transform_df, Input, Output
import re
import jellyfish


def _add_phonetic_codes(df):
    # 名前の各部分に対して音声コードを生成する
    df = df.withColumn(
        "name_part", F.split("cleaned_name", " ")
    ).withColumn(
        "name_part", F.explode("name_part")
    ).withColumn(
        "phonetic_code", F.soundex("name_part")
    ).drop("name_part")
    return df


@transform_df(
    Output(),
    entities2=Input(),
    entities1=Input(),
)
def compute(sanctions, entities):

    # テキストをクリーンアップするためのUDFを設定
    def clean_text(text):
        cleaned_text = re.sub(r" +", " ", re.sub(r"[./-]+", "", text)).lower()
        return cleaned_text

    clean_text_udf = F.udf(clean_text, T.StringType())

    # エンティティ名をクリーンアップ
    entities2 = entities2.withColumn("cleaned_name", clean_text_udf(F.col("name")))
    entities1 = entities1.withColumn("cleaned_name", clean_text_udf(F.col("entity_name")))

    # 音声コードを追加
    entities2 = _add_phonetic_codes(entities2)
    entities1 = _add_phonetic_codes(entities1)

    # ファジージョイン
    matched_entities = entities1.join(
        entities2, on=["phonetic_code"], how="inner"
    ).select(
        entities1.cleaned_name.alias("cleaned_name1"), entities1.id.alias("entity_id1")
        entities2.cleaned_name.alias("cleaned_name2"), entities2.id.alias("entity_id2")
    ).drop("phonetic_code")
    matched_entities = matched_entities.dropDuplicates()

    # 文字列比較のためのUDFを設定
    @F.udf()
    def jaro_compare(name1, name2):
        return jellyfish.jaro_similarity(name1, name2)

    # ファジーマッチング
    matched_entities = matched_entities.withColumn(
        "match_score", jaro_compare("cleaned_name1", "cleaned_name2")
    )
    matched_entities = matched_entities.filter(entities.match_score > 0.75)
    matched_entities = matched_entities.select("entity_id1", "entity_id2")
    return matched_entities

説明

_add_phonetic_codes関数:
- 名前の各部分に対して音声コード（Soundex）を生成し、phonetic_code列に追加します。
compute関数:
- clean_text関数を定義して、テキストのクリーンアップを行います（特殊文字の除去や小文字化）。
- clean_text_udfとしてUDFを設定し、entities2とentities1の名前列をクリーンアップします。
- 音声コードを生成するために、_add_phonetic_codes関数を呼び出します。
- phonetic_codeをキーにしてエンティティを結合し、重複を削除します。
- jaro_compare関数をUDFとして設定し、Jaro類似度を計算します。
- 類似度が0.75以上のマッチング結果をフィルタリングし、最終的なエンティティIDを返します。

提出日: 2024-05-23
タグ: pyspark, fuzzy matching, phonetic codes, jaro similarity

←

PREVIOUSGeospatial computation / geospatial-computation-transforms.md

NEXTIncremental transforms / incremental-transforms-transforms.md

→