ローカル環境

Python

データセットの行数

多くのデータセットの行数を一括で計算するにはどうすればよいですか？

このコードは、Foundry API を使用してデータセット RID のリストに対する行数計算をトリガーします。データセット RID とブランチをパラメーターとして Foundry Stats API に POST リクエストを送信します。

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from shutil import ExecError
from wsgiref import headers
import requests
from urllib3 import Retry
import json
import pprint

'''
Script will trigger row count computation on the set of provided dataset rids
'''

# ベース変数
base_url = "https://STACK_NAME.palantircloud.com"
branch = "master"

# データセットのRIDリスト
DATASETS_RIDS = [
    "ri.foundry.main.dataset.6d2cd3de-0052-xxxxx-c7ae2c4ab1d8"
]

# ヘッダー情報
headers = {
    'Authorization': 'Bearer eyg_PUT_YOUR_TOKEN_HERE_xxxx',
    'Content-Type': 'application/json'
}

# プロキシ設定（必要に応じて）
proxyDict = {
    # "https": "https://proxyIfNeeded:port"
}

# リトライ設定
retry = Retry(connect=1, backoff_factor=0.5)
adapter = requests.adapters.HTTPAdapter(max_retries=retry)
http = requests.Session()
http.mount("https://", adapter)

# 行数計算をトリガーする関数
def trigger_row_count(dataset_rid, branch):
    response = http.post(f'{base_url}/foundry-stats/api/stats/datasets/{dataset_rid}/branches/{branch}', headers=headers,
                         proxies=proxyDict)
    raw_response = response.text
    curr_response = json.loads(raw_response)
    pprint.pprint(curr_response)

    return curr_response

# 各データセットRIDに対して行数計算をトリガー
for curr_dataset_rid in DATASETS_RIDS :
    trigger_row_count(curr_dataset_rid, branch)

提出日: 2024-03-26
タグ: export, python, metrics, metadata, local

複数のデータセットにわたるすべての列のスーパーセットを取得する

複数のデータセットにわたるすべての列のセットを取得するにはどうすればよいですか？

このコードは、requestsライブラリを使用してターゲットデータセットのリスト内の各データセットのスキーマを取得し、そのスキーマ内のフィールドを反復処理して、すべての列のスーパーセット内で各列の頻度を含む辞書を作成します。

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import time

from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
import requests
import json
import pprint
import logging
import datetime
import collections


'''
Script that generates the superset of columns with their frequency from a set of datasets
'''
'''
このスクリプトは、データセットの集合から列の頻度を含むスーパーセットを生成します。
'''

headers = {
    'Authorization': 'Bearer eyg_PUT_YOUR_TOKEN_HERE_xxxx',
    'Content-Type': 'application/json',
}

## STACK_NAME
base_url = "STACK_NAME.palantircloud.com"
branch = "master"


target_datasets = ["ri.foundry.main.dataset.4c2ac089-xxxx-4df863eaf823"]

# Proxies
proxyDict = {
    #"https": "https://proxyIfNeeded:port"
}

# Retries
retry = Retry(connect=1, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
http = requests.Session()
http.mount("https://", adapter)


global_list_fields = {}

for curr_dataset in target_datasets :
    # Get schema of the dataset
    # データセットのスキーマを取得する
    print(f"Step 1. Get Schema of dataset")
    response = http.get(f'{base_url}/foundry-metadata/api/schemas/datasets/{curr_dataset}/branches/{branch}', headers=headers, proxies=proxyDict)
    print(f"Step 1. Response of getting schema of dataset")
    raw_response = response.text
    print(raw_response)
    curr_schema = json.loads(raw_response)
    list_fields = curr_schema["schema"]["fieldSchemaList"]

    for field in list_fields:
        curr_key = f"{field['name']} - {field['type']}"
        # Increment counter
        # カウンターをインクリメント
        global_list_fields[curr_key] = global_list_fields.get(curr_key, 0) + 1

print("Unsorted dict")
pprint.pprint(global_list_fields)

# Sort it
# ソートする
sorted_dict = {k: v for k, v in sorted(global_list_fields.items(), key=lambda item: item[1])}
print("Sorted dict")
pprint.pprint(sorted_dict)

提出日: 2024-03-26
タグ: python, API, metadata, code repositories, Code Authoring, local

指定されたリソース RID のパスを取得する

リソースの RID からパスを見つける方法は？

このコードは、requests ライブラリを使用して指定されたホストに与えられた RID で HTTP GET リクエストを送信し、リソースのパスを取得します。また、リトライやプロキシも処理します。

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from requests.adapters import HTTPAdapter
import requests
from urllib3 import Retry

'''
Script to return the path of a given Resource IDentifier (RID).
リソース識別子（RID）に基づいてパスを返すスクリプト。
'''

# Headers
headers = {
    'Authorization': 'Bearer xxx', # 'xxx'を自分のベアラートークンに置き換えてください
    'Content-Type': 'application/json',
}

# Host
host = 'host.com:443'

# Proxies
proxyDict = {
    'https': 'http://proxy.domain.com:3333'
}

# Retries
retry = Retry(connect=1, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
http = requests.Session()
http.mount('https://', adapter)

# Enter the rid of the resource you want the path of
# パスを取得したいリソースのRIDを入力してください
RESOURCE_RID = ''

# Throw an error if the reference has failed to be created
# リファレンスの作成に失敗した場合にエラーを投げる
try:
    print(f'Fetching path for rid {RESOURCE_RID} ...')
    response = http.get(f'https://{host}/compass/api/resources/{RESOURCE_RID}/path-json', headers=headers, proxies=proxyDict)
    print('Completed request')
    print(f'The path is: {response.text}')
except requests.exceptions.RequestException as e:
    raise Exception(f"An error occurred in the request.\nReturning the path for the repository: {RESOURCE_RID} failed due to: {response.status_code} - {response.text}\nException: {e}")

提出日: 2024-03-26
タグ: api, python, metadata, local

←

PREVIOUSdataset-metadata-operations-code-repositories.md

NEXTGraph and tree structured datasets / トランスフォーム

→