Python • ライトウェイト変換 • 最新版のUbuntuをベースにする • Palantir

データ接続と統合Pythonライトウェイト変換最新版のUbuntuをベースにする

注: 以下の翻訳の正確性は検証されていません。AIPを利用して英語版の原文から機械的に翻訳されたものです。

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 最新版のUbuntuをベースにする
FROM ubuntu:latest

# 必要なパッケージをインストールする
# coreutils: 基本的なファイル、シェル、テキスト操作のユーティリティ
# curl: URL構文を使用してファイルを転送するツール
# sed: ストリームエディタ
# build-essential: Cプログラムをコンパイルするのに必要なパッケージ
# gnucobol: COBOLコンパイラ
RUN apt update && apt install -y coreutils curl sed build-essential gnucobol

# UID 5001のユーザーを作成する
RUN useradd --uid 5001 user

# 作成したユーザーに切り替える
USER 5001

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from transforms.api import incremental, Input, lightweight, Output, transform

@lightweight()
@incremental()
@transform(my_input=Input("my-input"), my_output=Output('my-output'))
def my_incremental_transform(my_input, my_output):
    fs = my_input.filesystem()  # 入力ファイルシステムを取得します
    files = [f.path for f in fs.ls()]  # 入力ファイルのパスを取得します
    polars_dataframes = []  # ここにデータフレームを保存します

    for file_path in files:  # 各ファイルに対して
        # Access the file
        with fs.open(file_path, "rb") as f:  # ファイルを開きます
            # <do something with the file>
            # ファイルで何かを行います
            # append some data as a dataframe to polars_dataframes
            # データをデータフレームとしてpolars_dataframesに追加します

    # Union all the DFs into one
    # すべてのデータフレームを一つにまとめます
    combined_df = union_polars_dataframes(polars_dataframes)
    out.write_table(combined_df)  # 結果のデータフレームを出力します

次のコードは、Excel ファイルを解析するための例を示しています：

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from transforms.api import transform, Input, Output, lightweight
import tempfile
import shutil
import polars as pl
import pandas as pd


@lightweight()
@transform(
    my_output=Output("/path/tabular_output_dataset"),
    my_input=Input("/path/input_dataset_without_schema"),
)
def compute(my_input, my_output):
    # 各ファイルを解析
    # 提供されたファイルシステムを使用して、提供されたパスのExcelファイルを開く
    def read_excel_to_polars(fs, file_path):
        with fs.open(file_path, "rb") as f:
            with tempfile.TemporaryFile() as tmp:
                # ソースデータセットからのファイルをローカルファイルシステムにコピーペースト
                shutil.copyfileobj(f, tmp)
                tmp.flush()  # shutil.copyfileobj はフラッシュしない

                # Excelファイルを読む（ファイルは今、シーク可能）
                pandas_df = pd.read_excel(tmp)
                # ある場合、整数列を文字列列に変換
                pandas_df = pandas_df.astype(str)
                # pandasのデータフレームをpolarsのデータフレームに変換
                return pl.from_pandas(pandas_df)

    fs = my_input.filesystem()
    # 入力データセット内のすべてのファイルをリストアップ
    files = [f.path for f in fs.ls()]

    polars_dataframes = []

    for curr_file_as_row in files:
        # print(curr_file_as_row)
        polars_dataframes.append(read_excel_to_polars(fs, curr_file_as_row))

    def union_polars_dataframes(dfs):
        return pl.concat(dfs)

    # すべてのDFを1つに連結
    combined_df = union_polars_dataframes(polars_dataframes)

    my_output.write_table(combined_df)

←

PREVIOUSPolars ストリーミングモード

NEXTAIP / AIP オーケストレーター

→