O'qituvch F. I. Sh


Download 325.02 Kb.
bet2/4
Sana19.06.2023
Hajmi325.02 Kb.
#1609490
1   2   3   4
Bog'liq
3- joriy Fully

Sogruppirovannaya karta
DataFrame.groupby().cogroup().applyInPandas()Python-ning sovmestnoy guruhi va Python-ning asosiy funktsiyalari bo'lgan pySpark-da Pandas-dan foydalanish mumkin DataFrame:

  • Peremeshivayte dannye shunday obrazom, chtoby groupy kajdogo kadra dannyh, kotorye sovmestno ispolzuyut klyuch, byly sovmestno grupparovany.

  • Primenite funksiyasi k kajdoy sovmestnoy guruhe. Vxodnye dannye funksiyalari — dva pandas.DataFrame(s neobyazatelnym kortejem, predstavlyayuschim klyuch). Vyxodnye dannye funksiyalari — pandas.DataFrame.

  • Ob'edinite elementy pandas.DataFrameiz barcha guruhda yangi PySpark DataFrame.

Chtoby ispolzovat groupBy().cogroup().applyInPandas(), neobxodimo oredelit sleduyuschee:

  • Funksiya Python, kotoraya opredelyaet vychisleniya uchun kazhdoy sovetnoy guruhy.

  • Ob'ekt StructTypeyoki stroka, opredelyayushchaya sxemasi vyhodnyh dannyh PySpark DataFrame.

Metki stolbtsov vozvrashchaemo pandas.DataFrameob'ekta doljny sootvetstvovat imenam poley v opredelennoy vyxodnoy sxemasi, yoki ukazany v vide strok, yoki sootvetstvovat tipam dannyx poley pozitsii, yoki stroki, masalan, indekslenchis. Sm. pandalar. Kadr dannyx o tom, kak pomechat stolbtsy pri sozdanii pandas.DataFrame.
Barcha dannye uchun sovmestnoy guruhy pamyat pered primenieem funksiyalari uchun zagrujayutsya. Eto mumkin privesti k isklyuchenyu nexvatki pamyati, osobenno v slachae neravnomernogo raspredeleniya razmerov guruhi. maxRecordsPerBatch uchun konfiguratsiya hech qanday mos kelmaydi, i vy mumkin ubeditsya, chto sovmestno sgruppirovannye dannye pomeshchayutsya v dostupnuyu pamyat.
V sleduyuschem misol pokazano, kak ispolzvat groupby().cogroup().applyInPandas()dlya vypolneniya mejdu asof joindvumya naborami dannyx.
PythonKopirovat
import pandas as pd


df1 = spark.createDataFrame(
[(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)],
("time", "id", "v1"))


df2 = spark.createDataFrame(
[(20000101, 1, "x"), (20000101, 2, "y")],
("time", "id", "v2"))


def asof_join(l, r):
return pd.merge_asof(l, r, on="time", by="id")


df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
asof_join, schema="time int, id int, v1 double, v2 string").show()
# +--------+---+---+---+
# | time| id| v1| v2|
# +--------+---+---+---+
# |20000101| 1|1.0| x|
# |20000102| 1|3.0| x|
# |20000101| 2|2.0| y|
# |20000102| 2|4.0| y|
# +--------+---+---+---+


Download 325.02 Kb.

Do'stlaringiz bilan baham:
1   2   3   4




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling