O'qituvch F. I. Sh
Download 325.02 Kb.
|
3- joriy Fully
Sogruppirovannaya karta
DataFrame.groupby().cogroup().applyInPandas()Python-ning sovmestnoy guruhi va Python-ning asosiy funktsiyalari bo'lgan pySpark-da Pandas-dan foydalanish mumkin DataFrame: Peremeshivayte dannye shunday obrazom, chtoby groupy kajdogo kadra dannyh, kotorye sovmestno ispolzuyut klyuch, byly sovmestno grupparovany. Primenite funksiyasi k kajdoy sovmestnoy guruhe. Vxodnye dannye funksiyalari — dva pandas.DataFrame(s neobyazatelnym kortejem, predstavlyayuschim klyuch). Vyxodnye dannye funksiyalari — pandas.DataFrame. Ob'edinite elementy pandas.DataFrameiz barcha guruhda yangi PySpark DataFrame. Chtoby ispolzovat groupBy().cogroup().applyInPandas(), neobxodimo oredelit sleduyuschee: Funksiya Python, kotoraya opredelyaet vychisleniya uchun kazhdoy sovetnoy guruhy. Ob'ekt StructTypeyoki stroka, opredelyayushchaya sxemasi vyhodnyh dannyh PySpark DataFrame. Metki stolbtsov vozvrashchaemo pandas.DataFrameob'ekta doljny sootvetstvovat imenam poley v opredelennoy vyxodnoy sxemasi, yoki ukazany v vide strok, yoki sootvetstvovat tipam dannyx poley pozitsii, yoki stroki, masalan, indekslenchis. Sm. pandalar. Kadr dannyx o tom, kak pomechat stolbtsy pri sozdanii pandas.DataFrame. Barcha dannye uchun sovmestnoy guruhy pamyat pered primenieem funksiyalari uchun zagrujayutsya. Eto mumkin privesti k isklyuchenyu nexvatki pamyati, osobenno v slachae neravnomernogo raspredeleniya razmerov guruhi. maxRecordsPerBatch uchun konfiguratsiya hech qanday mos kelmaydi, i vy mumkin ubeditsya, chto sovmestno sgruppirovannye dannye pomeshchayutsya v dostupnuyu pamyat. V sleduyuschem misol pokazano, kak ispolzvat groupby().cogroup().applyInPandas()dlya vypolneniya mejdu asof joindvumya naborami dannyx. PythonKopirovat import pandas as pd df1 = spark.createDataFrame( [(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)], ("time", "id", "v1")) df2 = spark.createDataFrame( [(20000101, 1, "x"), (20000101, 2, "y")], ("time", "id", "v2")) def asof_join(l, r): return pd.merge_asof(l, r, on="time", by="id") df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas( asof_join, schema="time int, id int, v1 double, v2 string").show() # +--------+---+---+---+ # | time| id| v1| v2| # +--------+---+---+---+ # |20000101| 1|1.0| x| # |20000102| 1|3.0| x| # |20000101| 2|2.0| y| # |20000102| 2|4.0| y| # +--------+---+---+---+ Download 325.02 Kb. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling