1. 主页 > 好文章

实战指南:用Python处理缺值的7种技巧与案例解析

鍝庯紝鏁版嵁缂哄け杩欎簨鍎匡紝璋佹病閬囧埌杩囧晩锛熷垰鍏ラ棬鐨勫皬鐧界湅鐫€閭d竴鍫哊aN锛堢┖鍊硷級锛屾槸涓嶆槸澶撮兘澶т簡锛熷埆鎱岋紒浠婂ぉ鍜变滑灏辩敤鈥?strong>鈥婸ython鈥?/strong>鈥嬭繖涓鍣紝鎵嬫妸鎵嬫暀浣?鎷涘鐞嗙己澶卞€肩殑瀹炵敤鎶€宸э紝淇濆噯浣犲惉瀹岃兘绔嬪埢涓婃墜骞叉椿锛?/p>


涓€銆佺洿鎺ュ垹闄わ細蹇垁鏂╀贡楹荤殑鐙犳嫑

鈥?strong>鈥嬮€傜敤鍦烘櫙鈥?/strong>鈥嬶細缂哄け鍊煎皯锛堟瘮濡?lt;5%锛夋垨缂哄け鏃犺寰?br/> 璇寸櫧浜嗗氨鏄€斺€斺€?strong>鈥嬬湅涓嶉『鐪煎氨鍒犳帀鈥?/strong>鈥嬶紒鐢╬andas鐨?code>dropna()锛屼竴琛屼唬鐮佸氨鑳芥妸甯︾┖鍊肩殑琛?鍒楀共鎺夈€傛瘮濡傜數鍟嗘暟鎹噷鍋跺皵缂哄嚑涓敤鎴峰勾榫勫瓧娈碉紝鍒犱簡褰卞搷涓嶅ぇ銆?/p>

涓句釜鏍楀瓙锛?/p>

python澶嶅埗
import pandas as pd
# 鍒犳墍鏈夊甫绌哄€肩殑琛?/span>
df_clean = df.dropna(axis=0)
# 鍒犳煇鍒楃己澶辫秴杩?0%鐨勬暟鎹?/span>
df.dropna(thresh=len(df)*0.8, axis=1, inplace=True)

鈿狅笍 鈥?strong>鈥嬪潙鐐硅鍛娾€?/strong>鈥嬶細

  • 鏁版嵁閲忔湰鏉ュ氨涓嶅鐨勮瘽锛岃繖鎷涘彲鑳借浣犳牱鏈洿鎺ヨ叞鏂?/li>
  • 鏃堕棿搴忓垪鏁版嵁闅忎究鍒犲彲鑳界牬鍧忚繛缁€?/li>

浜屻€佸潎鍊?涓綅鏁板~鍏咃細涓囬噾娌瑰紡鍩虹鎿嶄綔

鈥?strong>鈥嬮€傜敤鍦烘櫙鈥?/strong>鈥嬶細鏁板€煎瀷鏁版嵁涓斿垎甯冪ǔ瀹?br/> 杩欐槸灏忕櫧鏈€鐖辩敤鐨勬柟娉曗€斺€斺€?strong>鈥嬬敤骞冲潎鏁颁唬鏇跨┖鍊尖€?/strong>鈥嬶紝pandas閲?code>fillna()涓夌鎼炲畾銆傛瘮濡傚鐞嗛攢鍞暟鎹噷缂虹殑瀹㈠崟浠峰瓧娈碉紝鐢ㄥ叏搴楀钩鍧囧€煎~涓婃€绘瘮绌虹潃寮恒€?/p>

鍏蜂綋鎿嶄綔锛?/p>

python澶嶅埗
# 鐢ㄦ暣鍒楀潎鍊煎~鍏?/span>
df['閿€鍞'].fillna(df['閿€鍞'].mean(), inplace=True)
# 鎸夊煄甯傚垎缁勫~鍏呬腑浣嶆暟
df['宸ヨ祫'] = df.groupby('鍩庡競')['宸ヨ祫'].transform(lambda x:x.fillna(x.median()))

馃挕 鈥?strong>鈥嬭繘闃舵妧宸р€?/strong>鈥嬶細

  • 鍋忔€佹暟鎹敤鈥?strong>鈥嬩腑浣嶆暟鈥?/strong>鈥嬫洿闈犺氨锛堟瘮濡傛敹鍏ユ暟鎹父鏈夋瀬绔€硷級
  • 鍒嗙被鏁版嵁璇曡瘯鈥?strong>鈥嬩紬鏁板~鍏呪€?/strong>鈥嬶紝姣斿鎬у埆瀛楁缂哄け灏卞~鍑虹幇鏈€澶氱殑鎬у埆

涓夈€佸墠鍚庡€兼彃鍊硷細鏃堕棿搴忓垪鐨勬晳鏄?/h3>

鈥?strong>鈥嬮€傜敤鍦烘櫙鈥?/strong>鈥嬶細浼犳劅鍣ㄦ暟鎹€佽偂浠风瓑鏃跺簭鏁版嵁
杩欎釜鏂规硶瓒呴€傚悎鈥?strong>鈥嬫寜鏃堕棿鎺掑垪鐨勬暟鎹€?/strong>鈥嬶紒鍘熺悊灏卞儚杩炵偣鎴愮嚎鈥斺€旂敤鍓嶅悗涓や釜鐐圭殑鏁板€兼帹绠椾腑闂寸己澶卞€笺€俻andas鐨?code>interpolate()鑷甫绾挎€?澶氶」寮忔彃鍊煎姛鑳姐€?/p>

鐪嬩釜鑲$エ鏁版嵁妗堜緥锛?/p>

python澶嶅埗
# 绾挎€ф彃鍊硷紙榛樿鍚戝墠濉厖锛?/span>
df['鏀剁洏浠?] = df['鏀剁洏浠?].interpolate(method='linear')
# 楂橀樁鎻掑€硷紙姣斿涓夋鏍锋潯锛?/span>
df['娓╁害'].interpolate(method='spline', order=3, inplace=True)

鉂?鈥?strong>鈥嬫敞鎰忕偣鈥?/strong>鈥嬶細

  • 鏁版嵁娉㈠姩鍓х儓鏃跺埆鐢ㄨ繖鎷涳紝瀹规槗绠楀嚭绂昏氨鏁板€?/li>
  • 棣栧熬缂哄け鍊煎~鍏呬笉浜嗭紝寰楅厤鍚堝叾浠栨柟娉?/li>

鍥涖€並NN杩戦偦濉厖锛氭渶鍍忕殑閭诲眳鏉ュ府蹇?/h3>

鈥?strong>鈥嬮€傜敤鍦烘櫙鈥?/strong>鈥嬶細鐗瑰緛闂村叧鑱旀€у己鐨勯珮缁存暟鎹?br/> 杩欐嫑鍘夊浜嗏€斺€斺€?strong>鈥嬫壘闀垮緱鏈€鍍忕殑K涓牱鏈€?/strong>鈥嬶紝鐢ㄥ畠浠殑骞冲潎鍊煎~缂哄け鍊硷紒Scikit-learn鐨?code>KNNImputer绠€鐩存槸绁炲櫒锛岀壒鍒€傚悎鍖荤枟鏁版嵁杩欑澶氭寚鏍囧叧鑱旂殑鍦烘櫙銆?/p>

涓婁唬鐮佹紨绀猴細

python澶嶅埗
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3)
# 鍋囪df鏄寘鍚勾榫勩€佽鍘嬨€佽绯栫殑鏁版嵁妗?/span>
df_filled = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

馃搶 鈥?strong>鈥嬪弬鏁拌皟浼樼璇€鈥?/strong>鈥嬶細

  • 鏁版嵁鍏堝仛鏍囧噯鍖栵紒涓嶇劧澶ф暟鍊肩壒寰佷細涓诲璺濈璁$畻
  • K鍊间竴鑸€?-5锛屽お澶у鏄撳紩鍏ュ櫔澹?/li>

浜斻€侀殢鏈烘.鏋楅娴嬶細璁〢I甯綘鐚滄暟鍊?/h3>

鈥?strong>鈥嬮€傜敤鍦烘櫙鈥?/strong>鈥嬶細澶嶆潅鍏崇郴鐨勬暟鎹笖缂哄け閲忚緝澶?br/> 杩欏睘浜庘€?strong>鈥嬮珮闃剁帺娉曗€?/strong>鈥嬩簡鈥斺€旀妸鏈夌己澶辩殑瀛楁褰撻娴嬬洰鏍囷紝鍏朵粬瀹屾暣瀛楁褰撶壒寰侊紝鐢ㄦ満鍣ㄥ涔犳ā鍨嬫潵棰勬祴锛佹瘮濡傚鎴锋暟鎹己浜嗘敹鍏ュ瓧娈碉紝鍙互鐢ㄥ勾榫勩€佽亴涓氥€佹秷璐硅褰曠瓑鐗瑰緛鏉ラ娴嬨€?/p>

鍏蜂綋鎿嶄綔鍒嗕笁姝ワ細

python澶嶅埗
from sklearn.ensemble import RandomForestRegressor

# 1.鍒嗗壊鏈?鏃犵己澶辩殑鏁版嵁
known = df[df['鏀跺叆'].notnull()]
unknown = df[df['鏀跺叆'].isnull()]

# 2.璁粌棰勬祴妯″瀷
X = known.drop('鏀跺叆', axis=1)
y = known['鏀跺叆']
model = RandomForestRegressor().fit(X, y)

# 3.棰勬祴骞跺~鍏?/span>
preds = model.predict(unknown.drop('鏀跺叆', axis=1))
df.loc[df['鏀跺叆'].isnull(), '鏀跺叆'] = preds

馃敟 鈥?strong>鈥嬩紭鍔胯В璇烩€?/strong>鈥嬶細

  • 鑳芥崟鎹夌壒寰侀棿鐨勯潪绾挎€у叧绯伙紙姣斿骞撮緞鍜屾敹鍏ヤ笉鏄洿绾垮叧绯伙級
  • 閫傚悎澶ф暟鎹泦锛屽皬鏍锋湰瀹规槗杩囨嫙鍚?/li>

鍏€佸閲嶆彃琛ワ細涓ヨ皑娲剧殑棣栭€夋柟妗?/h3>

鈥?strong>鈥嬮€傜敤鍦烘櫙鈥?/strong>鈥嬶細瀛︽湳鐮旂┒銆侀噾铻嶉鎺х瓑瑕佹眰楂樼殑鍦烘櫙
杩欐槸缁熻瀛﹀鏈€鐖辩殑鏂规硶鈥斺€斺€?strong>鈥嬪悓涓€涓潙濉簲娆″彇骞冲潎鈥?/strong>鈥嬶紒閫氳繃statsmodels鐨?code>MICE绠楁硶锛岀敓鎴愬涓彲鑳界殑濉厖鍊硷紝缁煎悎鑰冮噺涓嶇‘瀹氭€с€?/p>

鎿嶄綔绀轰緥锛?/p>

python澶嶅埗
from statsmodels.imputation.mice import MICE

# 鍒涘缓5涓~鍏呯増鏈?/span>
imp = MICE(data=df, n_imputations=5)
# 鎻愬彇绗竴涓~鍏呯粨鏋?/span>
imputed_data = next(imp)

馃幆 鈥?strong>鈥嬫牳蹇冧环鍊尖€?/strong>鈥嬶細

  • 姣斿崟娆″~鍏呮洿鎺ヨ繎鐪熷疄鏁版嵁鍒嗗竷
  • 鐗瑰埆閫傚悎瀛樺湪绯荤粺鎬х己澶辩殑鍦烘櫙锛堟瘮濡備綆鏀跺叆浜虹兢涓嶆効閫忛湶鏀跺叆锛?/li>

涓冦€佺壒娈婂€兼爣璁帮細浠ラ€€涓鸿繘鐨勫鎷?/h3>

鈥?strong>鈥嬮€傜敤鍦烘櫙鈥?/strong>鈥嬶細鏃犳硶纭畾濉厖鏂瑰紡鏃剁暀鍚庢墜
瀹炲湪涓嶇煡閬撴€庝箞濉紵閭e氨鈥?strong>鈥嬫妸缂哄け鏈韩鍙樻垚淇℃伅鈥?/strong>鈥嬶紒鏂板涓€涓?鏄惁缂哄け"鐨勬爣璁板垪锛岃涓嶅畾妯″瀷鑷繁鑳藉彂鐜拌寰嬨€?/p>

浠g爜寰堢畝鍗曪細

python澶嶅埗
# 鏂板缓缂哄け鏍囪鍒?/span>
df['骞撮緞_缂哄け'] = df['骞撮緞'].isnull().astype(int)
# 鐢?999濉厖鍘熷瓧娈?/span>
df['骞撮緞'].fillna(-999, inplace=True)

馃 鈥?strong>鈥嬩娇鐢ㄥ績寰椻€?/strong>鈥嬶細

  • 鍦ㄩ噾铻嶅弽娆鸿瘓鍦烘櫙鐗瑰埆鏈夌敤锛堟瘮濡傛晠鎰忎笉濉煇浜涗俊鎭彲鑳戒唬琛ㄩ闄╋級
  • 鍐崇瓥鏍戠被妯″瀷鑳借嚜鍔ㄥ埄鐢ㄨ繖绉嶆爣璁?/li>

涓汉瑙傜偣鏃堕棿

骞蹭簡杩欎箞澶氬勾鏁版嵁鍒嗘瀽锛屾垜鍙戠幇鏂版墜鏈€瀹规槗鐘袱涓敊锛氳涔堟棤鑴戝叏鍒犳暟鎹紝瑕佷箞闂溂鐢ㄥ潎鍊煎~鍏呫€傚叾瀹炩€?strong>鈥嬫病鏈夋渶濂界殑鏂规硶锛屽彧鏈夋渶閫傚悎鍦烘櫙鐨勬柟妗堚€?/strong>鈥嬧€斺€旀暟鎹噺澶х殑鏃跺€橩NN鐪熼锛屼絾瑕佽В閲婃€у己鐨勫満鏅繕鏄潎鍊煎~鍏呮洿鐩寸櫧銆?/p>

鏈€杩戝府鏈嬪弸澶勭悊杩囦竴缁勫仴韬獳PP鐨勬暟鎹紝閲岄潰鏈?0%鐨勭敤鎴蜂綋閲嶆暟鎹己澶便€傝瘯浜嗛殢鏈烘.鏋楀~鍏呭悗鍙戠幇锛岀敤杩愬姩鏃堕暱+蹇冪巼鏁版嵁棰勬祴鐨勪綋閲嶏紝灞呯劧姣旂湡瀹炴祴閲忓€艰繕鍑嗭紙鍚庢潵鍙戠幇鍥犱负寰堝鐢ㄦ埛鐬庡~浣撻噸锛夈€備綘鐪嬶紝鏈夋椂鍊欑己澶卞€煎鐞嗚繕鑳藉府浣犲彂鐜版暟鎹川閲忛棶棰樺憿锛?/p>

鏈€鍚庤鍙ュぇ瀹炶瘽鈥斺€斺€?strong>鈥嬪崈涓囧埆瑙夊緱鐢ㄤ簡楂樼骇鏂规硶灏变竾浜嬪ぇ鍚夆€?/strong>鈥嬨€傚~瀹岀己澶卞€煎悗涓€瀹氳鍋氫袱浠朵簨锛?銆佸姣斿~鍏呭墠鍚庢暟鎹垎甯冨彉鍖?2銆佽窇涓熀绾挎ā鍨嬬湅鏁堟灉鎻愬崌銆傛瘯绔熷挶浠殑鐩爣涓嶆槸鎶奛aN娑堢伃鍏夛紝鑰屾槸璁╂暟鎹湡姝h兘浜у嚭浠峰€煎鍚э紵

本文由嘻道妙招独家原创,未经允许,严禁转载