忘備録 大きなデータを分割読み込みする方法
アクセスログなどデータが大きすぎてメモリー不足になった時に使いたいソース。
kaggleで見つけたのですがURLを忘れてしまいました。
作者の人に感謝、感謝です。
def chunk_load(path, file, sample_ratio, seed, usecols=None, chunksize=None, encoding=None, sep=None, names=None, dtype_dict = None): '''Loading data by the chunk method Args : ------- path : str Path of data file : str File name size : int Any interger, But not 0. Represent the size of every chunk of the data dtype_dict : dictionary, default=None Dictionary contains column name and column format. Ex.{age:int8} Returns : ------- data : pandas.dataframe ''' data_chunk = pd.read_csv(f'{path}{file}', encoding=encoding, chunksize=chunksize , sep=sep, usecols=usecols, names=names, dtype=dtype_dict, header=None) data_temp = [] for chunk in tqdm(data_chunk): sample_chunk = chunk.sample(frac=sample_ratio, random_state=seed) data_temp.append(sample_chunk) data = pd.concat(data_temp, axis=0) del data_temp, data_chunk, sample_chunk, chunk return data # ex fdir = '/kaggle/input/xxxxxxxxxxx' file = 'xxxxx.txt' sample_ratio = 0.05 seed = 39 data = chunk_load(fdir, file, sample_ratio, seed, usecols=None, chunksize=10**6, encoding=None, sep='\t', names=None, dtype_dict = None)