CSV指的是逗號(hào)分隔值(Comma-Separated Values,CSV,,有時(shí)成為字符分隔值),。其文件以純文本形式存儲(chǔ)表格數(shù)據(jù)(數(shù)字和文本),文件的每一行都是一個(gè)數(shù)據(jù)記錄 很多數(shù)據(jù)文件都以CSV格式保存,,因此在使用Python和Matlab面對(duì)海量實(shí)驗(yàn)或計(jì)算數(shù)據(jù)時(shí),,CSV的讀寫速度十分重要。 這次測(cè)試的項(xiàng)目是:
讀取1個(gè)包含1e6x36隨機(jī)數(shù)矩陣的csv文件,,記錄讀取用時(shí),。 將讀取的數(shù)據(jù)重新寫入一個(gè)csv文件,記錄寫入用時(shí),。
重復(fù)10次并計(jì)算平均讀寫用時(shí)
首先需要生成測(cè)試用的數(shù)據(jù)文件 Python:
import pandas as pd import numpy as np
N = 10 n = 36 m = int(1e6)
for i in range(N): M = np.random.rand(m, n) M_df = pd.DataFrame(M) M_df.to_csv(f"這里是數(shù)據(jù)文件的存儲(chǔ)路徑\dataset{i}.csv")
Matlab
m = 1e6; n = 36; N = 10; for i = 1:N data = rand(m, n); writematrix(data,"dataset" + num2str(i) + ".csv") end
生成的數(shù)據(jù)文件體積達(dá)到了600+Mb,,應(yīng)該可以代表大部分情況下的數(shù)據(jù)量了。 在Python中使用了數(shù)據(jù)分析常用的第三方庫(kù)—Pandas進(jìn)行文件的讀寫,。
import pandas as pd import numpy as np import time
N = 10 read_time = [] write_time = []
for i in range(N): tic = time.time() M = pd.read_csv(f"這里是數(shù)據(jù)文件讀取的路徑\dataset{i}.csv") read_time.append(time.time() - tic) print(f"read dataset{i}.csv")
tic = time.time() M.to_csv(f"這里是數(shù)據(jù)文件寫入的路徑\dataset{i}_out.csv") write_time.append(time.time() - tic) print(f"wrote dataset{i}_out.csv")
print(f"平均讀取用時(shí): {np.mean(read_time)}") print(f"平均寫入用時(shí): {np.mean(write_time)}")
運(yùn)行結(jié)果為:
clc; clear; close all; my_path = "這里是數(shù)據(jù)文件寫入的路徑\data_out"; data_path = "這里是數(shù)據(jù)文件讀取的路徑\dataset";
N = 10; read_time = zeros(N, 1); write_time = zeros(N, 1); for i = 1:N tic M = readtable(data_path + num2str(i) + ".csv"); read_time(i) = toc; fprintf("read dataset%d.csv\n", i); tic % dont' use csvwrite, it's limited to 5 points of precision writetable(M,my_path + num2str(i) + ".csv"); write_time(i) = toc; fprintf("wrote dataset%d_out.csv\n", i); end
fprintf("平均讀取用時(shí): %f\n", mean(read_time)) fprintf("平均寫入用時(shí): %f\n", mean(write_time))
運(yùn)行結(jié)果為: 通過對(duì)比可以看到: 以上測(cè)試結(jié)果受計(jì)算機(jī)配置及軟件版本的影響可能會(huì)有所不同,,大家感興趣的話可以在自己的計(jì)算機(jī)上嘗試一下,。 —— end ——
|