与PyMongo比较

在本页

读取数据

写入数据
基准测试

在本指南中，您可以了解 PyMongoArrow 和 PyMongo 驱动程序之间的差异。本指南假定您对基本PyMongo 和 MongoDB 概念有所了解。

读取数据

使用 PyMongo 读取数据的最基本方法是

coll = db.benchmark
f = list(coll.find({}, projection={"_id": 0}))
table = pyarrow.Table.from_pylist(f)

这可以工作，但您必须排除_id 字段，否则您将得到以下错误

pyarrow.lib.ArrowInvalid: Could not convert ObjectId('642f2f4720d92a85355671b3') with type ObjectId: did not recognize Python value type when inferring an Arrow data type

以下代码示例显示了在使用 PyMongo 时解决前面错误的方法

>>> f = list(coll.find({}))
>>> for doc in f:
...     doc["_id"] = str(doc["_id"])
...
>>> table = pyarrow.Table.from_pylist(f)
>>> print(table)
pyarrow.Table
_id: string
x: int64
y: double

即使这样可以避免错误，但缺点是Arrow无法识别_id是一个ObjectId，正如schema中显示的_id是一个字符串所示。

PyMongoArrow通过Arrow或Pandas扩展类型支持BSON类型。这允许您避免前面提到的解决方案。

>>> from pymongoarrow.types import ObjectIdType
>>> schema = Schema({"_id": ObjectIdType(), "x": pyarrow.int64(), "y": pyarrow.float64()})
>>> table = find_arrow_all(coll, {}, schema=schema)
>>> print(table)
pyarrow.Table
_id: extension<arrow.py_extension_type<ObjectIdType>>
x: int64
y: double

使用此方法，Arrow可以正确地识别类型。这对于非数值扩展类型的使用有限，但可以避免某些操作（如排序日期时间）的冗余转换。

f = list(coll.find({}, projection={"_id": 0, "x": 0}))
naive_table = pyarrow.Table.from_pylist(f)
schema = Schema({"time": pyarrow.timestamp("ms")})
table = find_arrow_all(coll, {}, schema=schema)
assert (
    table.sort_by([("time", "ascending")])["time"]
    == naive_table["time"].cast(pyarrow.timestamp("ms")).sort()
)

此外，PyMongoArrow支持Pandas扩展类型。在PyMongo中，一个Decimal128值的行为如下

coll = client.test.test
coll.insert_many([{"value": Decimal128(str(i))} for i in range(200)])
cursor = coll.find({})
df = pd.DataFrame(list(cursor))
print(df.dtypes)
# _id      object
# value    object

在PyMongoArrow中相对应的是

from pymongoarrow.api import find_pandas_all
coll = client.test.test
coll.insert_many([{"value": Decimal128(str(i))} for i in range(200)])
df = find_pandas_all(coll, {})
print(df.dtypes)
# _id      bson_PandasObjectId
# value    bson_PandasDecimal128

在这两种情况下，底层值是BSON类类型

print(df["value"][0])
Decimal128("0")

写入数据

使用PyMongo从Arrow表格中写入数据如下

data = arrow_table.to_pylist()
db.collname.insert_many(data)

在PyMongoArrow中相对应的是

from pymongoarrow.api import write
write(db.collname, arrow_table)

截至PyMongoArrow 1.0版本，使用write函数的主要优势是它遍历箭头表、数据帧或NumPy数组，而不会将整个对象转换为列表。

基准测试

以下测量是在PyMongoArrow版本1.0和PyMongo版本4.4下进行的。对于插入操作，库的表现与使用传统PyMongo相同，并且使用相同的内存量。

ProfileInsertSmall.peakmem_insert_conventional      107M
ProfileInsertSmall.peakmem_insert_arrow             108M
ProfileInsertSmall.time_insert_conventional         202±0.8ms
ProfileInsertSmall.time_insert_arrow                181±0.4ms
ProfileInsertLarge.peakmem_insert_arrow             127M
ProfileInsertLarge.peakmem_insert_conventional      125M
ProfileInsertLarge.time_insert_arrow                425±1ms
ProfileInsertLarge.time_insert_conventional         440±1ms

对于读取操作，库在处理小文档和嵌套文档时较慢，但在处理大文档时更快。在所有情况下，它使用的内存较少。

ProfileReadSmall.peakmem_conventional_arrow     85.8M
ProfileReadSmall.peakmem_to_arrow               83.1M
ProfileReadSmall.time_conventional_arrow        38.1±0.3ms
ProfileReadSmall.time_to_arrow                  60.8±0.3ms
ProfileReadLarge.peakmem_conventional_arrow     138M
ProfileReadLarge.peakmem_to_arrow               106M
ProfileReadLarge.time_conventional_ndarray      243±20ms
ProfileReadLarge.time_to_arrow                  186±0.8ms
ProfileReadDocument.peakmem_conventional_arrow  209M
ProfileReadDocument.peakmem_to_arrow            152M
ProfileReadDocument.time_conventional_arrow     865±7ms
ProfileReadDocument.time_to_arrow               937±1ms

新增功能

下一步

数据类型