torcharrow.Column.map¶

Column.map(arg: ty.Union[ty.Dict, ty.Callable], na_action: ty.Literal['ignore', None] = None, dtype: ty.Optional[dt.DType] = None, columns: ty.Optional[ty.List[str]] = None)¶

根据输入对应关系映射行。

参数:

callable (arg - dict 或) – 如果 arg 是一个字典，则使用该字典映射输入，未映射的值将变为 null。如果 arg 是一个可调用对象，则将其视为用户定义函数 (UDF)，在输入的每个元素上调用该函数。可调用对象必须是全局函数或类实例上的方法，不支持 lambda 函数。
None (默认值) – 如果您的 UDF 对 null 输入返回 null，选择“忽略”是一种效率改进，其中 map 将避免在 null 值上调用您的 UDF。如果为 None，则始终调用 UDF。
None – 如果您的 UDF 对 null 输入返回 null，选择“忽略”是一种效率改进，其中 map 将避免在 null 值上调用您的 UDF。如果为 None，则始终调用 UDF。
DType (dtype -) – DType 用于强制输出类型。如果结果类型 != 项目类型，则需要 DType。
None – DType 用于强制输出类型。如果结果类型 != 项目类型，则需要 DType。
names (columns - 列列表) – 确定要提供给映射字典或 UDF 的列。
None – 确定要提供给映射字典或 UDF 的列。

另请参阅

flatmap, filter

示例

>>> import torcharrow as ta
>>> ta.column([1,2,None,4]).map({1:111})
0  111
1  None
2  None
3  None
dtype: Int64(nullable=True), length: 4, null_count: 3

使用 defaultdict 提供缺失值

>>> from collections import defaultdict
>>> ta.column([1,2,None,4]).map(defaultdict(lambda: -1, {1:111}))
0  111
1   -1
2   -1
3   -1
dtype: Int64(nullable=True), length: 4, null_count: 0

使用用户提供的 Python 函数

>>> def add_ten(num):
>>>     return num + 10
>>>
>>> ta.column([1,2,None,4]).map(add_ten, na_action='ignore')
0  11
1  12
2  None
3  14
dtype: Int64(nullable=True), length: 4, null_count: 1

请注意，上面的示例中的 .map(add_ten, na_action=None) 会因类型错误而失败，因为 addten 没有为 None/null 定义。要将 null 传递给 UDF，UDF 需要为此做好准备

>>> def add_ten_or_0(num):
>>>     return 0 if num is None else num + 10
>>>
>>> ta.column([1,2,None,4]).map(add_ten_or_0, na_action=None)
0  11
1  12
2   0
3  14
dtype: Int64(nullable=True), length: 4, null_count: 0

映射到不同的类型需要 dtype 参数

>>> ta.column([1,2,None,4]).map(str, dtype=dt.string)
0  '1'
1  '2'
2  'None'
3  '4'
dtype: string, length: 4, null_count: 0

在 DataFrame 上进行映射，UDF 会将整行作为元组获取

>>> def add_unary(tup):
>>>     return tup[0]+tup[1]
>>>
>>> ta.dataframe({'a': [1,2,3], 'b': [1,2,3]}).map(add_unary , dtype = dt.int64)
0  2
1  4
2  6
dtype: int64, length: 3, null_count: 0

多参数 UDF

>>> def add_binary(a,b):
>>>     return a + b
>>>
>>> ta.dataframe({'a': [1,2,3], 'b': ['a', 'b', 'c'], 'c':[1,2,3]}).map(add_binary, columns = ['a','c'], dtype = dt.int64)
0  2
1  4
2  6
dtype: int64, length: 3, null_count: 0

多返回值 UDF - 返回多个列的函数可以通过返回 DataFrame（也称为结构列）来指定；提供返回值类型是必须的

>>> ta.dataframe({'a': [17, 29, 30], 'b': [3,5,11]}).map(divmod, columns= ['a','b'], dtype = dt.Struct([dt.Field('quotient', dt.int64), dt.Field('remainder', dt.int64)]))
  index    quotient    remainder
-------  ----------  -----------
      0           5            2
      1           5            4
      2           2            8
dtype: Struct([Field('quotient', int64), Field('remainder', int64)]), count: 3, null_count: 0

具有状态的 UDF 可以通过在（数据）类中捕获状态并使用方法作为委托来编写

>>> def fib(n):
>>>     if n == 0:
>>>         return 0
>>>     elif n == 1 or n == 2:
>>>         return 1
>>>     else:
>>>         return fib(n-1) + fib(n-2)
>>>
>>> from dataclasses import dataclass
>>> @dataclass
>>> class State:
>>>     state: int
>>>     def __post_init__(self):
>>>         self.state = fib(self.state)
>>>     def add_fib(self, x):
>>>         return self.state+x
>>>
>>> m = State(10)
>>> ta.column([1,2,3]).map(m.add_fib)
0  56
1  57
2  58
dtype: int64, length: 3, null_count: 0

torcharrow.Column.map¶

文档

教程

资源