ARROW-11511: [Rust] Replace `Arc<ArrayData>` by `ArrayData` in all arrays #9329

jorgecarleitao · 2021-01-26T17:34:57Z

Rational

Currently, all our arrays use an Arc<ArrayData>, which they expose via Array::data and Array::data_ref. This adds a level of indirection. Now, it happens that, afaik, in the current code base Arc<> is not needed.

On #9271 we are observing some performace issues with small arrays, and one of the ideas that came up was to get rid of Arc and see what happens.

This PR

Well, this PR replaces all Arc<ArrayData> by ArrayData. On the one hand, this means that cloning an array is a tad more expensive (Arc vs ArrayData). On the other hand, it means that often the compiler can optimize out.

The gist of the benchmarks below is:

~10%-20% improvement over basically everything
~20%-100% improvement in take

There is some noise, as there are benches that are not expected to be affected and are being affected.

Personally, I like this PR because it makes working with ArrayData and arrays simpler: no need for Arc::new or as_ref and company (besides the speed).

questions

does anyone knows why we are using Arc<ArrayData> in all arrays?
Do you envision an issue with removing the Arc?
Would someone be so kind and run the benches independently, just to be sure.

Benchmarks

# modify cargo.toml by adding `bench = false` to the section [lib]

git checkout master
cargo bench --benches -- --save-baseline `git branch --show-current`-`git rev-parse --short HEAD`

git checkout arcless
cargo bench --benches -- --save-baseline `git branch --show-current`-`git rev-parse --short HEAD`

Jorges-MacBook-Pro-2:arrow jorgecarleitao$ critcmp master-437c8c944 arcless-3dbcaca49 -t 10
group                                         arcless-3dbcaca49                       master-437c8c944
-----                                         -----------------                       ----------------
add 512                                       1.00    435.3±8.19ns        ? B/sec     1.30   565.7±81.95ns        ? B/sec
add_nulls_512                                 1.00   451.9±15.41ns        ? B/sec     1.29   581.9±98.83ns        ? B/sec
and                                           1.00  1516.9±39.34ns        ? B/sec     1.21  1842.4±190.71ns        ? B/sec
array_from_vec 128                            1.00   932.0±35.13ns        ? B/sec     1.51  1411.2±475.00ns        ? B/sec
array_from_vec 256                            1.00  1125.6±24.63ns        ? B/sec     1.23  1382.1±201.90ns        ? B/sec
array_from_vec 512                            1.00  1519.9±52.92ns        ? B/sec     1.24  1877.9±368.40ns        ? B/sec
array_slice 128                               1.00    293.3±4.85ns        ? B/sec     1.61   471.3±94.42ns        ? B/sec
array_slice 2048                              1.00    319.7±8.25ns        ? B/sec     1.50  478.2±293.70ns        ? B/sec
array_slice 512                               1.00    293.8±5.91ns        ? B/sec     1.69  496.1±145.06ns        ? B/sec
array_string_from_vec 128                     1.00      3.2±0.10µs        ? B/sec     1.35      4.3±1.68µs        ? B/sec
array_string_from_vec 256                     1.00      4.1±0.13µs        ? B/sec     1.18      4.9±0.94µs        ? B/sec
array_string_from_vec 512                     1.00      5.9±0.11µs        ? B/sec     1.26      7.4±5.14µs        ? B/sec
bench_bool/bench_bool                         1.00      2.0±0.03ms   245.8 MB/sec     1.12      2.3±0.30ms   219.7 MB/sec
buffer_bit_ops and                            1.00   577.0±11.19ns        ? B/sec     1.14  658.5±185.85ns        ? B/sec
cast date32 to date64 512                     1.00      6.8±0.21µs        ? B/sec     1.18      8.0±1.25µs        ? B/sec
cast date64 to date32 512                     1.00      6.8±0.42µs        ? B/sec     1.27      8.6±2.22µs        ? B/sec
cast f32 to string 512                        1.00     52.2±1.37µs        ? B/sec     1.17     61.0±9.24µs        ? B/sec
cast float32 to int32 512                     1.00      2.9±0.07µs        ? B/sec     1.20      3.4±0.52µs        ? B/sec
cast float64 to float32 512                   1.00      3.0±0.07µs        ? B/sec     1.18      3.6±0.47µs        ? B/sec
cast float64 to uint64 512                    1.00      3.4±0.19µs        ? B/sec     1.45      5.0±0.72µs        ? B/sec
cast int32 to float32 512                     1.00      2.9±0.09µs        ? B/sec     1.11      3.3±0.52µs        ? B/sec
cast int32 to float64 512                     1.00      2.7±0.12µs        ? B/sec     1.27      3.4±0.77µs        ? B/sec
cast int32 to int64 512                       1.00      2.9±0.06µs        ? B/sec     1.26      3.6±0.67µs        ? B/sec
cast int32 to uint32 512                      1.00      2.7±0.08µs        ? B/sec     1.23      3.3±0.32µs        ? B/sec
cast int64 to int32 512                       1.00      2.7±0.05µs        ? B/sec     1.23      3.3±0.89µs        ? B/sec
cast time32s to time32ms 512                  1.00  1487.8±52.94ns        ? B/sec     1.11  1648.3±145.69ns        ? B/sec
cast time32s to time64us 512                  1.00      4.8±0.12µs        ? B/sec     1.26      6.0±0.82µs        ? B/sec
cast time64ns to time32s 512                  1.00      9.9±0.24µs        ? B/sec     1.24     12.3±1.87µs        ? B/sec
cast timestamp_ms to i64 512                  1.00   286.7±13.15ns        ? B/sec     1.66   474.9±72.34ns        ? B/sec
cast timestamp_ms to timestamp_ns 512         1.00  1961.6±38.52ns        ? B/sec     1.41      2.8±4.01µs        ? B/sec
cast timestamp_ns to timestamp_s 512          1.00     26.2±0.31ns        ? B/sec     1.12     29.4±3.74ns        ? B/sec
cast utf8 to date32 512                       1.00     43.9±1.05µs        ? B/sec     1.15    50.7±10.33µs        ? B/sec
cast utf8 to f32                              1.00     30.9±0.81µs        ? B/sec     1.21     37.5±5.32µs        ? B/sec
concat str 1024                               1.00      9.9±0.54µs        ? B/sec     1.34     13.2±5.26µs        ? B/sec
divide 512                                    1.00  1411.3±40.15ns        ? B/sec     1.13  1599.1±200.00ns        ? B/sec
divide_nulls_512                              1.00  1419.2±28.43ns        ? B/sec     1.12  1587.9±142.46ns        ? B/sec
eq Float32                                    1.00    103.3±3.33µs        ? B/sec     1.25   129.4±19.72µs        ? B/sec
equal_512                                     1.00     19.2±1.00ns        ? B/sec     2.58    49.5±12.56ns        ? B/sec
equal_bool_512                                1.00     19.1±1.39ns        ? B/sec     2.17     41.5±2.29ns        ? B/sec
equal_bool_513                                1.00     21.1±0.70ns        ? B/sec     2.15     45.4±3.84ns        ? B/sec
equal_nulls_512                               1.00      2.4±0.11µs        ? B/sec     1.11      2.7±0.37µs        ? B/sec
equal_string_512                              1.00     84.5±5.08ns        ? B/sec     1.32    111.9±5.30ns        ? B/sec
equal_string_nulls_512                        1.00      3.6±0.48µs        ? B/sec     1.11      4.0±1.29µs        ? B/sec
filter context f32 high selectivity           1.00   308.9±10.81µs        ? B/sec     1.11   343.2±57.50µs        ? B/sec
filter context f32 low selectivity            1.00      2.7±0.08µs        ? B/sec     1.10      3.0±0.43µs        ? B/sec
filter context string high selectivity        1.00  1091.3±26.04µs        ? B/sec     1.13  1238.6±174.82µs        ? B/sec
filter context u8                             1.00    231.6±4.05µs        ? B/sec     1.21  281.0±112.36µs        ? B/sec
filter context u8 w NULLs                     1.00   500.1±20.38µs        ? B/sec     1.17  586.5±155.02µs        ? B/sec
filter context u8 w NULLs high selectivity    1.00    295.2±5.91µs        ? B/sec     1.17   344.2±63.42µs        ? B/sec
filter context u8 w NULLs low selectivity     1.00      2.9±0.51µs        ? B/sec     1.21      3.5±5.02µs        ? B/sec
filter f32                                    1.00   797.4±33.25µs        ? B/sec     1.22  971.4±135.07µs        ? B/sec
filter u8 low selectivity                     1.00      7.9±0.73µs        ? B/sec     1.12      8.9±1.47µs        ? B/sec
from_slice prepared                           1.15   961.1±24.81µs        ? B/sec     1.00   834.2±59.41µs        ? B/sec
gt Float32                                    1.00    66.2±14.03µs        ? B/sec     1.28     84.6±9.31µs        ? B/sec
gt scalar Float32                             1.49     60.9±8.59µs        ? B/sec     1.00     40.8±4.06µs        ? B/sec
gt_eq Float32                                 1.00   117.3±31.66µs        ? B/sec     1.10   129.6±25.35µs        ? B/sec
json_list_primitive_to_record_batch           1.00     63.3±2.09µs        ? B/sec     1.16    73.4±11.01µs        ? B/sec
json_primitive_to_record_batch                1.00     25.1±0.82µs        ? B/sec     1.11     27.7±4.88µs        ? B/sec
length                                        1.00      2.9±0.10µs        ? B/sec     1.30      3.7±0.77µs        ? B/sec
like_utf8 scalar ends with                    1.00   246.5±13.62µs        ? B/sec     1.19   294.5±45.77µs        ? B/sec
like_utf8 scalar equals                       1.17     98.3±7.08µs        ? B/sec     1.00    83.7±10.13µs        ? B/sec
limit 512, 512                                1.00    291.9±5.45ns        ? B/sec     1.55  451.1±130.11ns        ? B/sec
lt Float32                                    1.00    70.0±19.63µs        ? B/sec     1.26    88.3±12.27µs        ? B/sec
lt scalar Float32                             1.00     62.6±3.01µs        ? B/sec     1.39    87.3±25.97µs        ? B/sec
lt_eq Float32                                 1.00    104.8±8.32µs        ? B/sec     1.43   149.9±56.97µs        ? B/sec
lt_eq scalar Float32                          1.00     82.1±3.59µs        ? B/sec     1.11    91.3±12.27µs        ? B/sec
max nulls 512                                 1.00  1378.4±51.28ns        ? B/sec     1.32  1820.6±260.46ns        ? B/sec
min 512                                       1.00  1510.6±13.17ns        ? B/sec     1.28  1938.2±515.72ns        ? B/sec
min nulls 512                                 1.00  1380.2±51.78ns        ? B/sec     1.44  1980.7±259.55ns        ? B/sec
min nulls string 512                          1.00      7.1±0.12µs        ? B/sec     1.25      8.9±1.66µs        ? B/sec
multiply 512                                  1.00   461.5±42.95ns        ? B/sec     1.30  601.0±102.46ns        ? B/sec
mutable                                       1.00   475.9±15.21µs        ? B/sec     1.19   566.7±77.31µs        ? B/sec
mutable prepared                              1.00   530.8±21.36µs        ? B/sec     1.17   621.0±60.61µs        ? B/sec
mutable str 1024                              1.00  1493.7±56.88µs        ? B/sec     1.19  1776.6±329.88µs        ? B/sec
mutable str nulls 1024                        1.00      4.6±0.30ms        ? B/sec     1.13      5.2±0.51ms        ? B/sec
neq Float32                                   1.00     65.8±1.70µs        ? B/sec     1.39    91.5±20.30µs        ? B/sec
neq scalar Float32                            1.47    97.6±50.68µs        ? B/sec     1.00     66.4±6.21µs        ? B/sec
nlike_utf8 scalar contains                    1.00      2.4±0.03ms        ? B/sec     1.16      2.8±0.41ms        ? B/sec
nlike_utf8 scalar equals                      1.00   218.8±19.42µs        ? B/sec     1.14   249.2±24.71µs        ? B/sec
not                                           1.00   955.5±27.28ns        ? B/sec     1.15  1102.4±153.08ns        ? B/sec
or                                            1.00  1497.9±37.68ns        ? B/sec     1.15  1726.6±88.42ns        ? B/sec
sort 2^10                                     1.00   153.9±11.62µs        ? B/sec     1.14   175.6±25.84µs        ? B/sec
sort 2^12                                     1.00   748.9±51.51µs        ? B/sec     1.13  849.8±109.51µs        ? B/sec
struct_array_from_vec 256                     1.00      7.6±0.23µs        ? B/sec     1.12      8.5±2.51µs        ? B/sec
struct_array_from_vec 512                     1.00     10.0±0.73µs        ? B/sec     1.24     12.4±3.51µs        ? B/sec
subtract 512                                  1.00    434.8±9.25ns        ? B/sec     1.32   574.3±37.34ns        ? B/sec
sum 512                                       1.00   524.8±12.74ns        ? B/sec     1.19  626.2±156.66ns        ? B/sec
take bool 1024                                1.00      3.6±0.27µs        ? B/sec     1.45      5.2±3.11µs        ? B/sec
take bool 512                                 1.00      2.1±0.10µs        ? B/sec     1.23      2.6±0.32µs        ? B/sec
take bool nulls 1024                          1.00      5.1±2.09µs        ? B/sec     1.60      8.1±2.08µs        ? B/sec
take bool nulls 512                           1.00      2.1±0.14µs        ? B/sec     1.82      3.9±0.77µs        ? B/sec
take i32 1024                                 1.00  1583.0±111.03ns        ? B/sec    1.49      2.4±0.58µs        ? B/sec
take i32 512                                  1.00  1058.3±88.60ns        ? B/sec     1.29  1364.4±97.02ns        ? B/sec
take i32 nulls 1024                           1.00  1527.5±54.16ns        ? B/sec     1.82      2.8±1.00µs        ? B/sec
take i32 nulls 512                            1.00  1086.4±124.91ns        ? B/sec    2.34      2.5±1.64µs        ? B/sec
take str 1024                                 1.00      5.9±0.51µs        ? B/sec     1.20      7.1±1.30µs        ? B/sec
take str 512                                  1.00      3.8±0.36µs        ? B/sec     1.21      4.6±0.96µs        ? B/sec
take str null indices 1024                    1.00      5.6±0.16µs        ? B/sec     1.27      7.2±1.63µs        ? B/sec
take str null indices 512                     1.00      3.7±0.31µs        ? B/sec     1.23      4.6±1.32µs        ? B/sec
take str null values 1024                     1.00      5.7±0.14µs        ? B/sec     1.25      7.1±1.02µs        ? B/sec
take str null values null indices 1024        1.00     12.8±0.39µs        ? B/sec     1.14     14.6±1.85µs        ? B/sec

jorgecarleitao · 2021-01-26T17:36:43Z

cc @Dandandan ^_^

Dandandan · 2021-01-26T18:12:54Z

Nice @jorgecarleitao !! Looking forward to see the impact in DataFusion on the slicing of smaller arrays and maybe other parts as well. I wouldn't have expected > 20% changes for bigger arrays (512/1024 elements) in the kernels only based on removing / not creating or cloning the Arc. Any idea why that could be the case?

Ergonomically it is definitely an improvement, the ArrayData struct is small and should be cheap to clone, this avoids quite some boilerplate.

jorgecarleitao · 2021-01-26T18:25:39Z

I wouldn't have expected > 20% changes for bigger arrays (512/1024 elements) in the kernels only based on removing / not creating or cloning the Arc. Any idea why that could be the case?

I remember that we had a similar issue some time ago where we were using data instead of data_ref, and that was causing a significant performance difference.

My general approach here is to write simpler code in the hope that it helps the compiler.

In practice, I humbly make these offerings to the gods of LLVM and pray for a sign.

yordan-pavlov · 2021-01-26T22:46:52Z

@jorgecarleitao that's an interesting idea; simplification makes sense, especially if cloning ArrayData is cheap and there is no obvious reason to use Arc<ArrayData>.

I ran the filter benchmarks with SIMD enabled and got the following results (in comparison to master).
Possibly some improvement, but the results are not very clear - often the differences are small.

filter u8 time: [556.89 us 562.29 us 568.54 us]
change: [-3.0769% +0.2211% +3.2221%] (p = 0.89 > 0.05)

filter u8 high selectivity
time: [13.859 us 13.959 us 14.081 us]
change: [-2.3832% +0.3285% +3.0131%] (p = 0.81 > 0.05)

filter u8 low selectivity
time: [7.7940 us 7.8819 us 7.9793 us]
change: [+8.3043% +11.999% +16.830%] (p = 0.00 < 0.05)

filter context u8 time: [205.24 us 209.06 us 213.16 us]
change: [+0.4175% +5.7203% +11.914%] (p = 0.04 < 0.05)

filter context u8 high selectivity
time: [3.7003 us 3.7456 us 3.8010 us]
change: [-3.2048% -0.7748% +1.5407%] (p = 0.55 > 0.05)

filter context u8 low selectivity
time: [1.3066 us 1.3243 us 1.3443 us]
change: [-3.4133% -1.5255% +0.1897%] (p = 0.11 > 0.05)

filter context u8 w NULLs
time: [496.69 us 501.24 us 506.60 us]
change: [-8.8715% -4.9662% -1.0448%] (p = 0.02 < 0.05)

filter context u8 w NULLs high selectivity
time: [254.92 us 258.17 us 262.51 us]
change: [+8.6837% +13.067% +17.583%] (p = 0.00 < 0.05)

filter context u8 w NULLs low selectivity
time: [1.9758 us 2.0159 us 2.0623 us]
change: [-10.590% -7.4822% -4.2891%] (p = 0.00 < 0.05)

filter f32 time: [779.40 us 789.32 us 801.26 us]
change: [-17.189% -13.338% -9.2795%] (p = 0.00 < 0.05)

filter context f32 time: [485.95 us 495.16 us 506.07 us]
change: [-7.7798% -3.5929% +0.9816%] (p = 0.12 > 0.05)

filter context f32 high selectivity
time: [258.28 us 259.96 us 261.83 us]
change: [+2.0252% +5.7103% +9.4974%] (p = 0.00 < 0.05)

filter context f32 low selectivity
time: [1.9878 us 2.0332 us 2.0857 us]
change: [-7.9039% -4.8554% -1.8226%] (p = 0.00 < 0.05)

filter context string time: [639.51 us 646.56 us 655.15 us]
change: [-9.9164% -5.9396% -2.2137%] (p = 0.00 < 0.05)

filter context string high selectivity
time: [1.0009 ms 1.0168 ms 1.0344 ms]
change: [-5.8412% -1.1777% +4.1292%] (p = 0.65 > 0.05)

filter context string low selectivity
time: [2.9339 us 2.9668 us 3.0060 us]
change: [-6.4880% -2.3484% +1.7663%] (p = 0.29 > 0.05)

yordan-pavlov · 2021-01-26T22:56:32Z

rust/arrow/src/array/array.rs

if both data() and data_ref() return &ArrayData may be just data_ref() is enough and data() should be removed; or possibly change data() to return ArrayData

I understand why we wanted to avoid Array::data().clone() to get ArrayData, but with these 2 functions returning the same reference, I'd opt to make Array::data() -> ArrayData

@yordan-pavlov I think it's safer to deprecate data(), then remove it later. We still use it a lot in the codebase, but it's often more performant to use array_ref(), so returning ArrayData doesn't guide users on a faster path.

yes, deprecate data() is what I meant, immediate removal is not great for backwards compatibility

jhorstmann · 2021-01-27T11:02:46Z

How cheap cloning of the ArrayData struct is probably depends a lot on the datatype. The removed indirection seems to be beneficial, but instead we now need to clone all fields, which include

the datatype enum, which can include boxes/vecs of nested types and the field metadata
the vec of buffers, and cloning a buffer would include cloning the Arc<Bytes>

So for primitive types, removing the arc should be a benefit because of less indirection when accessing it, it would still need to allocates vectors and clone the Arc which is inside the Buffer.

For more complex types, cloning the DataType enum itself and multiple buffers could be slower than the speedup gained by less indirection.

jorgecarleitao · 2021-01-31T19:54:23Z

How cheap cloning of the ArrayData struct is probably depends a lot on the datatype. The removed indirection seems to be beneficial, but instead we now need to clone all fields, which include
* the datatype enum, which can include boxes/vecs of nested types and the field metadata

* the vec of buffers, and cloning a buffer would include cloning the `Arc<Bytes>`
So for primitive types, removing the arc should be a benefit because of less indirection when accessing it, it would still need to allocates vectors and clone the Arc which is inside the Buffer.

For more complex types, cloning the DataType enum itself and multiple buffers could be slower than the speedup gained by less indirection.

I am sorry, I am not sure I follow the arguments:

DataType: all our operations already clone DataType, since they create a new ArrayData. Even something as simple as slicing an array requires cloning a DataType, as we need to increase the offset, which requires a new ArrayData, which requires a new DataType.
Arc<Bytes>: I do not see how using Arc<Bytes> is relevant to why we should use Arc<ArrayData>. Regardless, AFAI understand, its usage is really important: Bytes is an immutable memory region, and sharing it between arrays imo makes a lot of sense: the most used case is when the null buffer does not change due to an operation. So, the two natural options are Rc and an Arc, and allowing Arrays to be sharable across thread boundaries seems like a good justification for Arc.

jhorstmann · 2021-02-01T14:21:41Z

all our operations already clone DataType, since they create a new ArrayData

You are right, I wasn't looking closely enough at the code or thought only about ArrayData::clone vs Arc<ArrayData>::clone. I still think there might be some overhead there and it might be worth experimenting with an Arc<DataType> or replacing some of the Box in the DataType enum with Arc in a separate PR. For nested types like List<Dictionary<Int32, Utf8>> I think cloning the type currently involves more allocations than cloning the vec of buffers/child data.

github-actions · 2021-02-05T21:03:44Z

https://issues.apache.org/jira/browse/ARROW-11511

jorgecarleitao · 2021-02-05T21:08:42Z

@nevi-me @sunchao , I would appreciate your feedback here: I can't find a rational to keep Arc<ArrayData>, but I do not have all the background, and you are probably the most experienced persons for a qualified answer here. I marked you as reviewers as well.

codecov-io · 2021-02-06T08:14:17Z

Codecov Report

Merging #9329 (6cdec30) into master (a3e1513) will decrease coverage by 0.28%.
The diff coverage is 92.57%.

@@            Coverage Diff             @@
##           master    #9329      +/-   ##
==========================================
- Coverage   82.32%   82.03%   -0.29%     
==========================================
  Files         245      245              
  Lines       56277    55927     -350     
==========================================
- Hits        46328    45879     -449     
- Misses       9949    10048      +99

Impacted Files	Coverage Δ
rust/arrow/src/ipc/writer.rs	`87.44% <ø> (+0.21%)`	⬆️
rust/arrow/src/json/writer.rs	`87.91% <ø> (+1.74%)`	⬆️
...t/datafusion/src/physical_plan/math_expressions.rs	`27.27% <ø> (ø)`
rust/integration-testing/src/lib.rs	`0.00% <0.00%> (ø)`
rust/arrow/src/array/array.rs	`76.80% <50.00%> (+0.49%)`	⬆️
rust/arrow/src/array/data.rs	`76.87% <55.55%> (-2.00%)`	⬇️
rust/parquet/src/arrow/array_reader.rs	`77.61% <66.66%> (ø)`
rust/arrow/src/ipc/reader.rs	`84.36% <83.33%> (ø)`
rust/arrow/src/json/reader.rs	`83.45% <83.33%> (-0.06%)`	⬇️
rust/arrow/src/array/array_binary.rs	`84.53% <85.71%> (-7.09%)`	⬇️
... and 58 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a3e1513...6cdec30. Read the comment docs.

alamb

I went over this PR carefully and it makes sense to me. I am happy with the performance numbers reported by others, though if we want another opinion I can re-run stuff locally.

My opinion is that if this PR slows anything down, there are other optimizations we should pursue to get the performance back (Arcing compound datatypes and special casing small numbers of Buffers in ArrayData) that would also help the overall system

still think there might be some overhead there and it might be worth experimenting with an Arc or replacing some of the Box in the DataType enum with Arc in a separate PR. For nested types like List<Dictionary<Int32, Utf8>> I think cloning the type currently involves more allocations than cloning the vec of buffers/child data.

I agree with this @jhorstmann -- I think a lot of the arrow codebase assumes that Cloneing DataType is cheap (which is true for primitives and not true for compound types).

Also, given that ArrayData has several Vecs in it, I suspect an optimization like @jhorstmann was trying here #9330 would make a significant difference too (as most arrays have 1 or 0 items in that Vec)

In practice, I humbly make these offerings to the gods of LLVM and pray for a sign.

I think this is a good sacrifice. 😂

alamb · 2021-02-06T12:23:42Z

rust/arrow/src/array/array_union.rs

What was the reasoning for removing this size test rather than updating it for the new values ?

rust/arrow/src/array/equal/mod.rs

rust/arrow/src/array/null.rs

rust/parquet/src/arrow/arrow_writer.rs

nevi-me

LGTM, just one question

rust/arrow/src/array/data.rs

alamb · 2021-02-16T19:48:18Z

@jorgecarleitao shall we merge this PR? Or is it still an "another of those experiments to gather feedback and share results." in the description?

Merging #9329 did not go as well as I planned; Fixing error in master. Closes #9511 from jorgecarleitao/fix_error Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>

alamb · 2021-02-18T11:57:12Z

I think this one needs a rebase (Jorge fixed the real error in #9511)

nevi-me

@jorgecarleitao @alamb I've rebased this and fixed some errors in simd.

I have capacity to look at this in more detail, including running benchmarks; so I'll start doing that.
There's 3 questions that Jorge asked, which I haven't answered; so I'm marking this as "changes requested" to avoid us prematurely merging this.

I'd also like to hear from @sunchao and @vertexclique around what the change could mean from a memory-usage. I don't have enough experience to tell off-hand.

I'll keep this PR rebased, if there's merge conflicts that arise; so I can take that burden off your shoulders Jorge.

alamb · 2021-02-28T12:15:51Z

Thanks @nevi-me . Let me know if there is anything I can do to help

nevi-me · 2021-03-06T08:55:25Z

Benchmarks on an ARM machine:

nevilledipale@Nevilles-MacBook-Air rust % critcmp master-bfa99d98e arcless-ef2250b9d -t 10
group                           arcless-ef2250b9d                       master-bfa99d98e
-----                           -----------------                       ----------------
add_nulls_512                   1.00   351.5±22.53ns        ? B/sec     1.12   394.1±19.39ns        ? B/sec
array_from_vec 128              1.00   555.6±26.43ns        ? B/sec     1.11   619.1±17.89ns        ? B/sec
array_slice 128                 1.00     98.3±8.62ns        ? B/sec     1.42    139.4±7.01ns        ? B/sec
array_slice 2048                1.00   105.8±13.50ns        ? B/sec     1.37    145.1±8.93ns        ? B/sec
array_slice 512                 1.00    101.9±6.16ns        ? B/sec     1.41    143.3±9.44ns        ? B/sec
cast date32 to date64 512       1.00   246.7±17.81ns        ? B/sec     1.23   303.7±14.81ns        ? B/sec
cast time32s to time32ms 512    1.00   223.1±13.25ns        ? B/sec     1.18   262.3±16.68ns        ? B/sec
cast timestamp_ms to i64 512    1.00     98.6±9.32ns        ? B/sec     1.56    153.7±7.79ns        ? B/sec
divide_scalar 512               1.00   189.9±12.11ns        ? B/sec     1.16    220.4±8.05ns        ? B/sec
divide_scalar_nulls_512         1.00   186.6±14.60ns        ? B/sec     1.21   225.1±11.87ns        ? B/sec
eq Float32                      1.15     40.0±1.54µs        ? B/sec     1.00     34.7±1.30µs        ? B/sec
equal_bool_512                  1.00     25.8±1.34ns        ? B/sec     1.12     28.9±1.15ns        ? B/sec
equal_bool_513                  1.00     27.7±1.36ns        ? B/sec     1.14     31.5±1.42ns        ? B/sec
gt Float32                      1.15     40.6±2.00µs        ? B/sec     1.00     35.2±1.75µs        ? B/sec
gt_eq Float32                   1.20     42.1±2.70µs        ? B/sec     1.00     35.2±1.60µs        ? B/sec
limit 512, 512                  1.00    106.5±6.65ns        ? B/sec     1.31    139.0±7.04ns        ? B/sec
lt Float32                      1.15     40.5±1.97µs        ? B/sec     1.00     35.3±1.73µs        ? B/sec
lt_eq Float32                   1.14     40.3±2.40µs        ? B/sec     1.00     35.2±1.50µs        ? B/sec
max nulls 512                   1.00   889.7±59.93ns        ? B/sec     3.26      2.9±0.11µs        ? B/sec
min nulls 512                   1.00   916.4±70.85ns        ? B/sec     2.64      2.4±0.10µs        ? B/sec
multiply 512                    1.00   361.3±28.15ns        ? B/sec     1.12   403.5±21.82ns        ? B/sec
neq Float32                     1.16     40.6±1.72µs        ? B/sec     1.00     34.9±1.68µs        ? B/sec
take bool 1024                  1.00      4.2±0.30µs        ? B/sec     1.29      5.4±0.29µs        ? B/sec
take bool 512                   1.00  1750.4±100.33ns        ? B/sec    1.38      2.4±0.19µs        ? B/sec
take bool nulls 1024            1.00      3.3±0.16µs        ? B/sec     1.30      4.3±0.28µs        ? B/sec
take bool nulls 512             1.00  1635.3±119.87ns        ? B/sec    1.20  1956.5±83.47ns        ? B/sec
take i32 512                    1.00   421.9±11.67ns        ? B/sec     1.14   480.0±27.45ns        ? B/sec
take str null values 1024       1.00      6.8±0.42µs        ? B/sec     1.25      8.5±1.49µs        ? B/sec

Only equality benches are worse, but not by that much.

nevi-me · 2021-03-18T19:58:26Z

does anyone knows why we are using Arc in all arrays?

I think it comes down just to memory size, as cloning Arc<T> would give us a pointer-sized variable, compared to cloning T including its DataType.

Do you envision an issue with removing the Arc?

I personally don't. The underlying bytes are still backed by Arc, so we shouldn't use a lot more memory for copies of arrays.
The perf benchmarks are also positive.

Would someone be so kind and run the benches independently, just to be sure.

I attached my set of benchmark results, but I think the datafusion benchmarks would be more meaningful.
@Dandandan would you be able to run datafusion benchmarks from this branch vs master, and share results?

Thanks

nevi-me

I'm fine with continuing with this change. I've rebased the PR, and added comments on the questions that @jorgecarleitao asked.

Any thoughts @alamb, Jorge, @andygrove?

I've been out of the loop, so I'm not sure if an RFC is retrospectively required for the change, as the PR predates the RFC suggestion.

alamb · 2021-03-19T10:28:33Z

@nevi-me I think we should merge this PR:

It cleans the code up
It shows decent performance improvements
It sets us up for future work

nevi-me · 2021-03-19T10:43:36Z

I concur @alamb, I think the PR's been open long enough for those interested to comment.

I fixed the clippy lint already in master, so we need not worry about it.

alamb · 2021-03-19T10:45:46Z

I fixed the clippy lint already in master, so we need not worry about it.

Thanks -- you beat me to it! (#9754)

Dandandan · 2021-03-21T13:06:33Z

@nevi-me performance on some queries in here h2oai/db-benchmark#182 improve by ~5%.

alamb · 2021-03-22T21:25:08Z

Thanks @nevi-me for pushing the over the line. 🎉

@jorgecarleitao

These tests were all commented out in #9329. Given the PR made changes to the commented out code, I'm inclined to think this was an accidental omission? If not, happy for this to be closed 😀 FYI @jorgecarleitao Signed-off-by: Raphael Taylor-Davies <[email protected]> Closes #10048 from tustvold/enable-transform-tests Authored-by: Raphael Taylor-Davies <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>

jorgecarleitao changed the title ~~[Rust] [Experiment] Replace Arc<ArrayData> by ArrayData in all arrays~~ [Rust] [Experiment] Replace Arc<ArrayData> by ArrayData in all arrays (1.2-2x speedup) Jan 26, 2021

github-actions bot added the Component: Rust label Jan 26, 2021

jorgecarleitao changed the title ~~[Rust] [Experiment] Replace Arc<ArrayData> by ArrayData in all arrays (1.2-2x speedup)~~ [Rust] [Experiment] Replace Arc<ArrayData> by ArrayData in all arrays Jan 26, 2021

yordan-pavlov reviewed Jan 26, 2021

View reviewed changes

github-actions bot added Component: Parquet Component: Rust - DataFusion labels Feb 1, 2021

jorgecarleitao changed the title ~~[Rust] [Experiment] Replace Arc<ArrayData> by ArrayData in all arrays~~ ARROW-11511: [Rust] Replace Arc<ArrayData> by ArrayData in all arrays Feb 5, 2021

jorgecarleitao requested review from nevi-me and sunchao February 5, 2021 21:03

alamb approved these changes Feb 6, 2021

View reviewed changes

nevi-me approved these changes Feb 11, 2021

View reviewed changes

rust/arrow/src/array/data.rs Outdated Show resolved Hide resolved

jorgecarleitao force-pushed the master branch from d4608a9 to 356c300 Compare February 14, 2021 12:09

jorgecarleitao mentioned this pull request Feb 17, 2021

ARROW-11663: [Rust] [DataFusion] Fixed error. #9511

Closed

alamb added the needs-rebase A PR that needs to be rebased by the author label Feb 18, 2021

nevi-me removed the needs-rebase A PR that needs to be rebased by the author label Feb 25, 2021

nevi-me requested changes Feb 27, 2021

View reviewed changes

jorgecarleitao and others added 2 commits March 18, 2021 21:05

Removed Arc.

08a3df6

fix simd miscompile

6d8b852

nevi-me approved these changes Mar 18, 2021

View reviewed changes

nevi-me closed this in eebf64b Mar 22, 2021

jorgecarleitao deleted the arcless branch March 22, 2021 15:17

tustvold mentioned this pull request Apr 15, 2021

ARROW-12400: [Rust] Re-enable tests in arrow::array::transform #10048

Closed

asfimport mentioned this pull request Mar 22, 2021

[Rust] Replace Arc<ArrayData> by ArrayData #18498

Closed

ARROW-11511: [Rust] Replace Arc<ArrayData> by ArrayData in all arrays #9329

ARROW-11511: [Rust] Replace Arc<ArrayData> by ArrayData in all arrays #9329

Uh oh!

Conversation

jorgecarleitao commented Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rational

This PR

questions

Benchmarks

Uh oh!

jorgecarleitao commented Jan 26, 2021

Uh oh!

Dandandan commented Jan 26, 2021

Uh oh!

jorgecarleitao commented Jan 26, 2021

Uh oh!

yordan-pavlov commented Jan 26, 2021

Uh oh!

yordan-pavlov Jan 26, 2021

Choose a reason for hiding this comment

Uh oh!

nevi-me Feb 25, 2021

Choose a reason for hiding this comment

Uh oh!

nevi-me Mar 18, 2021

Choose a reason for hiding this comment

Uh oh!

yordan-pavlov Mar 21, 2021

Choose a reason for hiding this comment

Uh oh!

jhorstmann commented Jan 27, 2021

Uh oh!

jorgecarleitao commented Jan 31, 2021

Uh oh!

jhorstmann commented Feb 1, 2021

Uh oh!

github-actions bot commented Feb 5, 2021

Uh oh!

jorgecarleitao commented Feb 5, 2021

Uh oh!

codecov-io commented Feb 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Feb 6, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nevi-me left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Feb 16, 2021

Uh oh!

alamb commented Feb 18, 2021

Uh oh!

nevi-me left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 28, 2021

Uh oh!

nevi-me commented Mar 6, 2021

Uh oh!

nevi-me commented Mar 18, 2021

Uh oh!

nevi-me left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 19, 2021

Uh oh!

nevi-me commented Mar 19, 2021

Uh oh!

alamb commented Mar 19, 2021

Uh oh!

ARROW-11511: [Rust] Replace `Arc<ArrayData>` by `ArrayData` in all arrays #9329

ARROW-11511: [Rust] Replace `Arc<ArrayData>` by `ArrayData` in all arrays #9329

jorgecarleitao commented Jan 26, 2021 •

edited

Loading

codecov-io commented Feb 6, 2021 •

edited

Loading