Skip to content

Conversation

@jorgecarleitao
Copy link
Member

@jorgecarleitao jorgecarleitao commented Jan 26, 2021

Rational

Currently, all our arrays use an Arc<ArrayData>, which they expose via Array::data and Array::data_ref. This adds a level of indirection. Now, it happens that, afaik, in the current code base Arc<> is not needed.

On #9271 we are observing some performace issues with small arrays, and one of the ideas that came up was to get rid of Arc and see what happens.

This PR

Well, this PR replaces all Arc<ArrayData> by ArrayData. On the one hand, this means that cloning an array is a tad more expensive (Arc vs ArrayData). On the other hand, it means that often the compiler can optimize out.

The gist of the benchmarks below is:

  • ~10%-20% improvement over basically everything
  • ~20%-100% improvement in take

There is some noise, as there are benches that are not expected to be affected and are being affected.

Personally, I like this PR because it makes working with ArrayData and arrays simpler: no need for Arc::new or as_ref and company (besides the speed).

questions

  • does anyone knows why we are using Arc<ArrayData> in all arrays?
  • Do you envision an issue with removing the Arc?
  • Would someone be so kind and run the benches independently, just to be sure.

Benchmarks

# modify cargo.toml by adding `bench = false` to the section [lib]

git checkout master
cargo bench --benches -- --save-baseline `git branch --show-current`-`git rev-parse --short HEAD`

git checkout arcless
cargo bench --benches -- --save-baseline `git branch --show-current`-`git rev-parse --short HEAD`
Jorges-MacBook-Pro-2:arrow jorgecarleitao$ critcmp master-437c8c944 arcless-3dbcaca49 -t 10
group                                         arcless-3dbcaca49                       master-437c8c944
-----                                         -----------------                       ----------------
add 512                                       1.00    435.3±8.19ns        ? B/sec     1.30   565.7±81.95ns        ? B/sec
add_nulls_512                                 1.00   451.9±15.41ns        ? B/sec     1.29   581.9±98.83ns        ? B/sec
and                                           1.00  1516.9±39.34ns        ? B/sec     1.21  1842.4±190.71ns        ? B/sec
array_from_vec 128                            1.00   932.0±35.13ns        ? B/sec     1.51  1411.2±475.00ns        ? B/sec
array_from_vec 256                            1.00  1125.6±24.63ns        ? B/sec     1.23  1382.1±201.90ns        ? B/sec
array_from_vec 512                            1.00  1519.9±52.92ns        ? B/sec     1.24  1877.9±368.40ns        ? B/sec
array_slice 128                               1.00    293.3±4.85ns        ? B/sec     1.61   471.3±94.42ns        ? B/sec
array_slice 2048                              1.00    319.7±8.25ns        ? B/sec     1.50  478.2±293.70ns        ? B/sec
array_slice 512                               1.00    293.8±5.91ns        ? B/sec     1.69  496.1±145.06ns        ? B/sec
array_string_from_vec 128                     1.00      3.2±0.10µs        ? B/sec     1.35      4.3±1.68µs        ? B/sec
array_string_from_vec 256                     1.00      4.1±0.13µs        ? B/sec     1.18      4.9±0.94µs        ? B/sec
array_string_from_vec 512                     1.00      5.9±0.11µs        ? B/sec     1.26      7.4±5.14µs        ? B/sec
bench_bool/bench_bool                         1.00      2.0±0.03ms   245.8 MB/sec     1.12      2.3±0.30ms   219.7 MB/sec
buffer_bit_ops and                            1.00   577.0±11.19ns        ? B/sec     1.14  658.5±185.85ns        ? B/sec
cast date32 to date64 512                     1.00      6.8±0.21µs        ? B/sec     1.18      8.0±1.25µs        ? B/sec
cast date64 to date32 512                     1.00      6.8±0.42µs        ? B/sec     1.27      8.6±2.22µs        ? B/sec
cast f32 to string 512                        1.00     52.2±1.37µs        ? B/sec     1.17     61.0±9.24µs        ? B/sec
cast float32 to int32 512                     1.00      2.9±0.07µs        ? B/sec     1.20      3.4±0.52µs        ? B/sec
cast float64 to float32 512                   1.00      3.0±0.07µs        ? B/sec     1.18      3.6±0.47µs        ? B/sec
cast float64 to uint64 512                    1.00      3.4±0.19µs        ? B/sec     1.45      5.0±0.72µs        ? B/sec
cast int32 to float32 512                     1.00      2.9±0.09µs        ? B/sec     1.11      3.3±0.52µs        ? B/sec
cast int32 to float64 512                     1.00      2.7±0.12µs        ? B/sec     1.27      3.4±0.77µs        ? B/sec
cast int32 to int64 512                       1.00      2.9±0.06µs        ? B/sec     1.26      3.6±0.67µs        ? B/sec
cast int32 to uint32 512                      1.00      2.7±0.08µs        ? B/sec     1.23      3.3±0.32µs        ? B/sec
cast int64 to int32 512                       1.00      2.7±0.05µs        ? B/sec     1.23      3.3±0.89µs        ? B/sec
cast time32s to time32ms 512                  1.00  1487.8±52.94ns        ? B/sec     1.11  1648.3±145.69ns        ? B/sec
cast time32s to time64us 512                  1.00      4.8±0.12µs        ? B/sec     1.26      6.0±0.82µs        ? B/sec
cast time64ns to time32s 512                  1.00      9.9±0.24µs        ? B/sec     1.24     12.3±1.87µs        ? B/sec
cast timestamp_ms to i64 512                  1.00   286.7±13.15ns        ? B/sec     1.66   474.9±72.34ns        ? B/sec
cast timestamp_ms to timestamp_ns 512         1.00  1961.6±38.52ns        ? B/sec     1.41      2.8±4.01µs        ? B/sec
cast timestamp_ns to timestamp_s 512          1.00     26.2±0.31ns        ? B/sec     1.12     29.4±3.74ns        ? B/sec
cast utf8 to date32 512                       1.00     43.9±1.05µs        ? B/sec     1.15    50.7±10.33µs        ? B/sec
cast utf8 to f32                              1.00     30.9±0.81µs        ? B/sec     1.21     37.5±5.32µs        ? B/sec
concat str 1024                               1.00      9.9±0.54µs        ? B/sec     1.34     13.2±5.26µs        ? B/sec
divide 512                                    1.00  1411.3±40.15ns        ? B/sec     1.13  1599.1±200.00ns        ? B/sec
divide_nulls_512                              1.00  1419.2±28.43ns        ? B/sec     1.12  1587.9±142.46ns        ? B/sec
eq Float32                                    1.00    103.3±3.33µs        ? B/sec     1.25   129.4±19.72µs        ? B/sec
equal_512                                     1.00     19.2±1.00ns        ? B/sec     2.58    49.5±12.56ns        ? B/sec
equal_bool_512                                1.00     19.1±1.39ns        ? B/sec     2.17     41.5±2.29ns        ? B/sec
equal_bool_513                                1.00     21.1±0.70ns        ? B/sec     2.15     45.4±3.84ns        ? B/sec
equal_nulls_512                               1.00      2.4±0.11µs        ? B/sec     1.11      2.7±0.37µs        ? B/sec
equal_string_512                              1.00     84.5±5.08ns        ? B/sec     1.32    111.9±5.30ns        ? B/sec
equal_string_nulls_512                        1.00      3.6±0.48µs        ? B/sec     1.11      4.0±1.29µs        ? B/sec
filter context f32 high selectivity           1.00   308.9±10.81µs        ? B/sec     1.11   343.2±57.50µs        ? B/sec
filter context f32 low selectivity            1.00      2.7±0.08µs        ? B/sec     1.10      3.0±0.43µs        ? B/sec
filter context string high selectivity        1.00  1091.3±26.04µs        ? B/sec     1.13  1238.6±174.82µs        ? B/sec
filter context u8                             1.00    231.6±4.05µs        ? B/sec     1.21  281.0±112.36µs        ? B/sec
filter context u8 w NULLs                     1.00   500.1±20.38µs        ? B/sec     1.17  586.5±155.02µs        ? B/sec
filter context u8 w NULLs high selectivity    1.00    295.2±5.91µs        ? B/sec     1.17   344.2±63.42µs        ? B/sec
filter context u8 w NULLs low selectivity     1.00      2.9±0.51µs        ? B/sec     1.21      3.5±5.02µs        ? B/sec
filter f32                                    1.00   797.4±33.25µs        ? B/sec     1.22  971.4±135.07µs        ? B/sec
filter u8 low selectivity                     1.00      7.9±0.73µs        ? B/sec     1.12      8.9±1.47µs        ? B/sec
from_slice prepared                           1.15   961.1±24.81µs        ? B/sec     1.00   834.2±59.41µs        ? B/sec
gt Float32                                    1.00    66.2±14.03µs        ? B/sec     1.28     84.6±9.31µs        ? B/sec
gt scalar Float32                             1.49     60.9±8.59µs        ? B/sec     1.00     40.8±4.06µs        ? B/sec
gt_eq Float32                                 1.00   117.3±31.66µs        ? B/sec     1.10   129.6±25.35µs        ? B/sec
json_list_primitive_to_record_batch           1.00     63.3±2.09µs        ? B/sec     1.16    73.4±11.01µs        ? B/sec
json_primitive_to_record_batch                1.00     25.1±0.82µs        ? B/sec     1.11     27.7±4.88µs        ? B/sec
length                                        1.00      2.9±0.10µs        ? B/sec     1.30      3.7±0.77µs        ? B/sec
like_utf8 scalar ends with                    1.00   246.5±13.62µs        ? B/sec     1.19   294.5±45.77µs        ? B/sec
like_utf8 scalar equals                       1.17     98.3±7.08µs        ? B/sec     1.00    83.7±10.13µs        ? B/sec
limit 512, 512                                1.00    291.9±5.45ns        ? B/sec     1.55  451.1±130.11ns        ? B/sec
lt Float32                                    1.00    70.0±19.63µs        ? B/sec     1.26    88.3±12.27µs        ? B/sec
lt scalar Float32                             1.00     62.6±3.01µs        ? B/sec     1.39    87.3±25.97µs        ? B/sec
lt_eq Float32                                 1.00    104.8±8.32µs        ? B/sec     1.43   149.9±56.97µs        ? B/sec
lt_eq scalar Float32                          1.00     82.1±3.59µs        ? B/sec     1.11    91.3±12.27µs        ? B/sec
max nulls 512                                 1.00  1378.4±51.28ns        ? B/sec     1.32  1820.6±260.46ns        ? B/sec
min 512                                       1.00  1510.6±13.17ns        ? B/sec     1.28  1938.2±515.72ns        ? B/sec
min nulls 512                                 1.00  1380.2±51.78ns        ? B/sec     1.44  1980.7±259.55ns        ? B/sec
min nulls string 512                          1.00      7.1±0.12µs        ? B/sec     1.25      8.9±1.66µs        ? B/sec
multiply 512                                  1.00   461.5±42.95ns        ? B/sec     1.30  601.0±102.46ns        ? B/sec
mutable                                       1.00   475.9±15.21µs        ? B/sec     1.19   566.7±77.31µs        ? B/sec
mutable prepared                              1.00   530.8±21.36µs        ? B/sec     1.17   621.0±60.61µs        ? B/sec
mutable str 1024                              1.00  1493.7±56.88µs        ? B/sec     1.19  1776.6±329.88µs        ? B/sec
mutable str nulls 1024                        1.00      4.6±0.30ms        ? B/sec     1.13      5.2±0.51ms        ? B/sec
neq Float32                                   1.00     65.8±1.70µs        ? B/sec     1.39    91.5±20.30µs        ? B/sec
neq scalar Float32                            1.47    97.6±50.68µs        ? B/sec     1.00     66.4±6.21µs        ? B/sec
nlike_utf8 scalar contains                    1.00      2.4±0.03ms        ? B/sec     1.16      2.8±0.41ms        ? B/sec
nlike_utf8 scalar equals                      1.00   218.8±19.42µs        ? B/sec     1.14   249.2±24.71µs        ? B/sec
not                                           1.00   955.5±27.28ns        ? B/sec     1.15  1102.4±153.08ns        ? B/sec
or                                            1.00  1497.9±37.68ns        ? B/sec     1.15  1726.6±88.42ns        ? B/sec
sort 2^10                                     1.00   153.9±11.62µs        ? B/sec     1.14   175.6±25.84µs        ? B/sec
sort 2^12                                     1.00   748.9±51.51µs        ? B/sec     1.13  849.8±109.51µs        ? B/sec
struct_array_from_vec 256                     1.00      7.6±0.23µs        ? B/sec     1.12      8.5±2.51µs        ? B/sec
struct_array_from_vec 512                     1.00     10.0±0.73µs        ? B/sec     1.24     12.4±3.51µs        ? B/sec
subtract 512                                  1.00    434.8±9.25ns        ? B/sec     1.32   574.3±37.34ns        ? B/sec
sum 512                                       1.00   524.8±12.74ns        ? B/sec     1.19  626.2±156.66ns        ? B/sec
take bool 1024                                1.00      3.6±0.27µs        ? B/sec     1.45      5.2±3.11µs        ? B/sec
take bool 512                                 1.00      2.1±0.10µs        ? B/sec     1.23      2.6±0.32µs        ? B/sec
take bool nulls 1024                          1.00      5.1±2.09µs        ? B/sec     1.60      8.1±2.08µs        ? B/sec
take bool nulls 512                           1.00      2.1±0.14µs        ? B/sec     1.82      3.9±0.77µs        ? B/sec
take i32 1024                                 1.00  1583.0±111.03ns        ? B/sec    1.49      2.4±0.58µs        ? B/sec
take i32 512                                  1.00  1058.3±88.60ns        ? B/sec     1.29  1364.4±97.02ns        ? B/sec
take i32 nulls 1024                           1.00  1527.5±54.16ns        ? B/sec     1.82      2.8±1.00µs        ? B/sec
take i32 nulls 512                            1.00  1086.4±124.91ns        ? B/sec    2.34      2.5±1.64µs        ? B/sec
take str 1024                                 1.00      5.9±0.51µs        ? B/sec     1.20      7.1±1.30µs        ? B/sec
take str 512                                  1.00      3.8±0.36µs        ? B/sec     1.21      4.6±0.96µs        ? B/sec
take str null indices 1024                    1.00      5.6±0.16µs        ? B/sec     1.27      7.2±1.63µs        ? B/sec
take str null indices 512                     1.00      3.7±0.31µs        ? B/sec     1.23      4.6±1.32µs        ? B/sec
take str null values 1024                     1.00      5.7±0.14µs        ? B/sec     1.25      7.1±1.02µs        ? B/sec
take str null values null indices 1024        1.00     12.8±0.39µs        ? B/sec     1.14     14.6±1.85µs        ? B/sec

@jorgecarleitao jorgecarleitao changed the title [Rust] [Experiment] Replace Arc<ArrayData> by ArrayData in all arrays [Rust] [Experiment] Replace Arc<ArrayData> by ArrayData in all arrays (1.2-2x speedup) Jan 26, 2021
@jorgecarleitao
Copy link
Member Author

cc @Dandandan ^_^

@Dandandan
Copy link
Contributor

Nice @jorgecarleitao !! Looking forward to see the impact in DataFusion on the slicing of smaller arrays and maybe other parts as well. I wouldn't have expected > 20% changes for bigger arrays (512/1024 elements) in the kernels only based on removing / not creating or cloning the Arc. Any idea why that could be the case?

Ergonomically it is definitely an improvement, the ArrayData struct is small and should be cheap to clone, this avoids quite some boilerplate.

@jorgecarleitao jorgecarleitao changed the title [Rust] [Experiment] Replace Arc<ArrayData> by ArrayData in all arrays (1.2-2x speedup) [Rust] [Experiment] Replace Arc<ArrayData> by ArrayData in all arrays Jan 26, 2021
@jorgecarleitao
Copy link
Member Author

I wouldn't have expected > 20% changes for bigger arrays (512/1024 elements) in the kernels only based on removing / not creating or cloning the Arc. Any idea why that could be the case?

I remember that we had a similar issue some time ago where we were using data instead of data_ref, and that was causing a significant performance difference.

My general approach here is to write simpler code in the hope that it helps the compiler.

In practice, I humbly make these offerings to the gods of LLVM and pray for a sign.

@yordan-pavlov
Copy link
Contributor

@jorgecarleitao that's an interesting idea; simplification makes sense, especially if cloning ArrayData is cheap and there is no obvious reason to use Arc<ArrayData>.

I ran the filter benchmarks with SIMD enabled and got the following results (in comparison to master).
Possibly some improvement, but the results are not very clear - often the differences are small.

filter u8 time: [556.89 us 562.29 us 568.54 us]
change: [-3.0769% +0.2211% +3.2221%] (p = 0.89 > 0.05)

filter u8 high selectivity
time: [13.859 us 13.959 us 14.081 us]
change: [-2.3832% +0.3285% +3.0131%] (p = 0.81 > 0.05)

filter u8 low selectivity
time: [7.7940 us 7.8819 us 7.9793 us]
change: [+8.3043% +11.999% +16.830%] (p = 0.00 < 0.05)

filter context u8 time: [205.24 us 209.06 us 213.16 us]
change: [+0.4175% +5.7203% +11.914%] (p = 0.04 < 0.05)

filter context u8 high selectivity
time: [3.7003 us 3.7456 us 3.8010 us]
change: [-3.2048% -0.7748% +1.5407%] (p = 0.55 > 0.05)

filter context u8 low selectivity
time: [1.3066 us 1.3243 us 1.3443 us]
change: [-3.4133% -1.5255% +0.1897%] (p = 0.11 > 0.05)

filter context u8 w NULLs
time: [496.69 us 501.24 us 506.60 us]
change: [-8.8715% -4.9662% -1.0448%] (p = 0.02 < 0.05)

filter context u8 w NULLs high selectivity
time: [254.92 us 258.17 us 262.51 us]
change: [+8.6837% +13.067% +17.583%] (p = 0.00 < 0.05)

filter context u8 w NULLs low selectivity
time: [1.9758 us 2.0159 us 2.0623 us]
change: [-10.590% -7.4822% -4.2891%] (p = 0.00 < 0.05)

filter f32 time: [779.40 us 789.32 us 801.26 us]
change: [-17.189% -13.338% -9.2795%] (p = 0.00 < 0.05)

filter context f32 time: [485.95 us 495.16 us 506.07 us]
change: [-7.7798% -3.5929% +0.9816%] (p = 0.12 > 0.05)

filter context f32 high selectivity
time: [258.28 us 259.96 us 261.83 us]
change: [+2.0252% +5.7103% +9.4974%] (p = 0.00 < 0.05)

filter context f32 low selectivity
time: [1.9878 us 2.0332 us 2.0857 us]
change: [-7.9039% -4.8554% -1.8226%] (p = 0.00 < 0.05)

filter context string time: [639.51 us 646.56 us 655.15 us]
change: [-9.9164% -5.9396% -2.2137%] (p = 0.00 < 0.05)

filter context string high selectivity
time: [1.0009 ms 1.0168 ms 1.0344 ms]
change: [-5.8412% -1.1777% +4.1292%] (p = 0.65 > 0.05)

filter context string low selectivity
time: [2.9339 us 2.9668 us 3.0060 us]
change: [-6.4880% -2.3484% +1.7663%] (p = 0.29 > 0.05)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if both data() and data_ref() return &ArrayData may be just data_ref() is enough and data() should be removed; or possibly change data() to return ArrayData

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why we wanted to avoid Array::data().clone() to get ArrayData, but with these 2 functions returning the same reference, I'd opt to make Array::data() -> ArrayData

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yordan-pavlov I think it's safer to deprecate data(), then remove it later. We still use it a lot in the codebase, but it's often more performant to use array_ref(), so returning ArrayData doesn't guide users on a faster path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, deprecate data() is what I meant, immediate removal is not great for backwards compatibility

@jhorstmann
Copy link
Contributor

How cheap cloning of the ArrayData struct is probably depends a lot on the datatype. The removed indirection seems to be beneficial, but instead we now need to clone all fields, which include

  • the datatype enum, which can include boxes/vecs of nested types and the field metadata
  • the vec of buffers, and cloning a buffer would include cloning the Arc<Bytes>

So for primitive types, removing the arc should be a benefit because of less indirection when accessing it, it would still need to allocates vectors and clone the Arc which is inside the Buffer.

For more complex types, cloning the DataType enum itself and multiple buffers could be slower than the speedup gained by less indirection.

@jorgecarleitao
Copy link
Member Author

How cheap cloning of the ArrayData struct is probably depends a lot on the datatype. The removed indirection seems to be beneficial, but instead we now need to clone all fields, which include

* the datatype enum, which can include boxes/vecs of nested types and the field metadata

* the vec of buffers, and cloning a buffer would include cloning the `Arc<Bytes>`

So for primitive types, removing the arc should be a benefit because of less indirection when accessing it, it would still need to allocates vectors and clone the Arc which is inside the Buffer.

For more complex types, cloning the DataType enum itself and multiple buffers could be slower than the speedup gained by less indirection.

I am sorry, I am not sure I follow the arguments:

  • DataType: all our operations already clone DataType, since they create a new ArrayData. Even something as simple as slicing an array requires cloning a DataType, as we need to increase the offset, which requires a new ArrayData, which requires a new DataType.
  • Arc<Bytes>: I do not see how using Arc<Bytes> is relevant to why we should use Arc<ArrayData>. Regardless, AFAI understand, its usage is really important: Bytes is an immutable memory region, and sharing it between arrays imo makes a lot of sense: the most used case is when the null buffer does not change due to an operation. So, the two natural options are Rc and an Arc, and allowing Arrays to be sharable across thread boundaries seems like a good justification for Arc.

@jhorstmann
Copy link
Contributor

all our operations already clone DataType, since they create a new ArrayData

You are right, I wasn't looking closely enough at the code or thought only about ArrayData::clone vs Arc<ArrayData>::clone. I still think there might be some overhead there and it might be worth experimenting with an Arc<DataType> or replacing some of the Box in the DataType enum with Arc in a separate PR. For nested types like List<Dictionary<Int32, Utf8>> I think cloning the type currently involves more allocations than cloning the vec of buffers/child data.

@jorgecarleitao jorgecarleitao changed the title [Rust] [Experiment] Replace Arc<ArrayData> by ArrayData in all arrays ARROW-11511: [Rust] Replace Arc<ArrayData> by ArrayData in all arrays Feb 5, 2021
@github-actions
Copy link

github-actions bot commented Feb 5, 2021

@jorgecarleitao
Copy link
Member Author

@nevi-me @sunchao , I would appreciate your feedback here: I can't find a rational to keep Arc<ArrayData>, but I do not have all the background, and you are probably the most experienced persons for a qualified answer here. I marked you as reviewers as well.

@codecov-io
Copy link

codecov-io commented Feb 6, 2021

Codecov Report

Merging #9329 (6cdec30) into master (a3e1513) will decrease coverage by 0.28%.
The diff coverage is 92.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9329      +/-   ##
==========================================
- Coverage   82.32%   82.03%   -0.29%     
==========================================
  Files         245      245              
  Lines       56277    55927     -350     
==========================================
- Hits        46328    45879     -449     
- Misses       9949    10048      +99     
Impacted Files Coverage Δ
rust/arrow/src/ipc/writer.rs 87.44% <ø> (+0.21%) ⬆️
rust/arrow/src/json/writer.rs 87.91% <ø> (+1.74%) ⬆️
...t/datafusion/src/physical_plan/math_expressions.rs 27.27% <ø> (ø)
rust/integration-testing/src/lib.rs 0.00% <0.00%> (ø)
rust/arrow/src/array/array.rs 76.80% <50.00%> (+0.49%) ⬆️
rust/arrow/src/array/data.rs 76.87% <55.55%> (-2.00%) ⬇️
rust/parquet/src/arrow/array_reader.rs 77.61% <66.66%> (ø)
rust/arrow/src/ipc/reader.rs 84.36% <83.33%> (ø)
rust/arrow/src/json/reader.rs 83.45% <83.33%> (-0.06%) ⬇️
rust/arrow/src/array/array_binary.rs 84.53% <85.71%> (-7.09%) ⬇️
... and 58 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a3e1513...6cdec30. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went over this PR carefully and it makes sense to me. I am happy with the performance numbers reported by others, though if we want another opinion I can re-run stuff locally.

My opinion is that if this PR slows anything down, there are other optimizations we should pursue to get the performance back (Arcing compound datatypes and special casing small numbers of Buffers in ArrayData) that would also help the overall system

still think there might be some overhead there and it might be worth experimenting with an Arc or replacing some of the Box in the DataType enum with Arc in a separate PR. For nested types like List<Dictionary<Int32, Utf8>> I think cloning the type currently involves more allocations than cloning the vec of buffers/child data.

I agree with this @jhorstmann -- I think a lot of the arrow codebase assumes that Cloneing DataType is cheap (which is true for primitives and not true for compound types).

Also, given that ArrayData has several Vecs in it, I suspect an optimization like @jhorstmann was trying here #9330 would make a significant difference too (as most arrays have 1 or 0 items in that Vec)

In practice, I humbly make these offerings to the gods of LLVM and pray for a sign.

I think this is a good sacrifice. 😂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reasoning for removing this size test rather than updating it for the new values ?

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one question

@alamb
Copy link
Contributor

alamb commented Feb 16, 2021

@jorgecarleitao shall we merge this PR? Or is it still an "another of those experiments to gather feedback and share results." in the description?

jorgecarleitao added a commit that referenced this pull request Feb 17, 2021
Merging #9329 did not go as well as I planned; Fixing error in master.

Closes #9511 from jorgecarleitao/fix_error

Authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
@alamb
Copy link
Contributor

alamb commented Feb 18, 2021

I think this one needs a rebase (Jorge fixed the real error in #9511)

@alamb alamb added the needs-rebase A PR that needs to be rebased by the author label Feb 18, 2021
@nevi-me nevi-me removed the needs-rebase A PR that needs to be rebased by the author label Feb 25, 2021
Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao @alamb I've rebased this and fixed some errors in simd.

I have capacity to look at this in more detail, including running benchmarks; so I'll start doing that.
There's 3 questions that Jorge asked, which I haven't answered; so I'm marking this as "changes requested" to avoid us prematurely merging this.

I'd also like to hear from @sunchao and @vertexclique around what the change could mean from a memory-usage. I don't have enough experience to tell off-hand.

I'll keep this PR rebased, if there's merge conflicts that arise; so I can take that burden off your shoulders Jorge.

@alamb
Copy link
Contributor

alamb commented Feb 28, 2021

Thanks @nevi-me . Let me know if there is anything I can do to help

@nevi-me
Copy link
Contributor

nevi-me commented Mar 6, 2021

Benchmarks on an ARM machine:

nevilledipale@Nevilles-MacBook-Air rust % critcmp master-bfa99d98e arcless-ef2250b9d -t 10
group                           arcless-ef2250b9d                       master-bfa99d98e
-----                           -----------------                       ----------------
add_nulls_512                   1.00   351.5±22.53ns        ? B/sec     1.12   394.1±19.39ns        ? B/sec
array_from_vec 128              1.00   555.6±26.43ns        ? B/sec     1.11   619.1±17.89ns        ? B/sec
array_slice 128                 1.00     98.3±8.62ns        ? B/sec     1.42    139.4±7.01ns        ? B/sec
array_slice 2048                1.00   105.8±13.50ns        ? B/sec     1.37    145.1±8.93ns        ? B/sec
array_slice 512                 1.00    101.9±6.16ns        ? B/sec     1.41    143.3±9.44ns        ? B/sec
cast date32 to date64 512       1.00   246.7±17.81ns        ? B/sec     1.23   303.7±14.81ns        ? B/sec
cast time32s to time32ms 512    1.00   223.1±13.25ns        ? B/sec     1.18   262.3±16.68ns        ? B/sec
cast timestamp_ms to i64 512    1.00     98.6±9.32ns        ? B/sec     1.56    153.7±7.79ns        ? B/sec
divide_scalar 512               1.00   189.9±12.11ns        ? B/sec     1.16    220.4±8.05ns        ? B/sec
divide_scalar_nulls_512         1.00   186.6±14.60ns        ? B/sec     1.21   225.1±11.87ns        ? B/sec
eq Float32                      1.15     40.0±1.54µs        ? B/sec     1.00     34.7±1.30µs        ? B/sec
equal_bool_512                  1.00     25.8±1.34ns        ? B/sec     1.12     28.9±1.15ns        ? B/sec
equal_bool_513                  1.00     27.7±1.36ns        ? B/sec     1.14     31.5±1.42ns        ? B/sec
gt Float32                      1.15     40.6±2.00µs        ? B/sec     1.00     35.2±1.75µs        ? B/sec
gt_eq Float32                   1.20     42.1±2.70µs        ? B/sec     1.00     35.2±1.60µs        ? B/sec
limit 512, 512                  1.00    106.5±6.65ns        ? B/sec     1.31    139.0±7.04ns        ? B/sec
lt Float32                      1.15     40.5±1.97µs        ? B/sec     1.00     35.3±1.73µs        ? B/sec
lt_eq Float32                   1.14     40.3±2.40µs        ? B/sec     1.00     35.2±1.50µs        ? B/sec
max nulls 512                   1.00   889.7±59.93ns        ? B/sec     3.26      2.9±0.11µs        ? B/sec
min nulls 512                   1.00   916.4±70.85ns        ? B/sec     2.64      2.4±0.10µs        ? B/sec
multiply 512                    1.00   361.3±28.15ns        ? B/sec     1.12   403.5±21.82ns        ? B/sec
neq Float32                     1.16     40.6±1.72µs        ? B/sec     1.00     34.9±1.68µs        ? B/sec
take bool 1024                  1.00      4.2±0.30µs        ? B/sec     1.29      5.4±0.29µs        ? B/sec
take bool 512                   1.00  1750.4±100.33ns        ? B/sec    1.38      2.4±0.19µs        ? B/sec
take bool nulls 1024            1.00      3.3±0.16µs        ? B/sec     1.30      4.3±0.28µs        ? B/sec
take bool nulls 512             1.00  1635.3±119.87ns        ? B/sec    1.20  1956.5±83.47ns        ? B/sec
take i32 512                    1.00   421.9±11.67ns        ? B/sec     1.14   480.0±27.45ns        ? B/sec
take str null values 1024       1.00      6.8±0.42µs        ? B/sec     1.25      8.5±1.49µs        ? B/sec

Only equality benches are worse, but not by that much.

@nevi-me
Copy link
Contributor

nevi-me commented Mar 18, 2021

does anyone knows why we are using Arc in all arrays?

I think it comes down just to memory size, as cloning Arc<T> would give us a pointer-sized variable, compared to cloning T including its DataType.

Do you envision an issue with removing the Arc?

I personally don't. The underlying bytes are still backed by Arc, so we shouldn't use a lot more memory for copies of arrays.
The perf benchmarks are also positive.

Would someone be so kind and run the benches independently, just to be sure.

I attached my set of benchmark results, but I think the datafusion benchmarks would be more meaningful.
@Dandandan would you be able to run datafusion benchmarks from this branch vs master, and share results?

Thanks

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with continuing with this change. I've rebased the PR, and added comments on the questions that @jorgecarleitao asked.

Any thoughts @alamb, Jorge, @andygrove?

I've been out of the loop, so I'm not sure if an RFC is retrospectively required for the change, as the PR predates the RFC suggestion.

@alamb
Copy link
Contributor

alamb commented Mar 19, 2021

@nevi-me I think we should merge this PR:

  1. It cleans the code up
  2. It shows decent performance improvements
  3. It sets us up for future work

@nevi-me
Copy link
Contributor

nevi-me commented Mar 19, 2021

I concur @alamb, I think the PR's been open long enough for those interested to comment.

I fixed the clippy lint already in master, so we need not worry about it.

@alamb
Copy link
Contributor

alamb commented Mar 19, 2021

I fixed the clippy lint already in master, so we need not worry about it.

Thanks -- you beat me to it! (#9754)

@Dandandan
Copy link
Contributor

@nevi-me performance on some queries in here h2oai/db-benchmark#182 improve by ~5%.

@nevi-me nevi-me closed this in eebf64b Mar 22, 2021
@jorgecarleitao jorgecarleitao deleted the arcless branch March 22, 2021 15:17
@alamb
Copy link
Contributor

alamb commented Mar 22, 2021

Thanks @nevi-me for pushing the over the line. 🎉

alamb pushed a commit that referenced this pull request Apr 15, 2021
These tests were all commented out in #9329. Given the PR made changes to the commented out code, I'm inclined to think this was an accidental omission? If not, happy for this to be closed 😀

FYI @jorgecarleitao

Signed-off-by: Raphael Taylor-Davies <[email protected]>

Closes #10048 from tustvold/enable-transform-tests

Authored-by: Raphael Taylor-Davies <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants