diff options
author | Assaf Gordon <assafgordon@gmail.com> | 2013-03-06 18:25:49 -0500 |
---|---|---|
committer | Pádraig Brady <P@draigBrady.com> | 2013-03-25 20:07:14 +0000 |
commit | 20d7bce0f7e57d9a98f0ee811e31c757e9fedfff (patch) | |
tree | 0c3a560e7bbd16762807e0b2fc28d32670549eff /NEWS | |
parent | 4c49dc823ff7da589ae58d8d8313d38a75fc8f64 (diff) | |
download | coreutils-20d7bce0f7e57d9a98f0ee811e31c757e9fedfff.tar.xz |
shuf: use reservoir-sampling for large or unknown sized inputs
Reservoir sampling optimizes selecting K random lines from large or
unknown-sized input: http://en.wikipedia.org/wiki/Reservoir_sampling
Note this also avoids reading any input when -n0 is specified.
* src/shuf.c (main): Use reservoir-sampling when the number of output
lines is known, and the input size is large or unknown.
(input_size): A new function to get the input size for regular files.
(read_input_reservoir_sampling): New function to read lines from input,
keeping only K lines in memory, replacing lines with decreasing prob.
(write_permuted_output_reservoir): New function to output reservoir.
* tests/misc/shuf-reservoir.sh: An expensive_ test using valgrind to
exercise the reservoir-sampling code.
* tests/local.mk: Reference new test.
* NEWS: Mention the improvement.
Diffstat (limited to 'NEWS')
-rw-r--r-- | NEWS | 4 |
1 files changed, 4 insertions, 0 deletions
@@ -25,6 +25,10 @@ GNU coreutils NEWS -*- outline -*- inotify for files on those file systems, rather than the default (for unknown file system types) of issuing a warning and reverting to polling. + shuf outputs subsets of large inputs much more efficiently. + Reservoir sampling is used to limit memory usage based on the number of + outputs, rather than the number of inputs. + ** Build-related factor now builds on aarch64 based systems [bug introduced in coreutils-8.20] |