File: Benchmark.rst

package info (click to toggle)
py-stringmatching 0.4.3-1
links: PTS, VCS
area: main
in suites: bookworm
size: 1,956 kB
sloc: python: 3,979; makefile: 174; sh: 7
file content (66 lines) | stat: -rw-r--r-- 3,804 bytes
parent folder | download | duplicates (2)
Runtime Benchmark 
=================

For this package, we add a runtime benchmark (consisting of a script and several datasets) to measure the runtime performance of similarity measures. This benchmark can be used by users to judge whether similarity measures are fast enough for their purposes, and used by developers to speed up the measures.

Running the Benchmark
---------------------

The user can run the benchmark as follows:

Step 1: Clone the py_stringmatching package from GitHub using the following command::
    
    git clone https://github.com/anhaidgroup/py_stringmatching.git 

Step 2: Change the working directory to py_stringmatching/benchmarks/custom_benchmarks

Step 3: Run the benchmark using the following sequence of commands:

    >>> import py_stringmatching as sm
    >>> from run_benchmark import *
    # create an object for the similarity measure you need to benchmark
    >>> jaccard = sm.Jaccard()                                                                                   
    # create a tokenizer object (in case of token-based measures)            
    >>> ws = sm.WhitespaceTokenizer(return_set = True)
    # Set dataset paths
    >>> short_strings_path = 'datasets/short_strings.csv'
    >>> medium_strings_path = 'datasets/medium_strings.csv'
    >>> long_strings_path = 'datasets/long_strings.csv'
    # Data size (number of string pairs) over which the benchmark should be run
    >>> data_size = 10000
    # Number of times to repeat
    >>> num_repeat = 3
    # Output file where the benchmark results should be written
    >>> output_file = 'benchmark_results.csv'
    # run the benchmark
    >>> run_benchmark(short_strings_path, medium_strings_path, long_strings_path, data_size = data_size, jaccard.get_sim_score, ws.tokenize, num_repeat = num_repeat, output_file = output_file)

The benchmark contains three datasets in the `datasets` directory: (1) short_strings.csv, (2) medium_strings.csv, and (3) long_strings.csv. Each dataset contains 5000 strings. Specifically, short_strings.csv contains strings with length in the range of 2-15 (avg. of 10), medium_strings.csv contains strings with length in the range of 18-39 (avg. of 25), and
long_strings.csv contains strings with length in the range of 60-1726 (avg. of 127).

The above command will run the benchmark for 9 different configurations 
(short-short, short-medium, short-long, medium-short, medium-medium, medium-long, 
long-short, long-medium, long-long) for the provided similarity measure, and
writes the result to the provided output file. See below for additional details.

Interpreting the Results
--------------------------

The benchmark results will be a CSV file containing the following information:

   * Configuration
   * Runtime (in secs) for each run of a configuration (note that each configuration is run for `num_repeat` times)
   * Average runtime (in secs) for each configuration

An example output file will look like this::

    configuration,run_1 (in secs),run_2 (in secs),run_3 (in secs),average (in secs) 
    short_short,0.112642049789,0.112892866135,0.112852096558,0.112795670827         
    short_medium,0.115404129028,0.115512132645,0.115454912186,0.115457057953        
    short_long,0.194123983383,0.193922996521,0.193790912628,0.193945964177          
    medium_short,0.11647105217,0.116579055786,0.116438865662,0.116496324539         
    medium_medium,0.118470907211,0.118409156799,0.118496894836,0.118458986282       
    medium_long,0.206312894821,0.206974983215,0.206708908081,0.206665595373         
    long_short,0.205050945282,0.205410957336,0.205253124237,0.205238342285          
    long_medium,0.217441797256,0.21806883812,0.218235015869,0.217915217082          
    long_long,0.770321846008,0.76869893074,0.768806934357,0.769275903702