Skip to content

Commit f4328cf

Browse files
author
Gal Ben David
committed
improve the implementation and implement counts methods
1 parent 920f8fc commit f4328cf

5 files changed

Lines changed: 255 additions & 174 deletions

File tree

.clang-format

Lines changed: 0 additions & 63 deletions
This file was deleted.

README.md

Lines changed: 33 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@
1818
- [About The Project](#about-the-project)
1919
- [Built With](#built-with)
2020
- [Performance](#performance)
21+
- [500MB File](#500mb-file)
22+
- [6000MB File](#6000mb-file)
2123
- [Prerequisites](#prerequisites)
2224
- [Installation](#installation)
2325
- [Usage](#usage)
@@ -29,7 +31,10 @@
2931

3032
PySubstringSearch is a library intended for searching over an index file for substring patterns. The library is written in C++ to achieve speed and efficiency. The library also uses [Msufsort](https://github.com/michaelmaniscalco/msufsort) suffix array construction library for string indexing. The created index consists of the original text and a 32bit suffix array structs. The library relies on a proprietary container protocol to hold the original text along with the index in chunks of 512mb to evade the limitation of the Suffix Array Construction implementation.
3133

32-
The module implements two methods, search_sequential & search_parallel. search_sequential searches through the inner chunks one by one where search_parallel searches concurrently. When dealing with big indices, bigger than 1gb for example, search_parallel would function faster. I advice to check them both with the resulted index to find which one fits better.
34+
The module implements multiple methods.
35+
- `search` - search concurrently for a substring existed in different entries within the index file. As the index file getting bigger with multiple inner chunks, the concurrency effect increases.
36+
- `count_entries` - return the number of entries in the index file consisting of the substring.
37+
- `count_occurrences` - return the number of occurrences of the substring in all the entries. If the substring exists multiple times in the same entry, each occurrence will be counted.
3338

3439

3540
### Built With
@@ -39,20 +44,21 @@ The module implements two methods, search_sequential & search_parallel. search_s
3944

4045
### Performance
4146

42-
| Library | Text Size | Function | Time | #Results | Improvement Factor |
43-
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
44-
| [ripgrepy](https://pypi.org/project/ripgrepy/) | 500mb | Ripgrepy('text_one', '500mb').run().as_string.split('\n') | 127 ms ± 694 µs per loop | 12553 | 1.0x |
45-
| [PySubstringSearch](https://github.com/Intsights/PySubstringSearch) | 500mb | reader.search_sequential('text_one') | 2.48 ms ± 53.4 µs per loop | 12553 | 51.2x |
46-
| [PySubstringSearch](https://github.com/Intsights/PySubstringSearch) | 500mb | reader.search_parallel('text_one') | 3.78 ms ± 350 µs per loop | 12553 | 33.6x |
47-
| [ripgrepy](https://pypi.org/project/ripgrepy/) | 500mb | Ripgrepy('text_two', '500mb').run().as_string.split('\n') | 127 ms ± 623 µs per loop | 769 | 1.0x |
48-
| [PySubstringSearch](https://github.com/Intsights/PySubstringSearch) | 500mb | reader.search_sequential('text_two') | 156 µs ± 916 ns per loop | 769 | 814.0x |
49-
| [PySubstringSearch](https://github.com/Intsights/PySubstringSearch) | 500mb | reader.search_parallel('text_two') | 251 µs ± 80.2 µs per loop | 769 | 506.0x |
50-
| [ripgrepy](https://pypi.org/project/ripgrepy/) | 6gb | Ripgrepy('text_one', '6gb').run().as_string.split('\n') | 1.38 s ± 3.82 ms | 206884 | 1.0x |
51-
| [PySubstringSearch](https://github.com/Intsights/PySubstringSearch) | 6gb | reader.search_sequential('text_one') | 93.7 ms ± 2.16 ms per loop | 206884 | 15.3x |
52-
| [PySubstringSearch](https://github.com/Intsights/PySubstringSearch) | 6gb | reader.search_parallel('text_one') | 34.3 ms ± 321 µs per loop | 206884 | 40.5x |
53-
| [ripgrepy](https://pypi.org/project/ripgrepy/) | 6gb | Ripgrepy('text_two', '6gb').run().as_string.split('\n') | 1.61 s ± 37.2 ms per loop | 6921 | 1.0x |
54-
| [PySubstringSearch](https://github.com/Intsights/PySubstringSearch) | 6gb | reader.search_sequential('text_two') | 2.22 ms ± 79.3 µs per loop | 6921 | 725.2x |
55-
| [PySubstringSearch](https://github.com/Intsights/PySubstringSearch) | 6gb | reader.search_parallel('text_two') | 1.38 ms ± 26 µs per loop | 6921 | 1166.6x |
47+
#### 500MB File
48+
| Library | Function | Time | #Results | Improvement Factor |
49+
| ------------- | ------------- | ------------- | ------------- | ------------- |
50+
| [ripgrepy](https://pypi.org/project/ripgrepy/) | Ripgrepy('text_one', '500mb').run().as_string.split('\n') | 148ms | 2367 | 1.0x |
51+
| [PySubstringSearch](https://github.com/Intsights/PySubstringSearch) | reader.search('text_one') | 1.28ms | 2367 | 115.6x |
52+
| [ripgrepy](https://pypi.org/project/ripgrepy/) | Ripgrepy('text_two', '500mb').run().as_string.split('\n') | 116ms | 159 | 1.0x |
53+
| [PySubstringSearch](https://github.com/Intsights/PySubstringSearch) | reader.search('text_two') | 228µs | 159 | 508.7x |
54+
55+
#### 6000MB File
56+
| Library | Function | Time | #Results | Improvement Factor |
57+
| ------------- | ------------- | ------------- | ------------- | ------------- |
58+
| [ripgrepy](https://pypi.org/project/ripgrepy/) | Ripgrepy('text_one', '6000mb').run().as_string.split('\n') | 2.4s | 59538 | 1.0x |
59+
| [PySubstringSearch](https://github.com/Intsights/PySubstringSearch) | reader.search('text_one') | 15.4ms | 59538 | 155.8x |
60+
| [ripgrepy](https://pypi.org/project/ripgrepy/) | Ripgrepy('text_two', '6000mb').run().as_string.split('\n') | 1.5s | 7266 | 1.0x |
61+
| [PySubstringSearch](https://github.com/Intsights/PySubstringSearch) | reader.search('text_two') | 1.97ms | 7266 | 761.4x |
5662

5763
### Prerequisites
5864

@@ -104,21 +110,23 @@ reader = pysubstringsearch.Reader(
104110
index_file_path='output.idx',
105111
)
106112

107-
# lookup for a substring sequentially
108-
reader.search_sequential('short')
113+
# lookup for a substring
114+
reader.search('short')
109115
>>> ['some short string']
110116

111-
# lookup for a substring sequentially
112-
reader.search_sequential('string')
117+
# lookup for a substring
118+
reader.search('string')
113119
>>> ['some short string', 'another but now a longer string']
114120

115-
# lookup for a substring concurrently
116-
reader.search_parallel('short')
117-
>>> ['some short string']
121+
# count the number of occurrences
122+
# ['some short string', 'another string now, but a longer string']
123+
reader.count_occurences('string')
124+
>>> 3
118125

119-
# lookup for a substring concurrently
120-
reader.search_parallel('string')
121-
>>> ['some short string', 'another but now a longer string']
126+
# count the number of entries
127+
# ['some short string', 'another string now, but a longer string']
128+
reader.count_occurences('string')
129+
>>> 2
122130
```
123131

124132

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
setuptools.setup(
77
name='PySubstringSearch',
8-
version='0.3.1',
8+
version='0.4.0',
99
author='Gal Ben David',
1010
author_email='gal@intsights.com',
1111
url='https://github.com/Intsights/PySubstringSearch',

0 commit comments

Comments
 (0)