You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+33-25Lines changed: 33 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,6 +18,8 @@
18
18
-[About The Project](#about-the-project)
19
19
-[Built With](#built-with)
20
20
-[Performance](#performance)
21
+
-[500MB File](#500mb-file)
22
+
-[6000MB File](#6000mb-file)
21
23
-[Prerequisites](#prerequisites)
22
24
-[Installation](#installation)
23
25
-[Usage](#usage)
@@ -29,7 +31,10 @@
29
31
30
32
PySubstringSearch is a library intended for searching over an index file for substring patterns. The library is written in C++ to achieve speed and efficiency. The library also uses [Msufsort](https://github.com/michaelmaniscalco/msufsort) suffix array construction library for string indexing. The created index consists of the original text and a 32bit suffix array structs. The library relies on a proprietary container protocol to hold the original text along with the index in chunks of 512mb to evade the limitation of the Suffix Array Construction implementation.
31
33
32
-
The module implements two methods, search_sequential & search_parallel. search_sequential searches through the inner chunks one by one where search_parallel searches concurrently. When dealing with big indices, bigger than 1gb for example, search_parallel would function faster. I advice to check them both with the resulted index to find which one fits better.
34
+
The module implements multiple methods.
35
+
-`search` - search concurrently for a substring existed in different entries within the index file. As the index file getting bigger with multiple inner chunks, the concurrency effect increases.
36
+
-`count_entries` - return the number of entries in the index file consisting of the substring.
37
+
-`count_occurrences` - return the number of occurrences of the substring in all the entries. If the substring exists multiple times in the same entry, each occurrence will be counted.
33
38
34
39
35
40
### Built With
@@ -39,20 +44,21 @@ The module implements two methods, search_sequential & search_parallel. search_s
39
44
40
45
### Performance
41
46
42
-
| Library | Text Size | Function | Time | #Results | Improvement Factor |
0 commit comments