In a previous article we presented the Definition of memory Association by Reinforcements of Contexts.
In this article, we provide you with some feedback from an experimental demonstration which compares mARCTM to a “Best Class Search” engine. We don’t publish entirely our study. You should nevertheless find enough information to come back to us if you find this experimental demonstration relevant.
Experimental Demonstration:
In order to demonstrate mARC’s benefits, especially points 1,2,4 and 5 described in our previous article, we built an experimental platform which compares a traditional efficient procedural search engine, in this case «Best Class Search», with a basic search system, functional clone of «Best Class Search», using a mARCTM memory.
Experimentation Goals: Compare plausibly two systems functionally equivalent in order to check the advantages to implement mARCTM. Data: The selected
corpus consists of Wikipedia articles in French (Fr) and in English (En).
base Fr for mARC : 1.0 million articles base En for mARC : 3.5 million base Fr for «Best Class»: 1.4 million articles base En for «Best Class» : 3.9 million * The difference between the number of articles in both cases is due to the fact that Wikipedia, and «Best Class Search», as up-to-date web search engine, evolve, whereas data we access to are those of Wikipedia from approximately February 2010. Experimental Platform: 1) «Best Class Search» is restricted to fr.wikipedia.org and en.wikipedia.org domains 2) A procedural approach keyword search engine simulation, implementing a mARCTM software pre-commercial version. The name of the search engine and indexation is Syncytiotrophoblaste. It should be used as a programmatic basic sample using mARCTM memory within the first commercial documentation of mARCTM. The user interface (UI) mimics «Best Class Search» Look & Feel, including advanced features such as assistance to the input query, dynamic predictive requests while typing. The UI provides additional key features which can easily be implemented thanks to mARCTM like: - Search by contextual
similarity of articles (SA),
- Meta-search images engine based on the query, - Dynamic Query helper by associations and shapes suggestions. Syncytiotrophoblaste is not a contextual search engine by itself; as such a design wouldn’t have to make a point to point comparison with a reference of the market. The constraint of simulating a search mode type « keyword » involves the use of mARCTM in low level mode, except the search by contextual similarity of articles (SA). The indexation algorithm used is also keyword oriented, and not purely contextual, which overloads the internal database of the prototype. In other words, the technical design of the application, very basic, takes only partial benefit of the underlying features of mARCTM. 3) Environment On the one hand a distributed «Best Class Search» architecture and on the other hand an Intel core I5 hosted by OVH. The hosted OVH server contains two mARCs, one for the French indexation, the other one for the English indexation, and an Apache Web server. The operating system (OS) is a virtualized Windows 7 within a VMWare partition. 4) Validity of the comparative It is not obvious, at first sight, that the two platforms are comparable. However a somewhat finer analysis of the «Best Class Search» distributed architecture indicates, as a first approximation, that the comparison makes sense. After discussion with external engineers, they are of the opinion that the test conditions are significant; even that mARC’s platform is slightly disadvantaged. Results: Data Independence Indexation and requests are handled exactly in the same way on the French and English corpus by the application of (re)search. We are able to demonstrate the same behavior on the German, Spanish, Italian, Alsatian and Breton Wikipedia corpus. The
version of mARCTM used is based on two simplifying assumptions: Hence, the current version of mARC does not allow validating universal data independence. Nevertheless, it proofs ad minima, that it is independent from the stored language. Therefore we “only” have a partial proof of data independence. Results: Compactness Here
are the different sizes of elements used by the simulated search engine: mARCTM
memory content, indexation information, reverse indexation of information.
We
notice that:
1) mARC’s size does not vary linearly as a function of stored data, but, at worst, as logarithm (Log) of number of stored data. 2) The index, that is, all the information necessary to the search engine, weighs about 50% of the initial data size. Today, Full Text index included on most SQL servers on the market or search engines index like Indri, Sphinx and others, “costs” between 100% and 300 % of the initial data size. We don’t know «Best Class Search»’s index/data ratio. Conclusion, mARCTM itself is compact as specified at the beginning of this study. mARCTM based indexation applications are much more compact than similar ones based on linear memory (RAM). The gain in terms of index index/data between mARCTM indexation and classic indexation is between 2 and 6 for the mARCTM implemented application. A less “keyword search” and more contextual search indexation strategy (like Similar Article) would easily shrink the footprint of the static information of indexing by an order of magnitude (x10), using mARC’sTM dynamic resolution of relationships at runtime. Results: Speed This application does not allow a
direct measure of mARCTM’s speed. Nevertheless, as it is based on
its usage, the gain in
speed compared to a technology at the edge of linear memory, can only be attributed to the
use of mARCTM. Protocol: We used a list of 100 most popular search requests in 2011 and 2010 on Wikipedia in French and English. A second part of the test is based on the use of article titles as a new query, and copy/paste a portion of the text of an article in order to follow the trend of request size observed in recent years. Each query is made four times. The first one is to
measure nocache response time, the other three in order to evaluate a mean of
the cached query response time. The real recall rate was also measured. However, it is unnecessary to repeat the search with the omitted results included, the recall rate is generally not varying, or at most a few units. In the case of Syncytiotrophoblaste, the indicated recall rate is always the real one, and all documents can be accessed. We made measurements in 4 cases: Fr, En, Popular Requests, Long Requests. Average query times have been extrapolated from the Pareto rule that applies in general to the cache / nocache logic, namely: First Query (nocache) * 20% + average three next queries (cache) *80%. Consistency: For tests of the most popular queries, it appears that «Best Class Search» average response time restricted to wikipedia.fr and wikipedia.en domains, are respectively of 119 and 132 ms. The same queries, extended to the whole Web (not published here in order to avoid overload), indicate an average response time of about 320 ms, of the same order of magnitude as the «Best Class Search» claim of 250 ms. These results demonstrate the consistency of our assumptions; «Best Class Search» focuses and optimizes research areas, like Wikipedia, and furthermore, each server cluster which participates in the resolution of a query is very little stressed, as indicated by the stability of «Best Class Search» response time. This comparison is plausible, in our view. Conclusion about Speed: Reading the results leads to the following conclusions: - The search application that implements a software mARCTM is at least one order of magnitude faster (factor 10), - mARCTM implicitely increases data caching by a factor 2, compared to the most sophisticated solutions currently deployed. We believe that caching mechanisms based on contextual predictions could be able to achieve the improvement factor of one order of magnitude in that field (factor 10). -
One mARCTM
allows to our search application dynamic similarity from all contexts of a document, within an average response time of about 52 ms, which
is 3 to 5 times quicker than a simple keyword query from a traditional search
engine and with a result of unquestionable relevance. Remark: With
syncytiothrophoblaste, once the first result page is accessed, all results are
cached. As a result, average response time per page with 20 results per page is
about 5 ms. With «Best Class Search», loading another page is equivalent to a non-cached
query which requires each time 70 to 300 ms. Knowing that the average route requests generates about 2.5 pages, one can easily interpolate the average response time for a search engine optimized with mARCTM, to less than 5 ms, which corresponds to a ration of more than 25 compared to a procedural search engine like «Best Class Search». An optimized integration, based on a commercial version of mARCTM, and coupled with an industrialized search engine development would expect not just one order of magnitude (x10) but something closer to 2 orders of magnitude (x100). Results : Easy Programming You will find below the php code used within Syncytiotrophoblaste application to get the Similar Article query. 4 elementary access to mARCTM’s API, initialization code and results rendering. The complexity of the detection and selection of context is totally transparently outsourced to mARCTM within few milliseconds. public function connexearticles ($rowid) { // echo " similar article "; $this->s->Execute ($this->session, 'CONTEXTS.CLEAR'); } Partial Conclusion Technically, the different characteristics of memory
mARCTM are here indirectly demonstrated. In other words, in the case of a text type signal, are these contexts directly understandable by a human being? Another way of looking at it is whether a search engine using a mARCTM brings some relevance to the intuitive sense of the term, compared to a conventional system? Discussion about evaluation of relevancy Relevance, if it is subjective (not modelled), is very real. It is simply not possible to make a physical measure of relevance. The sole valid method would to provide this comparison platform to a large enough set of users from Wikipedia searching in parallel on «Best Class Search» and Syncytiotrophoblaste, and then evaluating their respective satisfaction. Please contact us to discuss further on with us if you are interested as we truncated the second part of our study. Global Conclusion: In terms of relevance, automatic selection of the most interesting articles, and exploration of documents in a database, it is clear that the application based on mARCTM offers better results than all present oriented procedural keywords engines. All
rights reserved. |