mARC - Experimental Demonstration

posted Jun 26, 2012, 6:22 AM by Jean-Michel Davault   [ updated Dec 18, 2013, 5:40 AM ]

In a previous article we presented the Definition of memory Association by Reinforcements of Contexts.

 

In this article, we provide you with some feedback from an experimental demonstration which compares mARCTM to a “Best Class Search” engine. We don’t publish entirely our study. You should nevertheless find enough information to come back to us if you find this experimental demonstration relevant.

 

Experimental Demonstration:

 

In order to demonstrate mARC’s benefits, especially points 1,2,4 and 5 described in our previous article, we built an experimental platform which compares a traditional efficient procedural search engine, in this case «Best Class Search», with a basic search system, functional clone of «Best Class Search», using a mARCTM memory.

 

Experimentation Goals:

Compare plausibly two systems functionally equivalent in order to check the advantages to implement mARCTM.

Data: The selected corpus consists of Wikipedia articles in French (Fr) and in English (En).

base Fr for mARC :            1.0 million articles        base En for mARC :             3.5 million

base Fr for «Best Class»:  1.4 million articles        base En for «Best Class» :   3.9 million

* The difference between the number of articles in both cases is due to the fact that Wikipedia, and «Best Class Search», as  up-to-date web search engine, evolve,  whereas data  we access  to  are  those  of Wikipedia from approximately February 2010.

Experimental Platform:

1) «Best Class Search» is restricted to fr.wikipedia.org and en.wikipedia.org domains

2) A procedural approach keyword search engine simulation, implementing a mARCTM software pre-commercial version.

The name of the search engine and indexation is Syncytiotrophoblaste. It should be used as a programmatic basic sample using mARCTM memory within the first commercial documentation of mARCTM.

The user interface (UI) mimics «Best Class Search» Look & Feel, including advanced features such as assistance to the input query, dynamic predictive requests while typing.

The UI provides additional key features which can easily be implemented thanks to mARCTM like:

- Search by contextual similarity of articles (SA),
- Meta-search images engine based on the query,
- Dynamic Query helper by associations and shapes suggestions.

Syncytiotrophoblaste is not a contextual search engine by itself; as such a design wouldn’t have to make a point to point comparison with a reference of the market.

The constraint of simulating a search mode type « keyword » involves the use of mARCTM in low level mode, except the search by contextual similarity of articles (SA).

The indexation algorithm used is also keyword oriented, and not purely contextual, which overloads the internal database of the prototype.

In other words, the technical design of the application, very basic, takes only partial benefit of the underlying features of mARCTM.

3) Environment

On the one hand a distributed «Best Class Search» architecture and on the other hand an Intel core I5 hosted by OVH.

The hosted OVH server contains two mARCs, one for the French indexation, the other one for the English indexation, and an Apache Web server. The operating system (OS) is a virtualized Windows 7 within a VMWare partition.

4) Validity of the comparative

It is not obvious, at first sight, that the two platforms are comparable. However a somewhat finer analysis of the «Best Class Search» distributed architecture indicates, as a first approximation, that the comparison makes sense. After discussion with external engineers, they are of the opinion that the test conditions are significant; even that mARC’s platform is slightly disadvantaged.

Results: Data Independence

Indexation and requests are handled exactly in the same way on the French and English corpus by the application of (re)search.

We are able to demonstrate the same behavior on the German, Spanish, Italian, Alsatian and Breton Wikipedia corpus.

The version of mARCTM used is based on two simplifying assumptions:
- Signal is segmented into packets of 8 bits
- The character « SPACE » is used as the primary signal segmenter.

Hence, the current version of mARC does not allow validating universal data independence. Nevertheless, it proofs ad minima, that it is independent from the stored language.

Therefore we “only” have a partial proof of data independence.

Results: Compactness

Here are the different sizes of elements used by the simulated search engine: mARCTM memory content, indexation information, reverse indexation of information.
 

 mARC (Mb)

Indexation Mb

 Reverse Indexation Mb

 Total

Data Mb

 Ratio %

 Fr 500 900 731 2100 4000 52.5
 En 600 1600 1500 3700 11000 33.6
We notice that:

1) mARC’s size does not vary linearly as a function of stored data, but, at worst, as logarithm (Log) of number of stored data.

2) The index, that is, all the information necessary to the search engine, weighs about 50% of the initial data size.

Today, Full Text index included on most SQL servers on the market or search engines index like Indri, Sphinx and others,  “costs” between 100% and 300 % of the initial data size.

We don’t know «Best Class Search»’s index/data ratio.

Conclusion, mARCTM itself is compact as specified at the beginning of this study.

mARCTM based indexation applications are much more compact than similar ones based on linear memory (RAM). The gain in terms of index index/data between mARCTM indexation and classic indexation is between 2 and 6 for the mARCTM implemented application.

A less “keyword search” and more contextual search indexation strategy (like Similar Article) would easily shrink the footprint of the static information of indexing by an order of magnitude (x10), using mARC’sTM dynamic resolution of relationships at runtime.

Results: Speed

This application does not allow a direct measure of mARCTM’s speed. Nevertheless, as it is based on its usage, the gain in speed compared to a technology at the edge of linear memory, can only be attributed to the use of mARCTM.
In the response time of the search application, CPU percentage used by mARCTM never exceeds 10% of response time, the remainder CPU usage being assigned to disk I/O (Input/Output), data rendering, API, and communication.

Protocol:

We used a list of 100 most popular search requests in 2011 and 2010 on Wikipedia in French and English.

A second part of the test is based on the use of article titles as a new query, and copy/paste a portion of the text of an article in order to follow the trend of request size observed in recent years.

Each query is made four times. The first one is to measure nocache response time, the other three in order to evaluate a mean of the cached query response time.
The selected values are those measured at the engine level and indicated by the two compared applications. It measures only the time necessary to solve the query and not the whole Web process which consists to format and transmit results.

The real recall rate was also measured.
Caution : in the case of one particular «Best Class Search», for marketing reasons, the display of the recall rate is only potential. Some other don’t even provide the real recall rate.
Example: “
About 72 000 results (0,26 secondes)
The real call rate goes never beyond 800 results. The sole way to measure it is to go to the last page of results.
Example: « 
To limit the results to the most relevant pages (total: 606), we have omitted some entries very similar. If you wish, you can repeat the search with the omitted results included »

However, it is unnecessary to repeat the search with the omitted results included, the recall rate is generally not varying, or at most a few units.

In the case of Syncytiotrophoblaste, the indicated recall rate is always the real one, and all documents can be accessed.

We made measurements in 4 cases: Fr, En, Popular Requests, Long Requests.

Average query times have been extrapolated from the Pareto rule that applies in general to the cache / nocache logic, namely: First Query (nocache) * 20% + average three next queries (cache) *80%.

Consistency:

For tests of the most popular queries, it appears that «Best Class Search» average response time restricted to wikipedia.fr and wikipedia.en domains, are respectively of 119 and 132 ms.

The same queries, extended to the whole Web (not published here in order to avoid overload), indicate an average response time of about 320 ms, of the same order of magnitude as the «Best Class Search» claim of 250 ms.

These results demonstrate the consistency of our assumptions; «Best Class Search» focuses and optimizes research areas, like Wikipedia, and furthermore, each server cluster which participates in the resolution of a query is very little stressed, as indicated by the stability of «Best Class Search» response time.  This comparison is plausible, in our view.

Conclusion about Speed:

Reading the results leads to the following conclusions:

-          The search application that implements a software mARCTM is at least one order of magnitude faster (factor 10),

-          mARCTM implicitely increases data caching by a factor 2, compared to the most sophisticated solutions currently deployed. We believe that caching mechanisms based on contextual predictions could be able to achieve the improvement factor of one order of magnitude in that field (factor 10).

-          One mARCTM allows to our search application dynamic similarity from all contexts of a document, within an average response time of about 52 ms, which is 3 to 5 times quicker than a simple keyword query from a traditional search engine and with a result of unquestionable relevance.
Please note that an imperfect simulation of a keyword contextual search, thanks to AND, OR operators implies explosive combinatory, equivalent to several hundreds of classic procedural queries, which can’t scale.
Reminder: For a query with n terms, the number of logical query is of the order of (2n-1 – 1) for n>1.

Remark:  With syncytiothrophoblaste, once the first result page is accessed, all results are cached. As a result, average response time per page with 20 results per page is about 5 ms. With «Best Class Search», loading another page is equivalent to a non-cached query which requires each time 70 to 300 ms.
Please note that in the case of our application, there is no query optimization by itself, each new page causes a global reappraisal of the query, like «Best Class Search». A trivial optimization would be to keep the result of the query in a session variable to optimize the route of the result pages. Response time would then drop to
0.5 ms / page, independently from the query complexity.

Knowing that the average route requests generates about 2.5 pages, one can easily interpolate the average response time for a search engine optimized with mARCTM, to less than 5 ms, which corresponds to a ration of more than 25 compared to a procedural search engine like «Best Class Search».

An optimized integration, based on a commercial version of mARCTM, and coupled with an industrialized search engine development would expect not just one order of magnitude (x10) but something closer to 2 orders of magnitude (x100). 

Results : Easy Programming

You will find below the php code used within Syncytiotrophoblaste application to get the Similar Article query.

4 elementary access to mARCTM’s API, initialization code and results rendering.

The complexity of the detection and selection of context is totally transparently outsourced to mARCTM within few milliseconds.

public function connexearticles ($rowid)

      {

      //    echo " similar article ";

$this->s->Execute ($this->session, 'CONTEXTS.CLEAR');
$this->s->Execute ($this->session, 'RESULTS.CLEAR');
$this->s->Execute ($this->session, 'CONTEXTS.SET','KNOWLEDGE',$this->knw );
$this->s->Execute ($this->session, 'CONTEXTS.NEW');
$this->s->Execute ($this->session, 'TABLE:wikimaster2.TOCONTEXT',$rowid);
$this->s->Execute ($this->session, 'CONTEXTS.DUP');
$this->s->Execute ($this->session, 'CONTEXTS.EVALUATE';
$this->s->Execute ($this->session, 'CONTEXTS.FILTERACT','25','true' );
$this->s->Execute ($this->session, 'CONTEXTS.NEWFROMSEM','1','-1','-1' );
$this->s->Execute ($this->session, 'CONTEXTS.SWAP');
$this->s->Execute ($this->session, 'CONTEXTS.DROP');
$this->s->Execute ($this->session, 'CONTEXTS.SWAP');
$this->s->Execute ($this->session, 'CONTEXTS.DUP');
$this->s->Execute ($this->session, 'CONTEXTS.ROLLDOWN','3');
$this->s->Execute ($this->session, 'CONTEXTS.UNION');
$this->s->Execute ($this->session, 'CONTEXTS.EVALUATE');
$this->s->Execute ($this->session, 'CONTEXTS.INTERSECTION');
$this->s->Execute ($this->session, 'CONTEXTS.NORMALIZE');
$this->s->Execute ($this->session, 'CONTEXTS.FILTERACT','25','true' );
$this->s->Execute ($this->session, 'CONTEXTS.TORESULTS','false','25');
$this->s->Execute ($this->session, 'RESULTS.SelectBy','Act','>','95');
$this->s->Execute ($this->session, 'RESULTS.SortBy','Act','false');
$this->s->Execute ($this->session, 'RESULTS.GET','ResultCount');
$count = $this->s->KMResults; 

      }

Partial Conclusion

Technically, the different characteristics of memory mARCTM are here indirectly demonstrated. 
Nevertheless, it remains to determine if the detected and trained contexts with mARCTM are directly usable.

In other words, in the case of a text type signal, are these contexts directly understandable by a human being?

Another way of looking at it is whether a search engine using a mARCTM brings some relevance to the intuitive sense of the term, compared to a conventional system?

Discussion about evaluation of relevancy

Relevance, if it is subjective (not modelled), is very real. It is simply not possible to make a physical measure of relevance. The sole valid method would to provide this comparison platform to a large enough set of users from Wikipedia searching in parallel on «Best Class Search» and Syncytiotrophoblaste, and then evaluating their respective satisfaction.

Please contact us to discuss further on with us if you are interested as we truncated the second part of our study.

Global Conclusion:

In terms of relevance, automatic selection of the most interesting articles, and exploration of documents in a database, it is clear that the application based on mARCTM offers better results than all present oriented procedural keywords engines.

All rights reserved.