LuceneModule

If in a multilevel parts of a certain nodes are repeatedly encountered, they are added to the documnt multiple times.

Details

  • Type: Bug Bug
  • Status: In Progress In Progress
  • Priority: Major Major
  • Resolution: Unresolved
  • Affects Version/s: 1.0
  • Fix Version/s: 1.0
  • Component/s: None
  • Description:
    Hide
    This has little effect on the findability of the document, because for that it doesn't matter much whether the same text is indexed more than once.

    It does have however a severe effect on the scoring.
    See e.g. http://www.lucenetutorial.com/advanced-topics/scoring.html

    It woudl be better to check whether the values of certain fields were indexed already for this document, and skip them otherwise.

    Index which was used while finding this problem (from EO repository):

       <list path="genres,programs,episodes,mediarel,mediafragments,mediasources" element="episodes" searchdirs="destination">
            <mmsq:constraint field="mediafragments.status" value="3" />
            <mmsq:constraint field="mediasources.format" operator="in" value="1,9,12" />
            <mmsq:constraint field="episodes.title" operator="like" value="%job%" />
            <mmsq:constraint field="episodes.body" operator="like" value="%job%" />
            <mmsq:field name="genres.number" alias="genre" keyword="true" />
            <mmsq:field name="programs.title" boost="5" />
            <mmsq:field name="programs.mediaclasse" alias="mediaclasse" keyword="true" />
            <mmsq:field name="episodes.title" boost="10" />
            <mmsq:field name="episodes.subtitle" />
            <mmsq:field name="episodes.intro" />
            <mmsq:field name="episodes.body" />
            <mmsq:field name="episodes.shorttext" />
            <mmsq:field name="episodes.keywords" alias="keywords" boost="20" keyword="true" split="," />
          </list>

    Here the multilevel is made longer just to be able to add constraints there. But if an episode as many related mediatfragments, it will add the 'episodes' themselves very many times to the document, making it score very badly, because it is so big.

    Show
    This has little effect on the findability of the document, because for that it doesn't matter much whether the same text is indexed more than once. It does have however a severe effect on the scoring. See e.g. http://www.lucenetutorial.com/advanced-topics/scoring.html It woudl be better to check whether the values of certain fields were indexed already for this document, and skip them otherwise. Index which was used while finding this problem (from EO repository):    <list path="genres,programs,episodes,mediarel,mediafragments,mediasources" element="episodes" searchdirs="destination">         <mmsq:constraint field="mediafragments.status" value="3" />         <mmsq:constraint field="mediasources.format" operator="in" value="1,9,12" />         <mmsq:constraint field="episodes.title" operator="like" value="%job%" />         <mmsq:constraint field="episodes.body" operator="like" value="%job%" />         <mmsq:field name="genres.number" alias="genre" keyword="true" />         <mmsq:field name="programs.title" boost="5" />         <mmsq:field name="programs.mediaclasse" alias="mediaclasse" keyword="true" />         <mmsq:field name="episodes.title" boost="10" />         <mmsq:field name="episodes.subtitle" />         <mmsq:field name="episodes.intro" />         <mmsq:field name="episodes.body" />         <mmsq:field name="episodes.shorttext" />         <mmsq:field name="episodes.keywords" alias="keywords" boost="20" keyword="true" split="," />       </list> Here the multilevel is made longer just to be able to add constraints there. But if an episode as many related mediatfragments, it will add the 'episodes' themselves very many times to the document, making it score very badly, because it is so big.

Activity

Hide
Pierre van Rooden added a comment - 2010-03-23 14:40
The fix as proposed cause problems with document data NOT getting indexed.
The problem occurs if you do something like this:

    <list path="t_stream,t_metadata" searchdirs="destination" element="t_stream">
      <mmsq:constraint field="t_metadata.enduser" value="learner" />
      <mmsq:constraint field="t_metadata.schooltype" value="1" />
      <mmsq:field name="title" boost="2" />
      <mmsq:field name="subtitle" />
      <mmsq:field name="intro" />
      <mmsq:field name="body" />
      <mmsq:relatednodes type="t_metadata">
        <mmsq:field name="enduser" alias="enduser" store="true" keyword="false" />
        <mmsq:field name="minfactor" alias="minfactor" store="true" />
        <mmsq:field name="maxfactor" alias="maxfactor" store="true" />
        <mmsq:relatednodes type="t_keyword">
          <mmsq:field name="name" boost="4" />
        </mmsq:relatednodes>
      </mmsq:relatednodes>
    </list>

In this case, enduser, minfactor, and maxfactor are not indexed because lucene presumes that t_metadata is already indexed.
It is possible to circumvent this by rewriting your queries thusly:

    <list path="t_stream,t_metadata" searchdirs="destination" element="t_stream">
      <mmsq:constraint field="t_metadata.enduser" value="learner" />
      <mmsq:constraint field="t_metadata.schooltype" value="1" />
      <mmsq:field name="title" boost="2" />
      <mmsq:field name="subtitle" />
      <mmsq:field name="intro" />
      <mmsq:field name="body" />
      <mmsq:field name="t_metadata.enduser" alias="enduser" store="true" keyword="false" />
      <mmsq:field name="t_metadata.minfactor" alias="minfactor" store="true" />
      <mmsq:field name="t_metadata.maxfactor" alias="maxfactor" store="true" />

      <mmsq:relatednodes type="t_metadata">
        <mmsq:relatednodes type="t_keyword">
          <mmsq:field name="name" boost="4" />
          <mmsq:field name="synonyms" />
          <mmsq:field name="typos" />
        </mmsq:relatednodes>
      </mmsq:relatednodes>
    </list>

But that is a bit silly.
At any rate, this is not backward compatible, so I prefer if this is rolled back (at least in 1.9) until a proper fix is made that does not cause these issues.
Show
Pierre van Rooden added a comment - 2010-03-23 14:40 The fix as proposed cause problems with document data NOT getting indexed. The problem occurs if you do something like this:     <list path="t_stream,t_metadata" searchdirs="destination" element="t_stream">       <mmsq:constraint field="t_metadata.enduser" value="learner" />       <mmsq:constraint field="t_metadata.schooltype" value="1" />       <mmsq:field name="title" boost="2" />       <mmsq:field name="subtitle" />       <mmsq:field name="intro" />       <mmsq:field name="body" />       <mmsq:relatednodes type="t_metadata">         <mmsq:field name="enduser" alias="enduser" store="true" keyword="false" />         <mmsq:field name="minfactor" alias="minfactor" store="true" />         <mmsq:field name="maxfactor" alias="maxfactor" store="true" />         <mmsq:relatednodes type="t_keyword">           <mmsq:field name="name" boost="4" />         </mmsq:relatednodes>       </mmsq:relatednodes>     </list> In this case, enduser, minfactor, and maxfactor are not indexed because lucene presumes that t_metadata is already indexed. It is possible to circumvent this by rewriting your queries thusly:     <list path="t_stream,t_metadata" searchdirs="destination" element="t_stream">       <mmsq:constraint field="t_metadata.enduser" value="learner" />       <mmsq:constraint field="t_metadata.schooltype" value="1" />       <mmsq:field name="title" boost="2" />       <mmsq:field name="subtitle" />       <mmsq:field name="intro" />       <mmsq:field name="body" />       <mmsq:field name="t_metadata.enduser" alias="enduser" store="true" keyword="false" />       <mmsq:field name="t_metadata.minfactor" alias="minfactor" store="true" />       <mmsq:field name="t_metadata.maxfactor" alias="maxfactor" store="true" />       <mmsq:relatednodes type="t_metadata">         <mmsq:relatednodes type="t_keyword">           <mmsq:field name="name" boost="4" />           <mmsq:field name="synonyms" />           <mmsq:field name="typos" />         </mmsq:relatednodes>       </mmsq:relatednodes>     </list> But that is a bit silly. At any rate, this is not backward compatible, so I prefer if this is rolled back (at least in 1.9) until a proper fix is made that does not cause these issues.
Hide
Michiel Meeuwissen added a comment - 2010-04-13 10:40
I must say that your query is quite complicated, if not convoluted.

But perhaps it should have administrated not only per node, but per node, per field whether it was already indexed?
Show
Michiel Meeuwissen added a comment - 2010-04-13 10:40 I must say that your query is quite complicated, if not convoluted. But perhaps it should have administrated not only per node, but per node, per field whether it was already indexed?

People

Dates

  • Created:
    2010-01-11 13:08
    Updated:
    2010-04-13 11:07