File: relevancy.xml

package info (click to toggle)
mnogosearch 3.3.7-3
  • links: PTS, VCS
  • area: main
  • in suites: lenny
  • size: 17,484 kB
  • ctags: 4,565
  • sloc: ansic: 94,097; xml: 16,864; sh: 8,915; makefile: 1,727; perl: 801; php: 561; sql: 15
file content (181 lines) | stat: -rw-r--r-- 8,387 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
<sect1 id="rel">
	<title>Relevancy
<indexterm><primary>Relevancy</primary></indexterm>
</title>
	<sect2 id="rel-order">
		<title>Ordering documents</title>
		<para><application>mnoGoSearch</application> sorts results first by <literal>relevancy</literal>
and second by <literal>popularity rank</literal>.</para>


<sect3 id="relevancy"><title>Relevancy calculation</title>

<para>Relevancy for every found document is calculated as 100% multiplied by the cosine of an angle formed by weights vectors for the request
and weights vectors for the document found. The number of vector coordinates is equal to the multiplication of the number of words forms in 
the search query and the number of sections defined in <filename>indexer.conf</filename>. Every vector's coordinate corresponds to
a word in a search query that fits one of the document's sections. The values of this coordinate depend on the weight of this section,
defined by the <option>wf</option> parameter (see <xref linkend="search-changeweight"/>).
And this word is exactly the same as in the search query or its word form or synonym.
And one more coordinate is equal to the average distance between searched words in the document. For the query's vector, this coordinate is equal to 0.
</para>

<para>
In the default configuration search can produce quite small score values,
because it expects that the words will be found in up to 256 document
sections at the same time. Please see <xref linkend="cmdref-numsections"/>
<filename>search.htm</filename> command description how to specify
the real number of sections used, and thus increase score values.
</para>

<para>
Other commands affecting document order and/or score value are:
<xref linkend="cmdref-datefactor"/>,
<xref linkend="cmdref-docsizeweight"/>,
<xref linkend="cmdref-mincoordfactor"/>,
<xref linkend="cmdref-numdistinctwordfactor"/>,
<xref linkend="cmdref-numwordfactor"/>,
<xref linkend="cmdref-worddistanceweight"/>.
</para>

</sect3>

<sect3 id="poprank"><title>Popularity rank<indexterm><primary>Popularity rank</primary></indexterm></title>
<para>
The popularity rank calculation is made in two stages. At first stage, the value of the <option>Weight</option> parameter
for every server is divided by the number of links from this server. Thus, the weight of one link from this server is calculated.
At second stage, for every page we find the sum of weights of all links pointed to this page. 
This sum is the popularity rank for this page. Self links, i.e. when a page
has a link to itself, do not affect popularity rank.
</para>
<para><indexterm><primary>Command</primary><secondary>Weight</secondary></indexterm>
By default, the value of the <option>Weight</option> parameter is equal to 1 for all servers indexed.
You may change this value by <command>Weight</command> command in the <filename>indexer.conf</filename> file or
directly in the <literal>server</literal> table, if you load the servers configuration from this table.
</para>

<para>If you place the
<option><indexterm><primary>Command</primary><secondary>PopRankSkipSameSite</secondary></indexterm>PopRankSkipSameSite yes</option>
command in the <filename>indexer.conf</filename> file, the <command>indexer</command> will take only inter-site links (i.e. links from a page on 
one site to a page on another site) for popularity rank calculation.
</para>

<para>If you place the
<option><indexterm><primary>Command</primary><secondary>PopRankFeedBack</secondary></indexterm>PopRankFeedBack yes</option>
command in the <filename>indexer.conf</filename> file, the <command>indexer</command> will calculate the site weight before page rank
calculation. To do that, the <command>indexer</command> calculates the sum of popularity rank for all pages from the same site. If this sum is 
greater than 1, the weight for the site is set to this sum, otherwise, the site weight is set to 1.
</para>

<para>If you place the
<option><indexterm><primary>Command</primary><secondary>PopRankUseTracking</secondary></indexterm>PopRankUseTracking yes</option>
command in the <filename>indexer.conf</filename> file, the <command>indexer</command> will calculate the site weight as the number of 
tracked queries with restriction on this site.
</para>

<para>If you place the
<option><indexterm><primary>Command</primary><secondary>PopRankUseShowCnt</secondary></indexterm>PopRankUseShowCnt yes</option>
command in the <filename>search.htm</filename> file, then for every result shown to the user, the
corresponding <literal>url.shows</literal> value will be increased by 1, if relevancy for this result is great or equal to
the value specified by the 
<option><indexterm><primary>Command</primary><secondary>PopRankShowCntRatio</secondary></indexterm>PopRankShowCntRatio</option>
command (default value is 25.0).
If you place <option>PopRankUseShowCnt yes</option> in the <filename>indexer.conf</filename> file, the <command>indexer</command>
will add to url's PopularityRank the value of <literal>url.shows</literal> multiplied by value, specified in the
<option><indexterm><primary>Command</primary><secondary>PopRankShowCntWeight</secondary></indexterm>PopRankShowCntWeight</option>
command (default value is 0.01).
</para>


</sect3>

	</sect2>

<sect2 id="score-debug">
  <title>Analyzing score values</title>
  <para>Starting from version 3.3.7, it's possible to debug
  score values calculated for the documents found. In order to
  debug score value go through these steps:
    <orderedlist>
      <listitem>
        Add this code into the bottom of the <literal>&lt;!--restop--&gt;</literal>
        section of your search template:
        <programlisting>
&lt;--restop--&gt;
....
[DebugScore: $(DebugScore)]
&lt;--/restop--&gt;
        </programlisting>
      </listitem>

      <listitem>
        Add this code into the bottom of the <literal>&lt;!--res--&gt;</literal>
        section of your search template:
        <programlisting>
&lt;--res--&gt;
....
[ID=$(ID)]
&lt;--/res--&gt;
        </programlisting>
      </listitem>
      <listitem>
        Open <program>search.cgi</program> in your browser and
        run a search query consisting of multiple words.
        You will see document ID after the usual document information.
      </listitem>
      <listitem>
        Choose a document you want to see score debug information for.
        Remember its ID (let's say the ID is 100).
      </listitem>
      <listitem>
        Go to your browser's location bar, add
      <command>&amp;DebugURLID=100</command> 
      at the very end of the URL and press Enter.
      <note>
        <para>
        URL will look approximately like this:
          <programlisting>
http://hostname/cgi-bin/search.cgi?q=test+query&amp;DebugURLID=100
          </programlisting>
        </para>
        </note>
      </listitem>
      <listitem>
        Find a line of this format in between the search form and the results:
        <programlisting>
DebugScore: url_id=82 RDsum=98 distance=84 (84/1) minmax=0.99091089
            density=0.00196271 numword=0.90135133 wordform=0.00000000
        </programlisting>
        It will give you an idea why score for the chosen document is
        too high or too low and help to fine tune various
        parameters like <xref linkend="cmdref-worddistanceweight"/>
        or  <xref linkend="cmdref-worddensityfactor"/>.
      </listitem>
    </orderedlist>
    <note>
      <para>
      Score debugging currently works only for queries with multiple search
      words. Queries with a single search word don't return debug information.
      </para>
    </note>
  </para>
</sect2>
	<sect2 id="rel-cwords">
		<title>Crosswords
<indexterm><primary>Crosswords</primary></indexterm>
</title>
		<para>This feature authorizes assignment of the words between
<literal>&lt;a href="xxx"&gt;</literal> and <literal>&lt;/a&gt;</literal>
to the document given in the link. 
To enable using Crosswords, use the <command><xref linkend="cmdref-crosswords"/>
<indexterm><primary>Command</primary><secondary>CrossWords</secondary></indexterm>
</command> command in
<filename>indexer.conf</filename> and
<filename>search.htm</filename>.</para>

	</sect2>
	<!-- sect2 id="rel-dr">
		<title>$(Score) template variable</title>
		<para>
			<varname>$(Score)</varname> template variable displays number of words from the query found in a document.</para>
	</sect2 -->
</sect1>