1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
|
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<html>
<body>
<h2>GSF Indexing and Querying</h2>
<p>
If you register an indexer with GSF, it will be called at the right times
to extract relevant information from a ParseResult, and store it somewhere
in persistent store. You can later quickly search this index. This lets
you for example implement cross-file behavior such that a user can
search the whole project for functions or classes (via the Open Type
dialog in NetBeans), or cross-file go to declaration, code completion
referencing symbols in other files, etc. And this can be achieved without
having to go and parse other files on the fly when you're trying to
collect project information. The index is always kept in sync by GSF.
At startup, GSF looks at all the files in your project (as determined by
the relevant source directories, see the section on <a href="#Classpath">Classpath</a>)
and checks whether the timestamp is more recent than the previous timestamp
that is recorded in the index. Files that have changed (or have never been
indexed) will be indexed, in a background thread, at startup. Similarly,
as soon as a file deleted, or edited and then closed, the index will be
updated. GSF also tracks whether a file has been edited, such that if you
attempt to access its index when the index is considered dirty, the indexer
will be immediately called such that all clients of the index can assume
the index is always up to date.
</p>
<p>
Indexing is simple. It basically lets you, for any particular source file,
store a series of "Documents", where each document contains a multimap.
Each multimap consists of keys, and one or more values each key.
Furthermore, keys can be designated as "searchable" or not.
This just controls whether you can search the index by the given key. (Keys that
are searchable require a bit more maintenance on the GSF side, which is
why you avoid it for keys you never plan on searching by).
</p>
<p>
It is up to <b>you</b> to decide what information to store in the index,
how to divide it into documents, how to "encode" it into Strings (all
[key,value] pairs are of type [String,String], and so on.
Typically, you will make this decision based on how you can later retrieve
the information. So, you need to think up front about what you will need,
and organize the information based on that.
When you search, you can search for a given key. This will return
you a set of all the <b>documents</b> where the search occurs. This lets
you find other values associated with the [key,value] pair in the
same document.
</p>
<h3>Example: Ruby</h3>
<p>
Here's how the information is currently stored for Ruby.
In Ruby, a file can contain many classes. I decided to put each
class in its own search document. This means that if I for example
search for methods that start with "foo", I can look in the same
document for the "class" attribute and I will have information
about the class containing the method. Here's an example of
a Ruby file and the search documents created by the Ruby language
plugin's indexer:
<br/>
<img src="indexing.png"/>
<br/>
<br/>
As you can see, we get two documents, one for each class in the file.
In the second class, we have two methods. Each method
is recorded with a "method" key. Later, if I call
<code>searchDoc.getValue("method")</code> I will get out two strings,
<code>{"mymethod","myother"}</code>. If I now want to know what
class these methods are coming from, I can call
<code>searchDoc.getValue("class")</code> - or to get the fully qualified
name (fqn), <code>searchDoc.getValue("fqn")</code>. I can also look
up attributes for this class with <code>searchDoc.getValue("attrs")</code>
which for example lets me see if this is a class or a module. I store
a number of attributes - whether a class is documented, or if it
is explicitly to be ignored, and so on. These are used by the code
completion and go to declaration feature implementations to for example
pick the best candidate among multiple possibilities.
There's a more complete list of the data stored in the Ruby index
later in this document.
</p>
<h3>Indexing</h3>
<p>
To index your code, implement the
<a href="org/netbeans/modules/gsf/api/Parser.html">Indexer</a> interface.
First, you must implement the <code>isIndexable</code> method:
<pre style="background: #ffffcc; color: black; border: solid 1px black; padding: 5px">
boolean <b>isIndexable</b>(@NonNull ParserFile file);
</pre>
This should just be a quick check to see if the given file should
be indexed by this indexer. Typically, you'll just look at the file
name and make a decision.
<div style="float:right; width: 300px; background: #ccffcc; color: black; border: solid 1px black; padding: 10px">
Indexers shouldn't have to know about potential languages they are embedded in.
GSF has this information already, via the embedding provider registrations.
This should be used somehow such that indexers don't have to know everything.
</div>
For JavaScript for example, we care about
<code>.js</code> files, as well as <code>.html</code>, <code>.jsp</code>
and <code>.erb</code>, since these embedded file types can contain
JavaScript functions we want in the index.
In some cases however, the logic can be slightly more complicated.
In the case of JavaScript again, it's pretty common to have "minimized"
versions of a file, where symbols have been replaced with single letter
names, whitespace removed etc. to make the code as small and fast as
possible. We typically don't want to index these versions, so the
JavaScript indexer checks the file name and looks for name pattern
overlaps. If it for example finds <code>foo-debug.js</code>, <code>foo-min.js</code>
and <code>foo.js</code>, it will return <code>false</code> from <code>isIndexable()</code>
for both <code>foo-debug.js</code> and <code>foo-min.js</code>, and <code>true</code>
only for <code>foo.js</code>.
</p>
<p>
<b>IMPORTANT</b>: You may be tempted to base your decision on the mimetype of
a file, but be careful calling <code>file.getFileObject().getMimeType()</code>.
First, <code>file.getFileObject()</code> can be a performance issue.
During startup, your indexer will be asked by the system for all the
source files in the project if they are indexable. Each <code>getFileObject</code>
can end up doing a lot of work.<br/><br/>
A bigger problem is that when source files are deleted, the index needs to
be cleaned up. The IDE will ask all the indexers if they care about this file,
and if they do, the index entries will be cleared for this file.
(TODO: Perhaps I should just unconditionally try to delete the file in all
indices?)
In any case, when a file is deleted, its <code>FileObject</code> no longer
exists, and calling <code>file.getFileObject()</code> will return null.
You can end up getting an NPE in these cases if you're not careful.
</p>
<p>
The second method you have to implement in your indexer is
<pre style="background: #ffffcc; color: black; border: solid 1px black; padding: 5px">
List<IndexDocument> <b>index</b>(ParserResult result, IndexDocumentFactory factory)
</pre>
Here, all you have to do is look up your own AST from your <code>ParserResult</code>.
Then, look through your AST, pick out the information you want to store, and store
it in the index. You do that by creating one or more documents (by calling
the <code>IndexDocumentFactory</code> which is passed to you above.
The <code>IndexDocumentFactory</code> lets you create an <code>IndexDocument</code>.
And each <code>IndexDocument</code> object has a single method:
<pre style="background: #ffffcc; color: black; border: solid 1px black; padding: 5px">
void <b>addPair</b>(String key, String value, boolean searchable);
</pre>
So, all you have to do here is call <code>addPair</code> repeatedly, with key,value pairs.
You also get to decide if your key is searchable. You <b>must</b> be consistent
in how you call this method with respect to the <code>searchable</code> parameter
for a given key in a given document.
</p>
<h3>Searching</h3>
<p>
You can search your index. For the various feature implementations (code completion,
go to declaration, and so on), you are passed a
<a href="org/netbeans/modules/gsf/api/CompilationInfo.html">CompilationInfo</a> instance.
The <code>CompilationInfo</code> holds a lot of vital state: your parser result,
the file object and document corresponding to the current file, and so on.
It also gives you access to the
<a href="org/netbeans/modules/gsf/api/Index.html">Index</a> object.
The <code>Index</code> class has one method:
<div style="float:right; width: 300px; background: #ccffcc; color: black; border: solid 1px black; padding: 10px">
I plan to add a simple <code>SearchContext</code> class which holds some of the
search parameters, and the search method will change into a simple
<code>search(SearchContext)</code> method. I also plan to remove the result list
input parameter and just return it (created by the index) instead.
</div>
<pre style="background: #ffffcc; color: black; border: solid 1px black; padding: 5px">
public abstract void <b>search</b>(
@NonNull final String <b>key</b>,
@NonNull final String <b>value</b>,
@NonNull final NameKind kind,
@NonNull final Set<SearchScope> scope,
@NonNull Set<SearchResult> <b>result</b>,
@NonNull final Set<String> includeKeys);
</pre>
The way this works is that you create your own empty set of SearchResults,
and then you call the search method on the index with a key, a value (which
for example can be a prefix), and so on. This will fill your SearchResult
set with matches.
</p>
<p>
You can now iterate over the matching SearchResults (which will correspond to
<code>IndexDocument</code>s your indexer have created earlier), and for each,
call one of the getValue methods:
<pre style="background: #ffffcc; color: black; border: solid 1px black; padding: 5px">
@NonNull String <b>getValue</b>(@NonNull String <b>key</b>);
@NonNull String[] <b>getValues</b>(@NonNull String <b>key</b>);
</pre>
If you know that you're storing just one value for a given in the index
(such as the <code>class</code> or <code>fqn</code> keys in the Ruby index),
you can call the <code>getValue</code> method to get it directly. If you're
storing many values (such as the <code>method</code> key in the Ruby index),
call <code>getValues</code> to get an array of all the results.
</p>
<h3>Classpath: Searching What?</h3>
<p>
In the search section above I glossed over some of the other parameters
in the <code>search</code> method. In particular, the <code>scope</code>
parameter, which lets you control whether to search just the current
project, or just the libraries, or both. But how does GSF know what source
files are relevant in the project? And what about the libraries?
</p>
<p>
This is controlled by a different API in GSF, which you use to integrate
into different project types and tell GSF which directories are relevant
for this project, as well as register libraries. This is described in
more detail in the <a href="classpath.html">Classpath</a> document.
</p>
<h3>Index Abstraction</h3>
<p>
I recommend that you build an abstraction on top of the index. Your various
feature implementations shouldn't go into the index and look for specific keys
and so forth. That will make changing the index (which I will discuss in the
next section) very difficult. Instead, you should create a class which
wraps the index with logical methods. For example, in Ruby, I have a RubyIndex
class which has dedicated methods like
<ul>
<li><code>getClasses()</code></li>
<li><code>getMethods()</code></li>
<li><code>getSubclassesOf()</code></li>
<li><code>getRequires()</code></li>
<li><code>getTransitiveRequires()</code></li>
<li><code>getSuperClass()</code></li>
<li><code>getOverridingMethod()</code></li>
<li><code>getInheritedMethods()</code></li>
</ul>
and so on. This places all the logic of your index schema in two classes -
your index abstraction and your indexer, and you can more easily change it
later.
</p>
<h3>Making Changes</h3>
<p>
During development, including after your first release, you'll probably find
that you want to make changes to the index, either because of performance
reasons, or because you want to store more information to support new features.
However, if you just make the change in your indexer and index query code,
you might have a really hard time dealing with the fact that your users
can have existing indices out there with the old data format.
</p>
<p>
Dealing with this is easy. The <code>Indexer</code> class has these two
methods:
<pre style="background: #ffffcc; color: black; border: solid 1px black; padding: 5px">
@NonNull String <b>getIndexerName</b>();
@NonNull String <b>getIndexVersion</b>();
</pre>
Any time you make an incompatible change to the way you are storing data
in the index, just change the value of your <code>getIndexVersion()</code>
method. This will force reindexing of all the code whenever your users
run with your new version of the indexer.
</p>
<p>
The way this works on the implementation side is that each index database
is isolated, and in particular, they are stored in the user directory,
under <code>var/cache/gsf-index/<i>your-index-name</i>/<i>your-index-version</i></code>.
For example, the Ruby indexer names its indexer "ruby", and the JavaScript
indexer names its indexer "javascript". This ensures that these two systems
don't interfere which each others databases. The index version is just a string,
and you can put anything you want here (as long as it's a string that is
filesystem safe, meaning it can be written on all file systems where NetBeans
runs). A good convention is to just use a number - e.g. "42" or "6.5.2", etc.
</p>
<h3>Unit Testing</h3>
<p>
Unit testing your indexer is trivial. Create a unit test class for your
indexer, and make sure it extends <code>GsfTestBase</code>. Then, just
create new tests with the following single line call:
<pre style="background: #ffffcc; color: black; border: solid 1px black; padding: 5px">
public class RubyIndexerTest extends RubyTestBase {
public void testIndexData() throws Exception {
checkIndexer("testfiles/date.rb");
}
public void testRails1() throws Exception {
checkIndexer("testfiles/action_controller.rb");
}
...etc
}
</pre>
All you have to do is put the test files you want into <code>test/unit/data</code>,
refer to them from the <code>checkIndexer</code> call, and the GSF test infrastructure
will do the rest: it will load the file, parse it using your parser,
call your indexer, pretty print it, see if a golden file (named the same as
the input file name, plus the extra extension <code>.indexed</code>) exists,
and if not create it, or otherwise diff the computed results with the golden
file. The test fails if the diffs aren't identical.
</p>
<p>
See the <a href="unit-testing.html">Unit Testing</a> document for more information.
</p>
<h3>Debugging Query Problems</h3>
<p>
Sometimes a feature fails -- for example, code completion doesn't return the
results you expect. One way to debug this is to use the GSF diagnostics tools.
The Lucene Index Browser is described in the
<a href="gsf-tools.html#index-browser">GSF Diagnostics Tools</a> document.
</p>
<h3>Ruby Indexed Data</h3>
<p>
The following table shows the data currently stored in the Ruby index, which
will hopefully give some ideas for how a complete indexer/query system
can work:
<table border="1" style="background: #ffffcc; border-collapse: collapse; border:solid 1px black">
<tr>
<th>Key</th><th>Value</th>
</tr>
<tr>
<td>fqn</td>
<td>Fully qualified name of a class or module, such as "Test::Unit" for the Unit class.</td>
</tr>
<tr>
<td>class</td>
<td>The "basename" of a class or module, such as "Unit" for the Test::Unit class.</td>
</tr>
<tr>
<td>class-ig</td>
<td>A lowercased version of the class name. Used for case insensitive queries. (This shouldn't be necessary).</td>
</tr>
<tr>
<td>extends</td>
<td>The fully qualified name of a superclass, if any. Used to compute inherited methods by iterating through class documents.</td>
</tr>
<tr>
<td>in</td>
<td>The name of the module this class is contained in.</td>
</tr>
<tr>
<td>require</td>
<td>The string required to include this class from other files, e.g. <code>io/nonblock</code> for the nonblock.rb file. Used for require-completion.</td>
</tr>
<tr>
<td>requires</td>
<td>A list of all the files <i>this</i> file requires. Used to compute file inclusion transitively.</td>
</tr>
<tr>
<td>includes</td>
<td>A list of modules included (not the same as requires) by this class or module.</td>
</tr>
<tr>
<td>extendWith</td>
<td>Used to simulate the way classes can dynamically be extended with a given module in Ruby.</td>
</tr>
<tr>
<td>method</td>
<td>A method in the class. This isn't just a method name; it's a pretty complicated encoding
of the method name, its parameter list, its (optional) return type, its (optional) parameter
hints or types, as well as a set of attributes for the method (is documented, is ignored,
etc.)</td>
</tr>
<tr>
<td>field</td>
<td>A field in the class</td>
</tr>
<tr>
<td>attribute</td>
<td>An attribute in the class.</td>
</tr>
<tr>
<td>constant</td>
<td>A constant in the class</td>
</tr>
<tr>
<td>attrs</td>
<td>A set of attributes for the class, written as a hex value string</td>
</tr>
<tr>
<td>dbtable</td>
<td>For active record database table completion, the table name this migration refers to</td>
</tr>
<tr>
<td>dbversion</td>
<td>The version number of this migration (used to apply the migrations in the correct order)</td>
</tr>
<tr>
<td>dbcolumn</td>
<td>A column name to be added, removed or renamed (indicated with + or - after the name)</td>
</tr>
</table>
</p>
<br/>
<span style="color: #cccccc">Tor Norbye <tor@netbeans.org></span>
</body>
</html>
|