1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147
|
/* -*- Mode: C++; tab-width: 4; indent-tabs-mode: nil; c-basic-offset: 4 -*- */
/*
* This file is part of the LibreOffice project.
*
* This Source Code Form is subject to the terms of the Mozilla Public
* License, v. 2.0. If a copy of the MPL was not distributed with this
* file, You can obtain one at http://mozilla.org/MPL/2.0/.
*/
#include <helpcompiler/HelpIndexer.hxx>
#include <rtl/string.hxx>
#include <rtl/uri.hxx>
#include <o3tl/runtimetooustring.hxx>
#include <osl/file.hxx>
#include <osl/thread.h>
#include <memory>
#include "LuceneHelper.hxx"
#include <CLucene.h>
#include <CLucene/analysis/LanguageBasedAnalyzer.h>
using namespace lucene::document;
HelpIndexer::HelpIndexer(OUString const &lang, OUString const &module,
OUString const &srcDir, OUString const &outDir)
: d_lang(lang), d_module(module)
{
d_indexDir = outDir + OUStringChar('/') + module + ".idxl";
d_captionDir = srcDir + "/caption";
d_contentDir = srcDir + "/content";
}
bool HelpIndexer::indexDocuments()
{
if (!scanForFiles())
return false;
try
{
OUString sLang = d_lang.getToken(0, '-');
bool bUseCJK = sLang == "ja" || sLang == "ko" || sLang == "zh";
// Construct the analyzer appropriate for the given language
std::unique_ptr<lucene::analysis::Analyzer> analyzer;
if (bUseCJK)
analyzer.reset(new lucene::analysis::LanguageBasedAnalyzer(L"cjk"));
else
analyzer.reset(new lucene::analysis::standard::StandardAnalyzer());
OUString ustrSystemPath;
osl::File::getSystemPathFromFileURL(d_indexDir, ustrSystemPath);
OString indexDirStr = OUStringToOString(ustrSystemPath, osl_getThreadTextEncoding());
lucene::index::IndexWriter writer(indexDirStr.getStr(), analyzer.get(), true);
//Double limit of tokens allowed, otherwise we'll get a too-many-tokens
//exception for ja help. Could alternative ignore the exception and get
//truncated results as per java-Lucene apparently
writer.setMaxFieldLength(lucene::index::IndexWriter::DEFAULT_MAX_FIELD_LENGTH*2);
// Index the identified help files
Document doc;
for (auto const& elem : d_files)
{
helpDocument(elem, &doc);
writer.addDocument(&doc);
doc.clear();
}
writer.optimize();
// Optimize the index
writer.optimize();
}
catch (CLuceneError &e)
{
d_error = o3tl::runtimeToOUString(e.what());
return false;
}
return true;
}
bool HelpIndexer::scanForFiles() {
if (!scanForFiles(d_contentDir)) {
return false;
}
if (!scanForFiles(d_captionDir)) {
return false;
}
return true;
}
bool HelpIndexer::scanForFiles(OUString const & path) {
osl::Directory dir(path);
if (osl::FileBase::E_None != dir.open()) {
d_error = "Error reading directory " + path;
return false;
}
osl::DirectoryItem item;
osl::FileStatus fileStatus(osl_FileStatus_Mask_FileName | osl_FileStatus_Mask_Type);
while (dir.getNextItem(item) == osl::FileBase::E_None) {
item.getFileStatus(fileStatus);
if (fileStatus.getFileType() == osl::FileStatus::Regular) {
d_files.insert(fileStatus.getFileName());
}
}
return true;
}
void HelpIndexer::helpDocument(OUString const & fileName, Document *doc) const {
// Add the help path as an indexed, untokenized field.
OUString path = "#HLP#" + d_module + "/" + fileName;
std::vector<TCHAR> aPath(OUStringToTCHARVec(path));
doc->add(*_CLNEW Field(_T("path"), aPath.data(), int(Field::STORE_YES) | int(Field::INDEX_UNTOKENIZED)));
OUString sEscapedFileName =
rtl::Uri::encode(fileName,
rtl_UriCharClassUric, rtl_UriEncodeIgnoreEscapes, RTL_TEXTENCODING_UTF8);
// Add the caption as a field.
OUString captionPath = d_captionDir + "/" + sEscapedFileName;
doc->add(*_CLNEW Field(_T("caption"), helpFileReader(captionPath), int(Field::STORE_NO) | int(Field::INDEX_TOKENIZED)));
// Add the content as a field.
OUString contentPath = d_contentDir + "/" + sEscapedFileName;
doc->add(*_CLNEW Field(_T("content"), helpFileReader(contentPath), int(Field::STORE_NO) | int(Field::INDEX_TOKENIZED)));
}
lucene::util::Reader *HelpIndexer::helpFileReader(OUString const & path) {
osl::File file(path);
if (osl::FileBase::E_None == file.open(osl_File_OpenFlag_Read)) {
file.close();
OUString ustrSystemPath;
osl::File::getSystemPathFromFileURL(path, ustrSystemPath);
OString pathStr = OUStringToOString(ustrSystemPath, osl_getThreadTextEncoding());
return _CLNEW lucene::util::FileReader(pathStr.getStr(), "UTF-8");
} else {
return _CLNEW lucene::util::StringReader(L"");
}
}
/* vim:set shiftwidth=4 softtabstop=4 expandtab: */
|