File: 0003-Fix-__init__-symbol-issue.patch

package info (click to toggle)
python-jieba 0.39-4
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 39,132 kB
  • sloc: python: 194,381; makefile: 5
file content (39 lines) | stat: -rw-r--r-- 1,521 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
From: CY Wang <a0953218488@gmail.com>
Date: Mon, 27 Aug 2018 17:05:46 +0800
Subject: Fix  __init__ "-" symbol issue
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit

Applied-Upstream: https://github.com/fxsjy/jieba/commit/36a27302ce345a1866d125a9e59bd8611cf06813

Solving "-" symbol can't be analyze issue . 

For example,
In keyword , chap-EX喬沛詩 , SK-II  ...etc 
the present version will show "chap", "-", "EX喬沛詩" , "SK", "-", "II"

After the modify,
The new version will show  "chap-EX","喬沛詩" , "SK-II" 

ps: I have used the jieba.load_userdict() , and added  "chap-EX" , "喬沛詩", "SK-II" in the userdict.txt.
---
 jieba/__init__.py | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/jieba/__init__.py b/jieba/__init__.py
index 62183a9..45dc908 100644
--- a/jieba/__init__.py
+++ b/jieba/__init__.py
@@ -40,7 +40,10 @@ re_eng = re.compile('[a-zA-Z0-9]', re.U)
 
 # \u4E00-\u9FD5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han
 # \r\n|\s : whitespace characters. Will not be handled.
-re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%]+)", re.U)
+# re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%]+)", re.U)
+# Adding "-" symbol in re_han_default
+re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+)", re.U)
+
 re_skip_default = re.compile("(\r\n|\s)", re.U)
 re_han_cut_all = re.compile("([\u4E00-\u9FD5]+)", re.U)
 re_skip_cut_all = re.compile("[^a-zA-Z0-9+#\n]", re.U)