موتور جستوجو
GoSearch!
قسمتها
۱- بارگذاری محتوای صفحات
۲- نرمالسازی متن
۳- ذخیره در داده ساختار مناسب
۴- جستوجو
بارگذاری محتوا / کرال
دانلود صفحه
استخراج متن اصلی
http://www.tutorialspoint.com/python/python_multithreading.htm
1
http://www.tutorialspoint.com/python/python_multithreading.htm
2
...
</div>
<hr>
<div class="pre-btn">
<a href="/python/python_sending_email.htm"><i class="icon icon-arrow-circle-o-left big-font"></i> Previous Page</a>
</div>
<div class="nxt-btn">
<a href="/python/python_xml_processing.htm">Next Page <i class="icon icon-arrow-circle-o-right big-font"></i> </a>
</div>
<div class="clearer"></div>
<hr />
<p>Running several threads is similar to running several different programs concurrently, but with the following benefits −</p>
<ul class="list">
<li><p>Multiple threads within a process share the same data space with the main thread and can therefore share information or communicate with each other more easily than if they were separate processes.</p></li>
<li><p>Threads sometimes called light-weight processes and they do not require much memory overhead; they care cheaper than processes.</p></li>
</ul>
<p>A thread has a beginning, an execution sequence, and a conclusion. It has an instruction pointer that keeps track of where within its context it is currently running.</p>
<ul class="list">
<li><p>It can be pre-empted (interrupted)</p></li>
<li><p>It can temporarily be put on hold (also known as sleeping) while other threads are running - this is called yielding.</p></li>
</ul>
<h2>Starting a New Thread</h2>
....
http://www.tutorialspoint.com/python/python_multithreading.htm
3
Running several threads is similar to running several different programs concurrently, but with the following benefits −
Multiple threads within a process share the same data space with the main thread and can therefore share information or communicate with each other more easily than if they were separate processes.
Threads sometimes called light-weight processes and they do not require much memory overhead; they care cheaper than processes.
A thread has a beginning, an execution sequence, and a conclusion. It has an instruction pointer that keeps track of where within its context it is currently running.
It can be pre-empted (interrupted)
It can temporarily be put on hold (also known as sleeping) while other threads are running - this is called yielding.
....
Python Multithreaded Programming
عنوان صفحه
متن اصلی
نرمال سازی متن
حذف حرفها و کلمههای بیکاربرد
- پیدا کردن حروف ربط،اضافه،تعریف
- حذف آنها
ریشهیابی و حذف پسوند و پیشوند
stemming
cats, catty, catlike
cat
stems, stemmer, stemming, stemmed
stem
argue, argued, argues, arguing
argu
run sever thread is similar run sever differ program concurr follow benefit thread has begin execut sequenc conclus it has instruct pointer that keep track where it context it is current run spawn thread you need call follow method avail thread modul method call enabl fast effici way creat new thread linux window method call return immedi child thread start call function pass list arg when function return thread termin here arg is tupl argu use empti tupl call function pass ani argu kwarg is option dictionari keyword argu when abov code is execut it produc follow result it is veri effect low level thread ...
4
python multithread program
ذخیره در داده ساختار مناسب
شمردن کلمهها
all: 1
code: 4
execut: 5
concurr: 1
follow: 9
pointer: 1
previous: 1
acquir: 4
window: 1
program: 5
has: 3
do: 1
return: 4
python: 6
cannot: 1
mechan: 1
veri: 2
discuss: 1
requir: 1
enabl: 2
specif: 1
...
۰- پیدا کردن کلمههای یکتای متن
۰- استفاده از تعداد تکرار برای امتیازدهی اولیه
داده ساختار
Inverted Index+
page_id | word |
---|---|
10,4, 3,6 | python |
1,2,4 | list |
20, 13, 42, 2,3 | pip |
20, 13, 4, 2 | modul |
...
...
داده ساختار
Inverted Index+
python:
list:
pip:
modul:
(10|19),(4|11),(3|2),(6|12)
(1|35),(2|7),(4|15)
(2|82),(3|56),(42|3),...
(2|25),(4|101),(13|19),...
امتیازدهی اولیه
S(p, word) = Max[rep_in_title(word),rep_in_content(word)] * n + Min[rep_in_title(word),rep_in_content(word)]
ذخیره مکان کلمهها
differ, program, concurr, follow, benefit, ...
[run, sever, thread, is, similar, run, sever,
0
1
2
3
4
5
7
8
9
10
11
12
داده ساختار مناسب
all: 1
code: 4
execut: 5
concurr: 1
follow: 9
pointer: 1
previous: 1
acquir: 4
0: 2
window: 1
program: 5
4: 1
has: 3
do: 1
return: 4
python: 6
cannot: 1
mechan: 1
veri: 2
2: 1
discuss: 1
requir: 1
enabl: 2
specif: 1
level: 2
list: 1
wait: 2
item: 1
benefit: 1
where: 1
...
ll: [136]
code: [91, 194, 291, 319]
execut: [15, 93, 196, 293, 321]
concurr: [9]
follow: [10, 37, 96, 158, 168, ...
pointer: [21]
previous: [132]
acquir: [225, 245, 259, 263]
0: [250, 254]
window: [52]
program: [8, 328]
4: [120]
has: [13, 19, 147]
do: [167]
return: [55, 67, 222, 252]
python: [118, 204, 326]
cannot: [257]
mechan: [209]
veri: [100, 108]
2: [119]
discuss: [131]
requir: [288]
enabl: [44, 240]
specif: [308]
level: [103, 126]
list: [63]
wait: [244, 270]
item: [310]
benefit: [11]
where: [25]
...
5
6
inverted index
word positions
جستوجو!
جستوجو ساده: تک کلمهای
۰- نرمالسازی کوئری کاربر
۱- پیدا کردن صفحات لینک شده به کلمه
۲- مرتبسازی براساس امتیاز اولیه
۳- تمام!
جستوجو ساده: عبارت
۰- نرمالسازی کوئری کاربر
۱- پیدا کردن صفحات لینک شده به ازای هر کلمه
۳- اشتراک گیری نتایج کلمات
list comprehension in python => list comprehens python
result = {
"list": [0, 4, 6, 8, 20],
"comprehens": [1, 20, 71, 33, ... ,42],
"python": [14, 20]
}
جستوجو ساده: عبارت
اشتراکگیری نتایج کلمات
1- مرتب سازی بر اساس طول نتایج کلمات
result = {
"python": [14, 20],
"list": [0, 4, 6, 8, 20],
"comprehens": [1, 20, 71, 33, ... ,42]
}
2- اشتراکگیری ۲ به ۲ از بالا به پایین
جستوجو ساده: عبارت
اعمال امتیاز کلمات پشتسر هم
۴- اعمال امتیاز برای صفحاتی که شامل عبارت به صورت پشتسر هم هستند.
def apply_positions_score():
first = get_positions(word, page) # first = [0, 3, 5, 8, 21, ...]
second = get_positions(next_word, page) # second = [1, 6, 9, 12, 18, ...]
for pos in first:
start = find_first_bigger_number(...)
dif = second[start] - pos
if dif == 1:
score += 50
۵- مرتبسازی بر اساس امتیاز محاسبهشده
۶- تمام!
جستوجو پیشرفته
and
مانند جستوجو عبارت
not
حذف اشتراک لیست نتایج کلمات ممنوعه
or
دادن امتیاز بیشتر به صفحاتی که دارای نتایج کلمات اُر هستند.
آمار
تعداد صفحات
5209
تعداد کلمات
14093
تعداد مکانهای کلمات
662077
تعداد ایندکس
302980
GoSearch!
By Amirhossein Kazemnejad
GoSearch!
- 538