موتور جست‌و‌جو

GoSearch!

قسمت‌ها

۱- بارگذاری محتوای صفحات

۲- نرمال‌سازی متن

۳- ذخیره در داده‌ ساختار مناسب

۴- جست‌وجو

بارگذاری محتوا / کرال

دانلود صفحه

استخراج متن اصلی

http://www.tutorialspoint.com/python/python_multithreading.htm

1

http://www.tutorialspoint.com/python/python_multithreading.htm

2

...
</div>
<hr>
<div class="pre-btn">
<a href="/python/python_sending_email.htm"><i class="icon icon-arrow-circle-o-left big-font"></i> Previous Page</a>
</div>
<div class="nxt-btn">
<a href="/python/python_xml_processing.htm">Next Page <i class="icon icon-arrow-circle-o-right big-font"></i> </a>
</div>
<div class="clearer"></div>
<hr />
<p>Running several threads is similar to running several different programs concurrently, but with the following benefits −</p>
<ul class="list">
<li><p>Multiple threads within a process share the same data space with the main thread and can therefore share information or communicate with each other more easily than if they were separate processes.</p></li>
<li><p>Threads sometimes called light-weight processes and they do not require much memory overhead; they care cheaper than processes.</p></li>
</ul>
<p>A thread has a beginning, an execution sequence, and a conclusion. It has an instruction pointer that keeps track of where within its context it is currently running.</p>
<ul class="list">
<li><p>It can be pre-empted (interrupted)</p></li>
<li><p>It can temporarily be put on hold (also known as sleeping) while other threads are running - this is called yielding.</p></li>
</ul>
<h2>Starting a New Thread</h2>
....
http://www.tutorialspoint.com/python/python_multithreading.htm

3

Running several threads is similar to running several different programs concurrently, but with the following benefits −

Multiple threads within a process share the same data space with the main thread and can therefore share information or communicate with each other more easily than if they were separate processes.

Threads sometimes called light-weight processes and they do not require much memory overhead; they care cheaper than processes.

A thread has a beginning, an execution sequence, and a conclusion. It has an instruction pointer that keeps track of where within its context it is currently running.

It can be pre-empted (interrupted)

It can temporarily be put on hold (also known as sleeping) while other threads are running - this is called yielding.

....
Python Multithreaded Programming

عنوان صفحه

متن اصلی

نرمال سازی متن

حذف حرف‌ها و کلمه‌های بی‌کاربرد

  • پیدا کردن حروف ربط،اضافه،تعریف
  • حذف آنها

ریشه‌یابی و حذف پسوند و پیشوند

stemming
cats, catty, catlike
cat
stems, stemmer, stemming, stemmed
stem
argue, argued, argues, arguing
argu
run sever thread is similar run sever differ program concurr follow benefit thread has begin execut sequenc conclus it has instruct pointer that keep track where it context it is current run spawn thread you need call follow method avail thread modul method call enabl fast effici way creat new thread linux window method call return immedi child thread start call function pass list arg when function return thread termin here arg is tupl argu use empti tupl call function pass ani argu kwarg is option dictionari keyword argu when abov code is execut it produc follow result it is veri effect low level thread ...

4

python multithread program 

ذخیره در داده ساختار مناسب

شمردن کلمه‌ها

all: 1
code: 4
execut: 5
concurr: 1
follow: 9
pointer: 1
previous: 1
acquir: 4
window: 1
program: 5
has: 3
do: 1
return: 4
python: 6
cannot: 1
mechan: 1
veri: 2
discuss: 1
requir: 1
enabl: 2
specif: 1
...

۰- پیدا کردن کلمه‌های یکتای متن

۰- استفاده از تعداد تکرار برای امتیازدهی اولیه

داده ساختار

Inverted Index+
page_id word
10,4, 3,6 python
1,2,4 list
20, 13, 42, 2,3 pip
20, 13, 4, 2 modul
...
...

داده ساختار

Inverted Index+
python:
list:
pip:
modul:
(10|19),(4|11),(3|2),(6|12)
(1|35),(2|7),(4|15)
(2|82),(3|56),(42|3),...
(2|25),(4|101),(13|19),...

امتیازدهی اولیه

S(p, word) = 
Max[rep_in_title(word),rep_in_content(word)] * n + Min[rep_in_title(word),rep_in_content(word)]

ذخیره مکان کلمه‌ها

differ, program, concurr, follow, benefit, ...
[run, sever, thread, is, similar, run, sever,
0
1
2
3
4
5
7
8
9
10
11
12

داده ساختار مناسب

all: 1
code: 4
execut: 5
concurr: 1
follow: 9
pointer: 1
previous: 1
acquir: 4
0: 2
window: 1
program: 5
4: 1
has: 3
do: 1
return: 4
python: 6
cannot: 1
mechan: 1
veri: 2
2: 1
discuss: 1
requir: 1
enabl: 2
specif: 1
level: 2
list: 1
wait: 2
item: 1
benefit: 1
where: 1
...
ll: [136]
code: [91, 194, 291, 319]
execut: [15, 93, 196, 293, 321]
concurr: [9]
follow: [10, 37, 96, 158, 168, ...
pointer: [21]
previous: [132]
acquir: [225, 245, 259, 263]
0: [250, 254]
window: [52]
program: [8, 328]
4: [120]
has: [13, 19, 147]
do: [167]
return: [55, 67, 222, 252]
python: [118, 204, 326]
cannot: [257]
mechan: [209]
veri: [100, 108]
2: [119]
discuss: [131]
requir: [288]
enabl: [44, 240]
specif: [308]
level: [103, 126]
list: [63]
wait: [244, 270]
item: [310]
benefit: [11]
where: [25]
...

5

6

inverted index

word positions

جست‌و‌جو!

جست‌و‌جو ساده: تک کلمه‌ای

۰- نرمال‌سازی کو‌ئری کاربر

۱- پیدا کردن صفحات لینک شده به کلمه

۲- مرتب‌سازی براساس امتیاز اولیه

۳- تمام!

جست‌و‌جو ساده: عبارت

۰- نرمال‌سازی کو‌ئری کاربر

۱- پیدا کردن صفحات لینک شده به ازای هر کلمه

۳- اشتراک گیری نتایج کلمات

list comprehension in python => list comprehens python
result = {
    "list": [0, 4, 6, 8, 20],
    "comprehens": [1, 20, 71, 33, ... ,42],
    "python": [14, 20]
}

جست‌و‌جو ساده: عبارت

اشتراک‌گیری نتایج کلمات

1- مرتب سازی بر اساس طول نتایج کلمات

result = {
    "python": [14, 20],
    "list": [0, 4, 6, 8, 20],
    "comprehens": [1, 20, 71, 33, ... ,42]
}

2- اشتراک‌گیری ۲ به ۲ از بالا به پایین

جست‌و‌جو ساده: عبارت

اعمال امتیاز کلمات پشت‌سر هم

۴- اعمال امتیاز برای صفحاتی که شامل عبارت به صورت پشت‌سر هم هستند.

def apply_positions_score():
    first = get_positions(word, page)        # first = [0, 3, 5, 8, 21, ...]
    second = get_positions(next_word, page)  # second = [1, 6, 9, 12, 18, ...]

    for pos in first:
        start = find_first_bigger_number(...)
        
        dif = second[start] - pos
        if dif == 1:
            score += 50

۵- مرتب‌سازی بر اساس امتیاز محاسبه‌شده

۶- تمام!

جست‌و‌جو پیشرفته

and

مانند جست‌و‌جو عبارت

not

حذف اشتراک لیست نتایج کلمات ممنوعه

or

دادن امتیاز بیش‌تر به صفحاتی که دارای نتایج کلمات اُر هستند.

آمار

تعداد صفحات

5209

تعداد کلمات

14093

تعداد مکان‌های کلمات

662077

تعداد ایندکس

302980

GoSearch!

By Amirhossein Kazemnejad