rapport.md 1.17 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12
## Projet recherche d'informations web

### Traitements linguistiques

#### Collection CACM

Voici l'analyse obtenue pour la collection CACM:

```
****************** Count tokens ******************

Total count of tokens : 	108,447
Dos Santos David's avatar
Dos Santos David committed
13
Vocabulary size: 		    11,627
14 15 16 17 18


****** Count tokens for half the collection ******

Total count of tokens : 	30,052
Dos Santos David's avatar
Dos Santos David committed
19
Vocabulary size: 		    6,049
20 21 22 23 24 25 26 27 28 29


******** Heap's law parameters estimation ********

b: 	0.509
k: 	31.7

estimation of vocabulary size for 1M tokens : 36034
```

30 31 32 33 34
Graphes pour la loi de Zipf :

![zipf_law](/graphs/cacm_zipf_law.png)
![zipf_law_logs](/graphs/cacm_zipf_law_logs.png)

35 36 37 38 39 40 41
#### Collection CS276

Voici l'analyse obtenue pour la collection CS276

```
****************** Count tokens ******************

Dos Santos David's avatar
Dos Santos David committed
42 43
Total count of tokens : 	25,498,340
Vocabulary size: 		    347,071
44 45 46 47


****** Count tokens for half the collection ******

Dos Santos David's avatar
Dos Santos David committed
48 49
Total count of tokens : 	14,332,579
Vocabulary size: 		    196,989
50 51 52 53


******** Heap's law parameters estimation ********

Dos Santos David's avatar
Dos Santos David committed
54 55
b: 	0.983
k: 	0.0181
56

Dos Santos David's avatar
Dos Santos David committed
57
estimation of vocabulary size for 1M tokens : 14374
Dos Santos David's avatar
Dos Santos David committed
58
```
Dos Santos David's avatar
Dos Santos David committed
59 60 61 62 63

Graphes pour la loi de Zipf :

![zipf_law](/graphs/cs276_zipf_law.png)
![zipf_law_logs](/graphs/cs276_zipf_law_logs.png)