rapport.md 940 Bytes
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## Projet recherche d'informations web

### Traitements linguistiques

#### Collection CACM

Voici l'analyse obtenue pour la collection CACM:

```
****************** Count tokens ******************

Total count of tokens : 	108,447
Vocabulary size: 		11,627


****** Count tokens for half the collection ******

Total count of tokens : 	30,052
Vocabulary size: 		6,049


******** Heap's law parameters estimation ********

b: 	0.509
k: 	31.7

estimation of vocabulary size for 1M tokens : 36034

```

#### Collection CS276

Voici l'analyse obtenue pour la collection CS276

```
****************** Count tokens ******************

Total count of tokens : 	17,879,253
Vocabulary size: 		337,191


****** Count tokens for half the collection ******

Total count of tokens : 	9,958,569
Vocabulary size: 		191,499


******** Heap's law parameters estimation ********

b: 	0.967
k: 	0.0328

estimation of vocabulary size for 1M tokens : 20755
```