Commit f5b729ed authored by Prot Alexandre's avatar Prot Alexandre

first draft of rapport.md

parent 71037300
## Projet recherche d'informations web
### Traitements linguistiques
#### Collection CACM
Voici l'analyse obtenue pour la collection CACM:
```
****************** Count tokens ******************
Total count of tokens : 108,447
Vocabulary size: 11,627
****** Count tokens for half the collection ******
Total count of tokens : 30,052
Vocabulary size: 6,049
******** Heap's law parameters estimation ********
b: 0.509
k: 31.7
estimation of vocabulary size for 1M tokens : 36034
```
#### Collection CS276
Voici l'analyse obtenue pour la collection CS276
```
****************** Count tokens ******************
Total count of tokens : 17,879,253
Vocabulary size: 337,191
****** Count tokens for half the collection ******
Total count of tokens : 9,958,569
Vocabulary size: 191,499
******** Heap's law parameters estimation ********
b: 0.967
k: 0.0328
estimation of vocabulary size for 1M tokens : 20755
```
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment