INDEX
Explanations
instances of the word "review" and its variations
New Auto-Interp
Negative Logits
fol
-0.18
gow
-0.16
uraa
-0.16
bell
-0.16
arr
-0.16
hower
-0.15
geber
-0.15
abyrin
-0.15
373
-0.14
htub
-0.14
POSITIVE LOGITS
able
0.26
ees
0.21
ers
0.20
ables
0.18
ee
0.18
/meta
0.17
uated
0.17
ABLE
0.17
iger
0.17
/comment
0.16
Activations Density 0.027%