INDEX
Explanations
expressions of opinion or personal judgment
phrases related to speech and expression of thoughts
New Auto-Interp
Negative Logits
unal
-0.63
agascar
-0.62
pered
-0.62
aired
-0.60
Flavoring
-0.59
astern
-0.58
ockets
-0.58
actionDate
-0.57
agra
-0.57
recomm
-0.56
POSITIVE LOGITS
'
1.29
'[
1.28
hey
1.20
"'
1.19
`
1.14
\"
1.08
'(
1.07
wow
0.99
"
0.98
hello
0.97
Activations Density 0.148%