INDEX
Explanations
descriptive phrases related to detailed presentations or explanations
suggestions or recommendations
New Auto-Interp
Negative Logits
Pigs
-0.61
Kills
-0.56
kills
-0.52
ittens
-0.50
hate
-0.50
indal
-0.50
hates
-0.48
KO
-0.48
kill
-0.47
shit
-0.47
POSITIVE LOGITS
nonetheless
0.75
nevertheless
0.73
wondering
0.69
cautiously
0.67
pmwiki
0.65
ãĤ¦ãĤ¹
0.65
understandably
0.63
etheless
0.63
hopeful
0.60
parting
0.59
Activations Density 0.850%