INDEX
Explanations
statements related to personal opinions or declarations
New Auto-Interp
Negative Logits
luaj
-0.78
opez
-0.78
ngth
-0.73
elve
-0.73
styles
-0.70
dies
-0.70
aneers
-0.69
alties
-0.67
anas
-0.66
ãĤ¨ãĥ«
-0.66
POSITIVE LOGITS
happening
0.93
true
0.75
unacceptable
0.73
nonsense
0.71
untrue
0.71
textbook
0.70
HUGE
0.69
purely
0.67
blat
0.67
assuming
0.65
Activations Density 0.758%