INDEX
Explanations
mentions of enjoyable food items or rewards
references to treats or special foods
New Auto-Interp
Negative Logits
constitu
-0.69
autonomous
-0.67
moot
-0.65
ova
-0.64
Karin
-0.62
seless
-0.62
dc
-0.59
condem
-0.59
Citiz
-0.59
Dani
-0.59
POSITIVE LOGITS
ises
1.12
ties
0.95
ise
0.95
nels
0.94
itionally
0.90
pieces
0.90
piece
0.90
orial
0.87
terson
0.85
ery
0.85
Activations Density 0.024%