INDEX
Explanations
references to the concept of "prison"
references to prison
New Auto-Interp
Negative Logits
lass
-0.74
issan
-0.69
udden
-0.69
thora
-0.68
oric
-0.67
Bundes
-0.66
laus
-0.66
yip
-0.65
///
-0.63
idy
-0.61
POSITIVE LOGITS
prisons
0.93
prison
0.93
inmates
0.92
prison
0.92
inmate
0.85
barr
0.82
confinement
0.82
jail
0.81
incarcer
0.80
sentences
0.80
Activations Density 0.019%