INDEX
Explanations
violent-related words and phrases
New Auto-Interp
Negative Logits
ÄŁ
-0.62
guesses
-0.61
Tsukuyomi
-0.59
edin
-0.59
afar
-0.57
Paula
-0.57
Mb
-0.56
ij士
-0.55
omething
-0.54
vous
-0.54
POSITIVE LOGITS
rice
1.03
hemat
1.03
hetically
0.98
terson
0.98
abase
0.98
hens
0.95
rix
0.93
ric
0.93
itudes
0.93
roph
0.92
Activations Density 0.032%