INDEX
Explanations
mentions of physical harm or destruction by fire
references to fire, burning, and related injuries or destruction
New Auto-Interp
Negative Logits
onsense
-0.75
ournal
-0.74
reluct
-0.71
awaru
-0.69
egal
-0.68
udeau
-0.67
ensical
-0.66
alian
-0.66
ortun
-0.66
remlin
-0.65
POSITIVE LOGITS
burning
1.14
burn
1.12
ished
1.04
burns
1.04
ishing
1.01
hotter
0.99
burning
0.98
burned
0.95
ishes
0.89
burner
0.86
Activations Density 0.018%