feat: expand tool catalog and improve 'what' search recall

- Add 32 new tool and dockerfile entries to README.md catalog.
- Increase 'what' shortlist limit to 100 for better search recall.
- Update 'what' default model to gemma4 and improve robust JSON parsing.
This commit is contained in:
tke
2026-05-18 13:21:23 +02:00
parent ac3245b78f
commit ae5d503268
2 changed files with 71 additions and 18 deletions
+28
View File
@@ -92,6 +92,11 @@ Format: `path | goal | usage`. This section is intentionally compact so `what` c
- `tools/data/between` | goal: print text between delimiters | usage: `tools/data/between START END < file.txt` - `tools/data/between` | goal: print text between delimiters | usage: `tools/data/between START END < file.txt`
- `tools/data/csv_get` | goal: extract selected CSV fields quickly | usage: `tools/data/csv_get file.csv column` - `tools/data/csv_get` | goal: extract selected CSV fields quickly | usage: `tools/data/csv_get file.csv column`
- `tools/data/csv2dot` | goal: turn CSV relationships into Graphviz dot edges | usage: `tools/data/csv2dot` - `tools/data/csv2dot` | goal: turn CSV relationships into Graphviz dot edges | usage: `tools/data/csv2dot`
- `config/visidata/plugins/ioc.py` | goal: VisiData plugin for IOC types (domains, URLs, hashes) with VT/MB integration | usage: `vd --plugin config/visidata/plugins/ioc.py ...`
- `config/visidata/plugins/iptype.py` | goal: VisiData plugin for IP and CIDR types with enrichment | usage: `vd --plugin config/visidata/plugins/iptype.py ...`
- `config/visidata/plugins/lookupcore.py` | goal: Core lookup and caching logic for VisiData plugins | usage: `(internal use)`
- `config/visidata/scripts/validate_ioclib.py` | goal: Offline validation for IOC parsing logic | usage: `python3 config/visidata/scripts/validate_ioclib.py`
- `config/visidata/scripts/validate_ip_lookups.py` | goal: Offline validation for IP lookup logic | usage: `python3 config/visidata/scripts/validate_ip_lookups.py`
### Hashing And Archives ### Hashing And Archives
@@ -115,6 +120,10 @@ Format: `path | goal | usage`. This section is intentionally compact so `what` c
- `tools/cloud/speech.py` | goal: run cloud-backed speech or transcription tasks | usage: `python3 tools/cloud/speech.py input` - `tools/cloud/speech.py` | goal: run cloud-backed speech or transcription tasks | usage: `python3 tools/cloud/speech.py input`
- `tools/cloud/vqa3.py` | goal: classify images with a local or model-backed VQA workflow | usage: `python3 tools/cloud/vqa3.py image.jpg` - `tools/cloud/vqa3.py` | goal: classify images with a local or model-backed VQA workflow | usage: `python3 tools/cloud/vqa3.py image.jpg`
- `tools/cloud/youtube_resolve.sh` | goal: resolve direct media URLs from YouTube-like inputs | usage: `tools/cloud/youtube_resolve.sh URL` - `tools/cloud/youtube_resolve.sh` | goal: resolve direct media URLs from YouTube-like inputs | usage: `tools/cloud/youtube_resolve.sh URL`
- `tools/dockerpullsave.py` | goal: download Docker images as tarballs without requiring a Docker daemon | usage: `python3 tools/dockerpullsave.py image:tag`
- `scripts/proxy/install_proxy.sh` | goal: installer for the Dumb Pipe Proxy Bridge service | usage: `scripts/proxy/install_proxy.sh`
- `scripts/proxy/bridge.js` | goal: Node.js proxy bridge with keyring authentication support | usage: `node scripts/proxy/bridge.js`
- `scripts/proxy/setup.js` | goal: Interactive setup for storing proxy credentials in the system keyring | usage: `node scripts/proxy/setup.js`
### Formats, System, And Text Experiments ### Formats, System, And Text Experiments
@@ -128,9 +137,28 @@ Format: `path | goal | usage`. This section is intentionally compact so `what` c
- `tools/system/ltop.py` | goal: show the most frequent lines from a stream like `top` | usage: `tail -f log | python3 tools/system/ltop.py` - `tools/system/ltop.py` | goal: show the most frequent lines from a stream like `top` | usage: `tail -f log | python3 tools/system/ltop.py`
- `tools/system/noerr` | goal: run a command with stderr suppressed | usage: `tools/system/noerr some command` - `tools/system/noerr` | goal: run a command with stderr suppressed | usage: `tools/system/noerr some command`
- `tools/system/wipe.sh` | goal: perform destructive wipe or cleanup steps | usage: `tools/system/wipe.sh target` - `tools/system/wipe.sh` | goal: perform destructive wipe or cleanup steps | usage: `tools/system/wipe.sh target`
- `tools/system/copy_firefox_extension.sh` | goal: sync Firefox extensions between profile (e.g. internet to intranet) | usage: `tools/system/copy_firefox_extension.sh [ext_name]`
- `tools/text/probability.py` | goal: run a small text probability experiment | usage: `python3 tools/text/probability.py` - `tools/text/probability.py` | goal: run a small text probability experiment | usage: `python3 tools/text/probability.py`
- `tools/text/depth` | goal: inspect text depth or nesting characteristics | usage: `tools/text/depth input.txt` - `tools/text/depth` | goal: inspect text depth or nesting characteristics | usage: `tools/text/depth input.txt`
### Container Recipes
- `dockerfiles/firefox.dockerfile` | goal: Docker recipe for a containerized Firefox with VNC access | usage: `docker build -f dockerfiles/firefox.dockerfile .`
- `dockerfiles/kali.dockerfile` | goal: Docker recipe for a Kali Linux base image | usage: `docker build -f dockerfiles/kali.dockerfile .`
- `dockerfiles/plaso.dockerfile` | goal: Docker recipe for the Plaso (log2timeline) forensic tool | usage: `docker build -f dockerfiles/plaso.dockerfile .`
- `dockerfiles/volatility/Dockerfile` | goal: Docker recipe for Volatility memory forensics | usage: `docker build -f dockerfiles/volatility/Dockerfile .`
- `dockerfiles/regripper/Dockerfile` | goal: Docker recipe for RegRipper (Windows registry analysis) | usage: `docker build -f dockerfiles/regripper/Dockerfile .`
- `dockerfiles/pdf-analysis/pdf-analysis.dockerfile` | goal: Docker recipe for a PDF analysis environment with peepdf and DidierStevensSuite | usage: `docker build -f dockerfiles/pdf-analysis/pdf-analysis.dockerfile .`
- `dockerfiles/flatpdf/Dockerfile` | goal: Docker recipe for a PDF flattening environment using pdftk | usage: `docker build -f dockerfiles/flatpdf/Dockerfile .`
- `dockerfiles/tools/clamav.dockerfile` | goal: Docker recipe for a ClamAV scanner | usage: `docker build -f dockerfiles/tools/clamav.dockerfile .`
- `dockerfiles/tools/john.dockerfile` | goal: Docker recipe for John the Ripper (Kali-based) | usage: `docker build -f dockerfiles/tools/john.dockerfile .`
- `dockerfiles/tools/nmap.dockerfile` | goal: Docker recipe for an Nmap scanner | usage: `docker build -f dockerfiles/tools/nmap.dockerfile .`
- `dockerfiles/tools/tcpdump.dockerfile` | goal: Docker recipe for tcpdump packet capture | usage: `docker build -f dockerfiles/tools/tcpdump.dockerfile .`
- `dockerfiles/cherokee/cherokee.dockerfile` | goal: Docker recipe for the Cherokee web server | usage: `docker build -f dockerfiles/cherokee/cherokee.dockerfile .`
- `dockerfiles/logstash/logstash.conf` | goal: Sample Logstash configuration for various ingestion cases | usage: `docker run -v $(pwd)/logstash.conf:/usr/share/logstash/pipeline/logstash.conf logstash`
- `dockerfiles/build_firefox.sh` | goal: Build and run the Firefox VNC container | usage: `dockerfiles/build_firefox.sh`
- `dockerfiles/build_kali.sh` | goal: Build the Kali Linux Docker image | usage: `dockerfiles/build_kali.sh`
### CTF Helpers ### CTF Helpers
- `tools/ctf/filtertext.py` | goal: filter challenge text to useful fragments | usage: `python3 tools/ctf/filtertext.py input.txt` - `tools/ctf/filtertext.py` | goal: filter challenge text to useful fragments | usage: `python3 tools/ctf/filtertext.py input.txt`
+43 -18
View File
@@ -21,7 +21,7 @@ from pathlib import Path
REPO_ROOT = Path(__file__).parent.resolve() REPO_ROOT = Path(__file__).parent.resolve()
README_PATH = REPO_ROOT / "README.md" README_PATH = REPO_ROOT / "README.md"
DEFAULT_MODEL = os.environ.get("WHAT_OLLAMA_MODEL", "ministral-3:3b") DEFAULT_MODEL = os.environ.get("WHAT_OLLAMA_MODEL", "gemma4")
CATALOG_HEADING = "## Tool Catalog" CATALOG_HEADING = "## Tool Catalog"
ENTRY_RE = re.compile( ENTRY_RE = re.compile(
r"^- `([^`]+)` \| goal: (.*?) \| usage: (.*)$" r"^- `([^`]+)` \| goal: (.*?) \| usage: (.*)$"
@@ -123,13 +123,20 @@ def build_prompt(query: str, entries: list[dict[str, str]]) -> str:
return f"""You are selecting tools from a repository catalog. return f"""You are selecting tools from a repository catalog.
Use only the catalog below. Prefer direct matches. Use archived tools only if they clearly fit the request. Use only the catalog below. Prefer direct matches. Use archived tools only if they clearly fit the request.
Return strict JSON only. The response must be a JSON array with up to 8 objects. Return strict JSON matching this schema exactly:
Each object must contain: {{
- "path": exact catalog path "results": [
- "reason": one short sentence {{
"path": "exact catalog path",
"reason": "one short sentence explaining why this tool matches"
}}
]
}}
Do not invent paths. Do not include markdown. Constraints:
Prefer the entry whose action best matches the query: compare beats hash for comparison queries, open beats convert for opening queries, and mount beats inspect for mount queries. - The "results" array must contain up to 8 objects.
- Do not invent paths.
- Prefer the entry whose action best matches the query: compare beats hash for comparison queries, open beats convert for opening queries, and mount beats inspect for mount queries.
Query: {query} Query: {query}
@@ -142,7 +149,7 @@ def tokenize(text: str) -> set[str]:
return set(TOKEN_RE.findall(text.lower())) return set(TOKEN_RE.findall(text.lower()))
def shortlist_entries(query: str, entries: list[dict[str, str]], limit: int = 28) -> list[dict[str, str]]: def shortlist_entries(query: str, entries: list[dict[str, str]], limit: int = 100) -> list[dict[str, str]]:
query_tokens = tokenize(query) query_tokens = tokenize(query)
if not query_tokens: if not query_tokens:
return entries[:limit] return entries[:limit]
@@ -163,28 +170,45 @@ def shortlist_entries(query: str, entries: list[dict[str, str]], limit: int = 28
def extract_json_array(output: str) -> list[dict[str, str]]: def extract_json_array(output: str) -> list[dict[str, str]]:
match = re.search(r"\[\s*\{.*\}\s*\]", output, re.DOTALL) # Step 1: Clean and find the root object boundary if Ollama prefixes anything
match = re.search(r"\{\s*.*\}\s*", output, re.DOTALL)
payload = match.group(0) if match else output payload = match.group(0) if match else output
data = json.loads(payload) try:
if not isinstance(data, list): # ALLOW literal newlines/control characters inside string properties
raise WhatError("Model output must be a JSON array.") data = json.loads(payload, strict=False)
except json.JSONDecodeError as exc:
raise WhatError(f"Failed to parse model output as JSON: {exc}")
if not isinstance(data, dict):
raise WhatError("Model output must be a root JSON object.")
# Step 2: Safe navigation into the expected schema array
results_list = data.get("results")
if results_list is None:
raise WhatError("Missing 'results' key in model JSON response.")
if not isinstance(results_list, list):
raise WhatError("The 'results' property must be a JSON array.")
# Step 3: Extract and normalize items
normalized: list[dict[str, str]] = [] normalized: list[dict[str, str]] = []
for item in data: for item in results_list:
if not isinstance(item, dict): if not isinstance(item, dict):
continue continue
path = str(item.get("path", "")).strip() path = str(item.get("path", "")).strip()
reason = str(item.get("reason", "")).strip() # Clean up any literal newlines the model injected into the text
reason = str(item.get("reason", "")).replace("\n", " ").strip()
if path: if path:
normalized.append({"path": path, "reason": reason}) normalized.append({"path": path, "reason": reason})
return normalized return normalized
def run_ollama_once(prompt: str, model: str) -> str: def run_ollama_once(prompt: str, model: str) -> str:
try: try:
result = subprocess.run( result = subprocess.run(
["ollama", "run", model, prompt], ["ollama", "run", "--format", "json", "--hidethinking", model, prompt],
capture_output=True, capture_output=True,
text=True, text=True,
timeout=60, timeout=60,
@@ -206,8 +230,9 @@ def run_ollama(prompt: str, model: str) -> list[dict[str, str]]:
return extract_json_array(first_output) return extract_json_array(first_output)
except (json.JSONDecodeError, WhatError): except (json.JSONDecodeError, WhatError):
repair_prompt = ( repair_prompt = (
"Rewrite the following response as strict JSON only.\n" "Rewrite the following response as strict JSON matching the target schema.\n"
'Target format: [{"path":"exact catalog path","reason":"short reason"}]\n' "Target format:\n"
'{\n "results": [{"path":"exact catalog path","reason":"short reason"}]\n}\n'
"Do not add markdown or commentary.\n\n" "Do not add markdown or commentary.\n\n"
f"Response to repair:\n{first_output}\n" f"Response to repair:\n{first_output}\n"
) )
@@ -300,4 +325,4 @@ def main() -> int:
if __name__ == "__main__": if __name__ == "__main__":
raise SystemExit(main()) raise SystemExit(main())