NowConnect and LDAP connector reliability enhancements

What changed

In update_group_members for group membership changes:
- Treats LDAP result 16 (noSuchAttribute) on delete and 20 (typeOrValueExists) on add as success; logs at info.
- Reduces flakiness during high‑churn tests and retries (benign duplicates no longer fail the flow).

Reliability
- Idempotent group membership updates prevent benign failures when tests or workflows retry/remove already‑removed members or re‑add existing members.
Observability
- Per‑write and per‑drain latency exposes backpressure and downstream slowness.
- Queue size highlights burst pressure on agent→cloud delivery.
- Explicit listener error counters (write/session/PDP) allow clean alerting without log parsing.

Metrics endpoint
- Inside nowconnect-cloud container: http://localhost:8765/metrics
- Prometheus scrape example:

- job_name: 'nowconnect'
  static_configs:
    - targets: ['nowconnect-cloud:8765']
  metrics_path: /metrics

Grafana panels (suggested)
- Write/drain latency:
  - p50/p95 for nowconnect_write_drain_seconds{phase="write"}, {phase="drain"} by connector
- Queue pressure:
  - nowconnect_inbound_queue_size by connector (gauge; use max over time)
- Traffic:
  - rate(nowconnect_tcp_bytes_total[1m]) by (direction,connector)
  - rate(nowconnect_tcp_connections_total[5m]) by (connector)
- Errors:
  - increase(nowconnect_listener_errors_total[5m]) by (connector,type)
  - increase(nowconnect_pdp_decisions_total{result="error"}[5m]) by (connector)
Alerting (examples)
- Backpressure: p95 write or drain > 100ms for 5m (connector=addomain)
- Queue overrun early‑warning: nowconnect_inbound_queue_size > 75% of queue_depth_per_cid for 2m
- Listener error spike: increase(nowconnect_listener_errors_total[5m]) > threshold
- PDP decision errors: increase(nowconnect_pdp_decisions_total{result="error"}[5m]) > 0

Validate tunnel and metrics
- Generate traffic (read‑only LDAP) via nowconnect-cloud:389
- Check counters: nowconnect_tcp_connections_total, nowconnect_tcp_bytes_total increase; nowconnect_write_drain_seconds and queue size present; write/drain near 0–10ms on local env
Investigate latency regressions
- High drain latency → target or network backpressure
- High write latency → CPU saturation in hub (consider more workers) or GIL contention
- Growing queue size → agent or target slow; verify agent connectivity and OpenLDAP health
Investigate errors
- listener_errors_total{type=session} rising → inspect cloud logs for exceptions and stack traces
- {type=write} rising → socket write failures; inspect Docker network and target service health
- PDP errors → verify PDP URL, timeouts, and PDP service logs

Where to extend metrics
- nowconnect_cloud/listeners.py:
  - Add additional labels if you route multiple connectors
  - Consider measuring per‑frame size buckets if needed
- Hub‑level metrics (nowconnect_cloud/hub.py) can expose per‑session counts if deeper visibility is required.
Connector behavior
- Idempotency currently scoped to LDAP group membership adds/deletes:
  - Codes handled: 16 (noSuchAttribute), 20 (typeOrValueExists)
- Extend similar patterns to other modify paths where safe and expected.

Backward‑compatible: new metrics and idempotency do not change API surfaces.
Requires rebuild/redeploy of nowconnect-cloud for metrics and crud-service for connector changes.
No config changes required. Optional: expose cloud hub 8765/metrics externally via Traefik for Prometheus outside the Docker network.

Grafana shows low, non‑zero write/drain histograms; queue size near zero at steady state.
Listener error counters near zero; PDP errors absent on healthy systems.
During perf tests: byte rates scale, latency p95 meets SLOs, no error spikes or sustained queue growth.