Data Code 101
datacode101.bsky.social
Data Code 101
@datacode101.bsky.social
Data / Software Engineering
- Typically 30–60% fewer tokens than JSON1
- Explicit lengths and fields enable validation
- Removes redundant punctuation (braces, brackets, most quotes)
- Indentation-based structure, like YAML, uses whitespace instead of braces
- Tabular arrays: declare keys once, stream data as rows
November 6, 2025 at 6:01 AM
JSON:

{
"users": [
{ "id": 1, "name": "Alice", "role": "admin" },
{ "id": 2, "name": "Bob", "role": "user" }
]
}

TOON:

users[2]{id,name,role}:
1,Alice,admin
2,Bob,user
November 6, 2025 at 6:01 AM
RAG is not just an integration problem. It’s a design problem. Each layer of this stack requires deliberate choices that impact latency, quality, explainability, and cost.

If you're serious about GenAI, it's time to think in terms of stacks—not just models.
October 27, 2025 at 10:36 AM
Evaluation

Tools like Ragas, Trulens, and Giskard bring much-needed observability—measuring hallucinations, relevance, grounding, and model behavior under pressure.
October 27, 2025 at 10:36 AM
Text Embeddings

The quality of retrieval starts here. Open-source models (Nomic, SBERT, BGE) are gaining ground, but proprietary offerings (OpenAI, Google, Cohere) still dominate enterprise use.
October 27, 2025 at 10:36 AM
Open LLM Access

Platforms like Hugging Face, Ollama, Groq, and Together AI abstract away infra complexity and speed up experimentation across models.
October 27, 2025 at 10:36 AM
Data Extraction (Web + Docs)

Whether you're crawling the web (Crawl4AI, FireCrawl) or parsing PDFs (LlamaParse, Docling), raw data access is non-negotiable. No context means no quality answers.
October 27, 2025 at 10:36 AM
Vector Database

Chroma, Qdrant, Weaviate, Milvus, and others power the retrieval engine behind every RAG system. Low-latency search, hybrid scoring, and scalable indexing are key to relevance.
October 27, 2025 at 10:36 AM
Frameworks

LangChain, LlamaIndex, Haystack, and txtai are now essential for building orchestrated, multi-step AI workflows. These tools handle chaining, memory, routing, and tool-use logic behind the scenes.
October 27, 2025 at 10:36 AM
LLMs (Open vs Closed)

Open models like LLaMA 3, Phi-4, and Mistral offer control and customization. Closed models (OpenAI, Claude, Gemini) bring powerful performance with less overhead. Your tradeoff: flexibility vs convenience.
October 27, 2025 at 10:36 AM
EtLT (Extract, transform, Load, Transform) (2/2)

Best for scenarios requiring strict data security/compliance (pre-load masking) while still benefiting from the speed and flexibility of cloud data warehouse transformations.
October 19, 2025 at 4:45 AM
EtLT (Extract, transform, Load, Transform) (1/2)

Attempts to balance the data governance of ETL with the speed and flexibility of ELT. A minimal transformation step is performed before loading. Essential tasks like data cleaning, basic formatting, masking sensitive data for immediate compliance.
October 19, 2025 at 4:45 AM
ELT (Extract, Load, Transform) (2/2)

Transformation is implemented inside the target system (e.g., a modern cloud data warehouse like Snowflake or BigQuery, or a data lake). Highly scalable for massive and diverse (structured/unstructured) datasets.
October 19, 2025 at 4:45 AM
ELT (Extract, Load, Transform) (1/2)

Modern Approach: Became popular with the rise of cloud-native data warehouses offering cheap storage and elastic compute. Raw, unprepared data is loaded immediately, offering faster data ingestion and near real-time analytics.
October 19, 2025 at 4:45 AM
ETL (Extract, Transform, Load) (2/2)

Transformation is in a dedicated, separate staging server or processing engine outside the target data warehouse. Typically higher latency, as the data must wait for the transformation to complete before loading.
October 19, 2025 at 4:45 AM
ETL (Extract, Transform, Load) (1/2)

Traditional Approach: Older methodology common with on-premises data warehouses where compute was limited and expensive. Data is cleaned, standardized, and sensitive information can be masked before it enters the final warehouse.
October 19, 2025 at 4:45 AM
EtLT (Extract, transform, Load, Transform) (2/2)

Best for scenarios requiring strict data security/compliance (pre-load masking) while still benefiting from the speed and flexibility of cloud data warehouse transformations.
October 19, 2025 at 4:40 AM
EtLT (Extract, transform, Load, Transform) (1/2)

Attempts to balance the data governance of ETL with the speed and flexibility of ELT. A minimal transformation step is performed before loading. Essential tasks like data cleaning, basic formatting, masking sensitive data for immediate compliance.
October 19, 2025 at 4:40 AM
ELT (Extract, Load, Transform) (2/2)

Transformation is implemented inside the target system (e.g., a modern cloud data warehouse like Snowflake or BigQuery, or a data lake). Highly scalable for massive and diverse (structured/unstructured) datasets.
October 19, 2025 at 4:40 AM
ELT (Extract, Load, Transform) (1/2)

Modern Approach: Became popular with the rise of cloud-native data warehouses offering cheap storage and elastic compute. Raw, unprepared data is loaded immediately, offering faster data ingestion and near real-time analytics.
October 19, 2025 at 4:40 AM
ETL (Extract, Transform, Load) (2/2)

Transformation is in a dedicated, separate staging server or processing engine outside the target data warehouse. Typically higher latency, as the data must wait for the transformation to complete before loading.
October 19, 2025 at 4:40 AM
ETL (Extract, Transform, Load) (1/2)

Traditional Approach: Older methodology common with on-premises data warehouses where compute was limited and expensive. Data is cleaned, standardized, and sensitive information can be masked before it enters the final warehouse.
October 19, 2025 at 4:40 AM