项目整体结构

1. Databases 数据库
A database is an organized collection of inter-related data that models some aspect of the real world (e.g., modeling the students in a class or a digital music store). Databases are the core component of most computer applications. Computer Science is essentially taking in some input, performing some operations, and then producing the output — which can also be seen as a form of database. 数据库是一个有组织的、相互关联的数据集合,用于模拟现实世界的某个方面(例如,模拟一个班级的学生或数字音乐商店)。数据库是大多数计算机应用的核心组件。计算机科学本质上是在接收一些输入、执行一些操作,然后产生输出——这也可以被视为一种数据库。
People often confuse “databases” with “database management systems (DBMS)” (e.g., MySQL, Oracle, MongoDB, Snowflake). 人们常常混淆“数据库”与“数据库管理系统(DBMS)”(例如,MySQL、Oracle、MongoDB、Snowflake)。 A DBMS is software that manages a database. Among many things, a DBMS is responsible for inserting, deleting, and retrieving data from a database. DBMS 是管理数据库的软件。在许多方面,DBMS 负责向数据库插入、删除和检索数据。
2. Flat File Strawman 平面文件稻草人
Consider that we want to store data for a music store like Spotify.
考虑我们想要存储类似 Spotify 的音乐商店的数据。
We want to hold information about artists and the albums those artists have released.
我们想要存储有关艺术家以及这些艺术家发行过的专辑的信息。
This database has two entities: an Artists entity and an Albums entity.
这个数据库有两个实体:一个 Artists 实体和一个 Albums 实体。
We choose to store the database’s records as comma-separated value (CSV) files that the DBMS manages, and each entity is stored in its own file (artists.csv and albums.csv).
我们选择将数据库的记录存储为 DBMS 管理的逗号分隔值(CSV)文件,并且每个实体都存储在其自己的文件( artists.csv 和 albums.csv )中。
Example schema: 示例架构:
- Artists: name, year, country 艺术家:姓名,年份,国家
- Albums: name, artist, year 专辑:名称,艺术家,年份
Example CSV for artists: 艺术家示例 CSV:
"Wu-Tang Clan", 1992, "USA"
"Notorious BIG", 1992, "USA"
"GZA", 1990, "USA"
Example Python-like pseudocode: 示例 Python 风格伪代码:
for line in file.readlines():
record = parse(line)
if record[0] == "GZA":
print(int(record[1]))
Issues with Flat Files 平面文件的问题
-
Data Integrity: 数据完整性:
- How do we ensure that the artist is the same for each album entry? 我们如何确保每张专辑条目中的艺术家都是相同的?
- What if someone overwrites the album year with an invalid string? 如果有人用无效字符串覆盖了专辑年份,会怎样?
- What if an album has multiple artists? 如果一张专辑有多个艺术家呢?
- What happens if we delete an artist who has albums? 如果我们删除一个有专辑的艺术家会怎样?
-
Implementation: 实现:
- How do you find a particular record? 如何找到一个特定的记录?
- What if another application (possibly on another machine) wants to use the same data? 如果另一个应用程序(可能在另一台机器上)想要使用相同的数据会怎样?
- What if two threads write to the same file simultaneously? 如果两个线程同时写入同一个文件会怎样?
-
Durability: 持久性:
- What if the machine crashes during an update? 如果在更新过程中机器崩溃会怎样?
- What if we want replication across machines for high availability? 如果我们需要跨机器进行复制以实现高可用性呢?
- Some systems may sacrifice durability for performance. 有些系统可能会为了性能而牺牲持久性。
3. Database Management System (DBMS) 数据库管理系统
A DBMS allows applications to store and analyze information in a database. 数据库管理系统 (DBMS) 允许应用程序在数据库中存储和分析信息。 A general-purpose DBMS supports definition, creation, querying, update, and administration of databases in accordance with some data model. 一种通用型数据库管理系统支持根据某种数据模型进行数据库的定义、创建、查询、更新和管理。
Examples of Data Models 数据模型示例
- Relational (most common) 关系型(最常见)
- NoSQL (key/value, document, graph) NoSQL(键/值、文档、图)
- Array / Matrix / Vector (for ML) 数组 / 矩阵 / 向量(用于机器学习)
A schema is a description of a particular collection of data using a given data model — defining its structure and meaning. 模式是使用给定数据模型对特定数据集合的描述——定义其结构和含义。
Common Data Models 常见数据模型
| Type类型 | Examples / Usage示例 / 使用 |
|---|---|
| Relational关系 | Most DBMSs大多数数据库管理系统 |
| Key/Value键/值 | Simple Apps / Caching简单应用 / 缓存 |
| Graph图 | NoSQL |
| Document / XML / Object文档 / XML / 对象 | NoSQL |
| Wide-Column / Column-family宽列 / 列族 | NoSQL |
| Array / Matrix / Vector数组 / 矩阵 / 向量 | ML / ScientificML / 科学 |
| Hierarchical层次结构 | Obsolete / Legacy过时/遗留 |
| Network网络 | Obsolete / Legacy过时 / 遗留 |
| Multi-Value多值 | Obsolete / Legacy过时 / 遗留 |
Early DBMSs 早期的数据库管理系统
In the 1960s, DBMSs (e.g., IDS, IMS, CODASYL) required procedural queries. 20 世纪 60 年代,数据库管理系统(如 IDS、IMS、CODASYL)需要过程性查询。 Developers had to write code specifying access paths and ordering — meaning changes to the data required rewriting code. 开发者必须编写代码来指定访问路径和排序——这意味着数据的变化需要重写代码。 Queries involved explicit loops and navigation orders; thus, assumptions could easily break as data grew. 查询涉及显式循环和导航顺序;因此,随着数据的增长,假设很容易被打破。
4. Relational Model 关系模型
Proposed by Ted Codd at IBM in the late 1960s. 由 IBM 的 Ted Codd 于 20 世纪 60 年代末提出。
The relational data model defines three main concepts: 关系数据模型定义了三个主要概念:
- Structure: 结构: Relations (tables) defined by a set of attributes, each with a domain of values. 关系(表)由一组属性定义,每个属性具有一个值域。
- Integrity: 完整性: Constraints to ensure data validity (e.g., age cannot be negative). 用于确保数据有效性的约束(例如,年龄不能为负)。
- Manipulation: 操作: Declarative API for data access and modification using relations (sets). 使用关系(集合)进行数据访问和修改的声明式 API。
Advantages 优点
- Data Independence: Applications are isolated from physical data representation. 数据独立性:应用程序与物理数据表示隔离。
- Optimization: DBMS can reorganize data storage or execution plans automatically. 优化:DBMS 可以自动重新组织数据存储或执行计划。
Core Concepts 核心概念
- Relation: Unordered set of tuples representing entities. 关系:表示实体的无序元组集合。
- Tuple: Set of attribute values (can be atomic, list, or nested). 元组:属性值的集合(可以是原子值、列表或嵌套)。
- Primary Key: Uniquely identifies a tuple. 主键:唯一标识一个元组。
- Foreign Key: Maps an attribute to a tuple in another relation. 外键:将一个属性映射到另一个关系中的元组。
- Constraint: User-defined conditions that must hold (e.g.,
NOT NULL,UNIQUE,FOREIGN KEY). 约束:用户定义的条件,必须成立(例如,NOT NULL,UNIQUE,FOREIGN KEY)。
5. Data Manipulation Languages (DMLs) 数据操作语言
DMLs are APIs that allow applications to store and retrieve data from a DBMS. DMLs 是允许应用程序存储和从 DBMS 检索数据的 API。
Two main classes: 两大主要类别:
-
Procedural: 过程式: The query specifies how to retrieve the data. 查询指定了如何检索数据。 Example: Iterate through all records in a loop. 示例:在循环中遍历所有记录。
-
Non-Procedural (Declarative): 非过程式(声明式): The query specifies what data to retrieve, not how. 查询指定了要检索什么数据,而不是如何检索。 Example: 示例:
SELECT COUNT(*) FROM artist;
6. Relational Algebra 关系代数
A formal system of operations for retrieving and manipulating tuples in relations. 一种用于检索和操作关系中的元组的正式运算系统。 Each operator takes one or more relations as input and outputs a new relation. 每个运算符以一个或多个关系作为输入,并输出一个新的关系。 Operators can be chained to build complex queries. 操作符可以串联起来构建复杂的查询。
Select (σ) 选择
Filters tuples satisfying a predicate. 过滤满足谓词的元组。
Syntax: 语法: σpredicate(R)
Example: 示例: σa_id=’a2’(R)
SQL Equivalent: SQL 等价:
SELECT * FROM R WHERE a_id = 'a2';
Projection (π) 投影
Selects specific attributes from a relation. 从关系中选取特定属性。
Syntax: 语法: πA1,A2,...,An(R)
Example: 示例: πb_id-100, a_id(σa_id=’a2’(R))
SQL Equivalent: SQL 等价:
SELECT b_id-100, a_id FROM R WHERE a_id = 'a2';
Union (∪) 并集
Combines tuples from both relations (must have same schema). 合并两个关系中的元组(必须具有相同的模式)。
Syntax: 语法: R ∪ S
SQL Equivalent: SQL 等价:
(SELECT * FROM R)
UNION ALL
(SELECT * FROM S);
Intersection (∩) 交集
Returns tuples present in both relations. 返回同时存在于两个关系中的元组。
Syntax: 语法: R ∩ S
SQL Equivalent: SQL 等价:
(SELECT * FROM R)
INTERSECT
(SELECT * FROM S);
Difference (−) 差异
Returns tuples in the first relation but not in the second. 返回第一个关系中存在但第二个关系中不存在的元组。
Syntax: 语法: R − S
SQL Equivalent: SQL 等价:
(SELECT * FROM R)
EXCEPT
(SELECT * FROM S);
Product (×) 笛卡尔积
Cartesian product of two relations. 两个关系的笛卡尔积。
Syntax: 语法: R × S
SQL Equivalent: SQL 等价:
SELECT * FROM R CROSS JOIN S;
-- or simply
SELECT * FROM R, S;
Join (▷◁) 连接
Combines tuples from two relations based on shared attributes. 根据共享属性组合两个关系中的元组。
Syntax: 语法: R ▷◁ S
SQL Equivalent: SQL 等价:
SELECT * FROM R JOIN S USING (ATTRIBUTE1, ATTRIBUTE2...);
Observation 观察
Relational Algebra defines both operations and order of execution. 关系代数定义了操作和执行顺序。
Example comparison: 示例比较:
- σb_id=102(R ▷◁ S): Join first, then filter. 先连接,再过滤。
- (R ▷◁ σb_id=102(S)): Filter S first, then join. 先过滤 S,再连接。
Although results are the same, the second is much more efficient if S is large. 尽管结果相同,但如果 S 很大,第二种方法效率高得多。
In SQL, the user specifies what to compute, not how to compute it — the DBMS performs query optimization. 在 SQL 中,用户指定要计算什么,而不是如何计算——数据库管理系统执行查询优化。
7. Other Data Models 其他数据模型
Document Data Model 文档数据模型
- Collection of hierarchical documents with field/value pairs. 包含字段/值对的层次化文档集合。
- Fields can contain:
字段可以包含:
- Scalars 标量
- Arrays 数组
- Pointers to other documents 指向其他文档
- Modern systems use JSON; older ones used XML or custom objects. 现代系统使用 JSON;旧系统使用 XML 或自定义对象。
- Solves some use cases but retains flat-file problems. 解决了一些用例,但保留了平面文件问题。
Vector Data Model 向量数据模型
- Represents 1D arrays for nearest-neighbor (NN) search (exact or approximate). 表示用于最近邻(NN)搜索的一维数组(精确或近似)。
- Used in semantic search, especially with embeddings from ML models (e.g., transformers). 用于语义搜索,尤其是在使用机器学习模型(如 transformers)的嵌入时。
- Supports integration with tools such as LangChain and OpenAI APIs. 支持与 LangChain 和 OpenAI API 等工具集成。
- Many relational DBMSs now support vector indexes (e.g.,
pgvector) for efficient NN search. 许多关系型数据库管理系统现在支持向量索引(例如pgvector),以实现高效的最近邻搜索。