概览
Apache Cassandra 是一个开源分布式 NoSQL 宽列数据库,专为处理大量数据而设计。它采用无主架构(无单点故障),支持跨多数据中心的高可用性和线性可扩展性。适合时序数据、物联网、消息系统等写入密集型场景。使用 CQL(类 SQL 查询语言),由 Facebook 开发后捐赠给 Apache 基金会。
Apache Cassandra 是一个开源分布式 NoSQL 宽列数据库,专为处理大量数据而设计。它采用无主架构(无单点故障),支持跨多数据中心的高可用性和线性可扩展性。适合时序数据、物联网、消息系统等写入密集型场景。使用 CQL(类 SQL 查询语言),由 Facebook 开发后捐赠给 Apache 基金会。
| 要求 | 说明 |
|---|---|
| 操作系统 | Linux(推荐 Ubuntu 20.04+ / CentOS 8+)、macOS、Windows(Docker) |
| Java 运行时 | OpenJDK 11 或 17 |
| 内存 | 最低 4 GB,推荐 8 GB+ |
| 磁盘 | SSD 推荐,至少 10 GB 可用空间 |
| Python | Python 3.6+(cqlsh 命令行工具需要) |
# 安装 Java
sudo apt update && sudo apt install -y openjdk-11-jdk
# 添加 Cassandra 仓库
echo "deb https://debian.cassandra.apache.org 41x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -
sudo apt update
# 安装 Cassandra
sudo apt install -y cassandra
# 启动服务
sudo systemctl start cassandra
sudo systemctl enable cassandra
# 检查状态
nodetool status
cqlsh
# 拉取并启动单节点
docker run --name cassandra -p 9042:9042 -d cassandra:latest
# 等待启动完成
docker exec cassandra nodetool status
# 进入 cqlsh
docker exec -it cassandra cqlsh
brew install cassandra
brew services start cassandra
cqlsh
wget https://dlcdn.apache.org/cassandra/4.1.5/apache-cassandra-4.1.5-bin.tar.gz
tar -xzf apache-cassandra-4.1.5-bin.tar.gz
cd apache-cassandra-4.1.5
bin/cassandra -f # 前台启动
Cassandra 需要内存锁定权限。解决:
# 在 /etc/security/limits.conf 添加
cassandra - memlock unlimited
cassandra - nofile 100000
确认 Cassandra 正在监听。检查 conf/cassandra.yaml 中:
listen_address: localhostrpc_address: localhostnative_transport_port: 9042修改 conf/cassandra.yaml 中的 storage_port(默认 7000)和 native_transport_port(默认 9042)。
Cassandra 集群初始化约 30-60 秒,等待后重试:
docker logs -f cassandra # 查看启动进度
创建第一个 Cassandra 键空间(Keyspace),建表并执行 CRUD 操作,理解 CQL 与 SQL 的异同。
在 cqlsh 中依次执行:
-- 1. 创建键空间(指定复制策略)
CREATE KEYSPACE university
WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': 1
};
-- 2. 切换到该键空间
USE university;
-- 3. 创建表(注意 PRIMARY KEY 含分区键 + 聚簇键)
CREATE TABLE students (
department TEXT,
student_id UUID,
name TEXT,
age INT,
email TEXT,
gpa FLOAT,
enrolled_date DATE,
PRIMARY KEY (department, student_id)
);
-- 4. 插入数据(必须提供完整主键)
INSERT INTO students (department, student_id, name, age, email, gpa, enrolled_date)
VALUES ('Computer Science', uuid(), '张三', 21, 'zhangsan@example.com', 3.8, '2024-09-01');
INSERT INTO students (department, student_id, name, age, email, gpa, enrolled_date)
VALUES ('Computer Science', uuid(), '李四', 22, 'lisi@example.com', 3.5, '2024-09-01');
INSERT INTO students (department, student_id, name, age, email, gpa, enrolled_date)
VALUES ('Mathematics', uuid(), '王五', 20, 'wangwu@example.com', 3.9, '2024-09-01');
-- 5. 查询(按分区键查询效率最高)
SELECT * FROM students WHERE department = 'Computer Science';
-- 6. 更新(实际上也是插入,因为 Cassandra 是 upsert)
UPDATE students SET gpa = 3.9 WHERE department = 'Computer Science' AND student_id = <某个UUID>;
-- 7. 删除
DELETE FROM students WHERE department = 'Computer Science' AND student_id = <某个UUID>;
-- 8. ALLOW FILTERING(全表扫描,生产环境慎用!)
SELECT * FROM students WHERE gpa > 3.5 ALLOW FILTERING;
# pip install cassandra-driver
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
import uuid
# 连接集群
cluster = Cluster(['127.0.0.1'], port=9042)
session = cluster.connect()
# 创建键空间
session.execute("""
CREATE KEYSPACE IF NOT EXISTS university
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}
""")
session.set_keyspace('university')
# 建表
session.execute("""
CREATE TABLE IF NOT EXISTS students (
department TEXT,
student_id UUID,
name TEXT,
age INT,
email TEXT,
gpa FLOAT,
enrolled_date DATE,
PRIMARY KEY (department, student_id)
)
""")
# 预编译插入
insert_stmt = session.prepare("""
INSERT INTO students (department, student_id, name, age, email, gpa, enrolled_date)
VALUES (?, ?, ?, ?, ?, ?, ?)
""")
# 批量执行
session.execute(insert_stmt, ['Computer Science', uuid.uuid4(), '张三', 21, 'zhangsan@example.com', 3.8, '2024-09-01'])
session.execute(insert_stmt, ['Mathematics', uuid.uuid4(), '王五', 20, 'wangwu@example.com', 3.9, '2024-09-01'])
# 查询
rows = session.execute("SELECT * FROM students WHERE department = 'Computer Science'")
for row in rows:
print(f"{row.name} - GPA: {row.gpa}")
cluster.shutdown()
department | student_id | age | email | enrolled_date | gpa | name
------------------+--------------------------------------+-----+-----------------------+---------------+-----+------
Computer Science | 3b8a1f4e-... | 21 | zhangsan@example.com | 2024-09-01 | 3.8 | 张三
Computer Science | 7c2d6f9a-... | 22 | lisi@example.com | 2024-09-01 | 3.5 | 李四
传统关系型数据库(如 MySQL)面临单点瓶颈:读写压力集中在主库。Cassandra 采用去中心化设计——所有节点对等,无主从之分。Google BigTable 的宽列模型 + Amazon Dynamo 的分布式哈希 = Cassandra。
| 概念 | 类比 SQL | 说明 |
|---|---|---|
| Keyspace | Database | 数据容器,定义复制策略 |
| Table | Table | 列族,实际存储结构 |
| Partition Key | 无直接类比 | 决定数据在哪个节点上 |
| Clustering Key | ORDER BY 字段 | 分区内排序 |
| TTL | 无直接类比 | 数据自动过期时间 |
Cassandra 建模不是"范式化",而是"查询驱动"。
❌ SQL 思维:先设计实体关系图
✅ CQL 思维:先列出所有查询,反推表结构
规则:
某物联网平台需要存储每台设备每分钟的温度读数,需支持:
CREATE KEYSPACE iot_data
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE iot_data;
CREATE TABLE sensor_readings (
device_id UUID,
date TEXT, -- '2024-09-01'
timestamp TIMESTAMP,
temperature DOUBLE,
humidity DOUBLE,
status TEXT,
PRIMARY KEY ((device_id, date), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)
AND default_time_to_live = 2592000; -- 30天自动过期
设计分析:
(device_id, date) 复合分区键——同一设备同一天的数据在同一分区timestamp DESC——最新数据先返回from cassandra.cluster import Cluster
import uuid, random, time
from datetime import datetime
cluster = Cluster(['127.0.0.1'])
session = cluster.connect('iot_data')
insert_stmt = session.prepare("""
INSERT INTO sensor_readings (device_id, date, timestamp, temperature, humidity, status)
VALUES (?, ?, ?, ?, ?, ?)
""")
device_id = uuid.uuid4()
for i in range(100):
now = datetime.now()
date = now.strftime('%Y-%m-%d')
session.execute(insert_stmt, [
device_id,
date,
now,
round(random.uniform(20, 30), 2),
round(random.uniform(40, 60), 2),
'normal'
])
time.sleep(0.1)
print("写入 100 条数据完成")
-- 查询某设备最新 10 条记录
SELECT * FROM sensor_readings
WHERE device_id = <UUID> AND date = '2024-09-01'
LIMIT 10;
-- 范围查询(timestamp 是聚簇键,支持范围)
SELECT * FROM sensor_readings
WHERE device_id = <UUID>
AND date = '2024-09-01'
AND timestamp > '2024-09-01T08:00:00'
AND timestamp < '2024-09-01T18:00:00';
-- 跨日期查询(需要查两个分区)
-- 查询某设备最近 3 天的数据(3次查询或使用 IN,但不推荐 IN 跨分区)