Apache Cassandra

技术栈

数据库

nosqlwide-columndistributedhigh-availabilitybig-data

概览

Apache Cassandra 是一个开源分布式 NoSQL 宽列数据库，专为处理大量数据而设计。它采用无主架构（无单点故障），支持跨多数据中心的高可用性和线性可扩展性。适合时序数据、物联网、消息系统等写入密集型场景。使用 CQL（类 SQL 查询语言），由 Facebook 开发后捐赠给 Apache 基金会。

安装

Apache Cassandra 安装指南

1. 环境准备

要求	说明
操作系统	Linux（推荐 Ubuntu 20.04+ / CentOS 8+）、macOS、Windows（Docker）
Java 运行时	OpenJDK 11 或 17
内存	最低 4 GB，推荐 8 GB+
磁盘	SSD 推荐，至少 10 GB 可用空间
Python	Python 3.6+（cqlsh 命令行工具需要）

2. 安装命令

Linux (Debian/Ubuntu)

# 安装 Java
sudo apt update &;& sudo apt install -y openjdk-11-jdk

# 添加 Cassandra 仓库
echo "deb https://debian.cassandra.apache.org 41x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -
sudo apt update

# 安装 Cassandra
sudo apt install -y cassandra

# 启动服务
sudo systemctl start cassandra
sudo systemctl enable cassandra

# 检查状态
nodetool status
cqlsh

Docker（推荐新手）

# 拉取并启动单节点
docker run --name cassandra -p 9042:9042 -d cassandra:latest

# 等待启动完成
docker exec cassandra nodetool status

# 进入 cqlsh
docker exec -it cassandra cqlsh

macOS

brew install cassandra
brew services start cassandra
cqlsh

从二进制安装

wget https://dlcdn.apache.org/cassandra/4.1.5/apache-cassandra-4.1.5-bin.tar.gz
tar -xzf apache-cassandra-4.1.5-bin.tar.gz
cd apache-cassandra-4.1.5
bin/cassandra -f    # 前台启动

3. 常见安装问题

Q1: 启动失败 "Unable to lock JVM memory"

Cassandra 需要内存锁定权限。解决：

# 在 /etc/security/limits.conf 添加
cassandra  -  memlock  unlimited
cassandra  -  nofile   100000

Q2: cqlsh 连接被拒绝

确认 Cassandra 正在监听。检查 conf/cassandra.yaml 中：

listen_address: localhost
rpc_address: localhost
native_transport_port: 9042

Q3: 端口冲突 7000/9042

修改 conf/cassandra.yaml 中的 storage_port（默认 7000）和 native_transport_port（默认 9042）。

Q4: Docker 容器启动后 cqlsh 不可用

Cassandra 集群初始化约 30-60 秒，等待后重试：

docker logs -f cassandra  # 查看启动进度

示例

Cassandra Hello World：键空间与表操作

目标

创建第一个 Cassandra 键空间（Keyspace），建表并执行 CRUD 操作，理解 CQL 与 SQL 的异同。

完整代码

在 cqlsh 中依次执行：

-- 1. 创建键空间（指定复制策略）
CREATE KEYSPACE university
WITH replication = {
    'class': 'SimpleStrategy',
    'replication_factor': 1
};

-- 2. 切换到该键空间
USE university;

-- 3. 创建表（注意 PRIMARY KEY 含分区键 + 聚簇键）
CREATE TABLE students (
    department TEXT,
    student_id UUID,
    name TEXT,
    age INT,
    email TEXT,
    gpa FLOAT,
    enrolled_date DATE,
    PRIMARY KEY (department, student_id)
);

-- 4. 插入数据（必须提供完整主键）
INSERT INTO students (department, student_id, name, age, email, gpa, enrolled_date)
VALUES ('Computer Science', uuid(), '张三', 21, 'zhangsan@example.com', 3.8, '2024-09-01');

INSERT INTO students (department, student_id, name, age, email, gpa, enrolled_date)
VALUES ('Computer Science', uuid(), '李四', 22, 'lisi@example.com', 3.5, '2024-09-01');

INSERT INTO students (department, student_id, name, age, email, gpa, enrolled_date)
VALUES ('Mathematics', uuid(), '王五', 20, 'wangwu@example.com', 3.9, '2024-09-01');

-- 5. 查询（按分区键查询效率最高）
SELECT * FROM students WHERE department = 'Computer Science';

-- 6. 更新（实际上也是插入，因为 Cassandra 是 upsert）
UPDATE students SET gpa = 3.9 WHERE department = 'Computer Science' AND student_id = <某个UUID>;

-- 7. 删除
DELETE FROM students WHERE department = 'Computer Science' AND student_id = <某个UUID>;

-- 8. ALLOW FILTERING（全表扫描，生产环境慎用！）
SELECT * FROM students WHERE gpa > 3.5 ALLOW FILTERING;

Python 驱动版

# pip install cassandra-driver
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
import uuid

# 连接集群
cluster = Cluster(['127.0.0.1'], port=9042)
session = cluster.connect()

# 创建键空间
session.execute("""
    CREATE KEYSPACE IF NOT EXISTS university
    WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}
""")
session.set_keyspace('university')

# 建表
session.execute("""
    CREATE TABLE IF NOT EXISTS students (
        department TEXT,
        student_id UUID,
        name TEXT,
        age INT,
        email TEXT,
        gpa FLOAT,
        enrolled_date DATE,
        PRIMARY KEY (department, student_id)
    )
""")

# 预编译插入
insert_stmt = session.prepare("""
    INSERT INTO students (department, student_id, name, age, email, gpa, enrolled_date)
    VALUES (?, ?, ?, ?, ?, ?, ?)
""")

# 批量执行
session.execute(insert_stmt, ['Computer Science', uuid.uuid4(), '张三', 21, 'zhangsan@example.com', 3.8, '2024-09-01'])
session.execute(insert_stmt, ['Mathematics', uuid.uuid4(), '王五', 20, 'wangwu@example.com', 3.9, '2024-09-01'])

# 查询
rows = session.execute("SELECT * FROM students WHERE department = 'Computer Science'")
for row in rows:
    print(f"{row.name} - GPA: {row.gpa}")

cluster.shutdown()

预期输出

 department       | student_id                           | age | email                 | enrolled_date | gpa | name
------------------+--------------------------------------+-----+-----------------------+---------------+-----+------
 Computer Science | 3b8a1f4e-...                        |  21 | zhangsan@example.com  |    2024-09-01 | 3.8 |   张三
 Computer Science | 7c2d6f9a-...                        |  22 | lisi@example.com      |    2024-09-01 | 3.5 |   李四

关键点

Cassandra 的 PRIMARY KEY = Partition Key + Clustering Key
WHERE 必须命中分区键，否则需要 ALLOW FILTERING（生产禁用）
写入是 upsert 语义（存在则更新，不存在则插入）

教程

Apache Cassandra 从零到实战

1. 背景与概念

1.1 为什么需要 Cassandra？

传统关系型数据库（如 MySQL）面临单点瓶颈：读写压力集中在主库。Cassandra 采用去中心化设计——所有节点对等，无主从之分。Google BigTable 的宽列模型 + Amazon Dynamo 的分布式哈希 = Cassandra。

1.2 核心概念速查

概念	类比 SQL	说明
Keyspace	Database	数据容器，定义复制策略
Table	Table	列族，实际存储结构
Partition Key	无直接类比	决定数据在哪个节点上
Clustering Key	ORDER BY 字段	分区内排序
TTL	无直接类比	数据自动过期时间

1.3 数据模型设计原则

Cassandra 建模不是"范式化"，而是"查询驱动"。

❌ SQL 思维：先设计实体关系图
✅ CQL 思维：先列出所有查询，反推表结构

规则：

一个查询 = 一张表（允许数据冗余）
WHERE 条件必须命中 Partition Key
不使用 JOIN（数据预聚合）

2. 分步实战：构建时序传感器数据平台

场景

某物联网平台需要存储每台设备每分钟的温度读数，需支持：

查询某设备最近 N 条记录
查询某设备某天的所有读数
自动清理 30 天前的旧数据

步骤一：设计表结构

CREATE KEYSPACE iot_data
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

USE iot_data;

CREATE TABLE sensor_readings (
    device_id UUID,
    date TEXT,          -- '2024-09-01'
    timestamp TIMESTAMP,
    temperature DOUBLE,
    humidity DOUBLE,
    status TEXT,
    PRIMARY KEY ((device_id, date), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)
   AND default_time_to_live = 2592000; -- 30天自动过期

设计分析：

(device_id, date) 复合分区键——同一设备同一天的数据在同一分区
timestamp DESC——最新数据先返回
TTL = 30天——自动清理旧数据

步骤二：模拟数据写入

from cassandra.cluster import Cluster
import uuid, random, time
from datetime import datetime

cluster = Cluster(['127.0.0.1'])
session = cluster.connect('iot_data')

insert_stmt = session.prepare("""
    INSERT INTO sensor_readings (device_id, date, timestamp, temperature, humidity, status)
    VALUES (?, ?, ?, ?, ?, ?)
""")

device_id = uuid.uuid4()
for i in range(100):
    now = datetime.now()
    date = now.strftime('%Y-%m-%d')
    
    session.execute(insert_stmt, [
        device_id,
        date,
        now,
        round(random.uniform(20, 30), 2),
        round(random.uniform(40, 60), 2),
        'normal'
    ])
    time.sleep(0.1)

print("写入 100 条数据完成")

步骤三：常见查询

-- 查询某设备最新 10 条记录
SELECT * FROM sensor_readings 
WHERE device_id = <UUID> AND date = '2024-09-01' 
LIMIT 10;

-- 范围查询（timestamp 是聚簇键，支持范围）
SELECT * FROM sensor_readings 
WHERE device_id = <UUID> 
  AND date = '2024-09-01'
  AND timestamp > '2024-09-01T08:00:00'
  AND timestamp < '2024-09-01T18:00:00';

-- 跨日期查询（需要查两个分区）
-- 查询某设备最近 3 天的数据（3次查询或使用 IN，但不推荐 IN 跨分区）

3. 思考题

为什么 Cassandra 不推荐 ALLOW FILTERING？在什么场景下可以接受？
如果设备数量暴增至 100 万台，分区键设计需要如何调整？提示：考虑热点问题。
TTL 设置为 30 天后，数据是精确在 30 天后删除还是近似？为什么？

参考资料

[1] Jeff Carpenter, Eben Hewitt. Cassandra: The Definitive Guide. 2020.
[2] Apache Software Foundation. Apache Cassandra 官方文档. 2024.
[3] Martin Kleppmann. Designing Data-Intensive Applications (Chapter 5). 2017.