友情提示：由于HuggingFace社区触犯了天朝的某些法律，有关HuggingFace系列的内容中，提到“冲浪板”就指科学上网，需要借助国外旅游工具。

数据集的加载和保存

在线加载数据集

该操作需要冲浪板。

from datasets import load_dataset
#加载glue数据集
dataset = load_dataset(path='glue',name='sst2',split='train')
#path是数据集名字，name指定数据集子集，split指定加载部分。
dataset

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 67349
})

这是glue数据集，包含了67349个数据，每个数据由sentence(文本),label(标签),idx(下标)。
可以具体查看一下dataset的成分。

1 2	type(dataset['sentence']),type(dataset['label']),type(dataset['idx']) dataset['sentence'][0],dataset['label'][0],dataset['idx'][0]

1 2	(<class 'list'>, <class 'list'>, <class 'list'>) ('hide new secretions from the parental units ', 0, 0)

数据集的本地保存和加载

#将数据集保存到本地。
dataset.save_to_disk(     #dataset_dict_path是保存路径。
  dataset_dict_path=path
)
#加载本地数据集。
from datasets import load_from_disk
dataset = load_from_disk(path) #指定数据路径，数据集不一定是一个文件，
                               #该路径是一个文件夹，包含了所有数据集相关文件。

对数据集的操作

数据排序

数据最开始的顺序可能是乱序，可以对其按某一规则进行排序。

print(dataset['label'][:10])   #输出前10个数据的label。
sorted_dataset = dataset.sort('label')  #对数据按label排序。
print(sorted_dataset['label'][:10])  #输出前10个数据。
print(sorted_dataset['label'][-10:]) #输出后10个数据。

1
2
3

[0, 0, 1, 0, 0, 0, 1, 1, 0, 1]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

打乱数据

除了排序，同样也有打乱数据的操作。

1 2	shuffled_dataset = sorted_dataset.shuffle(seed=42) shuffled_dataset['label'][:10]

1	[1, 0, 1, 0, 1, 1, 1, 1, 1, 1]

数据抽样

用select()选择某些数据。

1	dataset.select([0,10,20,30,40,50])

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 6
})

数据过滤

数据过滤需要自定义规则。

#filter()函数接收一个函数作为参数，在该函数中确定过滤的条件。
#在本例中过滤条件是sentence以"that"开头。
def f(data):
  return data['sentence'].startswith('that')
dataset.filter(f)

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 1033
})

数据划分

train_test_split()可以将数据集划分成训练集和测试集。

1	dataset.train_test_split(test_size=0.2) #test_size=0.2表示测试集占总数据量的20%。

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 53879
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 13470
    })
})

用shared()可以将数据平均分成n等份。

1 2	dataset.shard(num_shards=4,index=2) #参数num_shards表示数据分成的份数，index表示取出第几份数据。

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 16837
})

注意，数据的份数是从第0份开始计数的。

对字段的操作

rename_column()可以对数据中某一字段(属性)重命名。

1 2	dataset.rename_column('sentence','text') #将'sentence'字段重命名为'text'。

Dataset({
    features: ['text', 'label', 'idx'],
    num_rows: 67349
})

remove_columns()可以删除字段。

1 2	dataset.remove_columns(['sentence','label']) #将要删除的字段用列表打包，在本例中删除了'sentence'和'label'字段。

Dataset({
    features: ['idx'],
    num_rows: 67349
})

映射函数

有时我们希望对数据集总体做一些修改，可以使用map()函数遍历数据并修改。

def f(data):
  data['sentence'] = 'My sentence: ' + data['sentence']
  return data

maped_dataset = dataset.map(f)

print(dataset['sentence'][20])
print(maped_dataset['sentence'][20])

1 2	equals the original and in some ways even betters it My sentence: equals the original and in some ways even betters it

map()函数以一个函数作为参数，在该函数中对数据进行修改，可以是对数据本身的修改，例如本例中的代码就是对sentence字段增加了一个前缀，也可以进行增加字段、删除字段、修改数据格式等操作。

批处理加速

在使用过滤和映射这类需要使用一个函数遍历数据集的方法时，可以使用批处理减少函数调用的次数，从而达到加速处理的目的。在默认情况下是不使用批处理的，由于每条数据都需要调用一次函数，所以函数调用的次数等于数据集中数据的条数，如果数据的数量很多，则需要调用很多次函数。使用批处理函数，能够一批一批地处理数据，让函数调用的次数大大减少。

def f(data):
  text = data['sentence']
  text = ['My sentence: ' + i for i in text]
  data['sentence'] = text
  return data

maped_dataset = dataset.map(
  function=f,
  batched=True,    #batched=True和batch_size=1024表示以1024条数据为一个批次进行一次处理，
  batch_size=1024, #相当于把函数执行次数削减约1024倍，但对内存需求更高。
  num_proc=4  #num_proc=4表示在4条线程上执行任务，也与性能相关，自行选择大小。
)
print(dataset['sentence'][10])
print(maped_dataset['sentence'][10])

1 2	goes to absurd lengths print(maped_dataset['sentence'][10])

设置数据格式

set_format()可以修改数据格式。

dataset.set_format(
  type='torch',  #type表示要修改为的数据类型，常用的有numpy,torch,tensorflow,pandas。
  columns=['label'], #选择修改格式的字段。
  output_all_columns=True) #是否保留其他的字段，True表示保留。
dataset[20]

1	{'label': tensor(1), 'sentence': 'equals the original and in some ways even betters it ', 'idx': 20}

数据格式转换

将数据转为csv格式。

1	dataset.to_csv(path_or_buf='/glue.csv')

加载csv格式的数据。

1 2	csv_dataset = load_dataset(path='csv',data_files='./glue.csv',split='train') csv_dataset[20]

1	{'sentence': 'equals the original and in some ways even betters it ', 'label': 1, 'idx': 20}

将数据转为json格式并加载。

1
2
3

dataset.to_json(path_or_buf='./glue.json')
json_dataset = load_dataset(path='json',data_files='./glue.json',split='train')
json_dataset[20]

1	{'sentence': 'equals the original and in some ways even betters it ', 'label': 1, 'idx': 20}