自作事前学習データ利用の準備

ここの記事を参考に、ローカルディスクにダウンロードしたデータを利用する方法を予習しておく。ここにあるようにまず、

git lfs install
git clone https://huggingface.co/cl-tohoku/bert-base-japanese-char-v2

因みにlfsがよく分からず調べるとlarge file sizeつまり、大きなファイルをgitで扱うための仕組みらしい。

jupyter notebook

を起動し、

import copy
import torch
import numpy as np

from transformers import AutoTokenizer, AutoModel, BertJapaneseTokenizer, BertForMaskedLM

model_path = './bert-base-japanese-char-v2'
bert_tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path)

tokenizer = BertJapaneseTokenizer.from_pretrained(model_path)
bert_mlm = BertForMaskedLM.from_pretrained(model_path)

を実行すると、

Some weights of the model checkpoint at ./bert-base-japanese-char-v2 were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

というwarningが出て、一応データがロードできているのかな?
ここで「BERTによる自然言語処理入門」の第五章#5-7のスクリプトを走らせると、

今日は海へ行く。
今日は街へ行く。
今日は東へ行く。
今日は南へ行く。
今日は家へ行く。
今日は北へ行く。
今日は外へ行く。
今日は西へ行く。
今日は山へ行く。
今日は空へ行く。

と少なくとも間違ってはいない単語で穴埋めされる。そしていろいろと文章を変えてみると、どうやら一文字の単語しか出てこないようだ。妥当そうなのだけれど、一文字の単語しか選択されない…。これは何が間違ってるのだろうか。
と思ってよく見ると、charモデルを使っていたのだった。後でwordpiece版に変更してみよう。