Character-Based Neural Network Language Model in Keras

Knowledge Square
0

What is a Language Model

A language model predicts the next word in the sequence based on the specific words that have come before it in the sequence
It is also possible to develop language models at the character level using neural networks. The benefit of character-based language models is their small vocabulary and flexibility in handling any words, punctuation, and other document structure. This comes at the cost of requiring larger models that are slower to train.


1.Preparation of data 

First step is to prepare the data in order to train through the neural network model. For the data set choose a simple poem called Sing a Song of Sixpence.

Load Data

 First we need to add that poem to a .txt file (in my case it is named as poem.txt)  and has to get it to the memory.
This code can take a txt file and load it to the memory by assigning it to raw_text Variable,
After that we have to clean the text. We will not do much to it here. Specifically, we will strip all of the new line characters so that we have one long sequence of characters separated only by white space by using line 13 and 14 of the above code.

Create Sequences

Now we have long list of characters without any white spaces and new line characters.
we can create our input-output sequences used to train the model.
Each input sequence will be 10 characters with one output character, making each sequence 11 characters long.
We can create the sequences by enumerating the characters in the text, starting at the 11th character at index 10.

Save Sequence

save the sequence by following code After you open the txt file the file will look like this
"Sing a song
ing a song
ng a song o
g a song of
 a song of
a song of s
 song of si
song of six
ong of sixp
ng of sixpe"

2.Train Model

The model will read encoded characters and predict the next character in the sequence. A Long Short-Term Memory recurrent neural network hidden layer will be used to learn the context from the input sequence in order to make the predictions.

Load Data

The first step is to load the prepared character sequence data from ‘char_sequences.txt‘.
We can use the same load_doc() function developed in the previous section. Once loaded, we split the text by new line to give a list of sequences ready to be encoded.

Encode Sequences

The sequences of characters must be encoded as integers.
This means that each unique character will be assigned a specific integer value and each sequence of characters will be encoded as a sequence of integers.
We can create the mapping given a sorted set of unique characters in the raw input data. The mapping is a dictionary of character values to integer values.


Split Inputs and Output

Next, we need to one hot encode each character. That is, each character becomes a vector as long as the vocabulary (38 elements) with a 1 marked for the specific character. This provides a more precise input representation for the network. It also provides a clear objective for the network to predict, where a probability distribution over characters can be output by the model and compared to the ideal case of all 0 values with a 1 for the actual next character. We can use the to_categorical() function in the Keras API to one hot encode the input and output sequences.

Fit Model

The model is defined with an input layer that takes sequences that have 10 time steps and 38 features for the one hot encoded input sequences.
Rather than specify these numbers, we use the second and third dimensions on the X input data. This is so that if we change the length of the sequences or size of the vocabulary, we do not need to change the model definition.
The model has a single LSTM hidden layer with 75 memory cells, chosen with a little trial and error.
The model has a fully connected output layer that outputs one vector with a probability distribution across all characters in the vocabulary. A softmax activation function is used on the output layer to ensure the output has the properties of a probability distribution

3.Generate Text

We must provide sequences of 10 characters as input to the model in order to start the generation process. We will pick these manually. A given input sequence will need to be prepared in the same way as preparing the training data for the model.

Output


Summary

In this tutorial, you discovered how to develop a character-based neural language model. Specifically, you learned: How to prepare text for character-based language modeling. How to develop a character-based language model using LSTMs. How to use a trained character-based language model to generate text.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Post a Comment

0Comments
Post a Comment (0)