Transfer labels to Fonduer¶
In this example, we will transfer the labels we manually annotated in Label Studio to Fonduer to be used as gold labels for evaluation.
Before any labels are annotated, please ensure that the document representations in Fonduer and Label Studio are the same. Otherwise, the labels might not be transferable! See example_document_converter for further information.
Fonduer setup:¶
The way fonduer is set up might influence the ability to transfer labels between the systems. Therefore, Fonduer has to be configured so that it does not need to modify the documents.
Read export¶
First, we start with reading the export from Label Studio. We can use some of the information from our export to configure our data model in fonduer later.
project_name = "mails_sm"
conn_string = "postgresql://postgres:postgres@127.0.0.1:5432/"
dataset_path = "data/mails"
export_path = os.path.join(dataset_path, "export.json")
documents_path = os.path.join(dataset_path, "documents")
from LabelstudioToFonduer.to_fonduer import parse_export
export = parse_export(export_path)
Create the fonduer project¶
After that, we create the project in fonduer:
from LabelstudioToFonduer.fonduer_tools import save_create_project
save_create_project(conn_string=conn_string, project_name=project_name)
from fonduer import Meta, init_logging
init_logging(log_dir=os.path.join(dataset_path, "logs"))
session = Meta.init(conn_string + project_name).Session()
[2022-11-17 12:07:26,581][INFO] fonduer.meta:49 - Setting logging directory to: data/mails/logs/2022-11-17_12-07-26 [2022-11-17 12:07:26,582][INFO] fonduer.meta:135 - Connecting user:postgres to 127.0.0.1:5432/mails_sm [2022-11-17 12:07:26,657][INFO] fonduer.meta:162 - Initializing the storage schema
Fonduer might read the documents with the wrong encoding, which causes errors. To avoid this, a dedicated HTMLDocPreprocessor
can be used. LabelStudio_to_Fonduer
provides a slightly modified HTMLDocPreprocessor
as a starting point named My_HTMLDocPreprocessor.
The processor can be imported like this:
from LabelstudioToFonduer.document_processor import My_HTMLDocPreprocessor
from fonduer.parser import Parser
doc_preprocessor = My_HTMLDocPreprocessor(documents_path, max_docs=100)
Setup lingual parser¶
By default, Fonduer uses a lingual parser that splits sentences based on the SpaCy split_sentences
function. While this method generally performs quite well, it does not handle abbreviations and special punctuation well.
If our labels contain punctuations or abbreviations, we need to use a modified lingual_parser
.
LabelStudio_to_Fonduer
comes with a modified version that splits sentences only on the .
char and can handle given exceptions.
To add exceptions and use this ModifiedSpacyParser
, we can use this code:
from LabelstudioToFonduer.lingual_parser import ModifiedSpacyParser
exceptions = [".NET", "Sr.", ".WEB", ".de", "Jr.", "Inc.", "Senior.", "p.", "m."]
my_parser = ModifiedSpacyParser(lang="en", split_exceptions=exceptions)
Import documents¶
If the pipeline is set up, we can import our documents.
corpus_parser = Parser(session,
lingual_parser=my_parser,
structural=True,
lingual=True,
flatten=[])
corpus_parser.apply(doc_preprocessor, parallelism=8)
[2022-11-17 12:08:40,144][INFO] fonduer.utils.udf:67 - Running UDF...
0%| | 0/10 [00:00<?, ?it/s]
from fonduer.parser.models import Document, Sentence
print(f"Documents: {session.query(Document).count()}")
print(f"Sentences: {session.query(Sentence).count()}")
docs = session.query(Document).order_by(Document.name).all()
Documents: 10 Sentences: 1336
Setup Fonduer data model¶
In this step, the data model is created and then used to create the labeling functions and so on. For further information, please refer to the Fonduer documentation.
As we already have some labeled data, we can derive some starting values to create the Fonduer data model. This configuration is highly dependent on the data we have.
It might be beneficial to test the pipeline in advance to make sure Fonduer does not change any document and all annotated spans can be detected. Therefore, we will not spend too much time in setting up labeling functions and only rudimentarily set up some Fonduer processing for now. After we ensure that the pipeline works for our data, we will come back to that.
from fonduer.candidates.models import mention_subclass
Title = mention_subclass("Title")
Date = mention_subclass("Date")
from fonduer.candidates import MentionNgrams
title_ngrams = MentionNgrams(n_max=export.ngrams("Title")[1] + 5, n_min=export.ngrams("Title")[0])
date_ngrams = MentionNgrams(n_max=export.ngrams("Date")[1] + 5, n_min=export.ngrams("Date")[0])
from fonduer.candidates.matchers import LambdaFunctionMatcher
title = export.lable_entitis("Title")
date = export.lable_entitis("Date")
def is_title(mention):
if mention.get_span() in title:
return True
else:
False
def is_date(mention):
if mention.get_span() in date:
return True
else:
False
title_matcher = LambdaFunctionMatcher(func=is_title)
date_matcher = LambdaFunctionMatcher(func=is_date)
from fonduer.candidates import MentionExtractor
mention_extractor = MentionExtractor(
session,
[Title, Date],
[title_ngrams, date_ngrams],
[title_matcher, date_matcher],
)
from fonduer.candidates.models import Mention
mention_extractor.apply(docs)
num_title = session.query(Title).count()
num_date = session.query(Date).count()
print(f"Total Mentions: {session.query(Mention).count()} ({num_title} titles, {num_date} dates)")
from fonduer.candidates.models import candidate_subclass
TitleDate = candidate_subclass("TitleDate", [Title, Date])
from fonduer.candidates import CandidateExtractor
candidate_extractor = CandidateExtractor(session, [TitleDate])
candidate_extractor.apply(docs)
[2022-11-17 12:08:48,140][INFO] fonduer.candidates.mentions:467 - Clearing table: title [2022-11-17 12:08:48,161][INFO] fonduer.candidates.mentions:467 - Clearing table: date [2022-11-17 12:08:48,163][INFO] fonduer.utils.udf:67 - Running UDF...
0%| | 0/10 [00:00<?, ?it/s]
[2022-11-17 12:08:50,906][INFO] fonduer.candidates.candidates:138 - Clearing table title_date (split 0) [2022-11-17 12:08:50,919][INFO] fonduer.utils.udf:67 - Running UDF...
Total Mentions: 24 (15 titles, 9 dates)
0%| | 0/10 [00:00<?, ?it/s]
Load gold label¶
To use our gold data in fonduer, it is finally time to transfer the labels from Label Studio to Fonduer.
Therefore we create a converter
entity from LabelStudioToFonduer
based on our parsed export and the fonduer session.
Then we use the is_gold
function of our converter as a labeling function in the Fonduer Labeler.
from LabelstudioToFonduer.to_fonduer import ToFonduer
converter = ToFonduer(label_studio_export=export, fonduer_session=session)
from fonduer.supervision.models import GoldLabel
from fonduer.supervision import Labeler
labeler = Labeler(session, [TitleDate])
labeler.apply(
docs=docs,
lfs=[[converter.is_gold]],
table=GoldLabel,
train=True,
parallelism=8,
)
[2022-11-17 12:08:54,374][INFO] fonduer.supervision.labeler:330 - Clearing Labels (split ALL) /usr/local/lib/python3.7/site-packages/fonduer/supervision/labeler.py:340: SAWarning: Coercing Subquery object into a select() for use in IN(); please pass a select() construct explicitly query = self.session.query(table).filter(table.candidate_id.in_(sub_query)) [2022-11-17 12:08:54,379][INFO] fonduer.utils.udf:67 - Running UDF...
0%| | 0/10 [00:00<?, ?it/s]
To check if we were successful, we can count the transferred labels.
train_cands = candidate_extractor.get_candidates()
all_gold = labeler.get_gold_labels(train_cands)
print("Gold labels found:", all_gold[0].sum(), "from", len(export.documents))
print("Documents successfully transfered:")
golds = []
for k, v in zip(all_gold[0], train_cands[0]):
if k:
golds.append(v)
print(v.document.name)
Gold labels found: 9 from 9 Documents successfully transfered: file_0 file_1 file_2 file_3 file_4 file_5 file_6 file_7 file_8
/usr/local/lib/python3.7/site-packages/fonduer/candidates/candidates.py:201: SAWarning: Coercing Subquery object into a select() for use in IN(); please pass a select() construct explicitly .filter(candidate_class.id.in_(sub_query))