Skip to content

Commit

Permalink
Add examples for evaluators
Browse files Browse the repository at this point in the history
  • Loading branch information
KarolinaMiq authored and KaQuMiQ committed Jul 19, 2024
1 parent d15e145 commit f6f90e7
Show file tree
Hide file tree
Showing 8 changed files with 494 additions and 70 deletions.
83 changes: 69 additions & 14 deletions src/draive/evaluators/text_coherence.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,33 +20,35 @@ class CoherenceScore(DataModel):
Keep this document open while reviewing, and refer to it as needed.
Evaluation Criteria:
Coherence (1-5) - the collective quality of all sentences.
Coherence (0.0-4.0) - the collective quality of all sentences.
We align this dimension with the DUC (Document Understanding Conference) quality question of \
structure and coherence, whereby the text should be well-structured and well-organized.
The compared text should not just be a heap of related information, but should build from sentence
to sentence into a coherent body of information about a topic.
Rating Scale:
1: Very low coherence - the text is chaotic, lacking logical connections between sentences.
2: Low coherence - some connections are visible, but the overall structure is weak.
3: Moderate coherence - the text has a noticeable structure, but with some shortcomings.
4: Good coherence - the text is well-organized with minor imperfections.
5: Excellent coherence - the text is exemplarily structured, with smooth transitions between ideas.
0.0: Very low coherence - the text is chaotic, lacking logical connections between sentences.
1.0: Low coherence - some connections are visible, but the overall structure is weak.
2.0: Moderate coherence - the text has a noticeable structure, but with some shortcomings.
3.0: Good coherence - the text is well-organized with minor imperfections.
4.0: Excellent coherence - the text is exemplarily structured, with smooth transitions \
between ideas.
Evaluation Steps:
1. Read the reference text carefully and identify the main topic and key points.
2. Read the compared text and compare it to the reference text.
Check if the compared text covers the main topic and key points of the reference text, \
and if it presents them in a clear and logical order.
3. Assign a coherence score from 1 to 5 based on the provided criteria.
3. Assign a coherence score from 0.0 to 4.0 based on the provided criteria.
Important: The score must be a decimal number from 0.0 to 4.0. 4.0 is the maximum, \
do not exceed this value.
"""

INPUT: str = """\
Reference text:
{reference}
INPUT: str = """
Reference text: {reference}
Compered text:
{compared}
Compered text: {compared}
"""


Expand All @@ -60,6 +62,59 @@ async def text_coherence_evaluator(
CoherenceScore,
instruction=INSTRUCTION,
input=INPUT.format(reference=reference, compared=compared),
examples=[
(
INPUT.format(
reference=(
"Solar energy is a renewable energy source that is gaining popularity. "
"Solar panels convert sunlight into electricity. "
"This technology is environmentally friendly and can reduce electricity "
"bills. However, installing solar panels requires an initial investment "
"and is dependent on weather conditions."
),
compared=(
"Solar panels are on roofs. Energy is important. "
"The sun shines brightly. Electricity bills can be high. "
"Technology is developing fast. People like to save money."
),
),
CoherenceScore(score=0.0),
),
(
INPUT.format(
reference=(
"Coffee is a popular beverage worldwide. "
"It's made from roasted coffee beans. Caffeine in coffee "
"can boost energy and alertness. However, excessive consumption may "
"lead to sleep issues."
),
compared=(
"Coffee is drunk by many people. It comes from beans that are roasted. "
"Caffeine makes you feel more awake. "
"Drinking too much coffee might make it hard to sleep. "
"Some people add milk or sugar to their coffee."
),
),
CoherenceScore(score=2.0),
),
(
INPUT.format(
reference=(
"Honey is a natural sweetener produced by bees. "
"It has antibacterial properties and is rich in antioxidants. "
"People use honey in cooking, as a spread, and for medicinal "
"purposes. However, it's high in calories and should be consumed "
"in moderation."
),
compared=(
"Bees create honey, a natural sweetener with multiple benefits. "
"Its antibacterial and antioxidant-rich composition makes it valuable "
"for culinary, nutritional, and medicinal uses. While versatile, "
"honey's high caloric content necessitates mindful consumption."
),
),
CoherenceScore(score=4.0),
),
],
)

return model.score / 5
return model.score / 4
86 changes: 77 additions & 9 deletions src/draive/evaluators/text_conciseness.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,25 +20,34 @@ class ConcisenessScore(DataModel):
Keep this document open while reviewing, and refer to it as needed.
Evaluation Criteria:
Conciseness (1-5) - the extent to which the compared text is brief and to the point \
Conciseness (0.0-4.0) - the extent to which the compared text is brief and to the point \
while still covering all key information.
A concise compared text avoids unnecessary details and repetition.
Annotators should penalize compared texts that are overly verbose or include irrelevant information.
Rating Scale:
1: Very low conciseness - the text is excessively verbose with much irrelevant information.
2: Low conciseness - the text contains unnecessary details and some irrelevant information.
3: Moderate conciseness - the text is somewhat concise but could be more focused.
4: Good conciseness - the text is mostly concise with minimal unnecessary information.
5: Excellent conciseness - the text is highly concise, containing only essential information.
0.0: Very low conciseness - the text is excessively verbose with much irrelevant information.
1.0: Low conciseness - the text contains unnecessary details and some irrelevant information.
2.0: Moderate conciseness - the text is somewhat concise but could be more focused.
3.0: Good conciseness - the text is mostly concise with minimal unnecessary information.
4.0: Excellent conciseness - the text is highly concise, containing only essential information.
Evaluation Steps:
1. Read the derived text and the reference text carefully.
2. Compare the compared text to the reference text and identify the main \
points of the reference text.
3. Assess how well the compared text covers the main points of the reference text, \
and how much irrelevant or redundant information it contains.
4. Assign a conciseness score from 1 to 5 based on the provided criteria.
4. Assign a conciseness score from 0.0 to 4.0 based on the provided criteria.
Important: The score must be a decimal number from 0.0 to 4.0. 4.0 is the maximum, \
do not exceed this value.
"""

INPUT: str = """
Reference text: {reference}
Compered text: {compared}
"""


Expand All @@ -52,6 +61,65 @@ async def text_conciseness_evaluator(
ConcisenessScore,
instruction=INSTRUCTION,
input=f"Reference text: {reference}\n\nCompered text: {compared}",
examples=[
(
INPUT.format(
reference=(
"Solar energy is a renewable energy source that is gaining popularity. "
"Solar panels convert sunlight into electricity. "
"This technology is environmentally friendly and can reduce electricity "
"bills. However,installing solar panels requires an initial investment and "
"is dependent on weather conditions."
),
compared=(
"Did you know that solar energy is becoming super popular these days? "
"It's this amazing, eco-friendly way to make electricity using "
"the sun's rays. People are getting really excited about it! Basically, "
"you put these special panels on your roof, and they soak up the sunlight "
"like a sponge. Then, through some pretty cool science stuff, "
"they turn that sunlight into electricity you can use in your house. "
"It's pretty neat, right? And get this - it can actually help you save "
"money on your electricity bills in the long run. But here's the thing: "
"you've got to shell out some cash upfront to get those panels installed. "
"It's kind of like buying a fancy coffee machine - costs a bit at first, "
"but then you save on all those coffee shop visits."
),
),
ConcisenessScore(score=0.0),
),
(
INPUT.format(
reference=(
"Coffee is a popular beverage worldwide. "
"It's made from roasted coffee beans. Caffeine in coffee "
"can boost energy and alertness. However, excessive consumption may "
"lead to sleep issues."
),
compared=(
"Coffee is a widely consumed beverage made from roasted coffee beans. "
"It contains caffeine, which can enhance energy and alertness. However, "
"drinking too much coffee may cause sleep problems. "
"People enjoy coffee for its taste and stimulating effects, but it's "
"important to consume it in moderation."
),
),
ConcisenessScore(score=2.0),
),
(
INPUT.format(
reference=(
"The water cycle, also known as the hydrologic cycle, "
"describes the continuous movement of water within the Earth and "
"atmosphere. It involves processes such as evaporation, condensation, "
"precipitation, and runoff."
),
compared=(
"The water cycle is the continuous movement of water on Earth. "
"It includes evaporation, condensation, precipitation, and runoff."
),
),
ConcisenessScore(score=4.0),
),
],
)

return model.score / 5
return model.score / 4
76 changes: 67 additions & 9 deletions src/draive/evaluators/text_consistency.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,19 +20,19 @@ class ConsistencyScore(DataModel):
Keep this document open while reviewing, and refer to it as needed.
Evaluation Criteria:
Consistency(1-5) - the factual alignment between the reference text and the compared text.
Consistency(0.0-4.0) - the factual alignment between the reference text and the compared text.
A factually consistent compared text contains only statements that are entailed \
by the reference text.
Annotators should penalize compared texts that contain hallucinated facts.
Rating Scale:
1: Very low consistency - the text contains multiple hallucinated facts \
0.0: Very low consistency - the text contains multiple hallucinated facts \
or significant misalignments with the reference text.
2: Low consistency - the text has several instances of information not supported by \
1.0: Low consistency - the text has several instances of information not supported by \
the reference text.
3: Moderate consistency - the text is mostly consistent but contains a few unsupported statements.
4: Good consistency - the text is largely consistent with minor discrepancies.
5: Excellent consistency - the text is fully consistent with the reference text, \
2.0: Moderate consistency - the text is mostly consistent but contains a few unsupported statements.
3.0: Good consistency - the text is largely consistent with minor discrepancies.
4.0: Excellent consistency - the text is fully consistent with the reference text, \
containing only supported information.
Evaluation Steps:
Expand All @@ -41,7 +41,16 @@ class ConsistencyScore(DataModel):
of the reference text.
3. Assess how well the compared text covers the main points of the reference text \
and how much irrelevant or redundant information it contains.
4. Assign a consistency score from 1 to 5 based on the provided criteria.
4. Assign a consistency score from 0.0 to 4.0 based on the provided criteria.
Important: The score must be a decimal number from 0.0 to 4.0. 4.0 is the maximum, \
do not exceed this value.
"""

INPUT: str = """
Reference text: {reference}
Compered text: {compared}
"""


Expand All @@ -55,6 +64,55 @@ async def text_consistency_evaluator(
ConsistencyScore,
instruction=INSTRUCTION,
input=f"Reference text: {reference}\n\nCompered text: {compared}",
examples=[
(
INPUT.format(
reference=(
"Dolphins are intelligent marine mammals. They use echolocation "
"to navigate and hunt. Dolphins live in social groups called pods."
),
compared=(
"Dolphins are smart fish that can fly short distances. They use sonar "
"to talk to whales. Dolphins live in families and go to school "
"to learn hunting techniques."
),
),
ConsistencyScore(score=0.0),
),
(
INPUT.format(
reference=(
"Coffee is a popular beverage worldwide. "
"It's made from roasted coffee beans. Caffeine in coffee "
"can boost energy and alertness. However, excessive consumption may "
"lead to sleep issues."
),
compared=(
"Coffee is a widely consumed drink around the world. It's produced "
"by roasting coffee beans. The caffeine in coffee can increase energy "
"levels and improve alertness. However, drinking too much coffee might "
"cause sleep problems. Coffee is also known to improve memory and reduce "
"the risk of certain diseases."
),
),
ConsistencyScore(score=2.0),
),
(
INPUT.format(
reference=(
"Photosynthesis is the process by which plants use sunlight to "
"produce energy. It requires water, carbon dioxide, and chlorophyll. "
"Oxygen is released as a byproduct of photosynthesis."
),
compared=(
"Plants carry out photosynthesis to create energy from sunlight. "
"This process needs water, carbon dioxide, and the green pigment "
"chlorophyll. As plants photosynthesize, "
"they release oxygen into the environment."
),
),
ConsistencyScore(score=4.0),
),
],
)

return model.score / 5
return model.score / 4
Loading

0 comments on commit f6f90e7

Please sign in to comment.