Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/merge works clean #1288

Closed
wants to merge 270 commits into from
Closed
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
270 commits
Select commit Hold shift + click to select a range
999b950
Fix fiction filter
olovy Sep 8, 2021
447c571
Naming
olovy Sep 8, 2021
9af6957
Add more generic titles
olovy Sep 9, 2021
784ca67
Add more MARC fiction terms
olovy Sep 9, 2021
c855694
Add naive contribution linking
olovy Sep 9, 2021
5dc8c89
Fix typo
olovy Sep 9, 2021
cea11e4
Add naive contribution linking
olovy Sep 9, 2021
922fa43
Add naive contribution linking
olovy Sep 9, 2021
3a1806c
Add naive contribution linking
olovy Sep 9, 2021
1fed900
Add naive contribution linking
olovy Sep 9, 2021
5800fb1
Add naive contribution linking
olovy Sep 9, 2021
d1c535d
Add naive contribution linking
olovy Sep 9, 2021
fa23f10
Add naive contribution linking
olovy Sep 9, 2021
a17c60b
Add naive contribution linking
olovy Sep 9, 2021
ecb83c5
Add naive contribution linking
olovy Sep 9, 2021
4fbaed0
Add naive contribution linking
olovy Sep 9, 2021
080681f
Add naive contribution linking
olovy Sep 9, 2021
be56fc7
Clean up work clustering script
olovy Sep 10, 2021
d36545a
Update initial cluster search parameters
olovy Sep 10, 2021
ee68978
Merge summary
olovy Sep 10, 2021
bffbac9
New method for picking final work title
olovy Sep 10, 2021
b116fdd
New method for picking final work title
olovy Sep 10, 2021
678d91c
New method for picking final work title
olovy Sep 10, 2021
e4dcefb
Display work a bit nicer in show2
olovy Sep 10, 2021
b9a0013
Add verbose mode to link-contribution
olovy Sep 10, 2021
ed998a2
Handle extra chars in lifeSpan check
olovy Sep 10, 2021
abfa245
Fix linkContribution storeAtomicUpdate
olovy Sep 10, 2021
d8db886
Add more testes for extent
olovy Sep 14, 2021
b249707
Add script for moving Elib cover designers
olovy Sep 14, 2021
748073e
Add script for moving Elib cover designers
olovy Sep 14, 2021
7e6d8bb
Add script for moving Elib cover designers
olovy Sep 14, 2021
f7a335e
Add script for moving Elib cover designers
olovy Sep 14, 2021
62a0c74
Add script for moving Elib cover designers
olovy Sep 14, 2021
e6421d6
Add script for moving Elib cover designers
olovy Sep 14, 2021
0c70a74
Add script for moving Elib cover designers
olovy Sep 14, 2021
3cfeeed
Add script for moving Elib cover designers
olovy Sep 14, 2021
ef816c1
Add script for moving Elib cover designers
olovy Sep 14, 2021
51c0915
Add script for moving summary supplied by some providers from work to…
olovy Sep 14, 2021
f894a6c
Add script for moving Elib cover designers
olovy Sep 14, 2021
d490547
Add script for moving summary supplied by some providers from work to…
olovy Sep 14, 2021
88eb8b5
Add script for moving summary supplied by some providers from work to…
olovy Sep 14, 2021
487ae86
Add script for moving Elib cover designers
olovy Sep 14, 2021
ac1475d
Add script for moving Elib cover designers
olovy Sep 14, 2021
e72772e
Add script for moving Elib cover designers
olovy Sep 14, 2021
591bada
Add script for moving Elib cover designers
olovy Sep 14, 2021
30e631e
Add script for moving Elib cover designers
olovy Sep 15, 2021
2267fa1
Add more generic titles
olovy Sep 15, 2021
0f2ff25
Let intendedAudience marc/Juvenile match empty
olovy Sep 15, 2021
2a0d82f
Add filter for translation without translator
olovy Sep 15, 2021
d3d3bc7
Handle translationOf with different type
olovy Sep 15, 2021
e616c23
Handle translationOf with different type
olovy Sep 15, 2021
2fa6e7e
Link contribution: More aggressive normalization of name strings
olovy Sep 15, 2021
850a4cd
Move summary to instance before merging works
olovy Sep 15, 2021
47efe14
Move summary to instance before merging works
olovy Sep 15, 2021
1122166
No more drama
olovy Sep 15, 2021
b439e05
Add placeholder technicalNote to extracted work
olovy Sep 15, 2021
b13f73f
Also output clusters of size 1 when fitering
olovy Sep 16, 2021
d62f2b7
Move HTML stuff to its own file
olovy Sep 16, 2021
93837be
Print report when merging
olovy Sep 16, 2021
b4c66a3
mkdir report dir
olovy Sep 16, 2021
1ca8d97
Print report when merging
olovy Sep 16, 2021
720635d
Print report when merging
olovy Sep 16, 2021
1a3cbe4
Print report when merging
olovy Sep 16, 2021
253bb8c
Print report when merging
olovy Sep 16, 2021
aebb252
Print report when merging
olovy Sep 16, 2021
dd5c803
Print report when merging
olovy Sep 16, 2021
201ab1e
Fix store()
olovy Sep 16, 2021
d823566
Fix URI
olovy Sep 16, 2021
0366e1f
Fix URI
olovy Sep 16, 2021
fba2b53
Add script for removing language from title
olovy Sep 16, 2021
982b58f
Add script for removing language from title
olovy Sep 16, 2021
ea4ad2d
Add script for removing language from title
olovy Sep 16, 2021
7c410ed
Add script for removing language from title
olovy Sep 16, 2021
6a36adb
Add script for removing language from title
olovy Sep 16, 2021
8d574d7
Add script for removing language from title
olovy Sep 16, 2021
4e61faf
Add script for removing language from title
olovy Sep 16, 2021
09c39ae
Add script for removing language from title
olovy Sep 16, 2021
e7e8c33
Check isTranslationWithoutTranslator
olovy Sep 16, 2021
8b1d5c5
Check isTranslationWithoutTranslator
olovy Sep 16, 2021
d71d931
Try new rule for anonymous translations
olovy Sep 16, 2021
1ee5edb
Don't check extent
olovy Sep 16, 2021
68efdda
Fix -s2
olovy Sep 16, 2021
d3d9f61
Add filter -qm
olovy Sep 16, 2021
bd02be4
Add filter -qm
olovy Sep 16, 2021
8877dcf
Add script for moving up contributions from responsibilityStatement
kwahlin Sep 24, 2021
1988b46
Don't lose summaries when merging works
olovy Oct 6, 2021
3cf7756
Drop generic subtitles when picking extracted work title
olovy Oct 6, 2021
3d0e5fd
Drop generic subtitles when picking extracted work title
olovy Oct 6, 2021
8a4c4e1
Update ignored-subtitles.txt
olovy Oct 8, 2021
7a1679a
Make likelyRolePattern stricter and make roleSpecified stat clearer
kwahlin Oct 14, 2021
2a93628
Make thread safe
kwahlin Feb 8, 2022
52fda10
Properly remove contributor after match
kwahlin Feb 8, 2022
6ce4913
Add missing dependency
olovy Apr 26, 2022
3b095c5
remove language only from work titles
kwahlin May 11, 2022
a2c8021
filter out works having contribution in relationship
kwahlin May 11, 2022
896f6ef
Add revert method
kwahlin May 17, 2022
8ce37f0
Feature/resp statement to contribution (#1118)
kwahlin May 20, 2022
06ceba6
Enable filtering clusters by contribution role
kwahlin May 20, 2022
8b1add4
Add date catalogue to work report link
kwahlin May 20, 2022
0a6635a
Add [Publit] to move-summaries-to-instance script
olovy May 20, 2022
66cc802
Show reproductionOf links within cluster
kwahlin May 20, 2022
7dcc2eb
Use date as job identifier
kwahlin May 20, 2022
0a30419
Change catalogue structure
kwahlin May 20, 2022
7b3ac57
Handle labels in list
kwahlin May 20, 2022
18c7eb9
Fix revert
kwahlin May 23, 2022
5034adf
Fix removal of works
kwahlin May 23, 2022
fea0050
Handle exception when failing to find doc to remove
kwahlin May 23, 2022
7dbf9c4
Get encodingLevel properly
kwahlin May 23, 2022
8fc45b1
Save updates in right place
kwahlin May 23, 2022
be0c1e7
Don't reset changed flag to false
kwahlin May 23, 2022
09d019d
Prefix numPages
kwahlin May 24, 2022
4bcf364
Fix genreForm merging
kwahlin Jun 3, 2022
70e6cb5
Try new method for picking best work title
kwahlin Jun 16, 2022
e20d332
Add grep for safety
kwahlin Jun 16, 2022
cadd656
Fix argument types
kwahlin Jun 20, 2022
2e8dedd
Set generationDate/generationProcess before saving any updates
kwahlin Jun 20, 2022
cddf246
Drop all subtitles
kwahlin Jul 5, 2022
b477d39
Exclude local subject entities from merge unless type is ComplexSubject
kwahlin Jul 8, 2022
c9b5fdd
Exclude local genreForm entities from merge
kwahlin Jul 8, 2022
72f62e5
Add showHubs option
kwahlin Jul 12, 2022
3d7c5c4
Fix title overview
kwahlin Jul 12, 2022
7e27730
Only show hubs/clusters with more than 1 member
kwahlin Jul 12, 2022
2054cf8
Improve view for unstored extracted works while getting rid of the Do…
kwahlin Jul 13, 2022
dbdc53b
Use Doc instead of Doc2, filter on derivedFrom
kwahlin Jul 13, 2022
c9f5cb1
Add command line option for adding 9pu codes to illustrators
kwahlin Jul 13, 2022
36dae38
Fix view for titles
kwahlin Jul 13, 2022
92bb0e0
Filter out fields with null values properly
kwahlin Jul 18, 2022
00ec6d4
Actually copy work
kwahlin Jul 18, 2022
1ed9517
Print html only if cluster generates new merged work
kwahlin Jul 18, 2022
0196509
Don't parse editors for now
kwahlin Jul 19, 2022
73e2aa0
Pile ids vertically in _derivedFrom
kwahlin Jul 19, 2022
09e2624
Don't add title source until after partition and prefer work title wh…
kwahlin Jul 21, 2022
bb5d956
Put scripts in separate files to run normally with whelktool
kwahlin Aug 17, 2022
02e080c
WIP: refactor: separate Doc business and display logic
olovy Aug 18, 2022
17501e7
Clean up
olovy Aug 18, 2022
01963b8
Move out unneeded stuff from Util
kwahlin Aug 19, 2022
5157aac
Minor fixes
kwahlin Aug 25, 2022
0d103fd
Adjust for when there will be linked works already
kwahlin Aug 30, 2022
0d7d354
Add new files too...
kwahlin Aug 30, 2022
7addd79
Don't report if not stored
kwahlin Aug 31, 2022
5e8473d
Split instead of tokenize
kwahlin Sep 1, 2022
a84e109
Add missing size()
kwahlin Sep 1, 2022
6960b38
Remove exit method
kwahlin Sep 1, 2022
7202c02
handle titles in translationOf
kwahlin Jan 24, 2023
9212e28
Remove unneeded method
kwahlin Jan 25, 2023
5397fd7
Fix reproductionOf display
kwahlin Jan 25, 2023
9f6c1e2
Add draft for scripted job and update some whelktool scripts accordingly
kwahlin Jan 27, 2023
b55c2a5
Change order
kwahlin Jan 29, 2023
f0dd340
Go back to prioritizing work titles over instance titles and drop gen…
kwahlin Jan 30, 2023
31be1d2
Add flag option for report dir
kwahlin Jan 31, 2023
18009af
Take into account that translationOf can be list when comparing
kwahlin Jan 31, 2023
ce4f020
Add missing parameter for storeAtomicUpdate
kwahlin Jan 31, 2023
a976b27
Add num-threads flag
kwahlin Jan 31, 2023
6975a70
Remove return statement that prevents instances from being linked to …
kwahlin Jan 31, 2023
af342a9
Add missing parameter for storeAtomicUpdate again
kwahlin Jan 31, 2023
c2ec727
Fix report uri in technicalNote
kwahlin Jan 31, 2023
d0208ef
Copy executor service from Whelktool
kwahlin Mar 23, 2023
e8a906e
Make possible to pass Whelktool flags to wrapper script and print wha…
kwahlin Mar 23, 2023
26be9ae
Add script for adding missing translationOf to translations
kwahlin Mar 24, 2023
df50014
Better naming
kwahlin Mar 27, 2023
7e89e44
Prepare scripts to be run regularly
kwahlin Apr 18, 2023
88b8f74
Set generationProcess and generationDate for new works too
kwahlin Apr 18, 2023
23c6a66
Don't filter out anonymous translations and include translationOf nor…
kwahlin Apr 19, 2023
5725555
Always produce report with split clusters
kwahlin Apr 25, 2023
c79cba7
More concise where clause
kwahlin Apr 26, 2023
613b557
Enable cluster reporting with already merged works
kwahlin Apr 27, 2023
5aa753d
Reverse direction when comparing encoding levels since higher index m…
kwahlin Apr 27, 2023
60f6977
Skip adapter -> editor normalization for now
kwahlin Apr 27, 2023
5956a39
Add generic subtitle 'berättelser för barn'
kwahlin May 2, 2023
6680c15
Keep blank gf terms / subjects
kwahlin May 2, 2023
f05bb6c
Add apostrophe-like symbol to noise list
kwahlin May 3, 2023
339d9f3
Match untitled translationOf with those that have a title and pick be…
kwahlin May 4, 2023
58fe196
Exclude tactile text works from selection
kwahlin May 5, 2023
e1c04a8
Merge SAB classification only when codes are equal
kwahlin May 5, 2023
7b5f03a
Fix classification display
kwahlin May 5, 2023
8b0b4ef
Add missing parenthesis
kwahlin May 5, 2023
0ac3a71
Redirect annoying Whelktool output
kwahlin May 5, 2023
547b0ed
Interlink cluster members with closeMatch
kwahlin May 8, 2023
99f7fda
Use right method for loading docs when collecting title clusters
kwahlin May 9, 2023
7c5c564
Ignore too large result sets in find-work-clusters.groovy
olovy May 10, 2023
5916f7f
Remove ES query operators from e.g. titles in find-work-clusters
olovy May 10, 2023
87d06df
Don't do fuzzy title search in find-work-clusters
olovy May 10, 2023
ed6d139
Refine closeMatch linkage and start general restructuring (WIP)
kwahlin May 12, 2023
a30001c
Add urval to generic titles
kwahlin May 16, 2023
3ef5f8c
WIP: More refactoring
kwahlin May 17, 2023
6ecffec
Make report methods work after refactoring
kwahlin May 17, 2023
b44be93
Remove unused methods
kwahlin May 17, 2023
470b686
Don't save unmodified documents
kwahlin May 17, 2023
3613e0b
Allow copying contribution from records with lower encoding level
kwahlin May 17, 2023
13e7b8b
Match only titles of type Title, not ParallelTitle
kwahlin May 17, 2023
9b842e4
Ignore more subtitles
kwahlin May 17, 2023
8101f9f
Add more selection criteria (LXL-4147)
kwahlin May 17, 2023
972f255
Append partNumber/partName to mainTitle when copying title from insta…
kwahlin May 17, 2023
7dbbecb
Link instances to work, not to self
kwahlin May 17, 2023
6c8097c
More refactoring (towards eliminating Datatool)
kwahlin May 19, 2023
90ef6ec
Add ʼ character to noise list
kwahlin May 23, 2023
f3f88ce
Output only clusters with at least two records when filtering Swedish…
kwahlin Jun 1, 2023
e65665d
Bin some redundant classifications when merging
kwahlin May 23, 2023
0e25cfd
Trim whitespace from SAB codes
kwahlin May 24, 2023
4c23451
Filter out records whose titles have more than one part from selection
kwahlin Jun 22, 2023
238eece
Filter out manuscripts from selection
kwahlin Jun 22, 2023
9a242da
Drop generic subtitles appearing as substrings
kwahlin Jun 22, 2023
75520fd
Make saogf/L%C3%A4ttl%C3%A4st distinguishing
kwahlin Jun 22, 2023
1f61554
Look for saogf/Handskrifter too when deciding if manuscript
kwahlin Jun 22, 2023
85756cb
Create all necessary directories before writing reports
kwahlin Jun 26, 2023
60627cc
Make sure incompatible works never end up in same group when partitio…
kwahlin Jun 26, 2023
77743b3
WIP: Add new script for *all* contribution related normalizations on …
kwahlin Jun 1, 2023
2d05453
Implement more normalization methods and extensive reporting
kwahlin Jun 13, 2023
251721c
Add missing translationOf if translator in contribution
kwahlin Jun 14, 2023
11e868a
Remove replaced scripts
kwahlin Jun 14, 2023
8e3dae2
Be more restrictive about adding local entities in contribution (avoi…
kwahlin Jun 15, 2023
39845cc
Finalize script
kwahlin Jun 21, 2023
96b1d99
Add script for moving some illustrators in work clusters to instance
kwahlin Jun 16, 2023
2921926
Look for more gf terms when deciding if illustrator should be moved
kwahlin Jun 21, 2023
4fded0a
Move more roles to instance
kwahlin Jun 27, 2023
aa64202
Include contributions to instance in wrapper script
kwahlin Jun 27, 2023
1c8f416
Add script for specifying designer roels in elib records (and move un…
kwahlin Jun 27, 2023
9668408
Move more roles to instance by looking at domain
kwahlin Jun 27, 2023
29c74e9
Report all moved roles
kwahlin Jun 27, 2023
fd1094f
Map Formgivare to bookDesigner instead of designer
kwahlin Jun 27, 2023
151ccec
Fix report dir typo
kwahlin Jun 27, 2023
535c196
Fix role link
kwahlin Jun 27, 2023
72147e8
Remove print
kwahlin Jun 27, 2023
4ba9df9
Make blank gf term Lättläst distinguishing when comparing works
kwahlin Jun 28, 2023
ecaa721
Allow different initials when either of the compared names is a mononym
kwahlin Jun 28, 2023
b7a4463
Add more criteria for when to move roles to instance
kwahlin Jun 29, 2023
7845894
Exclude anonymous translations from selection
kwahlin Jun 29, 2023
cd7c561
WIP: Start moving stuff out from WorkTool to separate scripts
kwahlin Jun 30, 2023
95743e9
Make all steps runnable without WorkTool
kwahlin Jul 4, 2023
dc31b3c
Write to multi-work report on the fly
kwahlin Jul 5, 2023
dcd2638
Avoid saving same document several times
kwahlin Jul 5, 2023
b1ac57c
Add display scripts and remove WorkTool altogether
kwahlin Jul 6, 2023
bd8ba05
Fix loading resources
kwahlin Jul 7, 2023
97ea3d5
Make sure that there are matching local works before updating existin…
kwahlin Jul 7, 2023
1350820
Minor fixes
kwahlin Jul 7, 2023
3601aac
WIP: Move work code to separate module
kwahlin Aug 16, 2023
8a03071
Add more dependencies to facilitate code navigation in IntelliJ
kwahlin Aug 17, 2023
856c441
Fix import paths
kwahlin Aug 17, 2023
43e7f84
Remove mergeworks package from whelktool
kwahlin Aug 17, 2023
04ca5ba
Set docItem only when necessary
kwahlin Aug 18, 2023
e3b6a16
Remove unnecessary dependency that was only used experimentally
kwahlin Aug 18, 2023
b7c3f59
Clean up no longer valid stuff
kwahlin Aug 18, 2023
812c8ef
Add reportsDir to WhelkTool.gdsl
kwahlin Aug 18, 2023
d13d423
Remove/revert stuff not actively used at this point
kwahlin Aug 18, 2023
7b5734f
Merge branch 'develop' into feature/merge-works-clean
kwahlin Aug 18, 2023
78e1c8e
Fix typos
olovy Aug 18, 2023
e0fe8d9
Rename asciiFold -> removeDiacritics
olovy Aug 18, 2023
a8e5b34
Fix broken test
olovy Aug 18, 2023
d80cb71
Fix comparison of 'lättläst' terms and also add barngf/Lättlästa böcker
kwahlin Aug 18, 2023
1d5b092
Include physicalDetailsNote in hmtl report
kwahlin Aug 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions librisworks/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/reports
37 changes: 37 additions & 0 deletions librisworks/build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
apply plugin: 'application'
apply plugin: 'groovy'

sourceSets {
scripts {
groovy { srcDir 'scripts' }
}
}

repositories {
mavenCentral()
}

dependencies {
implementation project(':whelktool')
compileOnly "org.codehaus.groovy:groovy:${groovyVersion}"
compileOnly project(':whelk-core')
scriptsCompileOnly sourceSets.main.output
scriptsCompileOnly project(':whelk-core')
}

jar {
manifest {
attributes "Main-Class": "whelk.datatool.WhelkTool",
// log4j uses multi-release to ship different stack walking implementations for different java
// versions. Since we repackage everything as a fat jar, that jar must also be multi-release.
"Multi-Release": true
}

duplicatesStrategy = DuplicatesStrategy.EXCLUDE
from {
configurations.runtimeClasspath.collect {
it.isDirectory() ? it : project.zipTree(it).matching {
}
}
}
}
124 changes: 124 additions & 0 deletions librisworks/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
#!/bin/bash

count_lines() {
if [ -f $1 ]; then
wc -l $1 | cut -d ' ' -f 1
else
echo 0
fi
}

if ! [[ "$1" =~ ^(local|dev|dev2|qa|stg|prod)$ ]]; then
echo "Missing or invalid environment"
exit 1
fi

ENV=$1
ARGS="${@:2}"
NUM_CLUSTERS=0

JAR_FILE=build/libs/librisworks.jar

SCRIPTS_DIR=scripts
REPORT_DIR=reports/merge-works/$ENV-$(date +%Y%m%d)

mkdir -p $REPORT_DIR/{clusters,normalizations,merged-works}

CLUSTERS_DIR=$REPORT_DIR/clusters
NORMALIZATIONS_DIR=$REPORT_DIR/normalizations

FIND_CLUSTERS=$CLUSTERS_DIR/find-clusters
ALL_CLUSTERS=$CLUSTERS_DIR/1-all.tsv
MERGED_CLUSTERS=$CLUSTERS_DIR/2-merged.tsv
TITLE_CLUSTERS=$CLUSTERS_DIR/3-title-clusters.tsv
SWEDISH_FICTION=$CLUSTERS_DIR/4-swedish-fiction.tsv
NO_ANONYMOUS_TRANSLATIONS=$CLUSTERS_DIR/5-no-anonymous-translations.tsv

LANGUAGE_IN_TITLE=$NORMALIZATIONS_DIR/1-titles-with-language
ELIB_DESIGNERS=$NORMALIZATIONS_DIR/2-elib-cover-designer
CONTRIBUTION=$NORMALIZATIONS_DIR/3-contribution
ROLES_TO_INSTANCE=$NORMALIZATIONS_DIR/4-roles-to-instance

# Clustering step 1 TODO: run only on recently updated records after first run
echo "Finding new clusters..."
time java -Dxl.secret.properties=$HOME/secret.properties-$ENV -jar $JAR_FILE \
$ARGS --report $FIND_CLUSTERS $SCRIPTS_DIR/find-work-clusters.groovy >$ALL_CLUSTERS 2>/dev/null
NUM_CLUSTERS=$(count_lines $ALL_CLUSTERS)
echo "$NUM_CLUSTERS clusters found"
if [ $NUM_CLUSTERS == 0 ]; then
exit 0
fi

# Clustering step 2
echo
echo "Merging clusters..."
time java -Dxl.secret.properties=$HOME/secret.properties-$ENV -Dclusters=$ALL_CLUSTERS -jar $JAR_FILE \
$ARGS $SCRIPTS_DIR/merge-clusters.groovy >$MERGED_CLUSTERS 2>/dev/null
NUM_CLUSTERS=$(count_lines $MERGED_CLUSTERS)
echo "Merged into $NUM_CLUSTERS clusters"
if [ $NUM_CLUSTERS == 0 ]; then
exit 0
fi

# Clustering step 3
echo
echo "Finding title clusters..."
time java -Dxl.secret.properties=$HOME/secret.properties-$ENV -Dclusters=$MERGED_CLUSTERS -jar $JAR_FILE \
$ARGS $SCRIPTS_DIR/title-clusters.groovy >$TITLE_CLUSTERS 2>/dev/null
NUM_CLUSTERS=$(count_lines $TITLE_CLUSTERS)
echo "$NUM_CLUSTERS title clusters found"
if [ $NUM_CLUSTERS == 0 ]; then
exit 0
fi

# Filter: Swedish fiction
echo
echo "Filtering on Swedish fiction..."
time java -Dxl.secret.properties=$HOME/secret.properties-$ENV -Dclusters=$TITLE_CLUSTERS -jar $JAR_FILE \
$ARGS $SCRIPTS_DIR/swedish-fiction.groovy >$SWEDISH_FICTION 2>/dev/null
NUM_CLUSTERS=$(count_lines $SWEDISH_FICTION)
echo "Found $NUM_CLUSTERS title clusters with Swedish fiction"
if [ $NUM_CLUSTERS == 0 ]; then
exit 0
fi

# Normalization
echo
echo "Removing language from work titles..."
time java -Dxl.secret.properties=$HOME/secret.properties-$ENV -Dclusters=$SWEDISH_FICTION -jar $JAR_FILE \
$ARGS --report $LANGUAGE_IN_TITLE $SCRIPTS_DIR/language-in-work-title.groovy 2>/dev/null
echo "$(count_lines $LANGUAGE_IN_TITLE/MODIFIED.txt) records affected, report in $LANGUAGE_IN_TITLE"

echo
echo "Specifying designer roles in Elib records..." # NOTE: Not dependent on clustering, can be run anytime after ContributionByRoleStep has been deployed.
time java -Dxl.secret.properties=$HOME/secret.properties-$ENV -jar $JAR_FILE \
$ARGS --report $ELIB_DESIGNERS $SCRIPTS_DIR/elib-unspecified-contributor.groovy 2>/dev/null
echo "$(count_lines $ELIB_DESIGNERS/MODIFIED.txt) records affected, report in $ELIB_DESIGNERS"

echo
echo "Normalizing contribution..."
time java -Dxl.secret.properties=$HOME/secret.properties-$ENV -Dclusters=$SWEDISH_FICTION -jar $JAR_FILE \
$ARGS --report $CONTRIBUTION $SCRIPTS_DIR/normalize-contribution.groovy 2>/dev/null
echo "$(count_lines $CONTRIBUTION/MODIFIED.txt) records affected, report in $CONTRIBUTION"

echo
echo "Moving roles to instance..."
time java -Dxl.secret.properties=$HOME/secret.properties-$ENV -Dclusters=$SWEDISH_FICTION -jar $JAR_FILE \
$ARGS --report $ROLES_TO_INSTANCE $SCRIPTS_DIR/contributions-to-instance.groovy 2>/dev/null
echo "$(count_lines $ROLES_TO_INSTANCE/MODIFIED.txt) records affected, report in $ROLES_TO_INSTANCE"

# Filter: Drop anonymous translations
echo "Filtering out anonymous translations..."
time java -Dxl.secret.properties=$HOME/secret.properties-$ENV -Dclusters=$SWEDISH_FICTION -jar $JAR_FILE \
$ARGS $SCRIPTS_DIR/drop-anonymous-translations.groovy >$NO_ANONYMOUS_TRANSLATIONS 2>/dev/null
NUM_CLUSTERS=$(count_lines $NO_ANONYMOUS_TRANSLATIONS)
echo "$NUM_CLUSTERS clusters ready for merge"
if [ $NUM_CLUSTERS == 0 ]; then
exit 0
fi

# Merge
echo
echo "Merging..."
time java -Dxl.secret.properties=$HOME/secret.properties-$ENV -Dclusters=$NO_ANONYMOUS_TRANSLATIONS -jar $JAR_FILE \
$ARGS --report $REPORT_DIR/merged-works $SCRIPTS_DIR/merge-works.groovy 2>/dev/null
96 changes: 96 additions & 0 deletions librisworks/scripts/contributions-to-instance.groovy
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
import whelk.Whelk

import static se.kb.libris.mergeworks.Util.Relator
import static whelk.JsonLd.ID_KEY
import static whelk.JsonLd.TYPE_KEY

report = getReportWriter('report.tsv')

def ids = new File(System.getProperty('clusters')).collect { it.split('\t').collect { it.trim() } }.flatten()

def whelk = getWhelk()
def instanceRolesByDomain = whelk.resourceCache.relators.findResults {
if (it.domain) {
def domain = whelk.jsonld.toTermKey(it.domain[ID_KEY])
if (whelk.jsonld.isSubClassOf(domain, 'Embodiment')) it.subMap([ID_KEY])
}
}
def instanceRoles = instanceRolesByDomain + [Relator.ILLUSTRATOR, Relator.AUTHOR_OF_INTRO, Relator.AUTHOR_OF_AFTERWORD].collect { [(ID_KEY): it.iri] }

selectByIds(ids) { bib ->
Map instance = bib.graph[1]
def work = instance.instanceOf
def contribution = work?.contribution

if (!contribution) return

def ill = [(ID_KEY): Relator.ILLUSTRATOR.iri]

def modified = false

contribution.removeAll { c ->
if (isPrimaryContribution(c)) return false

def toInstance = asList(c.role).intersect(instanceRoles)
if (toInstance.contains(ill)) {
if (has9pu(c) || isPictureBook(work) || isComics(work, bib.whelk) || isStillImage(work)) {
toInstance.remove(ill)
}
}
if (toInstance) {
instance['contribution'] = asList(instance['contribution']) + c.clone().tap { it['role'] = toInstance }
c['role'] = asList(c.role) - toInstance
modified = true
report.println([bib.doc.shortId, toInstance.collect { it[ID_KEY].split('/').last() }].join('\t'))
incrementStats('moved to instance', toInstance)
return c.role.isEmpty()
}

return false
}

if (contribution.isEmpty()) {
work.remove('contribution')
}

if (modified) {
bib.scheduleSave()
}
}

boolean isPrimaryContribution(Map contribution) {
contribution[TYPE_KEY] == 'PrimaryContribution'
}

boolean has9pu(Map contribution) {
asList(contribution.role).contains([(ID_KEY): Relator.PRIMARY_RIGHTS_HOLDER.iri])
}

boolean isStillImage(Map work) {
asList(work.contentType).contains([(ID_KEY): 'https://id.kb.se/term/rda/StillImage'])
}

boolean isPictureBook(Map work) {
def picBookTerms = [
'https://id.kb.se/term/barngf/Bilderb%C3%B6cker',
'https://id.kb.se/term/barngf/Sm%C3%A5barnsbilderb%C3%B6cker'
].collect { [(ID_KEY): it] }

return asList(work.genreForm).any { it in picBookTerms }
}

boolean isComics(Map work, Whelk whelk) {
def comicsTerms = [
'https://id.kb.se/term/saogf/Tecknade%20serier',
'https://id.kb.se/term/barngf/Tecknade%20serier',
'https://id.kb.se/term/gmgpc/swe/Tecknade%20serier',
'https://id.kb.se/marc/ComicOrGraphicNovel',
'https://id.kb.se/marc/ComicStrip'
].collect { [(ID_KEY): it] }

return asList(work.genreForm).any {
it in comicsTerms
|| it[ID_KEY] && whelk.relations.isImpliedBy('https://id.kb.se/term/saogf/Tecknade%20serier', it[ID_KEY])
|| asList(work.classification).any { it.code?.startsWith('Hci') }
}
}
23 changes: 23 additions & 0 deletions librisworks/scripts/display-clusters.groovy
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import se.kb.libris.mergeworks.Doc
import se.kb.libris.mergeworks.Html

htmlReport = getReportWriter('clusters.html')

htmlReport.println(Html.START)

new File(System.getProperty('clusters')).splitEachLine(~/[\t ]+/) { cluster ->
List<Doc> docs = Collections.synchronizedList([])

selectByIds(cluster) {
it.getVersions()
.reverse()
.find { getAtPath(it.data, it.workIdPath) == null }
?.with { docs.add(new Doc(getWhelk, it)) }
}

docs.each { it.addComparisonProps() }

htmlReport.println(Html.clusterTable(docs) + Html.HORIZONTAL_RULE)
}

htmlReport.println(Html.END)
55 changes: 55 additions & 0 deletions librisworks/scripts/display-works.groovy
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import se.kb.libris.mergeworks.Doc
import se.kb.libris.mergeworks.Html
import se.kb.libris.mergeworks.WorkComparator

import static se.kb.libris.mergeworks.Util.partition

htmlReport = getReportWriter('works.html')

htmlReport.println(Html.START)

new File(System.getProperty('clusters')).splitEachLine(~/[\t ]+/) { cluster ->
List<Doc> docs = Collections.synchronizedList([])

selectByIds(cluster) {
it.getVersions()
.reverse()
.find { getAtPath(it.data, it.workIdPath) == null }
?.with { docs.add(new Doc(getWhelk, it)) }
}

WorkComparator c = new WorkComparator(WorkComparator.allFields(docs))

def workClusters = workClusters(docs, c).findAll { it.size() > 1 }

workClusters.collect { [createNewWork(c.merge(it))] + it }
.each { htmlReport.println(Html.clusterTable(it) + Html.HORIZONTAL_RULE) }
}

htmlReport.println(Html.END)

Collection<Collection<Doc>> workClusters(Collection<Doc> docs, WorkComparator c) {
docs.each { it.addComparisonProps() }

def workClusters = partition(docs, { Doc a, Doc b -> c.sameWork(a, b) })
.each { work -> work.each { doc -> doc.removeComparisonProps() } }

return workClusters
}

Doc createNewWork(Map workData) {
workData['@id'] = "TEMPID#it"
Map data = [
"@graph": [
[
"@id" : "TEMPID",
"@type" : "Record",
"mainEntity": ["@id": "TEMPID#it"],

],
workData
]
]

return new Doc(create(data))
}
17 changes: 17 additions & 0 deletions librisworks/scripts/drop-anonymous-translations.groovy
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import se.kb.libris.mergeworks.Doc

new File(System.getProperty('clusters')).splitEachLine(~/[\t ]+/) { cluster ->
List<Doc> docs = Collections.synchronizedList([])
selectByIds(cluster) {
docs.add(new Doc(it))
}

def filtered = docs.split { it.instanceData }
.with { local, linked ->
linked + local.findAll { Doc d -> !d.isAnonymousTranslation() }
}

if (filtered.size() > 1) {
println(filtered.collect { Doc d -> d.shortId() }.join('\t'))
}
}
Loading