Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

When run on kenkyuusha certain headers are incomplete #5

Open
rtega opened this issue Jul 20, 2018 · 8 comments
Open

When run on kenkyuusha certain headers are incomplete #5

rtega opened this issue Jul 20, 2018 · 8 comments

Comments

@rtega
Copy link

rtega commented Jul 20, 2018

The heading of たしなむ is "heading": "嗜む" while it should be "たしなむ【嗜む】"

@rtega
Copy link
Author

rtega commented Jul 20, 2018

I added the following lines in line 143 of book.c:
if(strstr(result,"嗜む")) { printf("boef: %s %i %i\n",result,position->page,position->offset); }
which yields the following result:
boef: tashinamu <たしなむ【嗜む】> 30827 984
boef: たしなむ【嗜む】 <..> 138094 1506

boef: たしなむ【嗜む】 33548 130
boef: たしなむ【嗜む】 <..> 138094 1506

boef: 嗜む 38028 1326
boef: たしなむ【嗜む】 <..> 138094 1506
Basically whats happening is that there are three headers in the dictionary which all refer to the same article. Only the last header is exported.

@rtega
Copy link
Author

rtega commented Jul 20, 2018

Basically, things go wrong in book_undupe(book); We need to be smarter about what we are removing.

@rtega
Copy link
Author

rtega commented Jul 20, 2018

I would propose to save the heading with the largest content when removing in book_undupe(book). I don't understand your code at first view. Could you have a look at it?

@rtega
Copy link
Author

rtega commented Jul 21, 2018

I changed the undupe code with this quicksort and removeduplicates. The resulting file is a bit smaller but it seems to work as it should.
`void swap(Book_Entry* a, Book_Entry* b)
{
Book_Entry t = *a;
*a = *b;
*b = t;
}

int partition_entries(Book_Entry arr[], int low, int high)
{
Book_Entry * pivot = &arr[high]; // pivot
int i = (low - 1); // Index of smaller element

for (int j = low; j <= high- 1; j++)
{
    // If current element is smaller than or
    // equal to pivot
    if (arr[j].text.page < pivot->text.page)
    {
        i++;    // increment index of smaller element
        swap(&arr[i], &arr[j]);
    }
if(arr[j].text.page == pivot->text.page)
{
	if(arr[j].text.offset < pivot->text.offset)
	{
		i++;
		swap(&arr[i],&arr[j]);
		if(arr[j].text.offset == pivot->text.offset)
		{
			if(strlen(arr[j].heading.text) <= strlen(pivot->heading.text))
			{
				i++;
				swap(&arr[i],&arr[j]);
			}
		}
	}
}
}
swap(&arr[i + 1], &arr[high]);
return (i + 1);

}

/* The main function that implements QuickSort
arr[] --> Array to be sorted,
low --> Starting index,
high --> Ending index /
void quickSort_entries(Book_Entry arr[], int low, int high)
{
if (low < high)
{
/
pi is partitioning index, arr[p] is now
at right place */
int pi = partition_entries(arr, low, high);

    // Separately sort elements before
    // partition and after partition
    quickSort_entries(arr, low, pi - 1);
    quickSort_entries(arr, pi + 1, high);
}

}

int removeDuplicates_subbook(Book_Subbook* subbook)
{
int n = subbook->entry_count;
Book_Entry * arr = subbook->entries;
// Return, if array is empty
// or contains a single element
if (n==0 || n==1)
return n;

Book_Entry * temp = malloc(n*sizeof(Book_Entry));

// Start traversing elements
int j = 0;
for (int i=0; i<n-1; i++)

    // If current element is not equal
    // to next element then store that
    // current element
    if ((arr[i].text.page != arr[i+1].text.page) || (arr[i].text.offset != arr[i+1].text.offset))
        temp[j++] = arr[i];

// Store the last element as whether
// it is unique or repeated, it hasn't
// stored previously
temp[j++] = arr[n-1];

// Modify original array
for (int i=0; i<j; i++)
    arr[i] = temp[i];

subbook->entry_count = j;
free(temp);
return j;

}

static void subbook_undupe(Book_Subbook* subbook) {
quickSort_entries(subbook->entries,0,subbook->entry_count -1);
removeDuplicates_subbook(subbook);
`

@rtega
Copy link
Author

rtega commented Jul 21, 2018

It crashes on gakken though.

@rtega
Copy link
Author

rtega commented Jul 21, 2018

And doesn't work as it should. Working on an updated version.

@FooSoft
Copy link
Owner

FooSoft commented Jul 22, 2018

I think the easiest fix is just to check lengths when looking for dupes. If there is a dupe with a longer header length, swap it with the current entry and delete the dupe. You shouldn't have to sort anything.

That being said, I'm not sure you actually want to use headers for anything. All of that information can be found in the entry text, and you are going to have to parse all of that stuff out with regex anyway. Honestly, if anything, this made me wonder if I should even be exporting the headers out of zero-epwing as AFAIK they are just some weird artifact of the EPWING format.

@rtega
Copy link
Author

rtega commented Jul 22, 2018

For reference articles you don't have a header in the entry text itself:
"heading": "¶両三日 <りょう2【両】>",
"text": "・両三日 two or three days; a couple of days\n"
I guess you really want to keep the info in the heading in that case. Take the example of 普通高等学校:
"heading": "¶普通高等学校 <こうとうがっこう【高等学校】>",
"text": "普通高等学校 a general [an ordinary, an academic] high school.\nこうとうかん【高等官】 {{w_46695}}(k{{n_41528}}t{{n_41528}}kan)\n"
The heading is referring to 高等学校 while the text is referring to 高等官. You want to keep the info in the heading I think.

Looking at your code to remove dupes, I don't see how you can get at the entry which you are comparing from a Page-pointer solely.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants