- Name:
Firstname Lastname
- Land:
Switzerland
- Type:
hcp
orhco
- Numbers: No inverted comma, use Dot for decimal.
Right:
1000.50
, wrong:1'000,50
name,location,country,address,plz,uci,donations_grants,sponsorship,registration_fees,travel_accommodation,fees,related_expenses,total,type
Max Mustermann,Zürich,CH,Strasse 5,,,,,315.0,487.0,,,802.0,hcp
The export function will add the column source
. Dont do it manually.
type,donations_grants,sponsorship,registration_fees,travel_accommodation,fees,related_expenses,total
hcp_amount,,,0.0,0.0,2400.0,0.0,2400.0
hco_amount,0.0,109439.0,0.0,0.0,0.0,0.0,109439.0
hcp_count,,,0.0,0.0,2.0,0.0,2.0
hco_count,0.0,3.0,0.0,0.0,0.0,0.0,3.0
rnd,,,,,,,1761943.0
This Functions have to be in the list extraction
basic_string_conversion
df_export = basic_string_conversion(df_export)
Does some cleaning:
Switzerland
toCH
- Removes
Dr. Md. Med etc.
export_list
export_list(df_export, 'NAME_OF_COMPANY')
Safes the dataframe to the right folder and checks the dataframe. It calculates, checks for carinations, reorder the datasets, add new fields, etc. This is your final function in each script!
export_accumulation
export_accumulation(df_export, 'astrazeneca')
Similar to export_list
. This is your final function!
write_to_excel (not in production!)
df_export = write_to_excel(df_export, 'tmp.xlsx'[, open=False])
Safes the dataframe as Excel and may open it directly. Use this during development, but not at the end!
write_to_csv (not in production!)
df_export = write_to_excel(df_export, 'novartis'[, index=False])
Similar to write_to_excel
. Not in production!
amounts_to_number
df_export = amounts_to_number(df_export)
Converts all number columns (sponsorship, donations, etc.
) to actually floats. Use this in every script if possible. Only avoid, if cells have text, which you cant convert automaticly (as in scanned pdfs)
cleanup_number
df_export = cleanup_number(df_export)
Cleans the number fields (strings at the moment), to that you can call amounts_to_number
The function actually calls:
replace_apostrophe
remove_chf
remove_spaces
replace_apostrophe
df_export = replace_apostrophe(df_export)
Removes apostrophes in number fields
remove_chf
df_export = remove_chf(df_export)
Removes CHF
in all number fields
remove_spaces
df_export = remove_spaces(df_export)
Remove spaces in all number fields
remove_dots
df_export = remove_dots(df_export)
Removes all dots in number fields.
remove_comma
df_export = remove_comma(df_export)
Removes all commas in number fields.
remove_in_numbers
df_export = remove_in_numbers(df_export, 'X')
Removes X
in number fields
replace_comma_to_dot
df_export = replace_comma_to_dot(df_export)
Replaces comma with dots in all number fields
replace_in_number
df_export = replace_in_number(df_export, 'X', 'Y')
Replaces X
with Y
in all numberfields
add_empty_accumulation
df_export = add_empty_accumulation(df_export, "rnd")
Adds en empty accumulation line.
add_accumulation
df_export = add_accumulation(df_export, AccType.hcp_amount, fees = 14389.37, total = 14389.37)
Adds an accumulation manually. You can use all parameters which are allowed in the final table.
set_type_by_alphabetical_order
df_export = set_type_by_alphabetical_order(df_export)
Adds the type based on the alphabets. Means: The moment, a name is not anymore in alphabetical order, it has to be the start of the hco-group.
Example:
Banz Meier
Clus Hannes
Dinkelberger Peter <- not in order anymore, must be HCP
Kantonsspital Sarnen
Kantonsspital Zürich
Possible to change HCP and HCO
df_export = set_type_by_alphabetical_order(df_export, hcp_before_hco = False)
If ordered by lastname and lastname is at position 2:
df_export = set_type_by_alphabetical_order(df_export, lastname_before_name = False)
add_type_by_index
#Add Type by Index
add_type_by_index(df_export, 220, [type_before = 'hcp', type_after = 'hco'])
add_warning(manually=True) #Writes a warning to the output
Add type by index: Needs a number. Avoid this if possible.
Type by Textstring
Add the type by search for a text string
#Add Type
index_hco = df_export[df_export['name'].str.contains("INDIVIDUAL NAMED DISCLOSURE", na=False)].index[0]
df_export['type'] = np.where(df_export.index < index_hco, 'hcp', 'hco')
sum_amounts
df_export = sum_amounts(df_export)
Sums all number fields. Only use, when total not provided
revert_name
df_export = revert_name(df_export)
Changes Lastname To Surname. If more than two names, writes a warning to a separat column.
remove_carination
df_export = remove_carination(df_export[, substr=""])
Removes carination (\r and \n) in all columns
add_uci
df_export = add_uci()
Add an empty UCI-Column
remove_empty_columns (depricated)
df_export = remove_empty_columns(df_export)
Remove all columns which are empty. Dont use this, actually...
columns_to_string
df_export = columns_to_string(df_export)
Converts all collumns to strings. Sometimes you need this before shifting a columns.
Remove empty rows
df_export = df_export.dropna(subset=['donations_grants', 'sponsorship', 'registration_fees', 'travel_accommodation', 'fees', 'related_expenses', 'total'], how='all')
Shift rows
df_export[df_export['name'].notna()] = df_export[df_export['name'].notna()].shift(1, axis='columns')
IMPORTANT: Sometimes you have to remove all NaNs in the columns or you will lose data. Look at Celgene
Reorder Columns
df_export = df_export[fix_columns[:-1]]
Unlock PDF
import pikepdf
pdf = pikepdf.open('pkk-erfassungmepha-pharma-ag20180427_eng_final.pdf')
pdf.save('unlocked.pdf')
Remove Name in Address
Sometimes you see the company name again in the address field. Like:
"Stiftung SONK","Stiftung SONK Rorschacher Strasse 150"
for index, row in df_export.iterrows():
df_export.loc[index, 'address'] = row['address'].replace(row['name'] + ' ', '')
Remove Title
#Remove Titles
df_export['name'] = df_export.name.str.replace(regex_title, '', regex=True, case=False)
df_export['name'] = df_export.name.str.strip()
Extract PLZ from location
#Copy PLZ from Location to PLZ
df_export['plz'] = df_export.location.str.extract(r' (\d{4})')
df_export['location'] = df_export.location.str.replace(r' (\d{4})', '')
Extract PLZ from Address and remove in Address
#Copy PLZ from Location to PLZ
df_export['plz'] = df_export.address.str.extract(r' (\d{4})')
df_export['address'] = df_export.address.str.replace(r'(,\s\d{4}.*)', '')
Set Columns manually
columns = [17.8, 165.2, 248.3, 421.9, 480.8, 533.5, 589.2, 643.9, 698.6, 757.6, 773.4]
df_hcp = tabula.read_pdf("hcp_2019_offenlegung.pdf", pages='all', lattice=False, columns=columns, guess=False, pandas_options={'header': None})
- Duplicated entries: GlaxoSmithKline
- Beautiful Soup: Lundbeck
- PDF Unlock: Mepha
- Type by Text-String (
INDIVIDUAL NAMED DISCLOSURE
oä): Bristol - Accumulation manually Jansen-Cilag
- Nothing to declare: SFL Pharma