Tags: scrapinghub/mrjob
Tags
upload_dirs, pre-filters * automatically tarball and upload directories with --dir, setup hooks (Yelp#23) * specify path for inter-step output with --step-output-dir Yelp#263 * jobs: * better --help printout * deprecated option groups in MRJobs * deprecated MRJob.get_all_option_groups() * overriding *_pre_filter() methods in MRJob works again (Yelp#1521) * all step types accept jobconf (Yelp#1447) * quieted warning about SORT_VALUES on Hadoop 2 (Yelp#1286) * all runners: * wrap tasks that require pipes with sh_bin, not bash (Yelp#1330) * local runner: * allows non-zero exit status from pre-filters (Yelp#1524) * pre-filters can now handle compressed input (Yelp#1061) * EMR runner: * fetch logs from task nodes as well as core nodes (Yelp#1400) * use ListInstances rather than dfsadmin to get node list (Yelp#1345) * moved mrjob.util.bunzip2_stream() to mrjob.cat * moved mrjob.util.gunzip_stream() to mrjob.cat * deprecated: * mrjob.util.args_for_opt_dest_subset() * mrjob.util.bash_wrap() * mrjob.util.populate_option_groups_with_options() * mrjob.util.scrape_options_and_index_by_dest() * mrjob.util.tar_and_gz() * SSHFilesystem.ssh_slave_hosts()
Spark * EMR and Hadoop runners: * full support for Spark (Yelp#1320) * includes spark() method in MRJob and SparkStep/SparkScriptStep * can use environment variables and ~ in hadoop_streaming_jar option * EMR runner: * default AMI version is now 4.8.2 (Yelp#1486) * default instance type is m1.large when running Spark jobs (Yelp#1465) * added debug logging for matching available pooled clusters (Yelp#1449) * defaults to cheapest instance type that will work (Yelp#1369) * master bootstrap script always created when pooling * no longer crashes when trying to use missing ssh binary (Yelp#1474) * pooled clusters may have 1000 steps (Yelp#1463) * failed jobs no longer reported as 100% complete (Yelp#793) * All runners: * py_files option for Spark and streaming steps (Yelp#1375) * bootstrap mrjob with a .zip rather than a tarball * options refactor, added missing command-line switches (Yelp#1439) * mrjob terminate-idle-clusters works with all step types (Yelp#1363) * log interpretation * dropped unnecessary container-to-attempt-ID mapping (Yelp#1487) * more efficient search for task log errors (Yelp#1450) * cleaner error messages when bootstrapped mrjob won't compile * JarSteps * now support libjars, jobconf (Yelp#1481) * JarStep.{INPUT,OUTPUT} are deprecated (use mrjob.step.{INPUT,OUTPUT}) * is_uri() now only matches URIs containing "://" (Yelp#1455) * works in Anaconda3 Jupyter Notebook (Yelp#1441) * deprecated mrjob.parse.is_windows_path() * deprecated mrjob.parse.parse_key_value_list() * deprecated mrjob.parse.parse_port_range_list() * deprecated mrjob.util.scrape_options_into_new_groups() * deprecated non-strict protocols (Yelp#1452) * deprecated python_archives (Yelp#1056)
pooling auto-recovery * jobs: * pass_through_option(), for existing command-line options (Yelp#1075) * MRJob.options.runner now defaults to None, not 'inline' or 'local' * runners: * all: * names of uploaded files now never start with . or _ (Yelp#1200) * Hadoop: * log parsing: * handles more log4j patterns (Yelp#1405) * gracefully handles IOError from exists() (Yelp#1355) * fixed crash bug in Hadoop FS on Python 3 (Yelp#1396) * EMR: * pooling auto-recovers from joining a cluster that self-terminated (Yelp#708) * log fetching uses sudo on 4.3.0+ AMIs (Yelp#1244) * fixed broken --ssh-bind-ports switch (Yelp#1402) * idle termination script now only runs on master node (Yelp#1398) * ssh tunnel connects to internal IP of resource manager (Yelp#1397) * AWS credentials no longer logged in verbose mode (Yelp#1353) * many option names are now more generic (Yelp#1247) * ami_version -> image_version * aws_availability_zone -> zone * aws_region -> region * check_emr_status_every -> check_cluster_every * ec2_core_instance_bid_price -> core_instance_bid_price * ec2_core_instance_type -> core_instance_type * ec2_instance_type -> instance_type * ec2_master_instance_bid_price -> master_instance_bid_price * ec2_master_instance_type -> master_instance_type * ec2_task_instance_bid_price -> task_instance_bid_price * ec2_task_instance_type -> task_instance_type * emr_tags -> tags * num_ec2_core_instances -> num_core_instances * num_ec2_task_instances -> num_task_instances * s3_log_uri -> cloud_log_dir * s3_sync_wait_time -> cloud_fs_sync_secs * s3_tmp_dir -> cloud_tmp_dir * s3_upload_part_size -> cloud_upload_part_size * num_ec2_instances is deprecated (use num_core_instances) * ec2_slave_instance_type is deprecated (use core_instance_type) * hadoop_streaming_jar_on_emr is deprecated (Yelp#1405) * hadoop_streaming_jar handles this instead with file:// URIs * bootstrap_python does nothing on AMI 4.6.0+, as not needed (Yelp#1358) * mrjob audit-emr-usage should show less/no API throttling warnings (Yelp#1091)
libjars * jobs: * LIBJARS and libjars method (Yelp#1341) * runners: * all: * .cpython-3*.pyc files no longer included when bootstrapping mrjob * local: *PATH envvars combined with local separator (Yelp#1321) * Hadoop and EMR: * libjars option (Yelp#198) * fixes to ordering of generic and JAR-specific options (Yelp#1331, Yelp#1332) * Hadoop: * more default log dirs (Yelp#1339) * hadoop_tmp_dir handles ~ and envvars (Yelp#1322) (broken in v0.5.0) * EMR: * determine cause of failure of bootstrap scripts (Yelp#370) * master bootstrap script now redirects stdout to stderr * emr_configurations option (Yelp#1276) * subnet option (Yelp#1323) * SSH tunnel opened as soon as cluster is ready (Yelp#1115) * SSH tunnel leaves stdin alone (Yelp#1161) * combine_lists() treats dicts as values, not sequences
initial Cloud Dataproc support * basic support for Google Cloud Dataproc (Yelp#1243) * lacks log interpretation, JarStep support * on EMR, wait for steps to complete in correct order (Yelp#1316) * correctly handle ~ in include path in mrjob.conf (Yelp#1308) * new emr_applications option (Yelp#1293) * fix running deprecated tools with python -m (Yelp#1312) * fix ssh tunneling to 2.x AMIs on EMR in VPCs (Yelp#1311)
post-release bugfixes * strict_protocols in mrjob.conf is no longer ignored (Yelp#1302) * check_input_paths in mrjob.conf is no longer ignored * partitioner() is no longer ignored, fixing SORT_VALUES (Yelp#1294) * --partitioner switch is deprecated * improved probable cause of error from pre-YARN logs (Yelp#1288) * ssh_bind_ports now defaults to (x)range, not list (Yelp#1284) * mrjob terminate-idle-clusters handles debugging jar from boto 2.40.0 (Yelp#1306)
the future is in the past * supports Python 3 (Yelp#989) * requires boto 2.35.0 or newer (Yelp#980) * removed many workarounds for S3 and EMR (Yelp#980), IAM (Yelp#1062) * jobs: * is_mapper_or_reducer() is now is_task() (Yelp#1072) * mr() no longer takes positional arguments (Yelp#814) * removed jar() (use mrjob.step.JarStep) * removed testing methods parse_counters() and parse_output() * protocols: * protocols are strict by default (Yelp#724) * JSON protocols use ujson when available, then simplejson (Yelp#1002, Yelp#1266) * can explicitly choose Standard, Simple or Ultra JSON protocol * raw protocols handle bytes or unicode depending on Python version * can explicitly choose Text or Bytes protocol * mrjob.step: * JarStep only takes "args" and "main_class" keyword args * removed MRJobStep (use MRStep) * runners: * All runners: * totally revamped log handling (Yelp#1123) * runner status/log messages are less noisy (Yelp#1044) * don't bootstrap mrjob if interpreter is set (Yelp#1041) * fs methods path_exists() and path_join() are now exists() and join() * deprecation warning: use runner.fs explicitly (Yelp#1146) * changes to cleanup options: * removed IS_SUCCESSFUL (use ALL) * LOCAL_SCRATCH is now LOCAL_TMP (Yelp#318) * new HADOOP_TMP option handles HDFS cleanup (Yelp#1261) * REMOTE_SCRATCH is now CLOUD_TMP (Yelp#1261) * base_tmp_dir option is now local_tmp_dir (Yelp#318) * non-inline runners raise StepFailedException on step failure (Yelp#1219) * steps_python_bin defaults to current python interpreter (Yelp#1038) * _job_name is now _job_key (Yelp#982) * EMR: * default AWS region is us-west-2 (Yelp#1025) * default instance type is m1.medium (Yelp#992) * visible_to_all_users defaults to true (Yelp#1016) * matches your minor version of Python 2 on 3.x and 4.x AMIs (Yelp#1265) * 4.x AMIs are supported (Yelp#1105) * added --release-label switch (--ami-version 4.x.y also works) * can fetch counters and probable cause of failure on 3.x and 4.x AMIs * SSH tunnel now works on 3.x and 4.x AMIs (Yelp#1013) * ssh_tunnel_to_job_tracker option is now ssh_tunnel * correctly fetch step logs by step ID (Yelp#1117) * bootstrap_python option * s3_scratch_uri option is now s3_tmp_dir (Yelp#318) * aws_region is no longer inferred from s3_tmp_dir * create/select temp bucket in same region as EMR jobs (Yelp#687) * added iam_endpoint option (Yelp#1067) * removed s3_conn args from methods in EMRJobRunner and S3Filesystem * S3 Filesystem: * connect to each S3 bucket on appropriate endpoint (Yelp#1028) * fall back to default if we can't get bucket location (Yelp#1170) * removed special treatment of _$folder$ keys * removed deprecated S3Filesystem method get_s3_folder_keys() * recurse "subdirectories" even if uri lacks trailing / (Yelp#1183) * removed iam_job_flow_role option (use iam_instance_profile) * custom hadoop_streaming_jar gets properly uploaded * job cleanup temporarily disabled (Yelp#1241) * pooling respects key pair (Yelp#1230) * idle cluster self-termination respects non-streaming jobs (Yelp#1145) * deprecated "latest" AMI version not passed through to EMR (Yelp#1269) * emr_job_flow_id option is now cluster_id (Yelp#1082) * emr_job_flow_pool_name is now pool_name (Yelp#1082) * pool_emr_job_flows is now pool_clusters (Yelp#1082) * Hadoop * works out-of the-box on most Hadoop setups (Yelp#1160) * works out-of the box inside EMR (2.x, 3.x, and 4.x AMIs) * counters are parsed from Hadoop binary stderr in YARN (Yelp#1153) * can find logs and probable cause of failure in YARN (Yelp#1195) * will search in <output dir>/_logs, to support Cloudera (Yelp#565) * HDFS Filesystem: * use fs -ls -R and fs -rm -R in YARN (Yelp#1152) * mkdir() now uses -p on YARN (Yelp#991) * fs.du() now works on YARN (Yelp#1155) * fs.du() now returns 0 for nonexistent files instead of erroring * fs.rm() now uses -skipTrash * dropped support for Hadoop prior to 0.20.203 (Yelp#1208) * added hadoop_log_dirs option * hdfs_scratch_dir option is now hadoop_tmp_dir (Yelp#318) * hadoop_home is deprecated * uses -D and correct property name when step has no reduces (Yelp#1213) * Inline/Local * runner.fs raises IOError if passed URIs (Yelp#1185) * version-agnostic by default (Yelp#735) * removed ignored hadoop_extra_args and hadoop_streaming_jar opts (Yelp#1275) * inline runner uses multiple splits by default (Yelp#1276) * removed mrjob.compat.get_jobconf_value() (use jobconf_from_env()) * removed mrjob.compat methods to support Hadoop prior to 0.20.203: * supports_combiners_in_hadoop_streaming() * supports_new_distributed_cache_options() * uses_generic_jobconf() * removed mrjob.conf.combine_cmd_lists() * removed fetch-logs tool (Yelp#1127) * mrjob subcommands use "cluster" rather than "job-flow" (Yelp#1082) * create-job-flow is now create-cluster * terminate-idle-job-flows is now terminate-idle-clusters * terminate-job-flow is now terminate-cluster * Python-version-specific mrjob-x and mrjob-x.y commands (Yelp#1104) * use followlinks=True with os.walk() * all internal constants/functions/methods explicitly start with _ (Yelp#681) * mrjob.util: * file_ext() takes filename, not path * random_identifier() moved here from mrjob.aws * buffer_iterator_to_line_iterator() is now to_lines() * to_lines() no longer appends a newline to data (Yelp#819) * removed extract_dir_for_tar() * gunzip_stream() now yields chunks, not lines * removed hash_object()
config files * PyYAML>=3.08 is required * !clear tag in conf files (Yelp#1162) * combine_lists() and combine_path_lists() can handle scalars (Yelp#1172) * include: paths in conf files are relative to real path of conf file (Yelp#1166) * mrjob.conf.combine_cmd_lists() is deprecated (Yelp#1168) * EMR runner: pool_wait_minutes can now be loaded from mrjob.conf (Yelp#1070)
PreviousNext