Tags · scrapinghub/mrjob

v0.5.8

upload_dirs, pre-filters

 * automatically tarball and upload directories with --dir, setup hooks (Yelp#23)
 * specify path for inter-step output with --step-output-dir Yelp#263
 * jobs:
   * better --help printout
   * deprecated option groups in MRJobs
   * deprecated MRJob.get_all_option_groups()
   * overriding *_pre_filter() methods in MRJob works again (Yelp#1521)
   * all step types accept jobconf (Yelp#1447)
   * quieted warning about SORT_VALUES on Hadoop 2 (Yelp#1286)
 * all runners:
   * wrap tasks that require pipes with sh_bin, not bash (Yelp#1330)
 * local runner:
   * allows non-zero exit status from pre-filters (Yelp#1524)
   * pre-filters can now handle compressed input (Yelp#1061)
 * EMR runner:
   * fetch logs from task nodes as well as core nodes (Yelp#1400)
     * use ListInstances rather than dfsadmin to get node list (Yelp#1345)
 * moved mrjob.util.bunzip2_stream() to mrjob.cat
 * moved mrjob.util.gunzip_stream() to mrjob.cat
 * deprecated:
   * mrjob.util.args_for_opt_dest_subset()
   * mrjob.util.bash_wrap()
   * mrjob.util.populate_option_groups_with_options()
   * mrjob.util.scrape_options_and_index_by_dest()
   * mrjob.util.tar_and_gz()
   * SSHFilesystem.ssh_slave_hosts()

Feb 1, 2017
cc64250
zip
tar.gz

v0.5.7

Spark

 * EMR and Hadoop runners:
   * full support for Spark (Yelp#1320)
     * includes spark() method in MRJob and SparkStep/SparkScriptStep
   * can use environment variables and ~ in hadoop_streaming_jar option
 * EMR runner:
   * default AMI version is now 4.8.2 (Yelp#1486)
   * default instance type is m1.large when running Spark jobs (Yelp#1465)
   * added debug logging for matching available pooled clusters (Yelp#1449)
   * defaults to cheapest instance type that will work (Yelp#1369)
   * master bootstrap script always created when pooling
   * no longer crashes when trying to use missing ssh binary (Yelp#1474)
   * pooled clusters may have 1000 steps (Yelp#1463)
   * failed jobs no longer reported as 100% complete (Yelp#793)
 * All runners:
   * py_files option for Spark and streaming steps (Yelp#1375)
   * bootstrap mrjob with a .zip rather than a tarball
   * options refactor, added missing command-line switches (Yelp#1439)
 * mrjob terminate-idle-clusters works with all step types (Yelp#1363)
 * log interpretation
   * dropped unnecessary container-to-attempt-ID mapping (Yelp#1487)
   * more efficient search for task log errors (Yelp#1450)
   * cleaner error messages when bootstrapped mrjob won't compile
 * JarSteps
   * now support libjars, jobconf (Yelp#1481)
   * JarStep.{INPUT,OUTPUT} are deprecated (use mrjob.step.{INPUT,OUTPUT})
 * is_uri() now only matches URIs containing "://" (Yelp#1455)
 * works in Anaconda3 Jupyter Notebook (Yelp#1441)
 * deprecated mrjob.parse.is_windows_path()
 * deprecated mrjob.parse.parse_key_value_list()
 * deprecated mrjob.parse.parse_port_range_list()
 * deprecated mrjob.util.scrape_options_into_new_groups()
 * deprecated non-strict protocols (Yelp#1452)
 * deprecated python_archives (Yelp#1056)

Dec 19, 2016
b7c3056
zip
tar.gz

v0.5.6

dataproc crash fix

 * Dataproc runner:
   * fix Hadoop version crash on unknown image version (Yelp#1428)
 * EMR and Hadoop runners:
   * prioritize task errors as probable cause of failure (Yelp#1429)
   * ignore Java stack trace in task stderr logs (Yelp#1430)

Sep 12, 2016
d9949ab
zip
tar.gz

v0.5.5

missing ami_version option

 * EMR runner:
   * deprecate, don't remove ami_version option in v0.5.4 (Yelp#1421)
   * update memory/CPU stats for EC2 instances for pooling (Yelp#1414)
   * pooling treats application names as case-insensitive (Yelp#1417)

Sep 5, 2016
4ffc582
zip
tar.gz

v0.5.4

pooling auto-recovery

 * jobs:
   * pass_through_option(), for existing command-line options (Yelp#1075)
     * MRJob.options.runner now defaults to None, not 'inline' or 'local'
 * runners:
   * all:
     * names of uploaded files now never start with . or _ (Yelp#1200)
   * Hadoop:
     * log parsing:
       * handles more log4j patterns (Yelp#1405)
       * gracefully handles IOError from exists() (Yelp#1355)
     * fixed crash bug in Hadoop FS on Python 3 (Yelp#1396)
   * EMR:
     * pooling auto-recovers from joining a cluster that self-terminated (Yelp#708)
     * log fetching uses sudo on 4.3.0+ AMIs (Yelp#1244)
     * fixed broken --ssh-bind-ports switch (Yelp#1402)
     * idle termination script now only runs on master node (Yelp#1398)
     * ssh tunnel connects to internal IP of resource manager (Yelp#1397)
     * AWS credentials no longer logged in verbose mode (Yelp#1353)
     * many option names are now more generic (Yelp#1247)
       * ami_version -> image_version
       * aws_availability_zone -> zone
       * aws_region -> region
       * check_emr_status_every -> check_cluster_every
       * ec2_core_instance_bid_price -> core_instance_bid_price
       * ec2_core_instance_type -> core_instance_type
       * ec2_instance_type -> instance_type
       * ec2_master_instance_bid_price -> master_instance_bid_price
       * ec2_master_instance_type -> master_instance_type
       * ec2_task_instance_bid_price -> task_instance_bid_price
       * ec2_task_instance_type -> task_instance_type
       * emr_tags -> tags
       * num_ec2_core_instances -> num_core_instances
       * num_ec2_task_instances -> num_task_instances
       * s3_log_uri -> cloud_log_dir
       * s3_sync_wait_time -> cloud_fs_sync_secs
       * s3_tmp_dir -> cloud_tmp_dir
       * s3_upload_part_size -> cloud_upload_part_size
     * num_ec2_instances is deprecated (use num_core_instances)
     * ec2_slave_instance_type is deprecated (use core_instance_type)
     * hadoop_streaming_jar_on_emr is deprecated (Yelp#1405)
       * hadoop_streaming_jar handles this instead with file:// URIs
     * bootstrap_python does nothing on AMI 4.6.0+, as not needed (Yelp#1358)
 * mrjob audit-emr-usage should show less/no API throttling warnings (Yelp#1091)

Aug 27, 2016
ebc4943
zip
tar.gz

v0.5.3

libjars

 * jobs:
   * LIBJARS and libjars method (Yelp#1341)
 * runners:
   * all:
     * .cpython-3*.pyc files no longer included when bootstrapping mrjob
   * local:
     *PATH envvars combined with local separator (Yelp#1321)
   * Hadoop and EMR:
     * libjars option (Yelp#198)
     * fixes to ordering of generic and JAR-specific options (Yelp#1331, Yelp#1332)
   * Hadoop:
     * more default log dirs (Yelp#1339)
     * hadoop_tmp_dir handles ~ and envvars (Yelp#1322) (broken in v0.5.0)
   * EMR:
     * determine cause of failure of bootstrap scripts (Yelp#370)
       * master bootstrap script now redirects stdout to stderr
     * emr_configurations option (Yelp#1276)
     * subnet option (Yelp#1323)
     * SSH tunnel opened as soon as cluster is ready (Yelp#1115)
     * SSH tunnel leaves stdin alone (Yelp#1161)
 * combine_lists() treats dicts as values, not sequences

Jul 16, 2016
208130d
zip
tar.gz

v0.5.2

initial Cloud Dataproc support

 * basic support for Google Cloud Dataproc (Yelp#1243)
   * lacks log interpretation, JarStep support
 * on EMR, wait for steps to complete in correct order (Yelp#1316)
 * correctly handle ~ in include path in mrjob.conf (Yelp#1308)
 * new emr_applications option (Yelp#1293)
 * fix running deprecated tools with python -m (Yelp#1312)
 * fix ssh tunneling to 2.x AMIs on EMR in VPCs (Yelp#1311)

May 23, 2016
511c558
zip
tar.gz

v0.5.1

post-release bugfixes

 * strict_protocols in mrjob.conf is no longer ignored (Yelp#1302)
 * check_input_paths in mrjob.conf is no longer ignored
 * partitioner() is no longer ignored, fixing SORT_VALUES (Yelp#1294)
   * --partitioner switch is deprecated
 * improved probable cause of error from pre-YARN logs (Yelp#1288)
 * ssh_bind_ports now defaults to (x)range, not list (Yelp#1284)
 * mrjob terminate-idle-clusters handles debugging jar from boto 2.40.0 (Yelp#1306)

Apr 29, 2016
c3ce8a8
zip
tar.gz

v0.5.0

the future is in the past

 * supports Python 3 (Yelp#989)
 * requires boto 2.35.0 or newer (Yelp#980)
   * removed many workarounds for S3 and EMR (Yelp#980), IAM (Yelp#1062)
 * jobs:
   * is_mapper_or_reducer() is now is_task() (Yelp#1072)
   * mr() no longer takes positional arguments (Yelp#814)
   * removed jar() (use mrjob.step.JarStep)
   * removed testing methods parse_counters() and parse_output()
   * protocols:
     * protocols are strict by default (Yelp#724)
     * JSON protocols use ujson when available, then simplejson (Yelp#1002, Yelp#1266)
       * can explicitly choose Standard, Simple or Ultra JSON protocol
     * raw protocols handle bytes or unicode depending on Python version
       * can explicitly choose Text or Bytes protocol
   * mrjob.step:
      * JarStep only takes "args" and "main_class" keyword args
      * removed MRJobStep (use MRStep)
 * runners:
   * All runners:
     * totally revamped log handling (Yelp#1123)
     * runner status/log messages are less noisy (Yelp#1044)
     * don't bootstrap mrjob if interpreter is set (Yelp#1041)
     * fs methods path_exists() and path_join() are now exists() and join()
     * deprecation warning: use runner.fs explicitly (Yelp#1146)
     * changes to cleanup options:
       * removed IS_SUCCESSFUL (use ALL)
       * LOCAL_SCRATCH is now LOCAL_TMP (Yelp#318)
       * new HADOOP_TMP option handles HDFS cleanup (Yelp#1261)
       * REMOTE_SCRATCH is now CLOUD_TMP (Yelp#1261)
     * base_tmp_dir option is now local_tmp_dir (Yelp#318)
     * non-inline runners raise StepFailedException on step failure (Yelp#1219)
     * steps_python_bin defaults to current python interpreter (Yelp#1038)
     * _job_name is now _job_key (Yelp#982)
   * EMR:
     * default AWS region is us-west-2 (Yelp#1025)
     * default instance type is m1.medium (Yelp#992)
     * visible_to_all_users defaults to true (Yelp#1016)
     * matches your minor version of Python 2 on 3.x and 4.x AMIs (Yelp#1265)
     * 4.x AMIs are supported (Yelp#1105)
       * added --release-label switch (--ami-version 4.x.y also works)
     * can fetch counters and probable cause of failure on 3.x and 4.x AMIs
     * SSH tunnel now works on 3.x and 4.x AMIs (Yelp#1013)
       * ssh_tunnel_to_job_tracker option is now ssh_tunnel
     * correctly fetch step logs by step ID (Yelp#1117)
     * bootstrap_python option
     * s3_scratch_uri option is now s3_tmp_dir (Yelp#318)
     * aws_region is no longer inferred from s3_tmp_dir
     * create/select temp bucket in same region as EMR jobs (Yelp#687)
     * added iam_endpoint option (Yelp#1067)
     * removed s3_conn args from methods in EMRJobRunner and S3Filesystem
     * S3 Filesystem:
       * connect to each S3 bucket on appropriate endpoint (Yelp#1028)
         * fall back to default if we can't get bucket location (Yelp#1170)
       * removed special treatment of _$folder$ keys
         * removed deprecated S3Filesystem method get_s3_folder_keys()
       * recurse "subdirectories" even if uri lacks trailing / (Yelp#1183)
     * removed iam_job_flow_role option (use iam_instance_profile)
     * custom hadoop_streaming_jar gets properly uploaded
     * job cleanup temporarily disabled (Yelp#1241)
     * pooling respects key pair (Yelp#1230)
     * idle cluster self-termination respects non-streaming jobs (Yelp#1145)
     * deprecated "latest" AMI version not passed through to EMR (Yelp#1269)
     * emr_job_flow_id option is now cluster_id (Yelp#1082)
     * emr_job_flow_pool_name is now pool_name (Yelp#1082)
     * pool_emr_job_flows is now pool_clusters (Yelp#1082)
   * Hadoop
     * works out-of the-box on most Hadoop setups (Yelp#1160)
     * works out-of the box inside EMR (2.x, 3.x, and 4.x AMIs)
     * counters are parsed from Hadoop binary stderr in YARN (Yelp#1153)
     * can find logs and probable cause of failure in YARN (Yelp#1195)
       * will search in <output dir>/_logs, to support Cloudera (Yelp#565)
     * HDFS Filesystem:
       * use fs -ls -R and fs -rm -R in YARN (Yelp#1152)
       * mkdir() now uses -p on YARN (Yelp#991)
       * fs.du() now works on YARN (Yelp#1155)
       * fs.du() now returns 0 for nonexistent files instead of erroring
       * fs.rm() now uses -skipTrash
     * dropped support for Hadoop prior to 0.20.203 (Yelp#1208)
     * added hadoop_log_dirs option
     * hdfs_scratch_dir option is now hadoop_tmp_dir (Yelp#318)
     * hadoop_home is deprecated
     * uses -D and correct property name when step has no reduces (Yelp#1213)
   * Inline/Local
     * runner.fs raises IOError if passed URIs (Yelp#1185)
     * version-agnostic by default (Yelp#735)
     * removed ignored hadoop_extra_args and hadoop_streaming_jar opts (Yelp#1275)
     * inline runner uses multiple splits by default (Yelp#1276)
 * removed mrjob.compat.get_jobconf_value() (use jobconf_from_env())
 * removed mrjob.compat methods to support Hadoop prior to 0.20.203:
   * supports_combiners_in_hadoop_streaming()
   * supports_new_distributed_cache_options()
   * uses_generic_jobconf()
 * removed mrjob.conf.combine_cmd_lists()
 * removed fetch-logs tool (Yelp#1127)
 * mrjob subcommands use "cluster" rather than "job-flow" (Yelp#1082)
   * create-job-flow is now create-cluster
   * terminate-idle-job-flows is now terminate-idle-clusters
   * terminate-job-flow is now terminate-cluster
 * Python-version-specific mrjob-x and mrjob-x.y commands (Yelp#1104)
 * use followlinks=True with os.walk()
 * all internal constants/functions/methods explicitly start with _ (Yelp#681)
 * mrjob.util:
   * file_ext() takes filename, not path
   * random_identifier() moved here from mrjob.aws
   * buffer_iterator_to_line_iterator() is now to_lines()
     * to_lines() no longer appends a newline to data (Yelp#819)
   * removed extract_dir_for_tar()
   * gunzip_stream() now yields chunks, not lines
   * removed hash_object()

Mar 28, 2016
7ac9e9c
zip
tar.gz

v0.4.6

config files

 * PyYAML>=3.08 is required
 * !clear tag in conf files (Yelp#1162)
 * combine_lists() and combine_path_lists() can handle scalars (Yelp#1172)
 * include: paths in conf files are relative to real path of conf file (Yelp#1166)
 * mrjob.conf.combine_cmd_lists() is deprecated (Yelp#1168)
 * EMR runner: pool_wait_minutes can now be loaded from mrjob.conf (Yelp#1070)

Nov 9, 2015
898e6d1
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.8

v0.5.7

v0.5.6

v0.5.5

v0.5.4

v0.5.3

v0.5.2

v0.5.1

v0.5.0

v0.4.6

Tags: scrapinghub/mrjob